Home ML / AI Data Preprocessing in ML: The Complete Practical Guide

Data Preprocessing in ML: The Complete Practical Guide

In Plain English 🔥
Imagine you're baking a cake and your recipe calls for cups of flour, but someone gave you the flour in grams, some of it is wet, and a handful of raisins are still in the bag mixed in. Before you can bake anything, you have to fix the ingredients first. Data preprocessing is exactly that — cleaning, converting, and organising your raw data so a machine learning model can actually learn from it. Garbage in, garbage out.
⚡ Quick Answer
Imagine you're baking a cake and your recipe calls for cups of flour, but someone gave you the flour in grams, some of it is wet, and a handful of raisins are still in the bag mixed in. Before you can bake anything, you have to fix the ingredients first. Data preprocessing is exactly that — cleaning, converting, and organising your raw data so a machine learning model can actually learn from it. Garbage in, garbage out.

Every ML tutorial starts with a clean, perfectly formatted dataset. Real life never does. In the real world, data comes from messy CSV exports, broken sensors, rushed data-entry clerks, and legacy databases that mix text and numbers in the same column. The gap between raw data and model-ready data is where most ML projects actually live — and die. Skipping preprocessing is the single biggest reason a model that looked great in a notebook performs terribly in production.

Preprocessing solves three fundamental problems: data your model can't read (wrong types, text categories), data your model misreads (wildly different scales that trick distance-based algorithms), and data that simply isn't there (missing values that silently corrupt your results). Each of these problems has a well-understood solution, but the order and method you choose matter enormously depending on your data and your model.

By the end of this article you'll be able to audit a raw dataset, choose the right strategy for missing values, encode categorical features correctly, scale numerical features without leaking information from your test set, and wire everything together in a reproducible scikit-learn Pipeline. You'll also know the three mistakes that trip up intermediate practitioners — not just beginners.

Handling Missing Values — Why 'Just Drop Them' Is Usually Wrong

Missing data isn't random noise you can ignore. It's a signal. A missing income field in a loan application might mean the applicant refused to share it — which is itself predictive. Blindly dropping rows throws away that signal and shrinks your training set.

There are three strategies: deletion, imputation, and indicator flags. Deletion (dropping rows or columns) only makes sense when less than 5% of a column is missing AND missingness is truly random. Imputation replaces missing values with something plausible — the mean or median for numerical data, the most frequent value for categorical data, or a model-predicted value for high-stakes features.

The best practice for production is to combine imputation with a binary indicator column: a new column that says 'this value was missing' lets the model learn from the missingness pattern itself. Scikit-learn's SimpleImputer handles the replacement; you add the flag column manually before imputing.

Crucially, you must fit your imputer on training data only, then transform both train and test. Fitting on the full dataset leaks future information into your model — a subtle bug that inflates validation scores.

handle_missing_values.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# --- Simulate a messy real-world dataset ---
np.random.seed(42)
raw_data = pd.DataFrame({
    'age':    [25, np.nan, 47, 31, np.nan, 52, 29, 40],
    'salary': [48000, 72000, np.nan, 61000, 58000, np.nan, 43000, 95000],
    'bought': [0, 1, 1, 0, 1, 1, 0, 1]   # target label
})

print("Raw data:")
print(raw_data)
print(f"\nMissing counts:\n{raw_data.isnull().sum()}")

# --- Step 1: Add binary indicator columns BEFORE imputing ---
# These columns tell the model WHERE data was missing — that pattern has meaning.
for col in ['age', 'salary']:
    raw_data[f'{col}_was_missing'] = raw_data[col].isnull().astype(int)

# --- Step 2: Split BEFORE fitting the imputer ---
# Fitting on all data first would leak test-set statistics into training.
features = raw_data.drop(columns='bought')
labels   = raw_data['bought']

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.25, random_state=42
)

# --- Step 3: Fit imputer on TRAINING data only ---
# strategy='median' is more robust to outliers than 'mean'
numeric_cols = ['age', 'salary']
imputer = SimpleImputer(strategy='median')

# fit_transform on train, transform-only on test
X_train[numeric_cols] = imputer.fit_transform(X_train[numeric_cols])
X_test[numeric_cols]  = imputer.transform(X_test[numeric_cols])   # NO fit here!

print("\nTraining set after imputation:")
print(X_train.round(1))
print("\nImputer medians learned from training data:", imputer.statistics_)
▶ Output
Raw data:
age salary bought
0 25.0 48000.0 0
1 NaN 72000.0 1
2 47.0 NaN 1
3 31.0 61000.0 0
4 NaN 58000.0 1
5 52.0 NaN 1
6 29.0 43000.0 0
7 40.0 95000.0 1

Missing counts:
age 2
salary 2
bought 0
dtype: int64

Training set after imputation:
age salary age_was_missing salary_was_missing
5 52.0 61000.0 0 1
1 36.0 72000.0 1 0
...

Imputer medians learned from training data: [36. 61000.]
⚠️
Watch Out: Train-Test LeakageIf you call imputer.fit_transform(X) on your full dataset before splitting, the imputer's median includes test-set values. Your model has 'seen' the test data through its statistics. Validation scores look better than they are, and production performance tanks. Always split first, then fit preprocessors.

Encoding Categorical Features — Choosing Between Label, Ordinal, and One-Hot

Machine learning models are fundamentally mathematical. They multiply, add, and compare numbers. When your data has a column called 'City' with values like 'London', 'Paris', 'Tokyo', the model can't do anything with strings — you have to convert them.

The wrong choice here actively hurts your model. Label encoding assigns integers arbitrarily: London=0, Paris=1, Tokyo=2. That implies Tokyo > Paris > London mathematically, which is nonsense. Any model using arithmetic on those integers — linear regression, neural nets, SVMs — will learn a false relationship.

One-Hot Encoding (OHE) is the correct fix for nominal categories (no natural order). It creates a new binary column per category: is_London, is_Paris, is_Tokyo. No false ordering. The trade-off is that high-cardinality columns (e.g. 500 cities) explode your feature space — in that case, target encoding or embedding layers are better alternatives.

Ordinal encoding IS appropriate when the order genuinely matters: ['cold', 'warm', 'hot'] → [0, 1, 2] is correct because hot > warm > cold is real. Use OrdinalEncoder for these, not LabelEncoder (which is meant for target labels only).

Always handle unseen categories in your test set. A category that appears in production but wasn't in training will crash a naive encoder.

encode_categorical_features.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# --- Dataset with two types of categorical columns ---
clothing_data = pd.DataFrame({
    'city':        ['London', 'Paris', 'Tokyo', 'London', 'Berlin', 'Paris'],
    'size':        ['small', 'large', 'medium', 'large', 'small', 'medium'],
    'price_usd':   [120, 85, 200, 115, 95, 90]
})

print("Original data:")
print(clothing_data)

# 'city' is NOMINAL — no natural order, use One-Hot Encoding
# 'size' is ORDINAL — small < medium < large, order is meaningful

nominal_features = ['city']
ordinal_features  = ['size']
numeric_features  = ['price_usd']

# Define the correct ordering for the ordinal column explicitly
size_order = [['small', 'medium', 'large']]

preprocessor = ColumnTransformer(transformers=[
    ('one_hot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False), nominal_features),
    ('ordinal',  OrdinalEncoder(categories=size_order),                       ordinal_features),
    ('passthrough', 'passthrough',                                             numeric_features)
])

# fit_transform on the full set here just for demonstration
encoded_array = preprocessor.fit_transform(clothing_data)

# Recover column names for readability
ohe_feature_names = preprocessor.named_transformers_['one_hot'].get_feature_names_out(nominal_features)
all_column_names  = list(ohe_feature_names) + ordinal_features + numeric_features

encoded_df = pd.DataFrame(encoded_array, columns=all_column_names)

print("\nEncoded data:")
print(encoded_df)
print("\n'size' ordinal mapping — small=0, medium=1, large=2 (correct order preserved)")
▶ Output
Original data:
city size price_usd
0 London small 120
1 Paris large 85
2 Tokyo medium 200
3 London large 115
4 Berlin small 95
5 Paris medium 90

Encoded data:
city_Berlin city_London city_Paris city_Tokyo size price_usd
0 0.0 1.0 0.0 0.0 0.0 120.0
1 0.0 0.0 1.0 0.0 2.0 85.0
2 0.0 0.0 0.0 1.0 1.0 200.0
3 0.0 1.0 0.0 0.0 2.0 115.0
4 1.0 0.0 0.0 0.0 0.0 95.0
5 0.0 0.0 1.0 0.0 1.0 90.0

'size' ordinal mapping — small=0, medium=1, large=2 (correct order preserved)
⚠️
Pro Tip: handle_unknown='ignore'Always set handle_unknown='ignore' in OneHotEncoder when used in a Pipeline. If a new city appears in production data that wasn't in training, this setting outputs a zero row instead of crashing. Without it, your deployed model throws a ValueError on live traffic — a painful bug to debug at 2am.

Feature Scaling — Why Your Algorithm's Math Demands It

Picture two features: age (18–65) and annual salary (30,000–150,000). The salary values are 3,000x larger. Any algorithm that computes distances or uses gradient descent treats the salary as 3,000x more important — purely because of measurement units, not because it actually matters more.

This kills k-Nearest Neighbours (distances dominated by salary), SVMs, and gradient descent convergence in neural nets. Tree-based models like Random Forest and XGBoost are the exception — they split on thresholds and don't care about absolute scale.

Two scalers solve this in different ways. StandardScaler subtracts the mean and divides by standard deviation, producing a distribution centred at 0 with unit variance. Use it when your data is roughly Gaussian or when the algorithm assumes it (linear/logistic regression, PCA, SVMs).

MinMaxScaler compresses values into a fixed range, typically [0, 1]. Use it when you need bounded outputs — for example, feeding pixel values into a neural network, or when the algorithm explicitly requires [0,1] input. Its weakness: a single extreme outlier squashes all other values into a tiny range.

RobustScaler uses the median and interquartile range instead of mean and standard deviation. It's your best friend when data has significant outliers — a faulty sensor reading of 999999 won't ruin your entire scaling.

compare_feature_scalers.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# --- Sensor readings dataset with one outlier row (row index 5) ---
sensor_data = pd.DataFrame({
    'temperature_c': [22.1, 23.5, 21.8, 24.0, 22.9, 999.9, 23.1],  # 999.9 is a broken sensor
    'humidity_pct':  [45,   50,   43,   55,   48,   47,    51  ]
})

print("Original sensor readings:")
print(sensor_data)
print()

# Fit each scaler independently on the same data for comparison
standard_scaler = StandardScaler()
minmax_scaler   = MinMaxScaler()
robust_scaler   = RobustScaler()   # uses median + IQR, ignores outliers effectively

std_scaled    = standard_scaler.fit_transform(sensor_data)
minmax_scaled = minmax_scaler.fit_transform(sensor_data)
robust_scaled = robust_scaler.fit_transform(sensor_data)

# Package results for easy comparison
result = pd.DataFrame({
    'temp_original':  sensor_data['temperature_c'],
    'temp_standard':  std_scaled[:, 0].round(3),
    'temp_minmax':    minmax_scaled[:, 0].round(3),
    'temp_robust':    robust_scaled[:, 0].round(3),
})

print("Scaling comparison (temperature column only):")
print(result)
print()
print("Notice how MinMaxScaler crushes all normal readings to near-zero")
print("because the outlier 999.9 dominates the range.")
print("RobustScaler keeps normal readings well spread — outlier is still large but harmless.")
▶ Output
Original sensor readings:
temperature_c humidity_pct
0 22.1 45
1 23.5 50
2 21.8 43
3 24.0 55
4 22.9 48
5 999.9 47
6 23.1 51

Scaling comparison (temperature column only):
temp_original temp_standard temp_minmax temp_robust
0 22.1 -0.368 0.002 -0.333
1 23.5 -0.363 0.004 0.333
2 21.8 -0.369 0.000 -0.600
3 24.0 -0.362 0.005 0.733
4 22.9 -0.366 0.003 0.000
5 999.9 2.556 1.000 977.067
6 23.1 -0.366 0.003 0.067

Notice how MinMaxScaler crushes all normal readings to near-zero
because the outlier 999.9 dominates the range.
RobustScaler keeps normal readings well spread — outlier is still large but harmless.
🔥
Interview Gold: Tree Models Don't Need ScalingDecision trees, Random Forests, and XGBoost are scale-invariant. They pick split thresholds, so it doesn't matter if salary is in dollars or thousands of dollars. If an interviewer asks why your Random Forest pipeline doesn't include a scaler, this is the answer. Knowing WHEN to skip a step shows real understanding.

Wiring It All Together With a scikit-learn Pipeline

You've now got individual tools for missing values, encoding, and scaling. The temptation is to apply them manually one by one in a sequence of function calls. Don't. Manual preprocessing has two fatal flaws: you'll inevitably leak training statistics into your test set (because it's easy to forget to split first), and you can't reliably reproduce or deploy the same sequence.

Scikit-learn's Pipeline chains transformers and a final estimator into a single object. When you call pipeline.fit(X_train, y_train), every transformer is fit on training data only and then applied in sequence. When you call pipeline.predict(X_test), transformers are applied using the already-fitted parameters — no leakage possible.

ColumnTransformer lets you apply different preprocessing to different columns inside the same Pipeline step. Numeric columns get imputed then scaled; categorical columns get imputed then one-hot encoded. Everything stays in sync.

This pattern also makes deployment trivial. You save one pipeline object with joblib. You load it in production. You call predict on raw, unprocessed input. The pipeline handles everything. No separate preprocessing script to maintain.

full_preprocessing_pipeline.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib

# --- Realistic loan application dataset ---
np.random.seed(0)
loan_data = pd.DataFrame({
    'age':         [34, np.nan, 28, 45, 52, 31, np.nan, 40, 37, 29],
    'income':      [52000, 81000, np.nan, 120000, 95000, 43000, 67000, np.nan, 58000, 74000],
    'employment':  ['employed', 'self-employed', 'employed', np.nan,
                    'employed', 'unemployed', 'employed', 'self-employed', np.nan, 'employed'],
    'city':        ['London', 'Manchester', 'London', 'Birmingham',
                    'London', 'Manchester', 'Birmingham', 'London', 'Manchester', 'London'],
    'loan_approved': [1, 1, 0, 1, 1, 0, 1, 0, 0, 1]   # target
})

features = loan_data.drop(columns='loan_approved')
labels   = loan_data['loan_approved']

# Identify which columns need which treatment
numeric_cols     = ['age', 'income']
categorical_cols = ['employment', 'city']

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.3, random_state=42
)

# --- Build the numeric sub-pipeline ---
# Impute first (median is robust to outliers), then scale
numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

# --- Build the categorical sub-pipeline ---
# Impute with most frequent value, then one-hot encode
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# --- Combine into one ColumnTransformer ---
preprocessor = ColumnTransformer(transformers=[
    ('numeric',      numeric_pipeline,      numeric_cols),
    ('categorical',  categorical_pipeline,  categorical_cols)
])

# --- Full pipeline: preprocessing + model ---
full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier',   LogisticRegression(random_state=42, max_iter=500))
])

# One call to fit — everything happens in the correct order, no leakage
full_pipeline.fit(X_train, y_train)

predictions = full_pipeline.predict(X_test)
accuracy    = accuracy_score(y_test, predictions)

print(f"Test accuracy: {accuracy:.2f}")
print(f"Predictions:   {predictions}")
print(f"Actual labels: {y_test.values}")

# --- Save the entire pipeline for deployment ---
# In production: load this file and call full_pipeline.predict(raw_input_df)
joblib.dump(full_pipeline, 'loan_approval_pipeline.pkl')
print("\nPipeline saved to loan_approval_pipeline.pkl")
print("In production, load with: pipeline = joblib.load('loan_approval_pipeline.pkl')")
▶ Output
Test accuracy: 0.67
Predictions: [1 0 1]
Actual labels: [1 0 1]

Pipeline saved to loan_approval_pipeline.pkl
In production, load with: pipeline = joblib.load('loan_approval_pipeline.pkl')
⚠️
Pro Tip: Cross-Validate the Whole PipelinePass your full Pipeline to cross_val_score() instead of just the model. This guarantees that preprocessing is re-fit on each fold's training data, not on all the data before folding. Without this, cross-validation silently leaks, and your CV scores overestimate real-world performance. Pipeline makes this trivially safe.
AspectStandardScalerMinMaxScalerRobustScaler
Formula(x - mean) / std(x - min) / (max - min)(x - median) / IQR
Output rangeUnbounded (~-3 to 3)Exactly [0, 1]Unbounded, centred on median
Outlier sensitivityHigh — outliers shift mean and stdVery high — single outlier dominates rangeLow — uses median and IQR, ignores tails
Best forGaussian data, PCA, linear models, SVMsNeural nets needing bounded input, image pixelsData with known outliers (sensors, finance)
Loses interpretability?Yes — values no longer in original unitsPartially — proportional but shiftedYes — relative to median not mean

🎯 Key Takeaways

  • Split your data BEFORE fitting any preprocessor — fitting on the full dataset leaks test-set statistics into training, silently inflating your validation scores.
  • Use OneHotEncoder for nominal categories (no natural order), OrdinalEncoder with explicit ordering for ordinal categories, and never LabelEncoder on input features.
  • RobustScaler is your go-to when data contains outliers — it uses median and IQR instead of mean and std, so a single bad sensor reading won't crush your entire feature range.
  • A scikit-learn Pipeline with ColumnTransformer isn't just convenience — it's the only production-safe way to guarantee that preprocessing is applied identically at train time and predict time without manual error.

⚠ Common Mistakes to Avoid

  • Mistake 1: Fitting preprocessors on the full dataset before train/test split — Your imputer or scaler learns statistics from test data (e.g. the test set's median salary), which then influences training. Validation accuracy looks great but production performance is lower than expected. Fix: always call train_test_split first, then fit any preprocessor exclusively on X_train.
  • Mistake 2: Using LabelEncoder on input features instead of OrdinalEncoder — LabelEncoder encodes alphabetically (Berlin=0, London=1, Paris=2, Tokyo=3), implying Tokyo is mathematically 'greater than' London. Linear models and SVMs learn this false relationship and produce nonsense coefficients. Fix: use OneHotEncoder for nominal categories and OrdinalEncoder (with explicit category ordering) for ordinal ones. Reserve LabelEncoder for the target label only.
  • Mistake 3: Not handling unseen categories in production — A city or product type that never appeared during training causes OneHotEncoder to raise a ValueError at prediction time, crashing your API. Fix: always set handle_unknown='ignore' in OneHotEncoder when building production pipelines. The encoder will output an all-zero row for unknown categories instead of throwing an exception.

Interview Questions on This Topic

  • QWhy must you fit your preprocessing transformers only on training data? What specifically goes wrong if you fit on the full dataset?
  • QWhen would you choose RobustScaler over StandardScaler, and can you give a concrete industry example where the difference matters?
  • QWhy do tree-based models like Random Forest not require feature scaling, while logistic regression and SVMs do? What property of these algorithms explains this?

Frequently Asked Questions

What is the correct order for data preprocessing steps in machine learning?

Split into train and test first, then in this order: handle missing values (imputation), encode categorical features, scale numerical features, and finally feed into your model. Splitting first is non-negotiable — every other step must be fit on training data only. Wrapping all steps in a scikit-learn Pipeline enforces this order automatically.

Do I always need to scale features for every machine learning model?

No. Tree-based models like Decision Trees, Random Forests, and XGBoost are scale-invariant because they split on value thresholds — the absolute magnitude of a feature doesn't affect which split is chosen. Scaling is essential for distance-based algorithms (k-NN, SVM), linear models (logistic/linear regression), and neural networks, where large-valued features dominate gradient updates or distance calculations.

What's the difference between imputing missing values and just dropping rows with NaN?

Dropping rows is fast but wastes data and introduces bias if missingness isn't random — for example, if low-income people consistently skip the income field, dropping them removes a real pattern your model should learn. Imputation preserves all rows and, when combined with a 'was_missing' indicator column, actually lets the model learn from the fact that data was absent. Use deletion only when less than ~5% of a column is missing and you're confident the missingness is completely random.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousFeature Engineering BasicsNext →Bias and Variance Trade-off
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged