Data Preprocessing in ML: The Complete Practical Guide
Every ML tutorial starts with a clean, perfectly formatted dataset. Real life never does. In the real world, data comes from messy CSV exports, broken sensors, rushed data-entry clerks, and legacy databases that mix text and numbers in the same column. The gap between raw data and model-ready data is where most ML projects actually live — and die. Skipping preprocessing is the single biggest reason a model that looked great in a notebook performs terribly in production.
Preprocessing solves three fundamental problems: data your model can't read (wrong types, text categories), data your model misreads (wildly different scales that trick distance-based algorithms), and data that simply isn't there (missing values that silently corrupt your results). Each of these problems has a well-understood solution, but the order and method you choose matter enormously depending on your data and your model.
By the end of this article you'll be able to audit a raw dataset, choose the right strategy for missing values, encode categorical features correctly, scale numerical features without leaking information from your test set, and wire everything together in a reproducible scikit-learn Pipeline. You'll also know the three mistakes that trip up intermediate practitioners — not just beginners.
Handling Missing Values — Why 'Just Drop Them' Is Usually Wrong
Missing data isn't random noise you can ignore. It's a signal. A missing income field in a loan application might mean the applicant refused to share it — which is itself predictive. Blindly dropping rows throws away that signal and shrinks your training set.
There are three strategies: deletion, imputation, and indicator flags. Deletion (dropping rows or columns) only makes sense when less than 5% of a column is missing AND missingness is truly random. Imputation replaces missing values with something plausible — the mean or median for numerical data, the most frequent value for categorical data, or a model-predicted value for high-stakes features.
The best practice for production is to combine imputation with a binary indicator column: a new column that says 'this value was missing' lets the model learn from the missingness pattern itself. Scikit-learn's SimpleImputer handles the replacement; you add the flag column manually before imputing.
Crucially, you must fit your imputer on training data only, then transform both train and test. Fitting on the full dataset leaks future information into your model — a subtle bug that inflates validation scores.
import pandas as pd import numpy as np from sklearn.impute import SimpleImputer from sklearn.model_selection import train_test_split # --- Simulate a messy real-world dataset --- np.random.seed(42) raw_data = pd.DataFrame({ 'age': [25, np.nan, 47, 31, np.nan, 52, 29, 40], 'salary': [48000, 72000, np.nan, 61000, 58000, np.nan, 43000, 95000], 'bought': [0, 1, 1, 0, 1, 1, 0, 1] # target label }) print("Raw data:") print(raw_data) print(f"\nMissing counts:\n{raw_data.isnull().sum()}") # --- Step 1: Add binary indicator columns BEFORE imputing --- # These columns tell the model WHERE data was missing — that pattern has meaning. for col in ['age', 'salary']: raw_data[f'{col}_was_missing'] = raw_data[col].isnull().astype(int) # --- Step 2: Split BEFORE fitting the imputer --- # Fitting on all data first would leak test-set statistics into training. features = raw_data.drop(columns='bought') labels = raw_data['bought'] X_train, X_test, y_train, y_test = train_test_split( features, labels, test_size=0.25, random_state=42 ) # --- Step 3: Fit imputer on TRAINING data only --- # strategy='median' is more robust to outliers than 'mean' numeric_cols = ['age', 'salary'] imputer = SimpleImputer(strategy='median') # fit_transform on train, transform-only on test X_train[numeric_cols] = imputer.fit_transform(X_train[numeric_cols]) X_test[numeric_cols] = imputer.transform(X_test[numeric_cols]) # NO fit here! print("\nTraining set after imputation:") print(X_train.round(1)) print("\nImputer medians learned from training data:", imputer.statistics_)
age salary bought
0 25.0 48000.0 0
1 NaN 72000.0 1
2 47.0 NaN 1
3 31.0 61000.0 0
4 NaN 58000.0 1
5 52.0 NaN 1
6 29.0 43000.0 0
7 40.0 95000.0 1
Missing counts:
age 2
salary 2
bought 0
dtype: int64
Training set after imputation:
age salary age_was_missing salary_was_missing
5 52.0 61000.0 0 1
1 36.0 72000.0 1 0
...
Imputer medians learned from training data: [36. 61000.]
Encoding Categorical Features — Choosing Between Label, Ordinal, and One-Hot
Machine learning models are fundamentally mathematical. They multiply, add, and compare numbers. When your data has a column called 'City' with values like 'London', 'Paris', 'Tokyo', the model can't do anything with strings — you have to convert them.
The wrong choice here actively hurts your model. Label encoding assigns integers arbitrarily: London=0, Paris=1, Tokyo=2. That implies Tokyo > Paris > London mathematically, which is nonsense. Any model using arithmetic on those integers — linear regression, neural nets, SVMs — will learn a false relationship.
One-Hot Encoding (OHE) is the correct fix for nominal categories (no natural order). It creates a new binary column per category: is_London, is_Paris, is_Tokyo. No false ordering. The trade-off is that high-cardinality columns (e.g. 500 cities) explode your feature space — in that case, target encoding or embedding layers are better alternatives.
Ordinal encoding IS appropriate when the order genuinely matters: ['cold', 'warm', 'hot'] → [0, 1, 2] is correct because hot > warm > cold is real. Use OrdinalEncoder for these, not LabelEncoder (which is meant for target labels only).
Always handle unseen categories in your test set. A category that appears in production but wasn't in training will crash a naive encoder.
import pandas as pd from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder from sklearn.compose import ColumnTransformer # --- Dataset with two types of categorical columns --- clothing_data = pd.DataFrame({ 'city': ['London', 'Paris', 'Tokyo', 'London', 'Berlin', 'Paris'], 'size': ['small', 'large', 'medium', 'large', 'small', 'medium'], 'price_usd': [120, 85, 200, 115, 95, 90] }) print("Original data:") print(clothing_data) # 'city' is NOMINAL — no natural order, use One-Hot Encoding # 'size' is ORDINAL — small < medium < large, order is meaningful nominal_features = ['city'] ordinal_features = ['size'] numeric_features = ['price_usd'] # Define the correct ordering for the ordinal column explicitly size_order = [['small', 'medium', 'large']] preprocessor = ColumnTransformer(transformers=[ ('one_hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False), nominal_features), ('ordinal', OrdinalEncoder(categories=size_order), ordinal_features), ('passthrough', 'passthrough', numeric_features) ]) # fit_transform on the full set here just for demonstration encoded_array = preprocessor.fit_transform(clothing_data) # Recover column names for readability ohe_feature_names = preprocessor.named_transformers_['one_hot'].get_feature_names_out(nominal_features) all_column_names = list(ohe_feature_names) + ordinal_features + numeric_features encoded_df = pd.DataFrame(encoded_array, columns=all_column_names) print("\nEncoded data:") print(encoded_df) print("\n'size' ordinal mapping — small=0, medium=1, large=2 (correct order preserved)")
city size price_usd
0 London small 120
1 Paris large 85
2 Tokyo medium 200
3 London large 115
4 Berlin small 95
5 Paris medium 90
Encoded data:
city_Berlin city_London city_Paris city_Tokyo size price_usd
0 0.0 1.0 0.0 0.0 0.0 120.0
1 0.0 0.0 1.0 0.0 2.0 85.0
2 0.0 0.0 0.0 1.0 1.0 200.0
3 0.0 1.0 0.0 0.0 2.0 115.0
4 1.0 0.0 0.0 0.0 0.0 95.0
5 0.0 0.0 1.0 0.0 1.0 90.0
'size' ordinal mapping — small=0, medium=1, large=2 (correct order preserved)
Feature Scaling — Why Your Algorithm's Math Demands It
Picture two features: age (18–65) and annual salary (30,000–150,000). The salary values are 3,000x larger. Any algorithm that computes distances or uses gradient descent treats the salary as 3,000x more important — purely because of measurement units, not because it actually matters more.
This kills k-Nearest Neighbours (distances dominated by salary), SVMs, and gradient descent convergence in neural nets. Tree-based models like Random Forest and XGBoost are the exception — they split on thresholds and don't care about absolute scale.
Two scalers solve this in different ways. StandardScaler subtracts the mean and divides by standard deviation, producing a distribution centred at 0 with unit variance. Use it when your data is roughly Gaussian or when the algorithm assumes it (linear/logistic regression, PCA, SVMs).
MinMaxScaler compresses values into a fixed range, typically [0, 1]. Use it when you need bounded outputs — for example, feeding pixel values into a neural network, or when the algorithm explicitly requires [0,1] input. Its weakness: a single extreme outlier squashes all other values into a tiny range.
RobustScaler uses the median and interquartile range instead of mean and standard deviation. It's your best friend when data has significant outliers — a faulty sensor reading of 999999 won't ruin your entire scaling.
import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler # --- Sensor readings dataset with one outlier row (row index 5) --- sensor_data = pd.DataFrame({ 'temperature_c': [22.1, 23.5, 21.8, 24.0, 22.9, 999.9, 23.1], # 999.9 is a broken sensor 'humidity_pct': [45, 50, 43, 55, 48, 47, 51 ] }) print("Original sensor readings:") print(sensor_data) print() # Fit each scaler independently on the same data for comparison standard_scaler = StandardScaler() minmax_scaler = MinMaxScaler() robust_scaler = RobustScaler() # uses median + IQR, ignores outliers effectively std_scaled = standard_scaler.fit_transform(sensor_data) minmax_scaled = minmax_scaler.fit_transform(sensor_data) robust_scaled = robust_scaler.fit_transform(sensor_data) # Package results for easy comparison result = pd.DataFrame({ 'temp_original': sensor_data['temperature_c'], 'temp_standard': std_scaled[:, 0].round(3), 'temp_minmax': minmax_scaled[:, 0].round(3), 'temp_robust': robust_scaled[:, 0].round(3), }) print("Scaling comparison (temperature column only):") print(result) print() print("Notice how MinMaxScaler crushes all normal readings to near-zero") print("because the outlier 999.9 dominates the range.") print("RobustScaler keeps normal readings well spread — outlier is still large but harmless.")
temperature_c humidity_pct
0 22.1 45
1 23.5 50
2 21.8 43
3 24.0 55
4 22.9 48
5 999.9 47
6 23.1 51
Scaling comparison (temperature column only):
temp_original temp_standard temp_minmax temp_robust
0 22.1 -0.368 0.002 -0.333
1 23.5 -0.363 0.004 0.333
2 21.8 -0.369 0.000 -0.600
3 24.0 -0.362 0.005 0.733
4 22.9 -0.366 0.003 0.000
5 999.9 2.556 1.000 977.067
6 23.1 -0.366 0.003 0.067
Notice how MinMaxScaler crushes all normal readings to near-zero
because the outlier 999.9 dominates the range.
RobustScaler keeps normal readings well spread — outlier is still large but harmless.
Wiring It All Together With a scikit-learn Pipeline
You've now got individual tools for missing values, encoding, and scaling. The temptation is to apply them manually one by one in a sequence of function calls. Don't. Manual preprocessing has two fatal flaws: you'll inevitably leak training statistics into your test set (because it's easy to forget to split first), and you can't reliably reproduce or deploy the same sequence.
Scikit-learn's Pipeline chains transformers and a final estimator into a single object. When you call pipeline.fit(X_train, y_train), every transformer is fit on training data only and then applied in sequence. When you call pipeline.predict(X_test), transformers are applied using the already-fitted parameters — no leakage possible.
ColumnTransformer lets you apply different preprocessing to different columns inside the same Pipeline step. Numeric columns get imputed then scaled; categorical columns get imputed then one-hot encoded. Everything stays in sync.
This pattern also makes deployment trivial. You save one pipeline object with joblib. You load it in production. You call predict on raw, unprocessed input. The pipeline handles everything. No separate preprocessing script to maintain.
import pandas as pd import numpy as np from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import joblib # --- Realistic loan application dataset --- np.random.seed(0) loan_data = pd.DataFrame({ 'age': [34, np.nan, 28, 45, 52, 31, np.nan, 40, 37, 29], 'income': [52000, 81000, np.nan, 120000, 95000, 43000, 67000, np.nan, 58000, 74000], 'employment': ['employed', 'self-employed', 'employed', np.nan, 'employed', 'unemployed', 'employed', 'self-employed', np.nan, 'employed'], 'city': ['London', 'Manchester', 'London', 'Birmingham', 'London', 'Manchester', 'Birmingham', 'London', 'Manchester', 'London'], 'loan_approved': [1, 1, 0, 1, 1, 0, 1, 0, 0, 1] # target }) features = loan_data.drop(columns='loan_approved') labels = loan_data['loan_approved'] # Identify which columns need which treatment numeric_cols = ['age', 'income'] categorical_cols = ['employment', 'city'] X_train, X_test, y_train, y_test = train_test_split( features, labels, test_size=0.3, random_state=42 ) # --- Build the numeric sub-pipeline --- # Impute first (median is robust to outliers), then scale numeric_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # --- Build the categorical sub-pipeline --- # Impute with most frequent value, then one-hot encode categorical_pipeline = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) ]) # --- Combine into one ColumnTransformer --- preprocessor = ColumnTransformer(transformers=[ ('numeric', numeric_pipeline, numeric_cols), ('categorical', categorical_pipeline, categorical_cols) ]) # --- Full pipeline: preprocessing + model --- full_pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', LogisticRegression(random_state=42, max_iter=500)) ]) # One call to fit — everything happens in the correct order, no leakage full_pipeline.fit(X_train, y_train) predictions = full_pipeline.predict(X_test) accuracy = accuracy_score(y_test, predictions) print(f"Test accuracy: {accuracy:.2f}") print(f"Predictions: {predictions}") print(f"Actual labels: {y_test.values}") # --- Save the entire pipeline for deployment --- # In production: load this file and call full_pipeline.predict(raw_input_df) joblib.dump(full_pipeline, 'loan_approval_pipeline.pkl') print("\nPipeline saved to loan_approval_pipeline.pkl") print("In production, load with: pipeline = joblib.load('loan_approval_pipeline.pkl')")
Predictions: [1 0 1]
Actual labels: [1 0 1]
Pipeline saved to loan_approval_pipeline.pkl
In production, load with: pipeline = joblib.load('loan_approval_pipeline.pkl')
| Aspect | StandardScaler | MinMaxScaler | RobustScaler |
|---|---|---|---|
| Formula | (x - mean) / std | (x - min) / (max - min) | (x - median) / IQR |
| Output range | Unbounded (~-3 to 3) | Exactly [0, 1] | Unbounded, centred on median |
| Outlier sensitivity | High — outliers shift mean and std | Very high — single outlier dominates range | Low — uses median and IQR, ignores tails |
| Best for | Gaussian data, PCA, linear models, SVMs | Neural nets needing bounded input, image pixels | Data with known outliers (sensors, finance) |
| Loses interpretability? | Yes — values no longer in original units | Partially — proportional but shifted | Yes — relative to median not mean |
🎯 Key Takeaways
- Split your data BEFORE fitting any preprocessor — fitting on the full dataset leaks test-set statistics into training, silently inflating your validation scores.
- Use OneHotEncoder for nominal categories (no natural order), OrdinalEncoder with explicit ordering for ordinal categories, and never LabelEncoder on input features.
- RobustScaler is your go-to when data contains outliers — it uses median and IQR instead of mean and std, so a single bad sensor reading won't crush your entire feature range.
- A scikit-learn Pipeline with ColumnTransformer isn't just convenience — it's the only production-safe way to guarantee that preprocessing is applied identically at train time and predict time without manual error.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Fitting preprocessors on the full dataset before train/test split — Your imputer or scaler learns statistics from test data (e.g. the test set's median salary), which then influences training. Validation accuracy looks great but production performance is lower than expected. Fix: always call train_test_split first, then fit any preprocessor exclusively on X_train.
- ✕Mistake 2: Using LabelEncoder on input features instead of OrdinalEncoder — LabelEncoder encodes alphabetically (Berlin=0, London=1, Paris=2, Tokyo=3), implying Tokyo is mathematically 'greater than' London. Linear models and SVMs learn this false relationship and produce nonsense coefficients. Fix: use OneHotEncoder for nominal categories and OrdinalEncoder (with explicit category ordering) for ordinal ones. Reserve LabelEncoder for the target label only.
- ✕Mistake 3: Not handling unseen categories in production — A city or product type that never appeared during training causes OneHotEncoder to raise a ValueError at prediction time, crashing your API. Fix: always set handle_unknown='ignore' in OneHotEncoder when building production pipelines. The encoder will output an all-zero row for unknown categories instead of throwing an exception.
Interview Questions on This Topic
- QWhy must you fit your preprocessing transformers only on training data? What specifically goes wrong if you fit on the full dataset?
- QWhen would you choose RobustScaler over StandardScaler, and can you give a concrete industry example where the difference matters?
- QWhy do tree-based models like Random Forest not require feature scaling, while logistic regression and SVMs do? What property of these algorithms explains this?
Frequently Asked Questions
What is the correct order for data preprocessing steps in machine learning?
Split into train and test first, then in this order: handle missing values (imputation), encode categorical features, scale numerical features, and finally feed into your model. Splitting first is non-negotiable — every other step must be fit on training data only. Wrapping all steps in a scikit-learn Pipeline enforces this order automatically.
Do I always need to scale features for every machine learning model?
No. Tree-based models like Decision Trees, Random Forests, and XGBoost are scale-invariant because they split on value thresholds — the absolute magnitude of a feature doesn't affect which split is chosen. Scaling is essential for distance-based algorithms (k-NN, SVM), linear models (logistic/linear regression), and neural networks, where large-valued features dominate gradient updates or distance calculations.
What's the difference between imputing missing values and just dropping rows with NaN?
Dropping rows is fast but wastes data and introduces bias if missingness isn't random — for example, if low-income people consistently skip the income field, dropping them removes a real pattern your model should learn. Imputation preserves all rows and, when combined with a 'was_missing' indicator column, actually lets the model learn from the fact that data was absent. Use deletion only when less than ~5% of a column is missing and you're confident the missingness is completely random.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.