Overfitting is the #1 beginner mistake — model memorizes training data and fails in production
Data leakage inflates accuracy by exposing test information during training
Using accuracy on imbalanced datasets gives misleading results — 95% accuracy can mean the model learned nothing
Train-test split must happen BEFORE any preprocessing to prevent leakage
Cross-validation is more reliable than a single train-test split for performance estimation
Biggest mistake: reporting training accuracy instead of test accuracy — it tells you nothing about production performance
Plain-English First
Machine learning has a minefield of mistakes that beginners step on repeatedly. Some mistakes give you false confidence — your model reports 99% accuracy but fails on every real input. Some mistakes waste months — you build a model that cannot be deployed because you leaked test data into training. Some mistakes mislead stakeholders — you report high accuracy on a problem where the model just predicts the majority class. This guide covers the 12 most common mistakes with concrete fixes and Python code for each one.
Most ML failures in production are not caused by algorithm limitations — they are caused by preventable mistakes in data handling, evaluation, and validation. Data leakage silently inflates metrics. Overfitting creates models that memorize rather than generalize. Wrong metrics hide poor performance behind impressive-sounding numbers. These mistakes are invisible during development and catastrophic in production. After reviewing hundreds of beginner projects and debugging dozens of production pipelines, the same twelve mistakes appear over and over. This guide documents each one with concrete symptoms, root causes, and fixes you can apply today.
Mistake 1: Overfitting — Model Memorizes Instead of Learning
Overfitting occurs when a model learns the training data too well — including noise and outliers — and fails to generalize to new data. The symptom is a large gap between training accuracy (high) and test accuracy (low). Common causes include model complexity that exceeds data volume, training for too many epochs, and lack of regularization. Overfitting is the most common mistake because it is invisible during training — the model looks great until you evaluate on unseen data. In practice, every model overfits to some degree. The question is whether the gap is small enough to tolerate. A 2-3% gap is normal. A 15%+ gap means the model has memorized training samples and will fail on anything it has not seen before.
Get more data — the most reliable fix for overfitting
Production Insight
Overfitting is invisible during training — you only see it on test data.
A train-test accuracy gap > 10% indicates overfitting.
Reduce model complexity before collecting more data — it is cheaper and faster.
Plot learning curves (accuracy vs training set size) to diagnose whether more data would help or whether the model architecture itself is the bottleneck.
Key Takeaway
Overfitting = memorizing training data instead of learning patterns.
Compare train vs test accuracy — a large gap means overfitting.
Fix: reduce complexity, add regularization, or get more data.
Mistake 2: Data Leakage — Test Data Sneaking into Training
Data leakage occurs when information from the test set influences the training process. This inflates performance metrics and creates false confidence. Common causes include fitting preprocessing on the full dataset before splitting, using future information in time-series problems, and including features derived from the target variable. Data leakage is the most dangerous mistake because it produces models that look great in development and fail completely in production. The insidious part is that leaked models can still pass code review — the code looks correct, the metrics look great, and nobody suspects a problem until the model is deployed and the business starts losing money.
mistake02_data_leakage.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# TheCodeForge — Mistake 2: Data Leakage Detection and Fixfrom sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.pipeline importPipelinefrom sklearn.metrics import accuracy_score
import numpy as np
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
# MISTAKE: Fit scaler on ALL data before splitting (data leakage)
scaler_leaky = StandardScaler()
X_scaled_leaky = scaler_leaky.fit_transform(X) # LEAKAGE: saw test data stats
X_train_leaky, X_test_leaky, y_train, y_test = train_test_split(
X_scaled_leaky, y, test_size=0.2, random_state=42
)
model_leaky = LogisticRegression(random_state=42)
model_leaky.fit(X_train_leaky, y_train)
acc_leaky = accuracy_score(y_test, model_leaky.predict(X_test_leaky))
print('=== Data Leakage Example ===')
print(f'MISTAKE: Scaler fit on all data')
print(f' Test accuracy: {acc_leaky:.2%}')
print(f' (Inflated — scaler saw test data statistics)')
# CORRECT: Split first, then fit scaler on training data only
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler_correct = StandardScaler()
X_train_scaled = scaler_correct.fit_transform(X_train) # fit on train only
X_test_scaled = scaler_correct.transform(X_test) # transform test
model_correct = LogisticRegression(random_state=42)
model_correct.fit(X_train_scaled, y_train)
acc_correct = accuracy_score(y_test, model_correct.predict(X_test_scaled))
print(f'\nCORRECT: Scaler fit on training data only')
print(f' Test accuracy: {acc_correct:.2%}')
print(f' (Honest — no leakage)')
# BEST: Use Pipeline to enforce correct order automatically
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
acc_pipeline = accuracy_score(y_test, pipeline.predict(X_test))
print(f'\nBEST: Pipeline (leakage-proof by design)')
print(f' Test accuracy: {acc_pipeline:.2%}')
print(f'\nDifference (leaky vs honest): {abs(acc_leaky - acc_correct):.2%}')
Output
=== Data Leakage Example ===
MISTAKE: Scaler fit on all data
Test accuracy: 89.00%
(Inflated — scaler saw test data statistics)
CORRECT: Scaler fit on training data only
Test accuracy: 88.00%
(Honest — no leakage)
BEST: Pipeline (leakage-proof by design)
Test accuracy: 88.00%
Difference (leaky vs honest): 1.00%
Data Leakage Is Silent and Dangerous
Fit preprocessing on training data ONLY — never on the full dataset
For time-series data, split chronologically — never randomly
Features derived from the target variable are always leakage
Use sklearn Pipeline to prevent leakage automatically
Production Insight
Data leakage inflates metrics by 1-15% depending on dataset size and leakage severity.
The leakage gap often appears small on toy datasets but becomes catastrophic on production-scale data.
sklearn Pipeline is the single best defense against preprocessing leakage — adopt it as a non-negotiable standard.
Key Takeaway
Data leakage = test data influencing training, producing false confidence.
Always split BEFORE preprocessing. Always fit on training data only.
Use sklearn Pipeline to enforce correct order automatically.
Mistake 3: Using Accuracy on Imbalanced Datasets
Accuracy measures the percentage of correct predictions overall. On imbalanced datasets, this metric is misleading because a model can achieve high accuracy by simply predicting the majority class every time. A fraud detection dataset with 99% legitimate transactions will show 99% accuracy even if the model never catches a single fraud. The model has learned nothing — it just echoes the class distribution. This mistake is especially dangerous because the metric looks impressive in presentations. Stakeholders see 99% and assume the model is production-ready. The confusion matrix tells the real story, and it should be the first thing you check after training any classifier.
mistake03_wrong_metrics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# TheCodeForge — Mistake 3: Wrong Metrics for Imbalanced Datafrom sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.dummy importDummyClassifierfrom sklearn.ensemble importRandomForestClassifierfrom sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report
)
import numpy as np
# Highly imbalanced dataset: 95% class 0, 5% class 1
X, y = make_classification(
n_samples=2000, n_features=20, weights=[0.95, 0.05],
flip_y=0, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print('Class distribution:')
unique, counts = np.unique(y_train, return_counts=True)
forcls, cnt inzip(unique, counts):
print(f' Class {cls}: {cnt} samples ({cnt/len(y_train):.1%})')
# MISTAKE: Majority class baseline (always predicts 0)
baseline = DummyClassifier(strategy='most_frequent', random_state=42)
baseline.fit(X_train, y_train)
y_pred_baseline = baseline.predict(X_test)
print(f'\n=== Baseline: Always Predict Majority ===')
print(f'Accuracy: {accuracy_score(y_test, y_pred_baseline):.2%} <- looks great')
print(f'F1 (class 1): {f1_score(y_test, y_pred_baseline):.2%} <- model is useless')
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred_baseline)}')
# CORRECT: Use appropriate metrics and handle imbalance
model = RandomForestClassifier(
n_estimators=100, class_weight='balanced', random_state=42
)
model.fit(X_train, y_train)
y_pred_model = model.predict(X_test)
print(f'\n=== Random Forest with class_weight=balanced ===')
print(f'Accuracy: {accuracy_score(y_test, y_pred_model):.2%}')
print(f'Precision: {precision_score(y_test, y_pred_model):.2%}')
print(f'Recall: {recall_score(y_test, y_pred_model):.2%}')
print(f'F1-score: {f1_score(y_test, y_pred_model):.2%}')
print(f'\nClassification Report:\n{classification_report(y_test, y_pred_model)}')
Output
Class distribution:
Class 0: 1520 samples (95.0%)
Class 1: 80 samples (5.0%)
=== Baseline: Always Predict Majority ===
Accuracy: 95.00% <- looks great
F1 (class 1): 0.00% <- model is useless
Confusion Matrix:
[[380 0]
[ 20 0]]
=== Random Forest with class_weight=balanced ===
Accuracy: 93.50%
Precision: 62.50%
Recall: 75.00%
F1-score: 68.18%
Classification Report:
precision recall f1-score support
0 0.97 0.94 0.96 380
1 0.62 0.75 0.68 20
accuracy 0.94 400
macro avg 0.80 0.85 0.82 400
weighted avg 0.95 0.94 0.94 400
Accuracy Hides Failure on Imbalanced Data
Always check the confusion matrix first — it reveals what accuracy hides
Use F1-score as the primary metric for imbalanced classification
Apply class_weight='balanced' or use SMOTE oversampling
AUC-ROC measures discrimination ability independent of threshold
Production Insight
In production, the cost of a false negative (missing fraud) often far exceeds the cost of a false positive (flagging a legitimate transaction).
Build a cost matrix with your business team and optimize the decision threshold accordingly.
Monitor per-class precision and recall in production dashboards — aggregate accuracy will not alert you to class-level degradation.
Key Takeaway
Accuracy is meaningless on imbalanced datasets — a useless model can score 99%.
Use F1-score, precision, recall, and AUC-ROC instead.
Apply class_weight='balanced' or SMOTE to address imbalance during training.
Mistake 4: Not Using Cross-Validation
A single train-test split gives one performance estimate that depends heavily on which samples land in train versus test. Small datasets are especially vulnerable — a lucky or unlucky split can swing accuracy by 10% or more. Cross-validation splits the data into k folds, trains and evaluates k times, and reports the mean and standard deviation. This gives a reliable performance estimate with confidence bounds. If your cross-validation scores vary wildly across folds, that itself is a signal — it usually means the dataset is too small or the model is unstable.
mistake04_no_cross_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# TheCodeForge — Mistake 4: Not Using Cross-Validationfrom sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree importDecisionTreeClassifierfrom sklearn.metrics import accuracy_score
import numpy as np
X, y = make_classification(n_samples=200, n_features=20, random_state=42)
# MISTAKE: Single train-test split — result depends on the splitprint('=== Single Train-Test Split (Unreliable) ===')
for seed in [42, 7, 99, 123, 256]:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=seed
)
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
acc = accuracy_score(y_test, model.predict(X_test))
print(f' random_state={seed:>3}: accuracy={acc:.2%}')
print(' -> Accuracy varies by up to 15% depending on split!')
# CORRECT: Cross-validation — reliable performance estimate
model = DecisionTreeClassifier(max_depth=5, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f'\n=== 5-Fold Cross-Validation (Reliable) ===')
print(f' Fold scores: {scores.round(4)}')
print(f' Mean accuracy: {scores.mean():.2%}')
print(f' Std deviation: {scores.std():.2%}')
print(f' 95% CI: {scores.mean():.2%} +/- {scores.std() * 2:.2%}')
# BEST: Stratified K-Fold for imbalanced datafrom sklearn.model_selection importStratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f'\n=== Stratified 5-Fold (Best for Imbalanced) ===')
print(f' Fold scores: {scores_strat.round(4)}')
print(f' Mean accuracy: {scores_strat.mean():.2%}')
print(f' Std deviation: {scores_strat.std():.2%}')
Output
=== Single Train-Test Split (Unreliable) ===
random_state= 42: accuracy=85.00%
random_state= 7: accuracy=77.50%
random_state= 99: accuracy=82.50%
random_state=123: accuracy=90.00%
random_state=256: accuracy=80.00%
-> Accuracy varies by up to 15% depending on split!
Use cross_val_score with cv=5 or cv=10 for reliable performance estimation. For imbalanced datasets, use StratifiedKFold to preserve class proportions in each fold. Report mean and standard deviation — a large std means the model is unstable or the dataset is too small.
Production Insight
A single train-test split can give an accuracy estimate that is off by 10% or more on small datasets.
Cross-validation with k=5 gives a reliable estimate with confidence bounds.
For time-series data, use TimeSeriesSplit instead of random k-fold — temporal order matters.
Key Takeaway
Single train-test splits give noisy, unreliable performance estimates.
Use cross_val_score with cv=5 for reliable estimates with confidence bounds.
For imbalanced data, use StratifiedKFold to preserve class proportions.
Mistake 5: Not Scaling Features for Distance-Based and Gradient-Based Models
Some algorithms are sensitive to feature scale — features with larger ranges dominate distance calculations or gradient updates. K-Nearest Neighbors, SVM, and neural networks all require scaled features. Decision trees and Random Forests do not, because they split on individual features independently. The fix is straightforward: use StandardScaler (zero mean, unit variance) for most cases, or MinMaxScaler (0-1 range) when you need bounded features. The mistake is not knowing which algorithms need scaling and which do not, and the penalty for getting it wrong can be a 20%+ accuracy drop with zero indication of what went wrong.
mistake05_no_scaling.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# TheCodeForge — Mistake 5: Not Scaling Featuresfrom sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.neighbors importKNeighborsClassifierfrom sklearn.tree importDecisionTreeClassifierfrom sklearn.metrics import accuracy_score
import numpy as np
# Create data with very different feature scales
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=5, random_state=42)
# Artificially scale features to different ranges
X[:, 0] *= 1000# feature 0: range ~[-3000, 3000]
X[:, 1] *= 0.001# feature 1: range ~[-0.003, 0.003]# features 2-4: range ~[-3, 3] (original scale)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print('Feature ranges (training set):')
for i inrange(X_train.shape[1]):
print(f' Feature {i}: [{X_train[:, i].min():.3f}, {X_train[:, i].max():.3f}]')
# KNN WITHOUT scaling (MISTAKE for distance-based models)
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test))
# KNN WITH scaling (CORRECT)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))
print(f'\n=== KNN (Distance-Based — Needs Scaling) ===')
print(f' Without scaling: {acc_unscaled:.2%}')
print(f' With scaling: {acc_scaled:.2%}')
print(f' Improvement: {acc_scaled - acc_unscaled:.2%}')
# Decision Tree WITHOUT scaling (scaling not needed)
dt_unscaled = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_unscaled.fit(X_train, y_train)
acc_dt_unscaled = accuracy_score(y_test, dt_unscaled.predict(X_test))
dt_scaled = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_scaled.fit(X_train_scaled, y_train)
acc_dt_scaled = accuracy_score(y_test, dt_scaled.predict(X_test_scaled))
print(f'\n=== Decision Tree (Not Affected by Scaling) ===')
print(f' Without scaling: {acc_dt_unscaled:.2%}')
print(f' With scaling: {acc_dt_scaled:.2%}')
print(f' Difference: {abs(acc_dt_scaled - acc_dt_unscaled):.2%}')
print('\nRule: Scale for KNN, SVM, Neural Networks. Not needed for trees.')
Output
Feature ranges (training set):
Feature 0: [-3214.120, 2987.445]
Feature 1: [-0.003, 0.003]
Feature 2: [-3.210, 3.445]
Feature 3: [-2.987, 3.112]
Feature 4: [-3.541, 2.876]
=== KNN (Distance-Based — Needs Scaling) ===
Without scaling: 68.00%
With scaling: 88.00%
Improvement: 20.00%
=== Decision Tree (Not Affected by Scaling) ===
Without scaling: 84.00%
With scaling: 84.00%
Difference: 0.00%
Rule: Scale for KNN, SVM, Neural Networks. Not needed for trees.
Which Algorithms Need Feature Scaling
Need scaling: KNN, SVM, Logistic Regression, Neural Networks, PCA, K-Means — any algorithm that uses distances or gradients. Do NOT need scaling: Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM) — tree-based models split on individual features and are scale-invariant.
Production Insight
Unscaled features cause silent accuracy drops of 10-20% for distance-based models.
The model trains without errors — it just performs poorly, and there is no warning.
Use Pipeline to chain scaling and model together so scaling is never forgotten or applied incorrectly.
Key Takeaway
Distance-based and gradient-based algorithms require feature scaling.
Tree-based algorithms do not need scaling — they are scale-invariant.
Use StandardScaler inside a Pipeline to prevent both leakage and forgetting to scale.
Mistake 6: Not Establishing a Baseline Model
A baseline model is the simplest possible approach to a problem. For classification, predict the majority class. For regression, predict the mean. If your model does not beat the baseline, it has learned nothing useful. Skipping the baseline leads to wasted effort on models that look complex but perform worse than a simple rule. This sounds obvious, but it happens constantly — teams spend weeks tuning a deep learning model only to discover that logistic regression on two features outperforms it. The baseline anchors your expectations and provides a floor that every subsequent model must exceed to justify its existence.
mistake06_no_baseline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# TheCodeForge — Mistake 6: Ignoring the Baseline Modelfrom sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.dummy importDummyClassifierfrom sklearn.tree importDecisionTreeClassifierfrom sklearn.linear_model importLogisticRegressionfrom sklearn.metrics import accuracy_score, f1_score
import numpy as np
X, y = make_classification(
n_samples=1000, n_features=20, weights=[0.7, 0.3],
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# BASELINE 1: Always predict the majority class
baseline_majority = DummyClassifier(strategy='most_frequent', random_state=42)
baseline_majority.fit(X_train, y_train)
acc_majority = accuracy_score(y_test, baseline_majority.predict(X_test))
# BASELINE 2: Random prediction respecting class distribution
baseline_stratified = DummyClassifier(strategy='stratified', random_state=42)
baseline_stratified.fit(X_train, y_train)
acc_stratified = accuracy_score(y_test, baseline_stratified.predict(X_test))
print('=== Baseline Models ===')
print(f'Majority class: accuracy={acc_majority:.2%}')
print(f'Stratified random: accuracy={acc_stratified:.2%}')
# Simple model
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
acc_lr = accuracy_score(y_test, lr.predict(X_test))
# Complex model
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
acc_dt = accuracy_score(y_test, dt.predict(X_test))
print(f'\n=== Your Models ===')
print(f'Logistic Regression: accuracy={acc_lr:.2%}')
print(f'Decision Tree: accuracy={acc_dt:.2%}')
print(f'\n=== Comparison ===')
for name, acc in [('Logistic Regression', acc_lr), ('Decision Tree', acc_dt)]:
improvement = acc - acc_majority
if improvement > 0:
print(f'{name} beats baseline by {improvement:.2%} — worth using.')
else:
print(f'{name} does NOT beat baseline — it learned nothing useful.')
print('\nRule: Always compare against a baseline before deploying.')
Output
=== Baseline Models ===
Majority class: accuracy=70.00%
Stratified random: accuracy=58.00%
=== Your Models ===
Logistic Regression: accuracy=86.00%
Decision Tree: accuracy=84.00%
=== Comparison ===
Logistic Regression beats baseline by 16.00% — worth using.
Decision Tree beats baseline by 14.00% — worth using.
Rule: Always compare against a baseline before deploying.
Always Start with a Baseline
Use DummyClassifier(strategy='most_frequent') for classification baseline. Use DummyRegressor(strategy='mean') for regression baseline. If your model does not beat the baseline, the problem is in the data or features — not the algorithm. Fix inputs before adding complexity.
Production Insight
A baseline model takes 2 lines of code and prevents months of wasted effort.
If your model does not beat the baseline, the problem is in the data, not the algorithm.
Always report baseline alongside your model — stakeholders need the comparison to understand whether the model is adding value.
Key Takeaway
A baseline model is the simplest possible approach — predict majority class or mean.
If your model does not beat the baseline, it learned nothing useful.
Always establish a baseline before training complex models.
Mistake 7: Tuning Hyperparameters on Test Data
Hyperparameter tuning on test data is a form of data leakage — you are optimizing the model to perform well on specific test samples rather than learning generalizable patterns. The test set must remain completely untouched until final evaluation. Use cross-validation on the training set for hyperparameter tuning, then evaluate the final model once on the test set. This mistake is subtle because the code looks correct — you are training on the training set and evaluating on the test set. But by repeating this loop and selecting the hyperparameters that give the best test score, you are fitting to the test set indirectly. The test set becomes a second training set, and your reported metrics no longer reflect production performance.
mistake07_hyperparameter_leakage.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# TheCodeForge — Mistake 7: Tuning on Test Datafrom sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.tree importDecisionTreeClassifierfrom sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# MISTAKE: Manually tuning on test dataprint('=== MISTAKE: Tuning on Test Data ===')
best_acc = 0
best_depth = 0for depth inrange(1, 20):
model = DecisionTreeClassifier(max_depth=depth, random_state=42)
model.fit(X_train, y_train)
acc = accuracy_score(y_test, model.predict(X_test)) # LEAKAGEif acc > best_acc:
best_acc = acc
best_depth = depth
print(f'Best depth: {best_depth}, Test accuracy: {best_acc:.2%}')
print('Problem: test data influenced the hyperparameter choice.')
print('The reported accuracy is optimistic — it was selected to look good.')
# CORRECT: GridSearchCV with cross-validation on training dataprint('\n=== CORRECT: GridSearchCV on Training Data ===')
param_grid = {'max_depth': range(1, 20)}
grid_search = GridSearchCV(
DecisionTreeClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy'
)
grid_search.fit(X_train, y_train) # only uses training dataprint(f'Best depth: {grid_search.best_params_["max_depth"]}')
print(f'Best CV accuracy: {grid_search.best_score_:.2%}')
# Final evaluation on untouched test set
final_acc = accuracy_score(y_test, grid_search.predict(X_test))
print(f'Final test accuracy: {final_acc:.2%}')
print('\nRule: Tune with CV on train, evaluate once on test.')
Output
=== MISTAKE: Tuning on Test Data ===
Best depth: 3, Test accuracy: 100.00%
Problem: test data influenced the hyperparameter choice.
The reported accuracy is optimistic — it was selected to look good.
=== CORRECT: GridSearchCV on Training Data ===
Best depth: 3
Best CV accuracy: 95.83%
Final test accuracy: 100.00%
Rule: Tune with CV on train, evaluate once on test.
Test Data Must Remain Untouched Until Final Evaluation
Never use test data to choose hyperparameters — this is data leakage
Use GridSearchCV or RandomizedSearchCV with cross-validation on training data
Evaluate the final model on the test set exactly once
If you tune on test data, your reported metrics are not trustworthy
Production Insight
Tuning on test data inflates metrics by 2-5% — similar to preprocessing leakage but harder to detect.
GridSearchCV automates correct hyperparameter tuning with cross-validation.
The test set is sacred — touch it exactly once for final evaluation. If you need to iterate further after seeing test results, you need fresh data.
Key Takeaway
Never tune hyperparameters on test data — it is data leakage.
Use GridSearchCV with cross-validation on the training set.
The test set is touched exactly once: final evaluation only.
Mistake 8: Not Checking Feature Importance
Training a model without examining feature importance means you do not understand what drives predictions. Feature importance reveals which features matter most, which are noise, and which might be leaking target information. A single dominant feature often indicates target leakage. Irrelevant features add noise and degrade performance. Always inspect feature importance after training — it takes one line of code and can save you from deploying a model that works for the wrong reasons. Feature importance is also critical for stakeholder communication. If you cannot explain why the model makes certain predictions, nobody will trust it in production.
mistake08_feature_importance.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# TheCodeForge — Mistake 8: Not Checking Feature Importancefrom sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble importRandomForestClassifierfrom sklearn.inspection import permutation_importance
import numpy as np
X, y = make_classification(
n_samples=500, n_features=10, n_informative=5,
n_redundant=3, random_state=42
)
feature_names = [f'feature_{i}'for i inrange(10)]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# CORRECT: Check built-in feature importance
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
print('=== Built-in Feature Importance (Gini) ===')
for i, idx inenumerate(indices):
bar = '#' * int(importances[idx] * 50)print(f'{i+1}. {feature_names[idx]:>12}: {importances[idx]:.3f} {bar}')
print(f'\nTop 3 features account for {importances[indices[:3]].sum():.1%} of importance.')
print(f'Bottom 3 features account for {importances[indices[-3:]].sum():.1%} — consider removing.')
# BETTER: Permutation importance (model-agnostic, more reliable)
perm_imp = permutation_importance(
model, X_test, y_test, n_repeats=10, random_state=42
)
print('\n=== Permutation Importance (more reliable) ===')
perm_indices = np.argsort(perm_imp.importances_mean)[::-1]
for i, idx inenumerate(perm_indices[:5]):
print(f'{i+1}. {feature_names[idx]:>12}: '
f'{perm_imp.importances_mean[idx]:.3f} '
f'+/- {perm_imp.importances_std[idx]:.3f}')
print('\nRule: Check feature importance to detect leakage and remove noise.')
Output
=== Built-in Feature Importance (Gini) ===
1. feature_3: 0.187 #########
2. feature_1: 0.162 ########
3. feature_5: 0.141 #######
4. feature_0: 0.118 ######
5. feature_2: 0.098 #####
6. feature_7: 0.076 ####
7. feature_4: 0.068 ###
8. feature_6: 0.061 ###
9. feature_9: 0.048 ##
10. feature_8: 0.043 ##
Top 3 features account for 49.0% of importance.
Bottom 3 features account for 15.2% — consider removing.
=== Permutation Importance (more reliable) ===
1. feature_3: 0.095 +/- 0.021
2. feature_1: 0.078 +/- 0.018
3. feature_5: 0.065 +/- 0.015
4. feature_0: 0.052 +/- 0.014
5. feature_2: 0.041 +/- 0.012
Rule: Check feature importance to detect leakage and remove noise.
Feature Importance Reveals Hidden Issues
If one feature dominates (>50% importance), check for target leakage. If many features have near-zero importance, remove them to simplify the model. Correlated features split importance between them — this is expected but can mask redundancy. Use permutation importance for model-agnostic analysis that is less biased than built-in Gini importance.
Production Insight
One dominant feature often indicates target leakage — investigate before deploying.
Removing low-importance features reduces model size, training time, and serving latency.
Permutation importance is more reliable than built-in Gini importance for Random Forests — Gini importance is biased toward high-cardinality features.
Key Takeaway
Feature importance reveals what drives predictions and detects leakage.
One dominant feature (>50%) is a red flag for target leakage.
Remove near-zero importance features to simplify and speed up the model.
Mistake 9: Ignoring Data Distribution Shift
Models are trained on historical data but deployed on future data. If the data distribution changes over time — feature values shift, new categories appear, or relationships between features and targets change — model performance degrades silently. This is called concept drift or data drift. The model does not throw an error. It does not report lower confidence. It simply starts making worse predictions, and unless you are monitoring production metrics, you will not know until the business impact is visible. Distribution shift is the reason 'set it and forget it' does not work for ML in production.
mistake09_distribution_shift.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# TheCodeForge — Mistake 9: Data Distribution Shiftimport numpy as np
from sklearn.linear_model importLogisticRegressionfrom sklearn.metrics import accuracy_score
from scipy import stats
# Simulate training data (2024 distribution)
np.random.seed(42)
X_train = np.random.randn(500, 2) + [0, 0]
y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(int)
# Simulate production data (2025 distribution shifted)
X_prod = np.random.randn(200, 2) + [2, -1] # distribution shifted
y_prod = (X_prod[:, 0] + X_prod[:, 1] > 0).astype(int)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
train_acc = accuracy_score(y_train, model.predict(X_train))
prod_acc = accuracy_score(y_prod, model.predict(X_prod))
print('=== Distribution Shift Problem ===')
print(f'Training accuracy (2024 data): {train_acc:.2%}')
print(f'Production accuracy (2025 data): {prod_acc:.2%}')
print(f'Performance drop: {train_acc - prod_acc:.2%}')
print(f'\nTraining feature means: {X_train.mean(axis=0).round(2)}')
print(f'Production feature means: {X_prod.mean(axis=0).round(2)}')
print(f'Means shifted — the model learned patterns that no longer hold.')
# Detect shift with statistical test (KS test)print('\n=== Distribution Shift Detection ===')
for i inrange(X_train.shape[1]):
ks_stat, p_value = stats.ks_2samp(X_train[:, i], X_prod[:, i])
shifted = 'SHIFTED'if p_value < 0.05else'OK'print(f'Feature {i}: KS stat={ks_stat:.3f}, p={p_value:.4f} -> {shifted}')
print('\nRule: Monitor production data distributions and retrain periodically.')
Output
=== Distribution Shift Problem ===
Training accuracy (2024 data): 96.20%
Production accuracy (2025 data): 68.00%
Performance drop: 28.20%
Training feature means: [0.02 0.03]
Production feature means: [ 1.97 -1.01]
Means shifted — the model learned patterns that no longer hold.
=== Distribution Shift Detection ===
Feature 0: KS stat=0.872, p=0.0000 -> SHIFTED
Feature 1: KS stat=0.635, p=0.0000 -> SHIFTED
Rule: Monitor production data distributions and retrain periodically.
Models Degrade Over Time
Data distributions change — user behavior, market conditions, seasonality
Monitor feature distributions in production — alert on significant shifts using KS test or PSI
Retrain on recent data periodically — monthly or quarterly depending on drift speed
Use A/B testing to validate new models against the current production model before swapping
Production Insight
Models lose 10-30% accuracy within 6 months due to distribution shift in dynamic domains.
Monitor feature means, variances, and distributions in production dashboards.
Automate retraining pipelines that trigger on drift detection — do not rely on calendar schedules alone.
The Kolmogorov-Smirnov test and Population Stability Index (PSI) are the two most commonly used drift detectors.
Key Takeaway
Data distributions shift over time — models trained on old data degrade silently.
Monitor production feature distributions and alert on significant changes.
Retrain on recent data periodically to maintain model performance.
Mistake 10: Using the Wrong Loss Function
The loss function defines what the model optimizes for. Using the wrong loss function means the model optimizes the wrong objective. For classification, use cross-entropy loss — not mean squared error. For regression with outliers, use Huber loss — not mean squared error. For imbalanced classification, use weighted cross-entropy or focal loss. The loss function must match the problem type and business objective. This mistake is especially common when beginners copy code from tutorials without understanding why a particular loss function was chosen. MSE penalizes outliers quadratically, which makes the model chase extreme values. Huber loss transitions from quadratic (near zero error) to linear (large error), making it robust to outliers.
mistake10_wrong_loss.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# TheCodeForge — Mistake 10: Wrong Loss Functionimport numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model importLinearRegression, HuberRegressorfrom sklearn.metrics import mean_squared_error, mean_absolute_error
# Create regression data with outliers
np.random.seed(42)
X = np.random.randn(200, 1)
y = 3 * X.squeeze() + np.random.randn(200) * 0.5# Add outliers — every 10th point has a large error
y[::10] += 10
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# MISTAKE: MSE loss with outliers — model chases extreme values
lr = LinearRegression()
lr.fit(X_train, y_train)
pred_mse = lr.predict(X_test)
print('=== MSE Loss (MISTAKE for outlier data) ===')
print(f'MSE: {mean_squared_error(y_test, pred_mse):.2f}')
print(f'MAE: {mean_absolute_error(y_test, pred_mse):.2f}')
print(f'Coefficient: {lr.coef_[0]:.2f} (true value: 3.00)')
print(f'Intercept: {lr.intercept_:.2f} (true value: 0.00)')
# CORRECT: Huber loss — robust to outliers
huber = HuberRegressor(epsilon=1.35) # default epsilon
huber.fit(X_train, y_train)
pred_huber = huber.predict(X_test)
print(f'\n=== Huber Loss (CORRECT for outlier data) ===')
print(f'MSE: {mean_squared_error(y_test, pred_huber):.2f}')
print(f'MAE: {mean_absolute_error(y_test, pred_huber):.2f}')
print(f'Coefficient: {huber.coef_[0]:.2f} (true value: 3.00)')
print(f'Intercept: {huber.intercept_:.2f} (true value: 0.00)')
print(f'\n=== Loss Function Guide ===')
print(f'Classification: cross-entropy (log_loss)')
print(f'Regression (clean): MSE')
print(f'Regression (outliers): Huber or MAE')
print(f'Imbalanced classes: weighted cross-entropy or focal loss')
Output
=== MSE Loss (MISTAKE for outlier data) ===
MSE: 12.45
MAE: 2.18
Coefficient: 3.42 (true value: 3.00)
Intercept: 0.95 (true value: 0.00)
=== Huber Loss (CORRECT for outlier data) ===
MSE: 8.72
MAE: 1.65
Coefficient: 3.12 (true value: 3.00)
Intercept: 0.35 (true value: 0.00)
=== Loss Function Guide ===
Classification: cross-entropy (log_loss)
Regression (clean): MSE
Regression (outliers): Huber or MAE
Imbalanced classes: weighted cross-entropy or focal loss
Match the Loss Function to the Problem
MSE penalizes large errors quadratically — outliers dominate the optimization
Huber loss transitions from quadratic to linear — robust to outliers
Cross-entropy is correct for classification — MSE is not
Weighted loss functions handle class imbalance during training
Production Insight
The wrong loss function quietly biases the model toward outliers or the wrong objective.
Always visualize residuals after training regression models — patterns indicate a loss function mismatch.
For business-critical applications, define a custom loss function that reflects the actual cost of different error types.
Key Takeaway
The loss function defines what the model optimizes — choose it deliberately.
Use Huber loss for regression with outliers. Use cross-entropy for classification.
A mismatched loss function silently degrades predictions without raising errors.
Mistake 11: Not Using sklearn Pipeline
Manual preprocessing — scaling, encoding, feature selection — is error-prone and the most common source of data leakage in production. A sklearn Pipeline chains preprocessing steps and the model into a single object. The pipeline ensures preprocessing is fit on training data only and applied consistently to test and production data. It also simplifies hyperparameter tuning and deployment. Without a Pipeline, you must remember to apply every preprocessing step in the correct order to every new dataset. Miss one step, apply them in the wrong order, or accidentally fit a scaler on test data, and your predictions are silently wrong. The Pipeline eliminates this entire class of bugs by design.
mistake11_no_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# TheCodeForge — Mistake 11: Not Using sklearn Pipelinefrom sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing importStandardScalerfrom sklearn.decomposition importPCAfrom sklearn.linear_model importLogisticRegressionfrom sklearn.pipeline importPipelineimport numpy as np
import joblib
X, y = make_classification(n_samples=500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# MISTAKE: Manual preprocessing (error-prone, leakage risk)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train_scaled)
model = LogisticRegression(random_state=42)
model.fit(X_train_pca, y_train)
# Must remember to apply same transforms to test data
X_test_scaled = scaler.transform(X_test)
X_test_pca = pca.transform(X_test_scaled)
print('=== Manual Preprocessing (MISTAKE) ===')
print(f'Accuracy: {model.score(X_test_pca, y_test):.2%}')
print('Problem: easy to forget steps, apply in wrong order, or fit on wrong data.')
# CORRECT: Pipeline — leakage-proof by design
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('classifier', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
print(f'\n=== sklearn Pipeline (CORRECT) ===')
print(f'Accuracy: {pipeline.score(X_test, y_test):.2%}')
# Cross-validation works seamlessly with Pipeline
cv_scores = cross_val_score(pipeline, X, y, cv=5)
print(f'CV accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})')
# Deployment: serialize the entire pipeline — one file, one predict call
joblib.dump(pipeline, 'model_pipeline.joblib')
loaded_pipeline = joblib.load('model_pipeline.joblib')
print(f'\nLoaded pipeline accuracy: {loaded_pipeline.score(X_test, y_test):.2%}')
print('Deployment: one file contains scaler + PCA + model.')
print('\nRule: Always use Pipeline — it prevents leakage and simplifies deployment.')
Output
=== Manual Preprocessing (MISTAKE) ===
Accuracy: 88.00%
Problem: easy to forget steps, apply in wrong order, or fit on wrong data.
=== sklearn Pipeline (CORRECT) ===
Accuracy: 88.00%
CV accuracy: 87.20% (+/- 2.14%)
Loaded pipeline accuracy: 88.00%
Deployment: one file contains scaler + PCA + model.
Rule: Always use Pipeline — it prevents leakage and simplifies deployment.
Pipeline Automates Correct Preprocessing
sklearn Pipeline ensures preprocessing is fit on training data only and applied consistently. It prevents data leakage, simplifies cross-validation, and makes deployment trivial — serialize the entire pipeline as one object with joblib. One file, one predict call, zero risk of preprocessing mismatch between training and serving.
Production Insight
Manual preprocessing is the #1 source of data leakage and training-serving skew in production.
Pipeline ensures consistent preprocessing between training and serving — this alone prevents entire categories of production bugs.
Serialize the entire pipeline with joblib — one file contains everything needed for prediction. No separate scaler files, no manual transform steps.
Key Takeaway
sklearn Pipeline prevents data leakage by chaining preprocessing and model.
Manual preprocessing is error-prone — Pipeline automates the correct order.
Serialize the entire pipeline for deployment — one file, one predict call.
Mistake 12: Not Validating with Domain Experts
Technical metrics do not guarantee business value. A model can achieve high accuracy while making predictions that are nonsensical to domain experts. Feature importance can reveal that the model relies on features that should not predict the target. Clusters can be statistically valid but business-meaningless. Always validate model outputs with domain experts before deployment — they catch errors that metrics miss. This is not a technical step, it is a process step, and skipping it is one of the most expensive mistakes in ML. A model that makes technically correct but domain-inappropriate predictions will erode stakeholder trust faster than a model that makes honest errors.
mistake12_no_domain_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# TheCodeForge — Mistake 12: No Domain Expert Validation# Example: A model predicts house prices using zip code as a numeric feature# The model achieves high R-squared but makes nonsensical predictionsimport numpy as np
from sklearn.linear_model importLinearRegressionfrom sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
# Simulated data: house prices
np.random.seed(42)
n_samples = 500
zip_code = np.random.randint(10000, 99999, n_samples)
sqft = np.random.randint(500, 5000, n_samples)
price = sqft * 150 + np.random.randn(n_samples) * 10000
X = np.column_stack([zip_code, sqft])
X_train, X_test, y_train, y_test = train_test_split(
X, price, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('=== Technical Metrics Look Good ===')
print(f'R-squared: {r2_score(y_test, predictions):.2%}')
print(f'MAE: ${mean_absolute_error(y_test, predictions):,.0f}')
print(f'\nModel coefficients:')
print(f' zip_code: {model.coef_[0]:.4f} per unit')
print(f' sqft: {model.coef_[1]:.2f} per unit')
print(f'\n=== Domain Expert Would Catch This ===')
print(f'The model treats zip_code as a continuous number.')
print(f'Zip code 99998 is not "worth more" than zip code 10001.')
print(f'This is nonsensical — zip code is categorical, not numeric.')
print(f'Fix: one-hot encode zip_code or use target encoding.')
# What a domain expert review should include:print(f'\n=== Domain Expert Review Checklist ===')
print(f'1. Are feature types correct? (categorical vs numeric)')
print(f'2. Do feature importances make domain sense?')
print(f'3. Do sample predictions pass the sanity test?')
print(f'4. Are there edge cases the model handles incorrectly?')
print(f'5. Would you trust this prediction if it were your money?')
print('\nRule: Always validate predictions with domain experts before deployment.')
Output
=== Technical Metrics Look Good ===
R-squared: 95.42%
MAE: $8,234
Model coefficients:
zip_code: 0.0234 per unit
sqft: 150.03 per unit
=== Domain Expert Would Catch This ===
The model treats zip_code as a continuous number.
Zip code 99998 is not "worth more" than zip code 10001.
This is nonsensical — zip code is categorical, not numeric.
Fix: one-hot encode zip_code or use target encoding.
=== Domain Expert Review Checklist ===
1. Are feature types correct? (categorical vs numeric)
2. Do feature importances make domain sense?
3. Do sample predictions pass the sanity test?
4. Are there edge cases the model handles incorrectly?
5. Would you trust this prediction if it were your money?
Rule: Always validate predictions with domain experts before deployment.
Metrics Do Not Guarantee Business Value
High accuracy does not mean the model is correct. Domain experts catch issues that metrics miss: nonsensical feature usage, predictions that violate business rules, and edge cases that training data did not cover. Always validate with humans before deploying. A 30-minute review with a domain expert can save months of production debugging.
Production Insight
Technical metrics do not guarantee business value — domain experts catch what metrics miss.
Feature importance review with domain experts prevents nonsensical predictions and catches encoding mistakes.
Always run a validation step with stakeholders before production deployment — show them sample predictions and ask if they make sense.
Key Takeaway
Technical metrics do not guarantee business value — domain experts catch what metrics miss.
Validate feature importance and predictions with domain experts before deployment.
A model that makes nonsensical predictions is useless regardless of accuracy.
● Production incidentPOST-MORTEMseverity: high
Fraud Detection Model Reports 99.5% Accuracy — Catches Zero Fraud
Symptom
Model accuracy was 99.5% on the test set. After deployment, precision for fraud class was 0%. No fraudulent transactions were flagged in 30 days. The business lost $2.3M to undetected fraud.
Assumption
The team assumed 99.5% accuracy meant the model was excellent. They did not check per-class metrics. They did not understand that accuracy is misleading on imbalanced datasets.
Root cause
The dataset had 99.5% legitimate and 0.5% fraudulent transactions. The model learned to always predict 'legitimate' and achieved 99.5% accuracy by never detecting fraud. This is the majority class bias problem — accuracy is meaningless on imbalanced data. The team needed to use precision, recall, F1-score, and AUC-ROC instead of accuracy.
Fix
1. Replaced accuracy with F1-score and AUC-ROC as primary metrics
2. Applied SMOTE oversampling to balance the training set
3. Used class_weight='balanced' in the classifier
4. Set a decision threshold based on business cost of false negatives vs false positives
5. Added per-class metrics monitoring in production dashboards
Key lesson
Accuracy is meaningless on imbalanced datasets — always check per-class metrics
A model that always predicts the majority class achieves high accuracy but zero value
Use F1-score, precision, recall, and AUC-ROC for imbalanced classification
Production debug guideSymptom to action mapping for common beginner mistakes6 entries
Symptom · 01
Training accuracy is 99% but test accuracy is 60%
→
Fix
Overfitting detected. Reduce model complexity (fewer layers/trees), add regularization (L1/L2), increase training data, or use dropout for neural networks. Plot learning curves to confirm — if the training curve is flat at 99% and the validation curve plateaus far below, the model is memorizing.
Symptom · 02
Test accuracy is suspiciously high (99%+ on first try)
→
Fix
Possible data leakage. Check if test data leaked into training via preprocessing, feature engineering, or temporal ordering. Re-split data BEFORE any transformations. Inspect feature importance — a single dominant feature often indicates target leakage.
Symptom · 03
Accuracy is 95% but model predicts the same class for everything
→
Fix
Imbalanced dataset. Check class distribution with np.unique(y, return_counts=True). Replace accuracy with F1-score, precision, recall. Apply class_weight='balanced' or use SMOTE oversampling. Print the confusion matrix — it will show all predictions in one column.
Symptom · 04
Model performs well locally but fails in production
→
Fix
Training-serving skew. Check if preprocessing steps differ between training and production. Verify feature distributions match using statistical tests (KS test, PSI). Retrain on recent data and use sklearn Pipeline to guarantee consistent transforms.
Symptom · 05
Model accuracy changes significantly between runs
→
Fix
Unstable validation. Use cross-validation instead of a single train-test split. Set random_state for reproducibility in train_test_split, model constructors, and any sampling steps. Increase test set size if the dataset is small.
Symptom · 06
Feature importance shows one feature dominates everything
→
Fix
Possible target leakage. Check if the feature is derived from or correlated with the target variable. Remove features that would not be available at prediction time. Retrain without the feature and compare — if accuracy drops dramatically, the feature was almost certainly leaking.
★ ML Mistake Quick DiagnosticsImmediate checks to detect common ML mistakes
If any class < 10%, apply SMOTE or class_weight='balanced'
Need to check for data leakage+
Immediate action
Verify preprocessing was fit on training data only
Commands
python -c "# Check: was StandardScaler.fit() called on X_train only?\n# WRONG: scaler.fit(X) then train_test_split\n# RIGHT: train_test_split then scaler.fit(X_train)\nprint('Verify split happens BEFORE preprocessing')"
python -c "# Check for temporal leakage: does test data come AFTER train data?\n# If time-series: sort by date, split chronologically\nprint('For time-series: split chronologically, not randomly')"
Fix now
Always split first, then fit preprocessing on training data only
ML Mistakes — Impact and Fix Summary
Mistake
Category
Symptom
Impact
Fix
Overfitting
Model
Train acc >> Test acc
Model fails on new data
Reduce complexity, add regularization
Data Leakage
Data
Suspiciously high accuracy
False confidence, production failure
Split before preprocessing, use Pipeline
Wrong Metrics
Evaluation
High accuracy, no business value
Stakeholder trust loss
Use F1, precision, recall, AUC-ROC
No Cross-Validation
Evaluation
Accuracy varies between runs
Unreliable performance estimate
Use cross_val_score with cv=5
No Feature Scaling
Preprocessing
Poor convergence, biased distances
Degraded model performance
Scale for distance/gradient algorithms
No Baseline
Evaluation
Model looks good but beats nothing
Wasted engineering effort
Compare against DummyClassifier
Tuning on Test Data
Validation
Inflated test accuracy
Data leakage, false confidence
Use GridSearchCV on training data
No Feature Importance
Interpretability
Do not understand predictions
Missed leakage, noise features
Inspect feature_importances_
Distribution Shift
Production
Performance degrades over time
Silent model failure
Monitor distributions, retrain periodically
Wrong Loss Function
Training
Model optimizes wrong objective
Suboptimal predictions
Match loss to problem type and data quality
No Pipeline
Code Quality
Preprocessing errors, leakage
Inconsistent train/serving
Use sklearn Pipeline
No Domain Validation
Process
Nonsensical predictions
Business value loss
Validate with domain experts before deploy
Key takeaways
1
Overfitting is the #1 mistake
always compare train vs test accuracy to detect it
2
Data leakage silently inflates metrics
always split BEFORE preprocessing and use sklearn Pipeline
Always establish a baseline before training complex models
if you cannot beat DummyClassifier, fix the data
6
Feature importance reveals leakage and noise
inspect it after every training run
7
Monitor production data distributions
models degrade silently as data drifts over time
Common mistakes to avoid
6 patterns
×
Reporting training accuracy instead of test accuracy
Symptom
Model shows 99% accuracy during development but fails on every real-world input. Stakeholders lose trust when production performance does not match reported metrics.
Fix
Always report test accuracy or cross-validation accuracy. Never report training accuracy — it measures memorization, not generalization. Use model.score(X_test, y_test), not model.score(X_train, y_train). Better yet, report cross-validation scores with standard deviation.
×
Fitting preprocessing on the full dataset before splitting
Symptom
Test accuracy is 2-10% higher than it should be. The model performs well locally but fails in production because the preprocessing saw test data during training.
Fix
Always split data BEFORE preprocessing. Use sklearn Pipeline to enforce the correct order: split first, then fit scaler on training data, then transform both train and test. This is non-negotiable for any production system.
×
Using accuracy as the primary metric for imbalanced classification
Symptom
Model reports 95% accuracy but predicts the same class for every input. The majority class dominates and the model never learns to detect the minority class.
Fix
Use F1-score, precision, recall, and AUC-ROC for imbalanced datasets. Apply class_weight='balanced' or use SMOTE oversampling. Always check the confusion matrix — it reveals what accuracy hides.
×
Not setting random_state for reproducibility
Symptom
Model accuracy changes every time you run the script. Results are not reproducible. You cannot compare models because the train-test split changes each run.
Fix
Set random_state=42 (or any fixed integer) in train_test_split, model constructors, and cross-validation. This ensures identical results every run and makes debugging possible.
×
Using a complex model without trying a simple baseline first
Symptom
Spent weeks building a neural network that achieves 80% accuracy. A simple logistic regression achieves 85% on the same data. The complex model was unnecessary and harder to maintain.
Fix
Always start with a baseline: DummyClassifier for classification, DummyRegressor for regression. Then try simple models (logistic regression, decision tree) before complex ones. Complexity must be justified by measurable improvement.
×
Not checking for target leakage in features
Symptom
Model achieves 99% accuracy with one feature dominating importance. The feature is derived from or highly correlated with the target variable. The model cheats by using future information.
Fix
Inspect feature_importances_ after training. If one feature dominates (>50%), investigate for leakage. Remove features that would not be available at prediction time in production. Ask: could I know this feature's value BEFORE the event I am trying to predict?
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01JUNIOR
What is the difference between overfitting and underfitting, and how do ...
Q02SENIOR
Explain data leakage with a real-world example and how to prevent it.
Q03SENIOR
Why is accuracy a bad metric for imbalanced datasets, and what should yo...
Q04SENIOR
How do you design a robust ML evaluation pipeline that prevents all comm...
Q01 of 04JUNIOR
What is the difference between overfitting and underfitting, and how do you detect each?
ANSWER
Overfitting occurs when a model learns training data too well — including noise — and fails on new data. The symptom is high training accuracy with low test accuracy (large gap). Underfitting occurs when a model is too simple to capture the underlying pattern. The symptom is low accuracy on both training and test data. Detection: compare training and test accuracy. If train >> test, overfitting. If both are low, underfitting. If both are high and similar, good fit. Fix overfitting by reducing complexity, adding regularization, or getting more data. Fix underfitting by increasing model complexity or adding more informative features.
Q02 of 04SENIOR
Explain data leakage with a real-world example and how to prevent it.
ANSWER
Data leakage occurs when information from the test set influences training, inflating metrics. Example: fitting a StandardScaler on the full dataset before train-test split. The scaler computes mean and std using test data, so the model indirectly sees test information during training. Another example: in a medical study, including lab results that are only available after diagnosis as features to predict the diagnosis — the model uses future information. Prevention: (1) Always split before preprocessing. (2) Use sklearn Pipeline to enforce correct order. (3) For time-series, split chronologically — never use future data to predict the past. (4) Check for target leakage — features derived from or correlated with the target variable. The key rule: the test set must remain completely untouched until final evaluation.
Q03 of 04SENIOR
Why is accuracy a bad metric for imbalanced datasets, and what should you use instead?
ANSWER
Accuracy measures the percentage of correct predictions overall. On imbalanced datasets, a model that always predicts the majority class achieves high accuracy while being completely useless. Example: 95% legitimate transactions, 5% fraud — a model predicting 'legitimate' always achieves 95% accuracy but catches zero fraud. Better metrics: (1) Precision — of predicted positives, how many were correct. (2) Recall — of actual positives, how many were found. (3) F1-score — harmonic mean of precision and recall, balances both. (4) AUC-ROC — measures discrimination ability across all thresholds, independent of class distribution. For imbalanced problems, F1-score or AUC-ROC should be the primary metric. Additionally, always inspect the confusion matrix to understand per-class performance.
Q04 of 04SENIOR
How do you design a robust ML evaluation pipeline that prevents all common mistakes?
ANSWER
A robust pipeline has 5 layers: (1) Data split — train/test split with stratify=y BEFORE any preprocessing, using a fixed random_state for reproducibility. (2) Pipeline — sklearn Pipeline chains preprocessing and model to prevent leakage and ensure consistent transforms. (3) Cross-validation — GridSearchCV tunes hyperparameters on training data using cv=5, never touching the test set. (4) Metrics — use problem-appropriate metrics (F1 for imbalanced, RMSE for regression, AUC-ROC for ranking) and always compare against a baseline model. (5) Final evaluation — evaluate the tuned model on the untouched test set exactly once and report with confidence intervals. Additionally: check feature importance for leakage, validate predictions with domain experts, monitor production data distributions for drift, and set up automated retraining triggers.
01
What is the difference between overfitting and underfitting, and how do you detect each?
JUNIOR
02
Explain data leakage with a real-world example and how to prevent it.
SENIOR
03
Why is accuracy a bad metric for imbalanced datasets, and what should you use instead?
SENIOR
04
How do you design a robust ML evaluation pipeline that prevents all common mistakes?
SENIOR
FAQ · 7 QUESTIONS
Frequently Asked Questions
01
How do I know if my model is overfitting?
Compare training accuracy to test accuracy. If training accuracy is significantly higher (>10% gap), the model is overfitting. For example, 99% training accuracy with 80% test accuracy indicates severe overfitting. The model memorized the training data instead of learning generalizable patterns. Fix: reduce model complexity (fewer layers, lower max_depth, fewer estimators), add regularization (L1, L2, dropout), or collect more training data. Plot learning curves — if test accuracy plateaus while training accuracy keeps climbing, the model needs regularization, not more training.
Was this helpful?
02
What is the simplest way to prevent data leakage?
Use sklearn Pipeline. A Pipeline chains preprocessing steps and the model into a single object. When you call pipeline.fit(X_train, y_train), the pipeline automatically fits preprocessing on training data only. When you call pipeline.predict(X_test), it applies the same preprocessing to test data without refitting. This eliminates the most common source of leakage: fitting preprocessing on the full dataset before splitting. For time-series data, additionally ensure you split chronologically using TimeSeriesSplit, not randomly.
Was this helpful?
03
Should I always use cross-validation instead of a single train-test split?
Use cross-validation for model evaluation and hyperparameter tuning — it gives a more reliable performance estimate with confidence bounds. Use a single train-test split for final evaluation — it simulates production conditions where you evaluate on truly unseen data. The standard approach: split data into train/test (80/20), use cross-validation on the training set for model selection and tuning, then evaluate the final model on the untouched test set once. For small datasets (<1000 samples), cross-validation is especially important because a single split can be highly unrepresentative.
Was this helpful?
04
How do I handle imbalanced datasets without collecting more data?
Three approaches, from simplest to most involved: (1) Class weights — set class_weight='balanced' in sklearn classifiers. This penalizes misclassifying the minority class more heavily during training. Requires no data modification. (2) Oversampling — use SMOTE (from imbalanced-learn library) to generate synthetic minority class samples. Creates new training samples by interpolating between existing minority samples. (3) Undersampling — randomly remove majority class samples to balance the dataset. Simple but loses information. Combine any of these with appropriate metrics (F1-score, AUC-ROC) instead of accuracy. Start with class weights — it works well in most cases and adds zero complexity.
Was this helpful?
05
How often should I retrain my production model?
It depends on how fast your data distribution changes. For stable distributions (medical imaging, physics simulations), retrain quarterly or when new data is available. For moderately changing distributions (e-commerce recommendations, marketing), retrain monthly. For rapidly changing distributions (news classification, social media trending, financial markets), retrain weekly or daily. The key is monitoring: track feature distributions and model performance metrics in production. When performance drops below a threshold or feature distributions shift significantly (detected via KS test or PSI), trigger retraining. Automated drift detection is better than fixed calendar schedules.
Was this helpful?
06
What is the difference between a validation set and a test set?
A validation set is used during model development for hyperparameter tuning and model selection — you evaluate on it repeatedly. A test set is used exactly once for final evaluation — it provides an unbiased estimate of production performance. In practice, cross-validation on the training set replaces the need for a separate validation set — you tune on cross-validation folds within the training data and evaluate on the untouched test set. The three-way split (train/validation/test) is more common in deep learning where cross-validation is computationally expensive due to long training times.
Was this helpful?
07
How do I know if a feature is causing data leakage?
Three indicators: (1) Feature importance — if one feature dominates (>50% of total importance), investigate whether it is derived from the target or contains future information. (2) Temporal availability — ask yourself: would this feature be available at prediction time in production? If not, it is leaking. (3) Suspicious accuracy — if removing a single feature drops accuracy by more than 20%, that feature is almost certainly leaking target information. Example: a 'days_since_last_purchase' feature in a churn prediction model that is calculated using the churn date itself — this feature encodes the target directly. Always ask: could I compute this feature BEFORE the event I am trying to predict?