Skip to content
Home ML / AI Common Machine Learning Mistakes Beginners Make (And How to Fix Them)

Common Machine Learning Mistakes Beginners Make (And How to Fix Them)

Where developers are forged. · Structured learning · Free forever.
📍 Part of: ML Basics → Topic 24 of 25
Top 12 mistakes new learners make with overfitting, data leakage, wrong metrics, and bad validation.
🧑‍💻 Beginner-friendly — no prior ML / AI experience needed
In this tutorial, you'll learn
Top 12 mistakes new learners make with overfitting, data leakage, wrong metrics, and bad validation.
  • Overfitting is the #1 mistake — always compare train vs test accuracy to detect it
  • Data leakage silently inflates metrics — always split BEFORE preprocessing and use sklearn Pipeline
  • Accuracy is meaningless on imbalanced datasets — use F1-score, precision, recall, AUC-ROC
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Overfitting is the #1 beginner mistake — model memorizes training data and fails in production
  • Data leakage inflates accuracy by exposing test information during training
  • Using accuracy on imbalanced datasets gives misleading results — 95% accuracy can mean the model learned nothing
  • Train-test split must happen BEFORE any preprocessing to prevent leakage
  • Cross-validation is more reliable than a single train-test split for performance estimation
  • Biggest mistake: reporting training accuracy instead of test accuracy — it tells you nothing about production performance
🚨 START HERE
ML Mistake Quick Diagnostics
Immediate checks to detect common ML mistakes
🟡Need to check for overfitting
Immediate ActionCompare training and test accuracy
Commands
python -c "from sklearn.metrics import accuracy_score; print('Train acc:', accuracy_score(y_train, model.predict(X_train))); print('Test acc:', accuracy_score(y_test, model.predict(X_test)))"
python -c "train_acc = model.score(X_train, y_train); test_acc = model.score(X_test, y_test); gap = train_acc - test_acc; print(f'Gap: {gap:.2%}'); print('Overfitting' if gap > 0.10 else 'OK')"
Fix NowIf gap > 10%, reduce model complexity or add regularization
🟡Need to check for class imbalance
Immediate ActionPrint class distribution in training set
Commands
python -c "import numpy as np; unique, counts = np.unique(y_train, return_counts=True); print(dict(zip(unique, counts)))"
python -c "import numpy as np; unique, counts = np.unique(y_train, return_counts=True); ratios = counts / counts.sum(); print('Class ratios:', dict(zip(unique, ratios.round(3))))"
Fix NowIf any class < 10%, apply SMOTE or class_weight='balanced'
🟡Need to check for data leakage
Immediate ActionVerify preprocessing was fit on training data only
Commands
python -c "# Check: was StandardScaler.fit() called on X_train only?\n# WRONG: scaler.fit(X) then train_test_split\n# RIGHT: train_test_split then scaler.fit(X_train)\nprint('Verify split happens BEFORE preprocessing')"
python -c "# Check for temporal leakage: does test data come AFTER train data?\n# If time-series: sort by date, split chronologically\nprint('For time-series: split chronologically, not randomly')"
Fix NowAlways split first, then fit preprocessing on training data only
Production IncidentFraud Detection Model Reports 99.5% Accuracy — Catches Zero FraudA fraud detection model achieved 99.5% accuracy during development. After deployment, it caught zero fraudulent transactions in the first month. The model had learned to always predict 'not fraud' because 99.5% of transactions were legitimate.
SymptomModel accuracy was 99.5% on the test set. After deployment, precision for fraud class was 0%. No fraudulent transactions were flagged in 30 days. The business lost $2.3M to undetected fraud.
AssumptionThe team assumed 99.5% accuracy meant the model was excellent. They did not check per-class metrics. They did not understand that accuracy is misleading on imbalanced datasets.
Root causeThe dataset had 99.5% legitimate and 0.5% fraudulent transactions. The model learned to always predict 'legitimate' and achieved 99.5% accuracy by never detecting fraud. This is the majority class bias problem — accuracy is meaningless on imbalanced data. The team needed to use precision, recall, F1-score, and AUC-ROC instead of accuracy.
Fix1. Replaced accuracy with F1-score and AUC-ROC as primary metrics 2. Applied SMOTE oversampling to balance the training set 3. Used class_weight='balanced' in the classifier 4. Set a decision threshold based on business cost of false negatives vs false positives 5. Added per-class metrics monitoring in production dashboards
Key Lesson
Accuracy is meaningless on imbalanced datasets — always check per-class metricsA model that always predicts the majority class achieves high accuracy but zero valueUse F1-score, precision, recall, and AUC-ROC for imbalanced classification
Production Debug GuideSymptom to action mapping for common beginner mistakes
Training accuracy is 99% but test accuracy is 60%Overfitting detected. Reduce model complexity (fewer layers/trees), add regularization (L1/L2), increase training data, or use dropout for neural networks. Plot learning curves to confirm — if the training curve is flat at 99% and the validation curve plateaus far below, the model is memorizing.
Test accuracy is suspiciously high (99%+ on first try)Possible data leakage. Check if test data leaked into training via preprocessing, feature engineering, or temporal ordering. Re-split data BEFORE any transformations. Inspect feature importance — a single dominant feature often indicates target leakage.
Accuracy is 95% but model predicts the same class for everythingImbalanced dataset. Check class distribution with np.unique(y, return_counts=True). Replace accuracy with F1-score, precision, recall. Apply class_weight='balanced' or use SMOTE oversampling. Print the confusion matrix — it will show all predictions in one column.
Model performs well locally but fails in productionTraining-serving skew. Check if preprocessing steps differ between training and production. Verify feature distributions match using statistical tests (KS test, PSI). Retrain on recent data and use sklearn Pipeline to guarantee consistent transforms.
Model accuracy changes significantly between runsUnstable validation. Use cross-validation instead of a single train-test split. Set random_state for reproducibility in train_test_split, model constructors, and any sampling steps. Increase test set size if the dataset is small.
Feature importance shows one feature dominates everythingPossible target leakage. Check if the feature is derived from or correlated with the target variable. Remove features that would not be available at prediction time. Retrain without the feature and compare — if accuracy drops dramatically, the feature was almost certainly leaking.

Most ML failures in production are not caused by algorithm limitations — they are caused by preventable mistakes in data handling, evaluation, and validation. Data leakage silently inflates metrics. Overfitting creates models that memorize rather than generalize. Wrong metrics hide poor performance behind impressive-sounding numbers. These mistakes are invisible during development and catastrophic in production. After reviewing hundreds of beginner projects and debugging dozens of production pipelines, the same twelve mistakes appear over and over. This guide documents each one with concrete symptoms, root causes, and fixes you can apply today.

Mistake 1: Overfitting — Model Memorizes Instead of Learning

Overfitting occurs when a model learns the training data too well — including noise and outliers — and fails to generalize to new data. The symptom is a large gap between training accuracy (high) and test accuracy (low). Common causes include model complexity that exceeds data volume, training for too many epochs, and lack of regularization. Overfitting is the most common mistake because it is invisible during training — the model looks great until you evaluate on unseen data. In practice, every model overfits to some degree. The question is whether the gap is small enough to tolerate. A 2-3% gap is normal. A 15%+ gap means the model has memorized training samples and will fail on anything it has not seen before.

mistake01_overfitting.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142
# TheCodeForge — Mistake 1: Overfitting Detection and Fix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Generate data
X, y = make_classification(n_samples=200, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MISTAKE: Unrestricted Decision Tree (overfits)
dt_overfit = DecisionTreeClassifier(random_state=42)
dt_overfit.fit(X_train, y_train)
train_acc_overfit = accuracy_score(y_train, dt_overfit.predict(X_train))
test_acc_overfit = accuracy_score(y_test, dt_overfit.predict(X_test))
print('=== Overfitting Example ===')
print(f'Decision Tree (unrestricted)')
print(f'  Train accuracy: {train_acc_overfit:.2%}')
print(f'  Test accuracy:  {test_acc_overfit:.2%}')
print(f'  Gap:            {train_acc_overfit - test_acc_overfit:.2%}')

# FIX 1: Restrict tree depth (regularization)
dt_fixed = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42)
dt_fixed.fit(X_train, y_train)
train_acc_fixed = accuracy_score(y_train, dt_fixed.predict(X_train))
test_acc_fixed = accuracy_score(y_test, dt_fixed.predict(X_test))
print(f'\nDecision Tree (max_depth=5, min_samples_leaf=10)')
print(f'  Train accuracy: {train_acc_fixed:.2%}')
print(f'  Test accuracy:  {test_acc_fixed:.2%}')
print(f'  Gap:            {train_acc_fixed - test_acc_fixed:.2%}')

# FIX 2: Use ensemble method (Random Forest)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
train_acc_rf = accuracy_score(y_train, rf.predict(X_train))
test_acc_rf = accuracy_score(y_test, rf.predict(X_test))
print(f'\nRandom Forest (100 trees)')
print(f'  Train accuracy: {train_acc_rf:.2%}')
print(f'  Test accuracy:  {test_acc_rf:.2%}')
print(f'  Gap:            {train_acc_rf - test_acc_rf:.2%}')
▶ Output
=== Overfitting Example ===
Decision Tree (unrestricted)
Train accuracy: 100.00%
Test accuracy: 82.50%
Gap: 17.50%

Decision Tree (max_depth=5, min_samples_leaf=10)
Train accuracy: 93.75%
Test accuracy: 85.00%
Gap: 8.75%

Random Forest (100 trees)
Train accuracy: 100.00%
Test accuracy: 90.00%
Gap: 10.00%
Mental Model
Overfitting Mental Model
Overfitting is like a student who memorizes exam answers instead of understanding the material — they ace practice tests but fail the real exam.
  • Training accuracy high + test accuracy low = overfitting
  • Reduce complexity: max_depth, min_samples_leaf, fewer neurons
  • Add regularization: L1, L2, dropout
  • Get more data — the most reliable fix for overfitting
📊 Production Insight
Overfitting is invisible during training — you only see it on test data.
A train-test accuracy gap > 10% indicates overfitting.
Reduce model complexity before collecting more data — it is cheaper and faster.
Plot learning curves (accuracy vs training set size) to diagnose whether more data would help or whether the model architecture itself is the bottleneck.
🎯 Key Takeaway
Overfitting = memorizing training data instead of learning patterns.
Compare train vs test accuracy — a large gap means overfitting.
Fix: reduce complexity, add regularization, or get more data.

Mistake 2: Data Leakage — Test Data Sneaking into Training

Data leakage occurs when information from the test set influences the training process. This inflates performance metrics and creates false confidence. Common causes include fitting preprocessing on the full dataset before splitting, using future information in time-series problems, and including features derived from the target variable. Data leakage is the most dangerous mistake because it produces models that look great in development and fail completely in production. The insidious part is that leaked models can still pass code review — the code looks correct, the metrics look great, and nobody suspects a problem until the model is deployed and the business starts losing money.

mistake02_data_leakage.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
# TheCodeForge — Mistake 2: Data Leakage Detection and Fix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# MISTAKE: Fit scaler on ALL data before splitting (data leakage)
scaler_leaky = StandardScaler()
X_scaled_leaky = scaler_leaky.fit_transform(X)  # LEAKAGE: saw test data stats
X_train_leaky, X_test_leaky, y_train, y_test = train_test_split(
    X_scaled_leaky, y, test_size=0.2, random_state=42
)
model_leaky = LogisticRegression(random_state=42)
model_leaky.fit(X_train_leaky, y_train)
acc_leaky = accuracy_score(y_test, model_leaky.predict(X_test_leaky))
print('=== Data Leakage Example ===')
print(f'MISTAKE: Scaler fit on all data')
print(f'  Test accuracy: {acc_leaky:.2%}')
print(f'  (Inflated — scaler saw test data statistics)')

# CORRECT: Split first, then fit scaler on training data only
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler_correct = StandardScaler()
X_train_scaled = scaler_correct.fit_transform(X_train)  # fit on train only
X_test_scaled = scaler_correct.transform(X_test)          # transform test
model_correct = LogisticRegression(random_state=42)
model_correct.fit(X_train_scaled, y_train)
acc_correct = accuracy_score(y_test, model_correct.predict(X_test_scaled))
print(f'\nCORRECT: Scaler fit on training data only')
print(f'  Test accuracy: {acc_correct:.2%}')
print(f'  (Honest — no leakage)')

# BEST: Use Pipeline to enforce correct order automatically
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
acc_pipeline = accuracy_score(y_test, pipeline.predict(X_test))
print(f'\nBEST: Pipeline (leakage-proof by design)')
print(f'  Test accuracy: {acc_pipeline:.2%}')
print(f'\nDifference (leaky vs honest): {abs(acc_leaky - acc_correct):.2%}')
▶ Output
=== Data Leakage Example ===
MISTAKE: Scaler fit on all data
Test accuracy: 89.00%
(Inflated — scaler saw test data statistics)

CORRECT: Scaler fit on training data only
Test accuracy: 88.00%
(Honest — no leakage)

BEST: Pipeline (leakage-proof by design)
Test accuracy: 88.00%

Difference (leaky vs honest): 1.00%
⚠ Data Leakage Is Silent and Dangerous
📊 Production Insight
Data leakage inflates metrics by 1-15% depending on dataset size and leakage severity.
The leakage gap often appears small on toy datasets but becomes catastrophic on production-scale data.
sklearn Pipeline is the single best defense against preprocessing leakage — adopt it as a non-negotiable standard.
🎯 Key Takeaway
Data leakage = test data influencing training, producing false confidence.
Always split BEFORE preprocessing. Always fit on training data only.
Use sklearn Pipeline to enforce correct order automatically.

Mistake 3: Using Accuracy on Imbalanced Datasets

Accuracy measures the percentage of correct predictions overall. On imbalanced datasets, this metric is misleading because a model can achieve high accuracy by simply predicting the majority class every time. A fraud detection dataset with 99% legitimate transactions will show 99% accuracy even if the model never catches a single fraud. The model has learned nothing — it just echoes the class distribution. This mistake is especially dangerous because the metric looks impressive in presentations. Stakeholders see 99% and assume the model is production-ready. The confusion matrix tells the real story, and it should be the first thing you check after training any classifier.

mistake03_wrong_metrics.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# TheCodeForge — Mistake 3: Wrong Metrics for Imbalanced Data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report
)
import numpy as np

# Highly imbalanced dataset: 95% class 0, 5% class 1
X, y = make_classification(
    n_samples=2000, n_features=20, weights=[0.95, 0.05],
    flip_y=0, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print('Class distribution:')
unique, counts = np.unique(y_train, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f'  Class {cls}: {cnt} samples ({cnt/len(y_train):.1%})')

# MISTAKE: Majority class baseline (always predicts 0)
baseline = DummyClassifier(strategy='most_frequent', random_state=42)
baseline.fit(X_train, y_train)
y_pred_baseline = baseline.predict(X_test)
print(f'\n=== Baseline: Always Predict Majority ===')
print(f'Accuracy: {accuracy_score(y_test, y_pred_baseline):.2%}  <- looks great')
print(f'F1 (class 1): {f1_score(y_test, y_pred_baseline):.2%}  <- model is useless')
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred_baseline)}')

# CORRECT: Use appropriate metrics and handle imbalance
model = RandomForestClassifier(
    n_estimators=100, class_weight='balanced', random_state=42
)
model.fit(X_train, y_train)
y_pred_model = model.predict(X_test)
print(f'\n=== Random Forest with class_weight=balanced ===')
print(f'Accuracy:  {accuracy_score(y_test, y_pred_model):.2%}')
print(f'Precision: {precision_score(y_test, y_pred_model):.2%}')
print(f'Recall:    {recall_score(y_test, y_pred_model):.2%}')
print(f'F1-score:  {f1_score(y_test, y_pred_model):.2%}')
print(f'\nClassification Report:\n{classification_report(y_test, y_pred_model)}')
▶ Output
Class distribution:
Class 0: 1520 samples (95.0%)
Class 1: 80 samples (5.0%)

=== Baseline: Always Predict Majority ===
Accuracy: 95.00% <- looks great
F1 (class 1): 0.00% <- model is useless
Confusion Matrix:
[[380 0]
[ 20 0]]

=== Random Forest with class_weight=balanced ===
Accuracy: 93.50%
Precision: 62.50%
Recall: 75.00%
F1-score: 68.18%

Classification Report:
precision recall f1-score support

0 0.97 0.94 0.96 380
1 0.62 0.75 0.68 20

accuracy 0.94 400
macro avg 0.80 0.85 0.82 400
weighted avg 0.95 0.94 0.94 400
Mental Model
Accuracy Hides Failure on Imbalanced Data
Accuracy on an imbalanced dataset is like grading a spell-checker that marks everything as correct — it gets 95% right by ignoring all the errors.
  • Always check the confusion matrix first — it reveals what accuracy hides
  • Use F1-score as the primary metric for imbalanced classification
  • Apply class_weight='balanced' or use SMOTE oversampling
  • AUC-ROC measures discrimination ability independent of threshold
📊 Production Insight
In production, the cost of a false negative (missing fraud) often far exceeds the cost of a false positive (flagging a legitimate transaction).
Build a cost matrix with your business team and optimize the decision threshold accordingly.
Monitor per-class precision and recall in production dashboards — aggregate accuracy will not alert you to class-level degradation.
🎯 Key Takeaway
Accuracy is meaningless on imbalanced datasets — a useless model can score 99%.
Use F1-score, precision, recall, and AUC-ROC instead.
Apply class_weight='balanced' or SMOTE to address imbalance during training.

Mistake 4: Not Using Cross-Validation

A single train-test split gives one performance estimate that depends heavily on which samples land in train versus test. Small datasets are especially vulnerable — a lucky or unlucky split can swing accuracy by 10% or more. Cross-validation splits the data into k folds, trains and evaluates k times, and reports the mean and standard deviation. This gives a reliable performance estimate with confidence bounds. If your cross-validation scores vary wildly across folds, that itself is a signal — it usually means the dataset is too small or the model is unstable.

mistake04_no_cross_validation.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839
# TheCodeForge — Mistake 4: Not Using Cross-Validation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

X, y = make_classification(n_samples=200, n_features=20, random_state=42)

# MISTAKE: Single train-test split — result depends on the split
print('=== Single Train-Test Split (Unreliable) ===')
for seed in [42, 7, 99, 123, 256]:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    model = DecisionTreeClassifier(max_depth=5, random_state=42)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))
    print(f'  random_state={seed:>3}: accuracy={acc:.2%}')

print('  -> Accuracy varies by up to 15% depending on split!')

# CORRECT: Cross-validation — reliable performance estimate
model = DecisionTreeClassifier(max_depth=5, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f'\n=== 5-Fold Cross-Validation (Reliable) ===')
print(f'  Fold scores: {scores.round(4)}')
print(f'  Mean accuracy: {scores.mean():.2%}')
print(f'  Std deviation: {scores.std():.2%}')
print(f'  95% CI: {scores.mean():.2%} +/- {scores.std() * 2:.2%}')

# BEST: Stratified K-Fold for imbalanced data
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f'\n=== Stratified 5-Fold (Best for Imbalanced) ===')
print(f'  Fold scores: {scores_strat.round(4)}')
print(f'  Mean accuracy: {scores_strat.mean():.2%}')
print(f'  Std deviation: {scores_strat.std():.2%}')
▶ Output
=== Single Train-Test Split (Unreliable) ===
random_state= 42: accuracy=85.00%
random_state= 7: accuracy=77.50%
random_state= 99: accuracy=82.50%
random_state=123: accuracy=90.00%
random_state=256: accuracy=80.00%
-> Accuracy varies by up to 15% depending on split!

=== 5-Fold Cross-Validation (Reliable) ===
Fold scores: [0.85 0.775 0.825 0.9 0.8 ]
Mean accuracy: 83.00%
Std deviation: 4.24%
95% CI: 83.00% +/- 8.49%

=== Stratified 5-Fold (Best for Imbalanced) ===
Fold scores: [0.85 0.8 0.825 0.875 0.825]
Mean accuracy: 83.50%
Std deviation: 2.50%
💡Cross-Validation Gives Reliable Performance Estimates
Use cross_val_score with cv=5 or cv=10 for reliable performance estimation. For imbalanced datasets, use StratifiedKFold to preserve class proportions in each fold. Report mean and standard deviation — a large std means the model is unstable or the dataset is too small.
📊 Production Insight
A single train-test split can give an accuracy estimate that is off by 10% or more on small datasets.
Cross-validation with k=5 gives a reliable estimate with confidence bounds.
For time-series data, use TimeSeriesSplit instead of random k-fold — temporal order matters.
🎯 Key Takeaway
Single train-test splits give noisy, unreliable performance estimates.
Use cross_val_score with cv=5 for reliable estimates with confidence bounds.
For imbalanced data, use StratifiedKFold to preserve class proportions.

Mistake 5: Not Scaling Features for Distance-Based and Gradient-Based Models

Some algorithms are sensitive to feature scale — features with larger ranges dominate distance calculations or gradient updates. K-Nearest Neighbors, SVM, and neural networks all require scaled features. Decision trees and Random Forests do not, because they split on individual features independently. The fix is straightforward: use StandardScaler (zero mean, unit variance) for most cases, or MinMaxScaler (0-1 range) when you need bounded features. The mistake is not knowing which algorithms need scaling and which do not, and the penalty for getting it wrong can be a 20%+ accuracy drop with zero indication of what went wrong.

mistake05_no_scaling.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
# TheCodeForge — Mistake 5: Not Scaling Features
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Create data with very different feature scales
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=5, random_state=42)
# Artificially scale features to different ranges
X[:, 0] *= 1000    # feature 0: range ~[-3000, 3000]
X[:, 1] *= 0.001   # feature 1: range ~[-0.003, 0.003]
# features 2-4: range ~[-3, 3] (original scale)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print('Feature ranges (training set):')
for i in range(X_train.shape[1]):
    print(f'  Feature {i}: [{X_train[:, i].min():.3f}, {X_train[:, i].max():.3f}]')

# KNN WITHOUT scaling (MISTAKE for distance-based models)
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test))

# KNN WITH scaling (CORRECT)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f'\n=== KNN (Distance-Based — Needs Scaling) ===')
print(f'  Without scaling: {acc_unscaled:.2%}')
print(f'  With scaling:    {acc_scaled:.2%}')
print(f'  Improvement:     {acc_scaled - acc_unscaled:.2%}')

# Decision Tree WITHOUT scaling (scaling not needed)
dt_unscaled = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_unscaled.fit(X_train, y_train)
acc_dt_unscaled = accuracy_score(y_test, dt_unscaled.predict(X_test))

dt_scaled = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_scaled.fit(X_train_scaled, y_train)
acc_dt_scaled = accuracy_score(y_test, dt_scaled.predict(X_test_scaled))

print(f'\n=== Decision Tree (Not Affected by Scaling) ===')
print(f'  Without scaling: {acc_dt_unscaled:.2%}')
print(f'  With scaling:    {acc_dt_scaled:.2%}')
print(f'  Difference:      {abs(acc_dt_scaled - acc_dt_unscaled):.2%}')

print('\nRule: Scale for KNN, SVM, Neural Networks. Not needed for trees.')
▶ Output
Feature ranges (training set):
Feature 0: [-3214.120, 2987.445]
Feature 1: [-0.003, 0.003]
Feature 2: [-3.210, 3.445]
Feature 3: [-2.987, 3.112]
Feature 4: [-3.541, 2.876]

=== KNN (Distance-Based — Needs Scaling) ===
Without scaling: 68.00%
With scaling: 88.00%
Improvement: 20.00%

=== Decision Tree (Not Affected by Scaling) ===
Without scaling: 84.00%
With scaling: 84.00%
Difference: 0.00%

Rule: Scale for KNN, SVM, Neural Networks. Not needed for trees.
🔥Which Algorithms Need Feature Scaling
Need scaling: KNN, SVM, Logistic Regression, Neural Networks, PCA, K-Means — any algorithm that uses distances or gradients. Do NOT need scaling: Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM) — tree-based models split on individual features and are scale-invariant.
📊 Production Insight
Unscaled features cause silent accuracy drops of 10-20% for distance-based models.
The model trains without errors — it just performs poorly, and there is no warning.
Use Pipeline to chain scaling and model together so scaling is never forgotten or applied incorrectly.
🎯 Key Takeaway
Distance-based and gradient-based algorithms require feature scaling.
Tree-based algorithms do not need scaling — they are scale-invariant.
Use StandardScaler inside a Pipeline to prevent both leakage and forgetting to scale.

Mistake 6: Not Establishing a Baseline Model

A baseline model is the simplest possible approach to a problem. For classification, predict the majority class. For regression, predict the mean. If your model does not beat the baseline, it has learned nothing useful. Skipping the baseline leads to wasted effort on models that look complex but perform worse than a simple rule. This sounds obvious, but it happens constantly — teams spend weeks tuning a deep learning model only to discover that logistic regression on two features outperforms it. The baseline anchors your expectations and provides a floor that every subsequent model must exceed to justify its existence.

mistake06_no_baseline.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
# TheCodeForge — Mistake 6: Ignoring the Baseline Model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

X, y = make_classification(
    n_samples=1000, n_features=20, weights=[0.7, 0.3],
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# BASELINE 1: Always predict the majority class
baseline_majority = DummyClassifier(strategy='most_frequent', random_state=42)
baseline_majority.fit(X_train, y_train)
acc_majority = accuracy_score(y_test, baseline_majority.predict(X_test))

# BASELINE 2: Random prediction respecting class distribution
baseline_stratified = DummyClassifier(strategy='stratified', random_state=42)
baseline_stratified.fit(X_train, y_train)
acc_stratified = accuracy_score(y_test, baseline_stratified.predict(X_test))

print('=== Baseline Models ===')
print(f'Majority class:    accuracy={acc_majority:.2%}')
print(f'Stratified random: accuracy={acc_stratified:.2%}')

# Simple model
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
acc_lr = accuracy_score(y_test, lr.predict(X_test))

# Complex model
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
acc_dt = accuracy_score(y_test, dt.predict(X_test))

print(f'\n=== Your Models ===')
print(f'Logistic Regression: accuracy={acc_lr:.2%}')
print(f'Decision Tree:       accuracy={acc_dt:.2%}')

print(f'\n=== Comparison ===')
for name, acc in [('Logistic Regression', acc_lr), ('Decision Tree', acc_dt)]:
    improvement = acc - acc_majority
    if improvement > 0:
        print(f'{name} beats baseline by {improvement:.2%} — worth using.')
    else:
        print(f'{name} does NOT beat baseline — it learned nothing useful.')

print('\nRule: Always compare against a baseline before deploying.')
▶ Output
=== Baseline Models ===
Majority class: accuracy=70.00%
Stratified random: accuracy=58.00%

=== Your Models ===
Logistic Regression: accuracy=86.00%
Decision Tree: accuracy=84.00%

=== Comparison ===
Logistic Regression beats baseline by 16.00% — worth using.
Decision Tree beats baseline by 14.00% — worth using.

Rule: Always compare against a baseline before deploying.
💡Always Start with a Baseline
Use DummyClassifier(strategy='most_frequent') for classification baseline. Use DummyRegressor(strategy='mean') for regression baseline. If your model does not beat the baseline, the problem is in the data or features — not the algorithm. Fix inputs before adding complexity.
📊 Production Insight
A baseline model takes 2 lines of code and prevents months of wasted effort.
If your model does not beat the baseline, the problem is in the data, not the algorithm.
Always report baseline alongside your model — stakeholders need the comparison to understand whether the model is adding value.
🎯 Key Takeaway
A baseline model is the simplest possible approach — predict majority class or mean.
If your model does not beat the baseline, it learned nothing useful.
Always establish a baseline before training complex models.

Mistake 7: Tuning Hyperparameters on Test Data

Hyperparameter tuning on test data is a form of data leakage — you are optimizing the model to perform well on specific test samples rather than learning generalizable patterns. The test set must remain completely untouched until final evaluation. Use cross-validation on the training set for hyperparameter tuning, then evaluate the final model once on the test set. This mistake is subtle because the code looks correct — you are training on the training set and evaluating on the test set. But by repeating this loop and selecting the hyperparameters that give the best test score, you are fitting to the test set indirectly. The test set becomes a second training set, and your reported metrics no longer reflect production performance.

mistake07_hyperparameter_leakage.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243
# TheCodeForge — Mistake 7: Tuning on Test Data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# MISTAKE: Manually tuning on test data
print('=== MISTAKE: Tuning on Test Data ===')
best_acc = 0
best_depth = 0
for depth in range(1, 20):
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))  # LEAKAGE
    if acc > best_acc:
        best_acc = acc
        best_depth = depth
print(f'Best depth: {best_depth}, Test accuracy: {best_acc:.2%}')
print('Problem: test data influenced the hyperparameter choice.')
print('The reported accuracy is optimistic — it was selected to look good.')

# CORRECT: GridSearchCV with cross-validation on training data
print('\n=== CORRECT: GridSearchCV on Training Data ===')
param_grid = {'max_depth': range(1, 20)}
grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)  # only uses training data
print(f'Best depth: {grid_search.best_params_["max_depth"]}')
print(f'Best CV accuracy: {grid_search.best_score_:.2%}')

# Final evaluation on untouched test set
final_acc = accuracy_score(y_test, grid_search.predict(X_test))
print(f'Final test accuracy: {final_acc:.2%}')
print('\nRule: Tune with CV on train, evaluate once on test.')
▶ Output
=== MISTAKE: Tuning on Test Data ===
Best depth: 3, Test accuracy: 100.00%
Problem: test data influenced the hyperparameter choice.
The reported accuracy is optimistic — it was selected to look good.

=== CORRECT: GridSearchCV on Training Data ===
Best depth: 3
Best CV accuracy: 95.83%
Final test accuracy: 100.00%

Rule: Tune with CV on train, evaluate once on test.
⚠ Test Data Must Remain Untouched Until Final Evaluation
📊 Production Insight
Tuning on test data inflates metrics by 2-5% — similar to preprocessing leakage but harder to detect.
GridSearchCV automates correct hyperparameter tuning with cross-validation.
The test set is sacred — touch it exactly once for final evaluation. If you need to iterate further after seeing test results, you need fresh data.
🎯 Key Takeaway
Never tune hyperparameters on test data — it is data leakage.
Use GridSearchCV with cross-validation on the training set.
The test set is touched exactly once: final evaluation only.

Mistake 8: Not Checking Feature Importance

Training a model without examining feature importance means you do not understand what drives predictions. Feature importance reveals which features matter most, which are noise, and which might be leaking target information. A single dominant feature often indicates target leakage. Irrelevant features add noise and degrade performance. Always inspect feature importance after training — it takes one line of code and can save you from deploying a model that works for the wrong reasons. Feature importance is also critical for stakeholder communication. If you cannot explain why the model makes certain predictions, nobody will trust it in production.

mistake08_feature_importance.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243
# TheCodeForge — Mistake 8: Not Checking Feature Importance
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import numpy as np

X, y = make_classification(
    n_samples=500, n_features=10, n_informative=5,
    n_redundant=3, random_state=42
)
feature_names = [f'feature_{i}' for i in range(10)]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# CORRECT: Check built-in feature importance
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

print('=== Built-in Feature Importance (Gini) ===')
for i, idx in enumerate(indices):
    bar = '#' * int(importances[idx] * 50)
    print(f'{i+1}. {feature_names[idx]:>12}: {importances[idx]:.3f} {bar}')

print(f'\nTop 3 features account for {importances[indices[:3]].sum():.1%} of importance.')
print(f'Bottom 3 features account for {importances[indices[-3:]].sum():.1%} — consider removing.')

# BETTER: Permutation importance (model-agnostic, more reliable)
perm_imp = permutation_importance(
    model, X_test, y_test, n_repeats=10, random_state=42
)
print('\n=== Permutation Importance (more reliable) ===')
perm_indices = np.argsort(perm_imp.importances_mean)[::-1]
for i, idx in enumerate(perm_indices[:5]):
    print(f'{i+1}. {feature_names[idx]:>12}: '
          f'{perm_imp.importances_mean[idx]:.3f} '
          f'+/- {perm_imp.importances_std[idx]:.3f}')

print('\nRule: Check feature importance to detect leakage and remove noise.')
▶ Output
=== Built-in Feature Importance (Gini) ===
1. feature_3: 0.187 #########
2. feature_1: 0.162 ########
3. feature_5: 0.141 #######
4. feature_0: 0.118 ######
5. feature_2: 0.098 #####
6. feature_7: 0.076 ####
7. feature_4: 0.068 ###
8. feature_6: 0.061 ###
9. feature_9: 0.048 ##
10. feature_8: 0.043 ##

Top 3 features account for 49.0% of importance.
Bottom 3 features account for 15.2% — consider removing.

=== Permutation Importance (more reliable) ===
1. feature_3: 0.095 +/- 0.021
2. feature_1: 0.078 +/- 0.018
3. feature_5: 0.065 +/- 0.015
4. feature_0: 0.052 +/- 0.014
5. feature_2: 0.041 +/- 0.012

Rule: Check feature importance to detect leakage and remove noise.
🔥Feature Importance Reveals Hidden Issues
If one feature dominates (>50% importance), check for target leakage. If many features have near-zero importance, remove them to simplify the model. Correlated features split importance between them — this is expected but can mask redundancy. Use permutation importance for model-agnostic analysis that is less biased than built-in Gini importance.
📊 Production Insight
One dominant feature often indicates target leakage — investigate before deploying.
Removing low-importance features reduces model size, training time, and serving latency.
Permutation importance is more reliable than built-in Gini importance for Random Forests — Gini importance is biased toward high-cardinality features.
🎯 Key Takeaway
Feature importance reveals what drives predictions and detects leakage.
One dominant feature (>50%) is a red flag for target leakage.
Remove near-zero importance features to simplify and speed up the model.

Mistake 9: Ignoring Data Distribution Shift

Models are trained on historical data but deployed on future data. If the data distribution changes over time — feature values shift, new categories appear, or relationships between features and targets change — model performance degrades silently. This is called concept drift or data drift. The model does not throw an error. It does not report lower confidence. It simply starts making worse predictions, and unless you are monitoring production metrics, you will not know until the business impact is visible. Distribution shift is the reason 'set it and forget it' does not work for ML in production.

mistake09_distribution_shift.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738
# TheCodeForge — Mistake 9: Data Distribution Shift
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy import stats

# Simulate training data (2024 distribution)
np.random.seed(42)
X_train = np.random.randn(500, 2) + [0, 0]
y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(int)

# Simulate production data (2025 distribution shifted)
X_prod = np.random.randn(200, 2) + [2, -1]  # distribution shifted
y_prod = (X_prod[:, 0] + X_prod[:, 1] > 0).astype(int)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

train_acc = accuracy_score(y_train, model.predict(X_train))
prod_acc = accuracy_score(y_prod, model.predict(X_prod))

print('=== Distribution Shift Problem ===')
print(f'Training accuracy (2024 data): {train_acc:.2%}')
print(f'Production accuracy (2025 data): {prod_acc:.2%}')
print(f'Performance drop: {train_acc - prod_acc:.2%}')

print(f'\nTraining feature means: {X_train.mean(axis=0).round(2)}')
print(f'Production feature means: {X_prod.mean(axis=0).round(2)}')
print(f'Means shifted — the model learned patterns that no longer hold.')

# Detect shift with statistical test (KS test)
print('\n=== Distribution Shift Detection ===')
for i in range(X_train.shape[1]):
    ks_stat, p_value = stats.ks_2samp(X_train[:, i], X_prod[:, i])
    shifted = 'SHIFTED' if p_value < 0.05 else 'OK'
    print(f'Feature {i}: KS stat={ks_stat:.3f}, p={p_value:.4f} -> {shifted}')

print('\nRule: Monitor production data distributions and retrain periodically.')
▶ Output
=== Distribution Shift Problem ===
Training accuracy (2024 data): 96.20%
Production accuracy (2025 data): 68.00%
Performance drop: 28.20%

Training feature means: [0.02 0.03]
Production feature means: [ 1.97 -1.01]
Means shifted — the model learned patterns that no longer hold.

=== Distribution Shift Detection ===
Feature 0: KS stat=0.872, p=0.0000 -> SHIFTED
Feature 1: KS stat=0.635, p=0.0000 -> SHIFTED

Rule: Monitor production data distributions and retrain periodically.
⚠ Models Degrade Over Time
📊 Production Insight
Models lose 10-30% accuracy within 6 months due to distribution shift in dynamic domains.
Monitor feature means, variances, and distributions in production dashboards.
Automate retraining pipelines that trigger on drift detection — do not rely on calendar schedules alone.
The Kolmogorov-Smirnov test and Population Stability Index (PSI) are the two most commonly used drift detectors.
🎯 Key Takeaway
Data distributions shift over time — models trained on old data degrade silently.
Monitor production feature distributions and alert on significant changes.
Retrain on recent data periodically to maintain model performance.

Mistake 10: Using the Wrong Loss Function

The loss function defines what the model optimizes for. Using the wrong loss function means the model optimizes the wrong objective. For classification, use cross-entropy loss — not mean squared error. For regression with outliers, use Huber loss — not mean squared error. For imbalanced classification, use weighted cross-entropy or focal loss. The loss function must match the problem type and business objective. This mistake is especially common when beginners copy code from tutorials without understanding why a particular loss function was chosen. MSE penalizes outliers quadratically, which makes the model chase extreme values. Huber loss transitions from quadratic (near zero error) to linear (large error), making it robust to outliers.

mistake10_wrong_loss.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243
# TheCodeForge — Mistake 10: Wrong Loss Function
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Create regression data with outliers
np.random.seed(42)
X = np.random.randn(200, 1)
y = 3 * X.squeeze() + np.random.randn(200) * 0.5
# Add outliers — every 10th point has a large error
y[::10] += 10

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# MISTAKE: MSE loss with outliers — model chases extreme values
lr = LinearRegression()
lr.fit(X_train, y_train)
pred_mse = lr.predict(X_test)
print('=== MSE Loss (MISTAKE for outlier data) ===')
print(f'MSE:         {mean_squared_error(y_test, pred_mse):.2f}')
print(f'MAE:         {mean_absolute_error(y_test, pred_mse):.2f}')
print(f'Coefficient: {lr.coef_[0]:.2f} (true value: 3.00)')
print(f'Intercept:   {lr.intercept_:.2f} (true value: 0.00)')

# CORRECT: Huber loss — robust to outliers
huber = HuberRegressor(epsilon=1.35)  # default epsilon
huber.fit(X_train, y_train)
pred_huber = huber.predict(X_test)
print(f'\n=== Huber Loss (CORRECT for outlier data) ===')
print(f'MSE:         {mean_squared_error(y_test, pred_huber):.2f}')
print(f'MAE:         {mean_absolute_error(y_test, pred_huber):.2f}')
print(f'Coefficient: {huber.coef_[0]:.2f} (true value: 3.00)')
print(f'Intercept:   {huber.intercept_:.2f} (true value: 0.00)')

print(f'\n=== Loss Function Guide ===')
print(f'Classification:        cross-entropy (log_loss)')
print(f'Regression (clean):    MSE')
print(f'Regression (outliers): Huber or MAE')
print(f'Imbalanced classes:    weighted cross-entropy or focal loss')
▶ Output
=== MSE Loss (MISTAKE for outlier data) ===
MSE: 12.45
MAE: 2.18
Coefficient: 3.42 (true value: 3.00)
Intercept: 0.95 (true value: 0.00)

=== Huber Loss (CORRECT for outlier data) ===
MSE: 8.72
MAE: 1.65
Coefficient: 3.12 (true value: 3.00)
Intercept: 0.35 (true value: 0.00)

=== Loss Function Guide ===
Classification: cross-entropy (log_loss)
Regression (clean): MSE
Regression (outliers): Huber or MAE
Imbalanced classes: weighted cross-entropy or focal loss
Mental Model
Match the Loss Function to the Problem
Using MSE on data with outliers is like grading a test where one wrong answer costs 100 points and the rest cost 1 — the grade is dominated by a single mistake.
  • MSE penalizes large errors quadratically — outliers dominate the optimization
  • Huber loss transitions from quadratic to linear — robust to outliers
  • Cross-entropy is correct for classification — MSE is not
  • Weighted loss functions handle class imbalance during training
📊 Production Insight
The wrong loss function quietly biases the model toward outliers or the wrong objective.
Always visualize residuals after training regression models — patterns indicate a loss function mismatch.
For business-critical applications, define a custom loss function that reflects the actual cost of different error types.
🎯 Key Takeaway
The loss function defines what the model optimizes — choose it deliberately.
Use Huber loss for regression with outliers. Use cross-entropy for classification.
A mismatched loss function silently degrades predictions without raising errors.

Mistake 11: Not Using sklearn Pipeline

Manual preprocessing — scaling, encoding, feature selection — is error-prone and the most common source of data leakage in production. A sklearn Pipeline chains preprocessing steps and the model into a single object. The pipeline ensures preprocessing is fit on training data only and applied consistently to test and production data. It also simplifies hyperparameter tuning and deployment. Without a Pipeline, you must remember to apply every preprocessing step in the correct order to every new dataset. Miss one step, apply them in the wrong order, or accidentally fit a scaler on test data, and your predictions are silently wrong. The Pipeline eliminates this entire class of bugs by design.

mistake11_no_pipeline.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# TheCodeForge — Mistake 11: Not Using sklearn Pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import numpy as np
import joblib

X, y = make_classification(n_samples=500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# MISTAKE: Manual preprocessing (error-prone, leakage risk)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train_scaled)
model = LogisticRegression(random_state=42)
model.fit(X_train_pca, y_train)

# Must remember to apply same transforms to test data
X_test_scaled = scaler.transform(X_test)
X_test_pca = pca.transform(X_test_scaled)
print('=== Manual Preprocessing (MISTAKE) ===')
print(f'Accuracy: {model.score(X_test_pca, y_test):.2%}')
print('Problem: easy to forget steps, apply in wrong order, or fit on wrong data.')

# CORRECT: Pipeline — leakage-proof by design
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('classifier', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
print(f'\n=== sklearn Pipeline (CORRECT) ===')
print(f'Accuracy: {pipeline.score(X_test, y_test):.2%}')

# Cross-validation works seamlessly with Pipeline
cv_scores = cross_val_score(pipeline, X, y, cv=5)
print(f'CV accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})')

# Deployment: serialize the entire pipeline — one file, one predict call
joblib.dump(pipeline, 'model_pipeline.joblib')
loaded_pipeline = joblib.load('model_pipeline.joblib')
print(f'\nLoaded pipeline accuracy: {loaded_pipeline.score(X_test, y_test):.2%}')
print('Deployment: one file contains scaler + PCA + model.')

print('\nRule: Always use Pipeline — it prevents leakage and simplifies deployment.')
▶ Output
=== Manual Preprocessing (MISTAKE) ===
Accuracy: 88.00%
Problem: easy to forget steps, apply in wrong order, or fit on wrong data.

=== sklearn Pipeline (CORRECT) ===
Accuracy: 88.00%
CV accuracy: 87.20% (+/- 2.14%)

Loaded pipeline accuracy: 88.00%
Deployment: one file contains scaler + PCA + model.

Rule: Always use Pipeline — it prevents leakage and simplifies deployment.
💡Pipeline Automates Correct Preprocessing
sklearn Pipeline ensures preprocessing is fit on training data only and applied consistently. It prevents data leakage, simplifies cross-validation, and makes deployment trivial — serialize the entire pipeline as one object with joblib. One file, one predict call, zero risk of preprocessing mismatch between training and serving.
📊 Production Insight
Manual preprocessing is the #1 source of data leakage and training-serving skew in production.
Pipeline ensures consistent preprocessing between training and serving — this alone prevents entire categories of production bugs.
Serialize the entire pipeline with joblib — one file contains everything needed for prediction. No separate scaler files, no manual transform steps.
🎯 Key Takeaway
sklearn Pipeline prevents data leakage by chaining preprocessing and model.
Manual preprocessing is error-prone — Pipeline automates the correct order.
Serialize the entire pipeline for deployment — one file, one predict call.

Mistake 12: Not Validating with Domain Experts

Technical metrics do not guarantee business value. A model can achieve high accuracy while making predictions that are nonsensical to domain experts. Feature importance can reveal that the model relies on features that should not predict the target. Clusters can be statistically valid but business-meaningless. Always validate model outputs with domain experts before deployment — they catch errors that metrics miss. This is not a technical step, it is a process step, and skipping it is one of the most expensive mistakes in ML. A model that makes technically correct but domain-inappropriate predictions will erode stakeholder trust faster than a model that makes honest errors.

mistake12_no_domain_validation.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# TheCodeForge — Mistake 12: No Domain Expert Validation
# Example: A model predicts house prices using zip code as a numeric feature
# The model achieves high R-squared but makes nonsensical predictions

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

# Simulated data: house prices
np.random.seed(42)
n_samples = 500
zip_code = np.random.randint(10000, 99999, n_samples)
sqft = np.random.randint(500, 5000, n_samples)
price = sqft * 150 + np.random.randn(n_samples) * 10000

X = np.column_stack([zip_code, sqft])
X_train, X_test, y_train, y_test = train_test_split(
    X, price, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print('=== Technical Metrics Look Good ===')
print(f'R-squared: {r2_score(y_test, predictions):.2%}')
print(f'MAE:       ${mean_absolute_error(y_test, predictions):,.0f}')
print(f'\nModel coefficients:')
print(f'  zip_code: {model.coef_[0]:.4f} per unit')
print(f'  sqft:     {model.coef_[1]:.2f} per unit')

print(f'\n=== Domain Expert Would Catch This ===')
print(f'The model treats zip_code as a continuous number.')
print(f'Zip code 99998 is not "worth more" than zip code 10001.')
print(f'This is nonsensical — zip code is categorical, not numeric.')
print(f'Fix: one-hot encode zip_code or use target encoding.')

# What a domain expert review should include:
print(f'\n=== Domain Expert Review Checklist ===')
print(f'1. Are feature types correct? (categorical vs numeric)')
print(f'2. Do feature importances make domain sense?')
print(f'3. Do sample predictions pass the sanity test?')
print(f'4. Are there edge cases the model handles incorrectly?')
print(f'5. Would you trust this prediction if it were your money?')

print('\nRule: Always validate predictions with domain experts before deployment.')
▶ Output
=== Technical Metrics Look Good ===
R-squared: 95.42%
MAE: $8,234

Model coefficients:
zip_code: 0.0234 per unit
sqft: 150.03 per unit

=== Domain Expert Would Catch This ===
The model treats zip_code as a continuous number.
Zip code 99998 is not "worth more" than zip code 10001.
This is nonsensical — zip code is categorical, not numeric.
Fix: one-hot encode zip_code or use target encoding.

=== Domain Expert Review Checklist ===
1. Are feature types correct? (categorical vs numeric)
2. Do feature importances make domain sense?
3. Do sample predictions pass the sanity test?
4. Are there edge cases the model handles incorrectly?
5. Would you trust this prediction if it were your money?

Rule: Always validate predictions with domain experts before deployment.
🔥Metrics Do Not Guarantee Business Value
High accuracy does not mean the model is correct. Domain experts catch issues that metrics miss: nonsensical feature usage, predictions that violate business rules, and edge cases that training data did not cover. Always validate with humans before deploying. A 30-minute review with a domain expert can save months of production debugging.
📊 Production Insight
Technical metrics do not guarantee business value — domain experts catch what metrics miss.
Feature importance review with domain experts prevents nonsensical predictions and catches encoding mistakes.
Always run a validation step with stakeholders before production deployment — show them sample predictions and ask if they make sense.
🎯 Key Takeaway
Technical metrics do not guarantee business value — domain experts catch what metrics miss.
Validate feature importance and predictions with domain experts before deployment.
A model that makes nonsensical predictions is useless regardless of accuracy.
🗂 ML Mistakes — Impact and Fix Summary
All 12 mistakes ranked by frequency and severity
MistakeCategorySymptomImpactFix
OverfittingModelTrain acc >> Test accModel fails on new dataReduce complexity, add regularization
Data LeakageDataSuspiciously high accuracyFalse confidence, production failureSplit before preprocessing, use Pipeline
Wrong MetricsEvaluationHigh accuracy, no business valueStakeholder trust lossUse F1, precision, recall, AUC-ROC
No Cross-ValidationEvaluationAccuracy varies between runsUnreliable performance estimateUse cross_val_score with cv=5
No Feature ScalingPreprocessingPoor convergence, biased distancesDegraded model performanceScale for distance/gradient algorithms
No BaselineEvaluationModel looks good but beats nothingWasted engineering effortCompare against DummyClassifier
Tuning on Test DataValidationInflated test accuracyData leakage, false confidenceUse GridSearchCV on training data
No Feature ImportanceInterpretabilityDo not understand predictionsMissed leakage, noise featuresInspect feature_importances_
Distribution ShiftProductionPerformance degrades over timeSilent model failureMonitor distributions, retrain periodically
Wrong Loss FunctionTrainingModel optimizes wrong objectiveSuboptimal predictionsMatch loss to problem type and data quality
No PipelineCode QualityPreprocessing errors, leakageInconsistent train/servingUse sklearn Pipeline
No Domain ValidationProcessNonsensical predictionsBusiness value lossValidate with domain experts before deploy

🎯 Key Takeaways

  • Overfitting is the #1 mistake — always compare train vs test accuracy to detect it
  • Data leakage silently inflates metrics — always split BEFORE preprocessing and use sklearn Pipeline
  • Accuracy is meaningless on imbalanced datasets — use F1-score, precision, recall, AUC-ROC
  • Cross-validation gives reliable performance estimates — single train-test splits are noisy and misleading
  • Always establish a baseline before training complex models — if you cannot beat DummyClassifier, fix the data
  • Feature importance reveals leakage and noise — inspect it after every training run
  • Monitor production data distributions — models degrade silently as data drifts over time

⚠ Common Mistakes to Avoid

    Reporting training accuracy instead of test accuracy
    Symptom

    Model shows 99% accuracy during development but fails on every real-world input. Stakeholders lose trust when production performance does not match reported metrics.

    Fix

    Always report test accuracy or cross-validation accuracy. Never report training accuracy — it measures memorization, not generalization. Use model.score(X_test, y_test), not model.score(X_train, y_train). Better yet, report cross-validation scores with standard deviation.

    Fitting preprocessing on the full dataset before splitting
    Symptom

    Test accuracy is 2-10% higher than it should be. The model performs well locally but fails in production because the preprocessing saw test data during training.

    Fix

    Always split data BEFORE preprocessing. Use sklearn Pipeline to enforce the correct order: split first, then fit scaler on training data, then transform both train and test. This is non-negotiable for any production system.

    Using accuracy as the primary metric for imbalanced classification
    Symptom

    Model reports 95% accuracy but predicts the same class for every input. The majority class dominates and the model never learns to detect the minority class.

    Fix

    Use F1-score, precision, recall, and AUC-ROC for imbalanced datasets. Apply class_weight='balanced' or use SMOTE oversampling. Always check the confusion matrix — it reveals what accuracy hides.

    Not setting random_state for reproducibility
    Symptom

    Model accuracy changes every time you run the script. Results are not reproducible. You cannot compare models because the train-test split changes each run.

    Fix

    Set random_state=42 (or any fixed integer) in train_test_split, model constructors, and cross-validation. This ensures identical results every run and makes debugging possible.

    Using a complex model without trying a simple baseline first
    Symptom

    Spent weeks building a neural network that achieves 80% accuracy. A simple logistic regression achieves 85% on the same data. The complex model was unnecessary and harder to maintain.

    Fix

    Always start with a baseline: DummyClassifier for classification, DummyRegressor for regression. Then try simple models (logistic regression, decision tree) before complex ones. Complexity must be justified by measurable improvement.

    Not checking for target leakage in features
    Symptom

    Model achieves 99% accuracy with one feature dominating importance. The feature is derived from or highly correlated with the target variable. The model cheats by using future information.

    Fix

    Inspect feature_importances_ after training. If one feature dominates (>50%), investigate for leakage. Remove features that would not be available at prediction time in production. Ask: could I know this feature's value BEFORE the event I am trying to predict?

Interview Questions on This Topic

  • QWhat is the difference between overfitting and underfitting, and how do you detect each?JuniorReveal
    Overfitting occurs when a model learns training data too well — including noise — and fails on new data. The symptom is high training accuracy with low test accuracy (large gap). Underfitting occurs when a model is too simple to capture the underlying pattern. The symptom is low accuracy on both training and test data. Detection: compare training and test accuracy. If train >> test, overfitting. If both are low, underfitting. If both are high and similar, good fit. Fix overfitting by reducing complexity, adding regularization, or getting more data. Fix underfitting by increasing model complexity or adding more informative features.
  • QExplain data leakage with a real-world example and how to prevent it.Mid-levelReveal
    Data leakage occurs when information from the test set influences training, inflating metrics. Example: fitting a StandardScaler on the full dataset before train-test split. The scaler computes mean and std using test data, so the model indirectly sees test information during training. Another example: in a medical study, including lab results that are only available after diagnosis as features to predict the diagnosis — the model uses future information. Prevention: (1) Always split before preprocessing. (2) Use sklearn Pipeline to enforce correct order. (3) For time-series, split chronologically — never use future data to predict the past. (4) Check for target leakage — features derived from or correlated with the target variable. The key rule: the test set must remain completely untouched until final evaluation.
  • QWhy is accuracy a bad metric for imbalanced datasets, and what should you use instead?Mid-levelReveal
    Accuracy measures the percentage of correct predictions overall. On imbalanced datasets, a model that always predicts the majority class achieves high accuracy while being completely useless. Example: 95% legitimate transactions, 5% fraud — a model predicting 'legitimate' always achieves 95% accuracy but catches zero fraud. Better metrics: (1) Precision — of predicted positives, how many were correct. (2) Recall — of actual positives, how many were found. (3) F1-score — harmonic mean of precision and recall, balances both. (4) AUC-ROC — measures discrimination ability across all thresholds, independent of class distribution. For imbalanced problems, F1-score or AUC-ROC should be the primary metric. Additionally, always inspect the confusion matrix to understand per-class performance.
  • QHow do you design a robust ML evaluation pipeline that prevents all common mistakes?SeniorReveal
    A robust pipeline has 5 layers: (1) Data split — train/test split with stratify=y BEFORE any preprocessing, using a fixed random_state for reproducibility. (2) Pipeline — sklearn Pipeline chains preprocessing and model to prevent leakage and ensure consistent transforms. (3) Cross-validation — GridSearchCV tunes hyperparameters on training data using cv=5, never touching the test set. (4) Metrics — use problem-appropriate metrics (F1 for imbalanced, RMSE for regression, AUC-ROC for ranking) and always compare against a baseline model. (5) Final evaluation — evaluate the tuned model on the untouched test set exactly once and report with confidence intervals. Additionally: check feature importance for leakage, validate predictions with domain experts, monitor production data distributions for drift, and set up automated retraining triggers.

Frequently Asked Questions

How do I know if my model is overfitting?

Compare training accuracy to test accuracy. If training accuracy is significantly higher (>10% gap), the model is overfitting. For example, 99% training accuracy with 80% test accuracy indicates severe overfitting. The model memorized the training data instead of learning generalizable patterns. Fix: reduce model complexity (fewer layers, lower max_depth, fewer estimators), add regularization (L1, L2, dropout), or collect more training data. Plot learning curves — if test accuracy plateaus while training accuracy keeps climbing, the model needs regularization, not more training.

What is the simplest way to prevent data leakage?

Use sklearn Pipeline. A Pipeline chains preprocessing steps and the model into a single object. When you call pipeline.fit(X_train, y_train), the pipeline automatically fits preprocessing on training data only. When you call pipeline.predict(X_test), it applies the same preprocessing to test data without refitting. This eliminates the most common source of leakage: fitting preprocessing on the full dataset before splitting. For time-series data, additionally ensure you split chronologically using TimeSeriesSplit, not randomly.

Should I always use cross-validation instead of a single train-test split?

Use cross-validation for model evaluation and hyperparameter tuning — it gives a more reliable performance estimate with confidence bounds. Use a single train-test split for final evaluation — it simulates production conditions where you evaluate on truly unseen data. The standard approach: split data into train/test (80/20), use cross-validation on the training set for model selection and tuning, then evaluate the final model on the untouched test set once. For small datasets (<1000 samples), cross-validation is especially important because a single split can be highly unrepresentative.

How do I handle imbalanced datasets without collecting more data?

Three approaches, from simplest to most involved: (1) Class weights — set class_weight='balanced' in sklearn classifiers. This penalizes misclassifying the minority class more heavily during training. Requires no data modification. (2) Oversampling — use SMOTE (from imbalanced-learn library) to generate synthetic minority class samples. Creates new training samples by interpolating between existing minority samples. (3) Undersampling — randomly remove majority class samples to balance the dataset. Simple but loses information. Combine any of these with appropriate metrics (F1-score, AUC-ROC) instead of accuracy. Start with class weights — it works well in most cases and adds zero complexity.

How often should I retrain my production model?

It depends on how fast your data distribution changes. For stable distributions (medical imaging, physics simulations), retrain quarterly or when new data is available. For moderately changing distributions (e-commerce recommendations, marketing), retrain monthly. For rapidly changing distributions (news classification, social media trending, financial markets), retrain weekly or daily. The key is monitoring: track feature distributions and model performance metrics in production. When performance drops below a threshold or feature distributions shift significantly (detected via KS test or PSI), trigger retraining. Automated drift detection is better than fixed calendar schedules.

What is the difference between a validation set and a test set?

A validation set is used during model development for hyperparameter tuning and model selection — you evaluate on it repeatedly. A test set is used exactly once for final evaluation — it provides an unbiased estimate of production performance. In practice, cross-validation on the training set replaces the need for a separate validation set — you tune on cross-validation folds within the training data and evaluate on the untouched test set. The three-way split (train/validation/test) is more common in deep learning where cross-validation is computationally expensive due to long training times.

How do I know if a feature is causing data leakage?

Three indicators: (1) Feature importance — if one feature dominates (>50% of total importance), investigate whether it is derived from the target or contains future information. (2) Temporal availability — ask yourself: would this feature be available at prediction time in production? If not, it is leaking. (3) Suspicious accuracy — if removing a single feature drops accuracy by more than 20%, that feature is almost certainly leaking target information. Example: a 'days_since_last_purchase' feature in a churn prediction model that is calculated using the churn date itself — this feature encodes the target directly. Always ask: could I compute this feature BEFORE the event I am trying to predict?

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousYour First Machine Learning Project – Complete Step-by-Step (2026)Next →From Machine Learning to LLMs – What Should You Learn Next?
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged