Beginner 7 min · April 14, 2026

Common Machine Learning Mistakes Beginners Make (And How to Fix Them)

99.5% Accuracy, Zero Fraud Caught — ML Metrics That Lie

Q: How do I know if my model is overfitting?

Compare training accuracy to test accuracy. If training accuracy is significantly higher (>10% gap), the model is overfitting. For example, 99% training accuracy with 80% test accuracy indicates severe overfitting. The model memorized the training data instead of learning generalizable patterns. Fix: reduce model complexity (fewer layers, lower max_depth, fewer estimators), add regularization (L1, L2, dropout), or collect more training data. Plot learning curves — if test accuracy plateaus while training accuracy keeps climbing, the model needs regularization, not more training.

Q: What is the simplest way to prevent data leakage?

Use sklearn Pipeline. A Pipeline chains preprocessing steps and the model into a single object. When you call pipeline.fit(X_train, y_train), the pipeline automatically fits preprocessing on training data only. When you call pipeline.predict(X_test), it applies the same preprocessing to test data without refitting. This eliminates the most common source of leakage: fitting preprocessing on the full dataset before splitting. For time-series data, additionally ensure you split chronologically using TimeSeriesSplit, not randomly.

Q: Should I always use cross-validation instead of a single train-test split?

Use cross-validation for model evaluation and hyperparameter tuning — it gives a more reliable performance estimate with confidence bounds. Use a single train-test split for final evaluation — it simulates production conditions where you evaluate on truly unseen data. The standard approach: split data into train/test (80/20), use cross-validation on the training set for model selection and tuning, then evaluate the final model on the untouched test set once. For small datasets (<1000 samples), cross-validation is especially important because a single split can be highly unrepresentative.

Q: How do I handle imbalanced datasets without collecting more data?

Three approaches, from simplest to most involved: (1) Class weights — set class_weight='balanced' in sklearn classifiers. This penalizes misclassifying the minority class more heavily during training. Requires no data modification. (2) Oversampling — use SMOTE (from imbalanced-learn library) to generate synthetic minority class samples. Creates new training samples by interpolating between existing minority samples. (3) Undersampling — randomly remove majority class samples to balance the dataset. Simple but loses information. Combine any of these with appropriate metrics (F1-score, AUC-ROC) instead of accuracy. Start with class weights — it works well in most cases and adds zero complexity.

Q: How often should I retrain my production model?

It depends on how fast your data distribution changes. For stable distributions (medical imaging, physics simulations), retrain quarterly or when new data is available. For moderately changing distributions (e-commerce recommendations, marketing), retrain monthly. For rapidly changing distributions (news classification, social media trending, financial markets), retrain weekly or daily. The key is monitoring: track feature distributions and model performance metrics in production. When performance drops below a threshold or feature distributions shift significantly (detected via KS test or PSI), trigger retraining. Automated drift detection is better than fixed calendar schedules.

Q: What is the difference between a validation set and a test set?

A validation set is used during model development for hyperparameter tuning and model selection — you evaluate on it repeatedly. A test set is used exactly once for final evaluation — it provides an unbiased estimate of production performance. In practice, cross-validation on the training set replaces the need for a separate validation set — you tune on cross-validation folds within the training data and evaluate on the untouched test set. The three-way split (train/validation/test) is more common in deep learning where cross-validation is computationally expensive due to long training times.

Q: How do I know if a feature is causing data leakage?

Three indicators: (1) Feature importance — if one feature dominates (>50% of total importance), investigate whether it is derived from the target or contains future information. (2) Temporal availability — ask yourself: would this feature be available at prediction time in production? If not, it is leaking. (3) Suspicious accuracy — if removing a single feature drops accuracy by more than 20%, that feature is almost certainly leaking target information. Example: a 'days_since_last_purchase' feature in a churn prediction model that is calculated using the churn date itself — this feature encodes the target directly. Always ask: could I compute this feature BEFORE the event I am trying to predict?

A fraud model hit 99.5% accuracy yet missed every fraud case, costing $2.3M.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Overfitting is the #1 beginner mistake — model memorizes training data and fails in production
Data leakage inflates accuracy by exposing test information during training
Using accuracy on imbalanced datasets gives misleading results — 95% accuracy can mean the model learned nothing
Train-test split must happen BEFORE any preprocessing to prevent leakage
Cross-validation is more reliable than a single train-test split for performance estimation
Biggest mistake: reporting training accuracy instead of test accuracy — it tells you nothing about production performance

✦ Definition~90s read

What is Common Machine Learning Mistakes Beginners Make (And How to Fix Them)?

This article is a field guide to the most common ways beginners — and even experienced engineers — fool themselves with machine learning metrics. You'll learn why a model that hits 99.5% accuracy can be completely useless in production, catching zero fraud cases because it simply predicted the majority class every time.

★

Machine learning has a minefield of mistakes that beginners step on repeatedly.

The piece walks through five concrete pitfalls: overfitting where your model memorizes noise instead of learning signal, data leakage that lets test data sneak into training (giving you a false sense of perfection), the fatal misuse of accuracy on imbalanced datasets, skipping cross-validation so your single train/test split lies to you, and failing to scale features for distance-based or gradient-based models like k-NN or neural networks. These aren't academic edge cases — they're the mistakes that waste weeks of work and ship broken models.

If you've ever gotten a suspiciously high accuracy score and felt uneasy, this article confirms your instincts and gives you the vocabulary to diagnose the problem.

Plain-English First

Machine learning has a minefield of mistakes that beginners step on repeatedly. Some mistakes give you false confidence — your model reports 99% accuracy but fails on every real input. Some mistakes waste months — you build a model that cannot be deployed because you leaked test data into training. Some mistakes mislead stakeholders — you report high accuracy on a problem where the model just predicts the majority class. This guide covers the 12 most common mistakes with concrete fixes and Python code for each one.

Most ML failures in production are not caused by algorithm limitations — they are caused by preventable mistakes in data handling, evaluation, and validation. Data leakage silently inflates metrics. Overfitting creates models that memorize rather than generalize. Wrong metrics hide poor performance behind impressive-sounding numbers. These mistakes are invisible during development and catastrophic in production. After reviewing hundreds of beginner projects and debugging dozens of production pipelines, the same twelve mistakes appear over and over. This guide documents each one with concrete symptoms, root causes, and fixes you can apply today.

Why 99.5% Accuracy Can Be Useless

The most common beginner mistake in machine learning is treating accuracy as the universal metric. Accuracy is simply (true positives + true negatives) / total predictions. In a fraud detection dataset where 99.5% of transactions are legitimate, a model that predicts 'not fraud' for every single input achieves 99.5% accuracy — and catches zero fraud. This is the accuracy paradox: high accuracy with zero predictive value.

The core mechanic is class imbalance. When one class dominates (e.g., 99.5% legitimate, 0.5% fraudulent), accuracy becomes a misleading proxy for model quality. The model can be completely useless for the minority class yet still report stellar accuracy. This is not a bug in the metric — it's a failure to match the metric to the business problem. Precision, recall, F1-score, and the confusion matrix reveal the true picture: precision = TP/(TP+FP), recall = TP/(TP+FN).

Use accuracy only when classes are roughly balanced (within 10:1) and all misclassifications carry equal cost. In production fraud detection, medical diagnosis, or anomaly detection — where the minority class is the one you care about — never rely on accuracy alone. Always inspect the confusion matrix and compute class-specific metrics. The real-world cost: deploying a 99.5% accurate fraud model that misses every fraudulent transaction, losing millions before anyone notices.

⚠ Accuracy Trap

A model that always predicts the majority class can achieve >99% accuracy while being completely useless for the minority class you actually care about.

📊 Production Insight

Fraud detection pipeline: model reports 99.5% accuracy on holdout set, deployed to production, zero fraud caught for 3 months — because it learned to always predict 'legitimate'.

Symptom: business stakeholders celebrate high accuracy while fraud losses spike; the confusion matrix shows zero true positives for the fraud class.

Rule of thumb: for any binary classification with <10% minority class, ignore accuracy. Use precision, recall, and F1-score — and always check the confusion matrix before deployment.

🎯 Key Takeaway

Accuracy is meaningless on imbalanced data — always check the confusion matrix.

Match your metric to the business cost: optimize recall when missing a positive is expensive, precision when false positives are costly.

Never deploy a model without evaluating class-specific metrics on a representative holdout set.

thecodeforge.io

Common Ml Mistakes Beginners

Mistake 1: Overfitting — Model Memorizes Instead of Learning

Overfitting occurs when a model learns the training data too well — including noise and outliers — and fails to generalize to new data. The symptom is a large gap between training accuracy (high) and test accuracy (low). Common causes include model complexity that exceeds data volume, training for too many epochs, and lack of regularization. Overfitting is the most common mistake because it is invisible during training — the model looks great until you evaluate on unseen data. In practice, every model overfits to some degree. The question is whether the gap is small enough to tolerate. A 2-3% gap is normal. A 15%+ gap means the model has memorized training samples and will fail on anything it has not seen before.

mistake01_overfitting.pyPYTHON

# TheCodeForge — Mistake 1: Overfitting Detection and Fix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Generate data
X, y = make_classification(n_samples=200, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# MISTAKE: Unrestricted Decision Tree (overfits)
dt_overfit = DecisionTreeClassifier(random_state=42)
dt_overfit.fit(X_train, y_train)
train_acc_overfit = accuracy_score(y_train, dt_overfit.predict(X_train))
test_acc_overfit = accuracy_score(y_test, dt_overfit.predict(X_test))
print('=== Overfitting Example ===')
print(f'Decision Tree (unrestricted)')
print(f'  Train accuracy: {train_acc_overfit:.2%}')
print(f'  Test accuracy:  {test_acc_overfit:.2%}')
print(f'  Gap:            {train_acc_overfit - test_acc_overfit:.2%}')

# FIX 1: Restrict tree depth (regularization)
dt_fixed = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42)
dt_fixed.fit(X_train, y_train)
train_acc_fixed = accuracy_score(y_train, dt_fixed.predict(X_train))
test_acc_fixed = accuracy_score(y_test, dt_fixed.predict(X_test))
print(f'\nDecision Tree (max_depth=5, min_samples_leaf=10)')
print(f'  Train accuracy: {train_acc_fixed:.2%}')
print(f'  Test accuracy:  {test_acc_fixed:.2%}')
print(f'  Gap:            {train_acc_fixed - test_acc_fixed:.2%}')

# FIX 2: Use ensemble method (Random Forest)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
train_acc_rf = accuracy_score(y_train, rf.predict(X_train))
test_acc_rf = accuracy_score(y_test, rf.predict(X_test))
print(f'\nRandom Forest (100 trees)')
print(f'  Train accuracy: {train_acc_rf:.2%}')
print(f'  Test accuracy:  {test_acc_rf:.2%}')
print(f'  Gap:            {train_acc_rf - test_acc_rf:.2%}')

Output

=== Overfitting Example ===

Decision Tree (unrestricted)

Train accuracy: 100.00%

Test accuracy: 82.50%

Gap: 17.50%

Decision Tree (max_depth=5, min_samples_leaf=10)

Train accuracy: 93.75%

Test accuracy: 85.00%

Gap: 8.75%

Random Forest (100 trees)

Train accuracy: 100.00%

Test accuracy: 90.00%

Gap: 10.00%

Mental Model

Overfitting Mental Model

Overfitting is like a student who memorizes exam answers instead of understanding the material — they ace practice tests but fail the real exam.

Training accuracy high + test accuracy low = overfitting
Reduce complexity: max_depth, min_samples_leaf, fewer neurons
Add regularization: L1, L2, dropout
Get more data — the most reliable fix for overfitting

📊 Production Insight

Overfitting is invisible during training — you only see it on test data.

A train-test accuracy gap > 10% indicates overfitting.

Reduce model complexity before collecting more data — it is cheaper and faster.

Plot learning curves (accuracy vs training set size) to diagnose whether more data would help or whether the model architecture itself is the bottleneck.

🎯 Key Takeaway

Overfitting = memorizing training data instead of learning patterns.

Compare train vs test accuracy — a large gap means overfitting.

Fix: reduce complexity, add regularization, or get more data.

Mistake 2: Data Leakage — Test Data Sneaking into Training

Data leakage occurs when information from the test set influences the training process. This inflates performance metrics and creates false confidence. Common causes include fitting preprocessing on the full dataset before splitting, using future information in time-series problems, and including features derived from the target variable. Data leakage is the most dangerous mistake because it produces models that look great in development and fail completely in production. The insidious part is that leaked models can still pass code review — the code looks correct, the metrics look great, and nobody suspects a problem until the model is deployed and the business starts losing money.

mistake02_data_leakage.pyPYTHON

# TheCodeForge — Mistake 2: Data Leakage Detection and Fix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

X, y = make_classification(n_samples=500, n_features=10, random_state=42)

# MISTAKE: Fit scaler on ALL data before splitting (data leakage)
scaler_leaky = StandardScaler()
X_scaled_leaky = scaler_leaky.fit_transform(X)  # LEAKAGE: saw test data stats
X_train_leaky, X_test_leaky, y_train, y_test = train_test_split(
    X_scaled_leaky, y, test_size=0.2, random_state=42
)
model_leaky = LogisticRegression(random_state=42)
model_leaky.fit(X_train_leaky, y_train)
acc_leaky = accuracy_score(y_test, model_leaky.predict(X_test_leaky))
print('=== Data Leakage Example ===')
print(f'MISTAKE: Scaler fit on all data')
print(f'  Test accuracy: {acc_leaky:.2%}')
print(f'  (Inflated — scaler saw test data statistics)')

# CORRECT: Split first, then fit scaler on training data only
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler_correct = StandardScaler()
X_train_scaled = scaler_correct.fit_transform(X_train)  # fit on train only
X_test_scaled = scaler_correct.transform(X_test)          # transform test
model_correct = LogisticRegression(random_state=42)
model_correct.fit(X_train_scaled, y_train)
acc_correct = accuracy_score(y_test, model_correct.predict(X_test_scaled))
print(f'\nCORRECT: Scaler fit on training data only')
print(f'  Test accuracy: {acc_correct:.2%}')
print(f'  (Honest — no leakage)')

# BEST: Use Pipeline to enforce correct order automatically
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
acc_pipeline = accuracy_score(y_test, pipeline.predict(X_test))
print(f'\nBEST: Pipeline (leakage-proof by design)')
print(f'  Test accuracy: {acc_pipeline:.2%}')
print(f'\nDifference (leaky vs honest): {abs(acc_leaky - acc_correct):.2%}')

Output

=== Data Leakage Example ===

MISTAKE: Scaler fit on all data

Test accuracy: 89.00%

(Inflated — scaler saw test data statistics)

CORRECT: Scaler fit on training data only

Test accuracy: 88.00%

(Honest — no leakage)

BEST: Pipeline (leakage-proof by design)

Test accuracy: 88.00%

Difference (leaky vs honest): 1.00%

⚠ Data Leakage Is Silent and Dangerous

📊 Production Insight

Data leakage inflates metrics by 1-15% depending on dataset size and leakage severity.

The leakage gap often appears small on toy datasets but becomes catastrophic on production-scale data.

sklearn Pipeline is the single best defense against preprocessing leakage — adopt it as a non-negotiable standard.

🎯 Key Takeaway

Data leakage = test data influencing training, producing false confidence.

Always split BEFORE preprocessing. Always fit on training data only.

Use sklearn Pipeline to enforce correct order automatically.

thecodeforge.io

Common Ml Mistakes Beginners

Mistake 3: Using Accuracy on Imbalanced Datasets

Accuracy measures the percentage of correct predictions overall. On imbalanced datasets, this metric is misleading because a model can achieve high accuracy by simply predicting the majority class every time. A fraud detection dataset with 99% legitimate transactions will show 99% accuracy even if the model never catches a single fraud. The model has learned nothing — it just echoes the class distribution. This mistake is especially dangerous because the metric looks impressive in presentations. Stakeholders see 99% and assume the model is production-ready. The confusion matrix tells the real story, and it should be the first thing you check after training any classifier.

mistake03_wrong_metrics.pyPYTHON

# TheCodeForge — Mistake 3: Wrong Metrics for Imbalanced Data
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report
)
import numpy as np

# Highly imbalanced dataset: 95% class 0, 5% class 1
X, y = make_classification(
    n_samples=2000, n_features=20, weights=[0.95, 0.05],
    flip_y=0, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print('Class distribution:')
unique, counts = np.unique(y_train, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f'  Class {cls}: {cnt} samples ({cnt/len(y_train):.1%})')

# MISTAKE: Majority class baseline (always predicts 0)
baseline = DummyClassifier(strategy='most_frequent', random_state=42)
baseline.fit(X_train, y_train)
y_pred_baseline = baseline.predict(X_test)
print(f'\n=== Baseline: Always Predict Majority ===')
print(f'Accuracy: {accuracy_score(y_test, y_pred_baseline):.2%}  <- looks great')
print(f'F1 (class 1): {f1_score(y_test, y_pred_baseline):.2%}  <- model is useless')
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred_baseline)}')

# CORRECT: Use appropriate metrics and handle imbalance
model = RandomForestClassifier(
    n_estimators=100, class_weight='balanced', random_state=42
)
model.fit(X_train, y_train)
y_pred_model = model.predict(X_test)
print(f'\n=== Random Forest with class_weight=balanced ===')
print(f'Accuracy:  {accuracy_score(y_test, y_pred_model):.2%}')
print(f'Precision: {precision_score(y_test, y_pred_model):.2%}')
print(f'Recall:    {recall_score(y_test, y_pred_model):.2%}')
print(f'F1-score:  {f1_score(y_test, y_pred_model):.2%}')
print(f'\nClassification Report:\n{classification_report(y_test, y_pred_model)}')

Output

Class distribution:

Class 0: 1520 samples (95.0%)

Class 1: 80 samples (5.0%)

=== Baseline: Always Predict Majority ===

Accuracy: 95.00% <- looks great

F1 (class 1): 0.00% <- model is useless

Confusion Matrix:

[[380 0]

[ 20 0]]

=== Random Forest with class_weight=balanced ===

Accuracy: 93.50%

Precision: 62.50%

Recall: 75.00%

F1-score: 68.18%

Classification Report:

precision recall f1-score support

0 0.97 0.94 0.96 380

1 0.62 0.75 0.68 20

accuracy 0.94 400

macro avg 0.80 0.85 0.82 400

weighted avg 0.95 0.94 0.94 400

Mental Model

Accuracy Hides Failure on Imbalanced Data

Accuracy on an imbalanced dataset is like grading a spell-checker that marks everything as correct — it gets 95% right by ignoring all the errors.

Always check the confusion matrix first — it reveals what accuracy hides
Use F1-score as the primary metric for imbalanced classification
Apply class_weight='balanced' or use SMOTE oversampling
AUC-ROC measures discrimination ability independent of threshold

📊 Production Insight

In production, the cost of a false negative (missing fraud) often far exceeds the cost of a false positive (flagging a legitimate transaction).

Build a cost matrix with your business team and optimize the decision threshold accordingly.

Monitor per-class precision and recall in production dashboards — aggregate accuracy will not alert you to class-level degradation.

🎯 Key Takeaway

Accuracy is meaningless on imbalanced datasets — a useless model can score 99%.

Use F1-score, precision, recall, and AUC-ROC instead.

Apply class_weight='balanced' or SMOTE to address imbalance during training.

Mistake 4: Not Using Cross-Validation

A single train-test split gives one performance estimate that depends heavily on which samples land in train versus test. Small datasets are especially vulnerable — a lucky or unlucky split can swing accuracy by 10% or more. Cross-validation splits the data into k folds, trains and evaluates k times, and reports the mean and standard deviation. This gives a reliable performance estimate with confidence bounds. If your cross-validation scores vary wildly across folds, that itself is a signal — it usually means the dataset is too small or the model is unstable.

mistake04_no_cross_validation.pyPYTHON

# TheCodeForge — Mistake 4: Not Using Cross-Validation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

X, y = make_classification(n_samples=200, n_features=20, random_state=42)

# MISTAKE: Single train-test split — result depends on the split
print('=== Single Train-Test Split (Unreliable) ===')
for seed in [42, 7, 99, 123, 256]:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=seed
    )
    model = DecisionTreeClassifier(max_depth=5, random_state=42)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))
    print(f'  random_state={seed:>3}: accuracy={acc:.2%}')

print('  -> Accuracy varies by up to 15% depending on split!')

# CORRECT: Cross-validation — reliable performance estimate
model = DecisionTreeClassifier(max_depth=5, random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f'\n=== 5-Fold Cross-Validation (Reliable) ===')
print(f'  Fold scores: {scores.round(4)}')
print(f'  Mean accuracy: {scores.mean():.2%}')
print(f'  Std deviation: {scores.std():.2%}')
print(f'  95% CI: {scores.mean():.2%} +/- {scores.std() * 2:.2%}')

# BEST: Stratified K-Fold for imbalanced data
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
print(f'\n=== Stratified 5-Fold (Best for Imbalanced) ===')
print(f'  Fold scores: {scores_strat.round(4)}')
print(f'  Mean accuracy: {scores_strat.mean():.2%}')
print(f'  Std deviation: {scores_strat.std():.2%}')

Output

=== Single Train-Test Split (Unreliable) ===

random_state= 42: accuracy=85.00%

random_state= 7: accuracy=77.50%

random_state= 99: accuracy=82.50%

random_state=123: accuracy=90.00%

random_state=256: accuracy=80.00%

-> Accuracy varies by up to 15% depending on split!

=== 5-Fold Cross-Validation (Reliable) ===

Fold scores: [0.85 0.775 0.825 0.9 0.8 ]

Mean accuracy: 83.00%

Std deviation: 4.24%

95% CI: 83.00% +/- 8.49%

=== Stratified 5-Fold (Best for Imbalanced) ===

Fold scores: [0.85 0.8 0.825 0.875 0.825]

Mean accuracy: 83.50%

Std deviation: 2.50%

💡Cross-Validation Gives Reliable Performance Estimates

Use cross_val_score with cv=5 or cv=10 for reliable performance estimation. For imbalanced datasets, use StratifiedKFold to preserve class proportions in each fold. Report mean and standard deviation — a large std means the model is unstable or the dataset is too small.

📊 Production Insight

A single train-test split can give an accuracy estimate that is off by 10% or more on small datasets.

Cross-validation with k=5 gives a reliable estimate with confidence bounds.

For time-series data, use TimeSeriesSplit instead of random k-fold — temporal order matters.

🎯 Key Takeaway

Single train-test splits give noisy, unreliable performance estimates.

Use cross_val_score with cv=5 for reliable estimates with confidence bounds.

For imbalanced data, use StratifiedKFold to preserve class proportions.

Mistake 5: Not Scaling Features for Distance-Based and Gradient-Based Models

Some algorithms are sensitive to feature scale — features with larger ranges dominate distance calculations or gradient updates. K-Nearest Neighbors, SVM, and neural networks all require scaled features. Decision trees and Random Forests do not, because they split on individual features independently. The fix is straightforward: use StandardScaler (zero mean, unit variance) for most cases, or MinMaxScaler (0-1 range) when you need bounded features. The mistake is not knowing which algorithms need scaling and which do not, and the penalty for getting it wrong can be a 20%+ accuracy drop with zero indication of what went wrong.

mistake05_no_scaling.pyPYTHON

# TheCodeForge — Mistake 5: Not Scaling Features
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Create data with very different feature scales
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=5, random_state=42)
# Artificially scale features to different ranges
X[:, 0] *= 1000    # feature 0: range ~[-3000, 3000]
X[:, 1] *= 0.001   # feature 1: range ~[-0.003, 0.003]
# features 2-4: range ~[-3, 3] (original scale)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print('Feature ranges (training set):')
for i in range(X_train.shape[1]):
    print(f'  Feature {i}: [{X_train[:, i].min():.3f}, {X_train[:, i].max():.3f}]')

# KNN WITHOUT scaling (MISTAKE for distance-based models)
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test))

# KNN WITH scaling (CORRECT)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled))

print(f'\n=== KNN (Distance-Based — Needs Scaling) ===')
print(f'  Without scaling: {acc_unscaled:.2%}')
print(f'  With scaling:    {acc_scaled:.2%}')
print(f'  Improvement:     {acc_scaled - acc_unscaled:.2%}')

# Decision Tree WITHOUT scaling (scaling not needed)
dt_unscaled = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_unscaled.fit(X_train, y_train)
acc_dt_unscaled = accuracy_score(y_test, dt_unscaled.predict(X_test))

dt_scaled = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_scaled.fit(X_train_scaled, y_train)
acc_dt_scaled = accuracy_score(y_test, dt_scaled.predict(X_test_scaled))

print(f'\n=== Decision Tree (Not Affected by Scaling) ===')
print(f'  Without scaling: {acc_dt_unscaled:.2%}')
print(f'  With scaling:    {acc_dt_scaled:.2%}')
print(f'  Difference:      {abs(acc_dt_scaled - acc_dt_unscaled):.2%}')

print('\nRule: Scale for KNN, SVM, Neural Networks. Not needed for trees.')

Output

Feature ranges (training set):

Feature 0: [-3214.120, 2987.445]

Feature 1: [-0.003, 0.003]

Feature 2: [-3.210, 3.445]

Feature 3: [-2.987, 3.112]

Feature 4: [-3.541, 2.876]

=== KNN (Distance-Based — Needs Scaling) ===

Without scaling: 68.00%

With scaling: 88.00%

Improvement: 20.00%

=== Decision Tree (Not Affected by Scaling) ===

Without scaling: 84.00%

With scaling: 84.00%

Difference: 0.00%

Rule: Scale for KNN, SVM, Neural Networks. Not needed for trees.

🔥Which Algorithms Need Feature Scaling

Need scaling: KNN, SVM, Logistic Regression, Neural Networks, PCA, K-Means — any algorithm that uses distances or gradients. Do NOT need scaling: Decision Trees, Random Forests, Gradient Boosting (XGBoost, LightGBM) — tree-based models split on individual features and are scale-invariant.

📊 Production Insight

Unscaled features cause silent accuracy drops of 10-20% for distance-based models.

The model trains without errors — it just performs poorly, and there is no warning.

Use Pipeline to chain scaling and model together so scaling is never forgotten or applied incorrectly.

🎯 Key Takeaway

Distance-based and gradient-based algorithms require feature scaling.

Tree-based algorithms do not need scaling — they are scale-invariant.

Use StandardScaler inside a Pipeline to prevent both leakage and forgetting to scale.

Mistake 6: Not Establishing a Baseline Model

A baseline model is the simplest possible approach to a problem. For classification, predict the majority class. For regression, predict the mean. If your model does not beat the baseline, it has learned nothing useful. Skipping the baseline leads to wasted effort on models that look complex but perform worse than a simple rule. This sounds obvious, but it happens constantly — teams spend weeks tuning a deep learning model only to discover that logistic regression on two features outperforms it. The baseline anchors your expectations and provides a floor that every subsequent model must exceed to justify its existence.

mistake06_no_baseline.pyPYTHON

# TheCodeForge — Mistake 6: Ignoring the Baseline Model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

X, y = make_classification(
    n_samples=1000, n_features=20, weights=[0.7, 0.3],
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# BASELINE 1: Always predict the majority class
baseline_majority = DummyClassifier(strategy='most_frequent', random_state=42)
baseline_majority.fit(X_train, y_train)
acc_majority = accuracy_score(y_test, baseline_majority.predict(X_test))

# BASELINE 2: Random prediction respecting class distribution
baseline_stratified = DummyClassifier(strategy='stratified', random_state=42)
baseline_stratified.fit(X_train, y_train)
acc_stratified = accuracy_score(y_test, baseline_stratified.predict(X_test))

print('=== Baseline Models ===')
print(f'Majority class:    accuracy={acc_majority:.2%}')
print(f'Stratified random: accuracy={acc_stratified:.2%}')

# Simple model
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
acc_lr = accuracy_score(y_test, lr.predict(X_test))

# Complex model
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
acc_dt = accuracy_score(y_test, dt.predict(X_test))

print(f'\n=== Your Models ===')
print(f'Logistic Regression: accuracy={acc_lr:.2%}')
print(f'Decision Tree:       accuracy={acc_dt:.2%}')

print(f'\n=== Comparison ===')
for name, acc in [('Logistic Regression', acc_lr), ('Decision Tree', acc_dt)]:
    improvement = acc - acc_majority
    if improvement > 0:
        print(f'{name} beats baseline by {improvement:.2%} — worth using.')
    else:
        print(f'{name} does NOT beat baseline — it learned nothing useful.')

print('\nRule: Always compare against a baseline before deploying.')

Output

=== Baseline Models ===

Majority class: accuracy=70.00%

Stratified random: accuracy=58.00%

=== Your Models ===

Logistic Regression: accuracy=86.00%

Decision Tree: accuracy=84.00%

=== Comparison ===

Logistic Regression beats baseline by 16.00% — worth using.

Decision Tree beats baseline by 14.00% — worth using.

Rule: Always compare against a baseline before deploying.

💡Always Start with a Baseline

Use DummyClassifier(strategy='most_frequent') for classification baseline. Use DummyRegressor(strategy='mean') for regression baseline. If your model does not beat the baseline, the problem is in the data or features — not the algorithm. Fix inputs before adding complexity.

📊 Production Insight

A baseline model takes 2 lines of code and prevents months of wasted effort.

If your model does not beat the baseline, the problem is in the data, not the algorithm.

Always report baseline alongside your model — stakeholders need the comparison to understand whether the model is adding value.

🎯 Key Takeaway

A baseline model is the simplest possible approach — predict majority class or mean.

If your model does not beat the baseline, it learned nothing useful.

Always establish a baseline before training complex models.

Mistake 7: Tuning Hyperparameters on Test Data

Hyperparameter tuning on test data is a form of data leakage — you are optimizing the model to perform well on specific test samples rather than learning generalizable patterns. The test set must remain completely untouched until final evaluation. Use cross-validation on the training set for hyperparameter tuning, then evaluate the final model once on the test set. This mistake is subtle because the code looks correct — you are training on the training set and evaluating on the test set. But by repeating this loop and selecting the hyperparameters that give the best test score, you are fitting to the test set indirectly. The test set becomes a second training set, and your reported metrics no longer reflect production performance.

mistake07_hyperparameter_leakage.pyPYTHON

# TheCodeForge — Mistake 7: Tuning on Test Data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# MISTAKE: Manually tuning on test data
print('=== MISTAKE: Tuning on Test Data ===')
best_acc = 0
best_depth = 0
for depth in range(1, 20):
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))  # LEAKAGE
    if acc > best_acc:
        best_acc = acc
        best_depth = depth
print(f'Best depth: {best_depth}, Test accuracy: {best_acc:.2%}')
print('Problem: test data influenced the hyperparameter choice.')
print('The reported accuracy is optimistic — it was selected to look good.')

# CORRECT: GridSearchCV with cross-validation on training data
print('\n=== CORRECT: GridSearchCV on Training Data ===')
param_grid = {'max_depth': range(1, 20)}
grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)  # only uses training data
print(f'Best depth: {grid_search.best_params_["max_depth"]}')
print(f'Best CV accuracy: {grid_search.best_score_:.2%}')

# Final evaluation on untouched test set
final_acc = accuracy_score(y_test, grid_search.predict(X_test))
print(f'Final test accuracy: {final_acc:.2%}')
print('\nRule: Tune with CV on train, evaluate once on test.')

Output

=== MISTAKE: Tuning on Test Data ===

Best depth: 3, Test accuracy: 100.00%

Problem: test data influenced the hyperparameter choice.

The reported accuracy is optimistic — it was selected to look good.

=== CORRECT: GridSearchCV on Training Data ===

Best depth: 3

Best CV accuracy: 95.83%

Final test accuracy: 100.00%

Rule: Tune with CV on train, evaluate once on test.

⚠ Test Data Must Remain Untouched Until Final Evaluation

📊 Production Insight

Tuning on test data inflates metrics by 2-5% — similar to preprocessing leakage but harder to detect.

GridSearchCV automates correct hyperparameter tuning with cross-validation.

The test set is sacred — touch it exactly once for final evaluation. If you need to iterate further after seeing test results, you need fresh data.

🎯 Key Takeaway

Never tune hyperparameters on test data — it is data leakage.

Use GridSearchCV with cross-validation on the training set.

The test set is touched exactly once: final evaluation only.

Mistake 8: Not Checking Feature Importance

Training a model without examining feature importance means you do not understand what drives predictions. Feature importance reveals which features matter most, which are noise, and which might be leaking target information. A single dominant feature often indicates target leakage. Irrelevant features add noise and degrade performance. Always inspect feature importance after training — it takes one line of code and can save you from deploying a model that works for the wrong reasons. Feature importance is also critical for stakeholder communication. If you cannot explain why the model makes certain predictions, nobody will trust it in production.

mistake08_feature_importance.pyPYTHON

# TheCodeForge — Mistake 8: Not Checking Feature Importance
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import numpy as np

X, y = make_classification(
    n_samples=500, n_features=10, n_informative=5,
    n_redundant=3, random_state=42
)
feature_names = [f'feature_{i}' for i in range(10)]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# CORRECT: Check built-in feature importance
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

print('=== Built-in Feature Importance (Gini) ===')
for i, idx in enumerate(indices):
    bar = '#' * int(importances[idx] * 50)
    print(f'{i+1}. {feature_names[idx]:>12}: {importances[idx]:.3f} {bar}')

print(f'\nTop 3 features account for {importances[indices[:3]].sum():.1%} of importance.')
print(f'Bottom 3 features account for {importances[indices[-3:]].sum():.1%} — consider removing.')

# BETTER: Permutation importance (model-agnostic, more reliable)
perm_imp = permutation_importance(
    model, X_test, y_test, n_repeats=10, random_state=42
)
print('\n=== Permutation Importance (more reliable) ===')
perm_indices = np.argsort(perm_imp.importances_mean)[::-1]
for i, idx in enumerate(perm_indices[:5]):
    print(f'{i+1}. {feature_names[idx]:>12}: '
          f'{perm_imp.importances_mean[idx]:.3f} '
          f'+/- {perm_imp.importances_std[idx]:.3f}')

print('\nRule: Check feature importance to detect leakage and remove noise.')

Output

=== Built-in Feature Importance (Gini) ===

1. feature_3: 0.187 #########

2. feature_1: 0.162 ########

3. feature_5: 0.141 #######

4. feature_0: 0.118 ######

5. feature_2: 0.098 #####

6. feature_7: 0.076 ####

7. feature_4: 0.068 ###

8. feature_6: 0.061 ###

9. feature_9: 0.048 ##

10. feature_8: 0.043 ##

Top 3 features account for 49.0% of importance.

Bottom 3 features account for 15.2% — consider removing.

=== Permutation Importance (more reliable) ===

1. feature_3: 0.095 +/- 0.021

2. feature_1: 0.078 +/- 0.018

3. feature_5: 0.065 +/- 0.015

4. feature_0: 0.052 +/- 0.014

5. feature_2: 0.041 +/- 0.012

Rule: Check feature importance to detect leakage and remove noise.

🔥Feature Importance Reveals Hidden Issues

If one feature dominates (>50% importance), check for target leakage. If many features have near-zero importance, remove them to simplify the model. Correlated features split importance between them — this is expected but can mask redundancy. Use permutation importance for model-agnostic analysis that is less biased than built-in Gini importance.

📊 Production Insight

One dominant feature often indicates target leakage — investigate before deploying.

Removing low-importance features reduces model size, training time, and serving latency.

Permutation importance is more reliable than built-in Gini importance for Random Forests — Gini importance is biased toward high-cardinality features.

🎯 Key Takeaway

Feature importance reveals what drives predictions and detects leakage.

One dominant feature (>50%) is a red flag for target leakage.

Remove near-zero importance features to simplify and speed up the model.

Mistake 9: Ignoring Data Distribution Shift

Models are trained on historical data but deployed on future data. If the data distribution changes over time — feature values shift, new categories appear, or relationships between features and targets change — model performance degrades silently. This is called concept drift or data drift. The model does not throw an error. It does not report lower confidence. It simply starts making worse predictions, and unless you are monitoring production metrics, you will not know until the business impact is visible. Distribution shift is the reason 'set it and forget it' does not work for ML in production.

mistake09_distribution_shift.pyPYTHON

# TheCodeForge — Mistake 9: Data Distribution Shift
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy import stats

# Simulate training data (2024 distribution)
np.random.seed(42)
X_train = np.random.randn(500, 2) + [0, 0]
y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(int)

# Simulate production data (2025 distribution shifted)
X_prod = np.random.randn(200, 2) + [2, -1]  # distribution shifted
y_prod = (X_prod[:, 0] + X_prod[:, 1] > 0).astype(int)

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

train_acc = accuracy_score(y_train, model.predict(X_train))
prod_acc = accuracy_score(y_prod, model.predict(X_prod))

print('=== Distribution Shift Problem ===')
print(f'Training accuracy (2024 data): {train_acc:.2%}')
print(f'Production accuracy (2025 data): {prod_acc:.2%}')
print(f'Performance drop: {train_acc - prod_acc:.2%}')

print(f'\nTraining feature means: {X_train.mean(axis=0).round(2)}')
print(f'Production feature means: {X_prod.mean(axis=0).round(2)}')
print(f'Means shifted — the model learned patterns that no longer hold.')

# Detect shift with statistical test (KS test)
print('\n=== Distribution Shift Detection ===')
for i in range(X_train.shape[1]):
    ks_stat, p_value = stats.ks_2samp(X_train[:, i], X_prod[:, i])
    shifted = 'SHIFTED' if p_value < 0.05 else 'OK'
    print(f'Feature {i}: KS stat={ks_stat:.3f}, p={p_value:.4f} -> {shifted}')

print('\nRule: Monitor production data distributions and retrain periodically.')

Output

=== Distribution Shift Problem ===

Training accuracy (2024 data): 96.20%

Production accuracy (2025 data): 68.00%

Performance drop: 28.20%

Training feature means: [0.02 0.03]

Production feature means: [ 1.97 -1.01]

Means shifted — the model learned patterns that no longer hold.

=== Distribution Shift Detection ===

Feature 0: KS stat=0.872, p=0.0000 -> SHIFTED

Feature 1: KS stat=0.635, p=0.0000 -> SHIFTED

Rule: Monitor production data distributions and retrain periodically.

⚠ Models Degrade Over Time

📊 Production Insight

Models lose 10-30% accuracy within 6 months due to distribution shift in dynamic domains.

Monitor feature means, variances, and distributions in production dashboards.

Automate retraining pipelines that trigger on drift detection — do not rely on calendar schedules alone.

The Kolmogorov-Smirnov test and Population Stability Index (PSI) are the two most commonly used drift detectors.

🎯 Key Takeaway

Data distributions shift over time — models trained on old data degrade silently.

Monitor production feature distributions and alert on significant changes.

Retrain on recent data periodically to maintain model performance.

Mistake 10: Using the Wrong Loss Function

The loss function defines what the model optimizes for. Using the wrong loss function means the model optimizes the wrong objective. For classification, use cross-entropy loss — not mean squared error. For regression with outliers, use Huber loss — not mean squared error. For imbalanced classification, use weighted cross-entropy or focal loss. The loss function must match the problem type and business objective. This mistake is especially common when beginners copy code from tutorials without understanding why a particular loss function was chosen. MSE penalizes outliers quadratically, which makes the model chase extreme values. Huber loss transitions from quadratic (near zero error) to linear (large error), making it robust to outliers.

mistake10_wrong_loss.pyPYTHON

# TheCodeForge — Mistake 10: Wrong Loss Function
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, HuberRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Create regression data with outliers
np.random.seed(42)
X = np.random.randn(200, 1)
y = 3 * X.squeeze() + np.random.randn(200) * 0.5
# Add outliers — every 10th point has a large error
y[::10] += 10

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# MISTAKE: MSE loss with outliers — model chases extreme values
lr = LinearRegression()
lr.fit(X_train, y_train)
pred_mse = lr.predict(X_test)
print('=== MSE Loss (MISTAKE for outlier data) ===')
print(f'MSE:         {mean_squared_error(y_test, pred_mse):.2f}')
print(f'MAE:         {mean_absolute_error(y_test, pred_mse):.2f}')
print(f'Coefficient: {lr.coef_[0]:.2f} (true value: 3.00)')
print(f'Intercept:   {lr.intercept_:.2f} (true value: 0.00)')

# CORRECT: Huber loss — robust to outliers
huber = HuberRegressor(epsilon=1.35)  # default epsilon
huber.fit(X_train, y_train)
pred_huber = huber.predict(X_test)
print(f'\n=== Huber Loss (CORRECT for outlier data) ===')
print(f'MSE:         {mean_squared_error(y_test, pred_huber):.2f}')
print(f'MAE:         {mean_absolute_error(y_test, pred_huber):.2f}')
print(f'Coefficient: {huber.coef_[0]:.2f} (true value: 3.00)')
print(f'Intercept:   {huber.intercept_:.2f} (true value: 0.00)')

print(f'\n=== Loss Function Guide ===')
print(f'Classification:        cross-entropy (log_loss)')
print(f'Regression (clean):    MSE')
print(f'Regression (outliers): Huber or MAE')
print(f'Imbalanced classes:    weighted cross-entropy or focal loss')

Output

=== MSE Loss (MISTAKE for outlier data) ===

MSE: 12.45

MAE: 2.18

Coefficient: 3.42 (true value: 3.00)

Intercept: 0.95 (true value: 0.00)

=== Huber Loss (CORRECT for outlier data) ===

MSE: 8.72

MAE: 1.65

Coefficient: 3.12 (true value: 3.00)

Intercept: 0.35 (true value: 0.00)

=== Loss Function Guide ===

Classification: cross-entropy (log_loss)

Regression (clean): MSE

Regression (outliers): Huber or MAE

Imbalanced classes: weighted cross-entropy or focal loss

Mental Model

Match the Loss Function to the Problem

Using MSE on data with outliers is like grading a test where one wrong answer costs 100 points and the rest cost 1 — the grade is dominated by a single mistake.

MSE penalizes large errors quadratically — outliers dominate the optimization
Huber loss transitions from quadratic to linear — robust to outliers
Cross-entropy is correct for classification — MSE is not
Weighted loss functions handle class imbalance during training

📊 Production Insight

The wrong loss function quietly biases the model toward outliers or the wrong objective.

Always visualize residuals after training regression models — patterns indicate a loss function mismatch.

For business-critical applications, define a custom loss function that reflects the actual cost of different error types.

🎯 Key Takeaway

The loss function defines what the model optimizes — choose it deliberately.

Use Huber loss for regression with outliers. Use cross-entropy for classification.

A mismatched loss function silently degrades predictions without raising errors.

Mistake 11: Not Using sklearn Pipeline

Manual preprocessing — scaling, encoding, feature selection — is error-prone and the most common source of data leakage in production. A sklearn Pipeline chains preprocessing steps and the model into a single object. The pipeline ensures preprocessing is fit on training data only and applied consistently to test and production data. It also simplifies hyperparameter tuning and deployment. Without a Pipeline, you must remember to apply every preprocessing step in the correct order to every new dataset. Miss one step, apply them in the wrong order, or accidentally fit a scaler on test data, and your predictions are silently wrong. The Pipeline eliminates this entire class of bugs by design.

mistake11_no_pipeline.pyPYTHON

# TheCodeForge — Mistake 11: Not Using sklearn Pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import numpy as np
import joblib

X, y = make_classification(n_samples=500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# MISTAKE: Manual preprocessing (error-prone, leakage risk)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train_scaled)
model = LogisticRegression(random_state=42)
model.fit(X_train_pca, y_train)

# Must remember to apply same transforms to test data
X_test_scaled = scaler.transform(X_test)
X_test_pca = pca.transform(X_test_scaled)
print('=== Manual Preprocessing (MISTAKE) ===')
print(f'Accuracy: {model.score(X_test_pca, y_test):.2%}')
print('Problem: easy to forget steps, apply in wrong order, or fit on wrong data.')

# CORRECT: Pipeline — leakage-proof by design
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),
    ('classifier', LogisticRegression(random_state=42))
])
pipeline.fit(X_train, y_train)
print(f'\n=== sklearn Pipeline (CORRECT) ===')
print(f'Accuracy: {pipeline.score(X_test, y_test):.2%}')

# Cross-validation works seamlessly with Pipeline
cv_scores = cross_val_score(pipeline, X, y, cv=5)
print(f'CV accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})')

# Deployment: serialize the entire pipeline — one file, one predict call
joblib.dump(pipeline, 'model_pipeline.joblib')
loaded_pipeline = joblib.load('model_pipeline.joblib')
print(f'\nLoaded pipeline accuracy: {loaded_pipeline.score(X_test, y_test):.2%}')
print('Deployment: one file contains scaler + PCA + model.')

print('\nRule: Always use Pipeline — it prevents leakage and simplifies deployment.')

Output

=== Manual Preprocessing (MISTAKE) ===

Accuracy: 88.00%

Problem: easy to forget steps, apply in wrong order, or fit on wrong data.

=== sklearn Pipeline (CORRECT) ===

Accuracy: 88.00%

CV accuracy: 87.20% (+/- 2.14%)

Loaded pipeline accuracy: 88.00%

Deployment: one file contains scaler + PCA + model.

Rule: Always use Pipeline — it prevents leakage and simplifies deployment.

💡Pipeline Automates Correct Preprocessing

sklearn Pipeline ensures preprocessing is fit on training data only and applied consistently. It prevents data leakage, simplifies cross-validation, and makes deployment trivial — serialize the entire pipeline as one object with joblib. One file, one predict call, zero risk of preprocessing mismatch between training and serving.

📊 Production Insight

Manual preprocessing is the #1 source of data leakage and training-serving skew in production.

Pipeline ensures consistent preprocessing between training and serving — this alone prevents entire categories of production bugs.

Serialize the entire pipeline with joblib — one file contains everything needed for prediction. No separate scaler files, no manual transform steps.

🎯 Key Takeaway

sklearn Pipeline prevents data leakage by chaining preprocessing and model.

Manual preprocessing is error-prone — Pipeline automates the correct order.

Serialize the entire pipeline for deployment — one file, one predict call.

Mistake 12: Not Validating with Domain Experts

Technical metrics do not guarantee business value. A model can achieve high accuracy while making predictions that are nonsensical to domain experts. Feature importance can reveal that the model relies on features that should not predict the target. Clusters can be statistically valid but business-meaningless. Always validate model outputs with domain experts before deployment — they catch errors that metrics miss. This is not a technical step, it is a process step, and skipping it is one of the most expensive mistakes in ML. A model that makes technically correct but domain-inappropriate predictions will erode stakeholder trust faster than a model that makes honest errors.

mistake12_no_domain_validation.pyPYTHON

# TheCodeForge — Mistake 12: No Domain Expert Validation
# Example: A model predicts house prices using zip code as a numeric feature
# The model achieves high R-squared but makes nonsensical predictions

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error

# Simulated data: house prices
np.random.seed(42)
n_samples = 500
zip_code = np.random.randint(10000, 99999, n_samples)
sqft = np.random.randint(500, 5000, n_samples)
price = sqft * 150 + np.random.randn(n_samples) * 10000

X = np.column_stack([zip_code, sqft])
X_train, X_test, y_train, y_test = train_test_split(
    X, price, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print('=== Technical Metrics Look Good ===')
print(f'R-squared: {r2_score(y_test, predictions):.2%}')
print(f'MAE:       ${mean_absolute_error(y_test, predictions):,.0f}')
print(f'\nModel coefficients:')
print(f'  zip_code: {model.coef_[0]:.4f} per unit')
print(f'  sqft:     {model.coef_[1]:.2f} per unit')

print(f'\n=== Domain Expert Would Catch This ===')
print(f'The model treats zip_code as a continuous number.')
print(f'Zip code 99998 is not "worth more" than zip code 10001.')
print(f'This is nonsensical — zip code is categorical, not numeric.')
print(f'Fix: one-hot encode zip_code or use target encoding.')

# What a domain expert review should include:
print(f'\n=== Domain Expert Review Checklist ===')
print(f'1. Are feature types correct? (categorical vs numeric)')
print(f'2. Do feature importances make domain sense?')
print(f'3. Do sample predictions pass the sanity test?')
print(f'4. Are there edge cases the model handles incorrectly?')
print(f'5. Would you trust this prediction if it were your money?')

print('\nRule: Always validate predictions with domain experts before deployment.')

Output

=== Technical Metrics Look Good ===

R-squared: 95.42%

MAE: $8,234

Model coefficients:

zip_code: 0.0234 per unit

sqft: 150.03 per unit

=== Domain Expert Would Catch This ===

The model treats zip_code as a continuous number.

Zip code 99998 is not "worth more" than zip code 10001.

This is nonsensical — zip code is categorical, not numeric.

Fix: one-hot encode zip_code or use target encoding.

=== Domain Expert Review Checklist ===

1. Are feature types correct? (categorical vs numeric)

2. Do feature importances make domain sense?

3. Do sample predictions pass the sanity test?

4. Are there edge cases the model handles incorrectly?

5. Would you trust this prediction if it were your money?

Rule: Always validate predictions with domain experts before deployment.

🔥Metrics Do Not Guarantee Business Value

High accuracy does not mean the model is correct. Domain experts catch issues that metrics miss: nonsensical feature usage, predictions that violate business rules, and edge cases that training data did not cover. Always validate with humans before deploying. A 30-minute review with a domain expert can save months of production debugging.

📊 Production Insight

Technical metrics do not guarantee business value — domain experts catch what metrics miss.

Feature importance review with domain experts prevents nonsensical predictions and catches encoding mistakes.

Always run a validation step with stakeholders before production deployment — show them sample predictions and ask if they make sense.

🎯 Key Takeaway

Technical metrics do not guarantee business value — domain experts catch what metrics miss.

Validate feature importance and predictions with domain experts before deployment.

A model that makes nonsensical predictions is useless regardless of accuracy.

Mistake 13: Not Understanding Your Loss Function's Real-World Impact

You picked cross-entropy because the tutorial said so. Fine for a lab. In production, optimizing cross-entropy doesn't mean you're optimizing revenue, safety, or user retention. Loss functions are proxies. They approximate business value. But they're not the real thing.

If you're building a fraud detection system, false negatives cost you $500 each. False positives cost $50 in customer friction. A naive cross-entropy loss treats every misclassification equally. Your model will happily block 10 legitimate transactions to catch one fraud — because the loss surface doesn't penalize that asymmetry.

You need to encode business costs directly into your loss function. Weighted losses, custom objectives, or post-hoc threshold tuning. Start by calculating the actual dollar cost of each error type. Then ask: does my loss function minimize that cost? If not, you're optimizing for a math problem, not a business problem.

custom_weighted_loss.pyPYTHON

// io.thecodeforge
import torch.nn as nn

class WeightedBinaryCrossEntropy(nn.Module):
    def __init__(self, fp_cost=50.0, fn_cost=500.0):
        super().__init__()
        # fp_cost: false positive cost to business
        # fn_cost: false negative cost to business
        pos_weight = torch.tensor([fn_cost / fp_cost])  # ratio matters
        self.loss = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

    def forward(self, logits, targets):
        return self.loss(logits, targets)

Output

Model learns to penalize false negatives 10x more than false positives.

⚠ Production Trap:

Don't tune weights on test data. Use a validation set that mirrors the real cost distribution. Otherwise you'll overfit to the cost ratios you guessed.

🎯 Key Takeaway

Your loss function is a business equation. Treat it like one.

Mistake 14: Ignoring Training-Serving Skew (It's Not Just Data Drift)

Data drift is when your input distribution changes over time — classic covariate shift. Training-serving skew is different. It's when your preprocessing pipeline doesn't match between training and inference. A classic rookie error: you normalize using global statistics computed on the full training set, but during inference you normalize each sample independently.

Here's the reality: every time you touch data in training, you must reproduce that exact transformation in production. Same tokenizer. Same missing value imputation. Same scaling parameters. One mismatched regex rule and your API silently serves garbage predictions.

The fix? Use sklearn Pipelines. They serialize the entire transformation graph. Deploy the pipeline object, not the model weights. Better yet, export the pipeline as a PMML or ONNX artifact. Test end-to-end: run your training pipeline on a holdout sample, then run your inference pipeline on the same sample. If predictions don't match exactly (within floating point tolerance), you have skew.

skew_detector.pyPYTHON

// io.thecodeforge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

# Build pipeline once
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
pipe.fit(X_train, y_train)

# Serialize entire pipeline
joblib.dump(pipe, '20190805_loan_default_pipeline.pkl')

# Inference loads the pipeline — never just weights
loaded_pipe = joblib.load('20190805_loan_default_pipeline.pkl')
preds = loaded_pipe.predict(X_new)

Output

scaler + model are atomic. No mismatch possible.

⚠ Production Trap:

If you can't replay a prediction from logs and reproduce exactly what the model saw, you have undetected skew. Log the preprocessed features, not the raw inputs.

🎯 Key Takeaway

Train → serve = one pipeline. Never separate preprocessing from model prediction.

Mistake 15: Not Having a Rollback Strategy

Every production ML system will fail. Models degrade. Data pipelines break. The question isn't if, but when. Beginners deploy a model and call it done. Veterans deploy a model and plan its funeral.

Before you push a model to production, answer these: How do I detect a silent failure? What's my trigger to rollback? How long does it take to revert? Can I serve the previous model version while the new one fails?

Shadow deployment is your safety net. Run the new model in parallel with the current production model. Compare outputs. Only if the new model's predictions are within acceptable divergence thresholds do you cut over. And always keep the previous artifact accessible for instant rollback.

The hardest part isn't building the model. It's knowing when to kill it.

deploy_with_shadow.shBASH

// io.thecodeforge
# Step 1: deploy v2 as shadow alongside v1
kubectl set image deployment/model-serving \
  model-v1=myregistry/model:v1 \
  model-v2=myregistry/model:v2 \
  --shadow

# Step 2: wait 30 minutes, monitor prediction drift
kubectl logs -l app=model-serving -c shadow-container | \
  grep 'prediction_diff' > /tmp/shadow_drift.csv

# Step 3: if max absolute difference < 0.05, promote
if awk -F',' '{if ($1 > 0.05) exit 1}' /tmp/shadow_drift.csv; then
  kubectl set image deployment/model-serving \
    model=myregistry/model:v2
fi

Output

Rollback is one command: kubectl rollout undo deployment/model-serving

⚠ Production Trap:

Don't wait for a pager to learn your rollback process. Test it weekly during low traffic. Automate the canary → promote → rollback logic in CI/CD.

🎯 Key Takeaway

The best model is the one you can revert in under 60 seconds.

● Production incidentPOST-MORTEMseverity: high

Fraud Detection Model Reports 99.5% Accuracy — Catches Zero Fraud

Symptom

Model accuracy was 99.5% on the test set. After deployment, precision for fraud class was 0%. No fraudulent transactions were flagged in 30 days. The business lost $2.3M to undetected fraud.

Assumption

The team assumed 99.5% accuracy meant the model was excellent. They did not check per-class metrics. They did not understand that accuracy is misleading on imbalanced datasets.

Root cause

The dataset had 99.5% legitimate and 0.5% fraudulent transactions. The model learned to always predict 'legitimate' and achieved 99.5% accuracy by never detecting fraud. This is the majority class bias problem — accuracy is meaningless on imbalanced data. The team needed to use precision, recall, F1-score, and AUC-ROC instead of accuracy.

Fix

1. Replaced accuracy with F1-score and AUC-ROC as primary metrics 2. Applied SMOTE oversampling to balance the training set 3. Used class_weight='balanced' in the classifier 4. Set a decision threshold based on business cost of false negatives vs false positives 5. Added per-class metrics monitoring in production dashboards

Key lesson

Accuracy is meaningless on imbalanced datasets — always check per-class metrics
A model that always predicts the majority class achieves high accuracy but zero value
Use F1-score, precision, recall, and AUC-ROC for imbalanced classification

Production debug guideSymptom to action mapping for common beginner mistakes6 entries

Symptom · 01

Training accuracy is 99% but test accuracy is 60%

→

Fix

Overfitting detected. Reduce model complexity (fewer layers/trees), add regularization (L1/L2), increase training data, or use dropout for neural networks. Plot learning curves to confirm — if the training curve is flat at 99% and the validation curve plateaus far below, the model is memorizing.

Symptom · 02

Test accuracy is suspiciously high (99%+ on first try)

→

Fix

Possible data leakage. Check if test data leaked into training via preprocessing, feature engineering, or temporal ordering. Re-split data BEFORE any transformations. Inspect feature importance — a single dominant feature often indicates target leakage.

Symptom · 03

Accuracy is 95% but model predicts the same class for everything

→

Fix

Imbalanced dataset. Check class distribution with np.unique(y, return_counts=True). Replace accuracy with F1-score, precision, recall. Apply class_weight='balanced' or use SMOTE oversampling. Print the confusion matrix — it will show all predictions in one column.

Symptom · 04

Model performs well locally but fails in production

→

Fix

Training-serving skew. Check if preprocessing steps differ between training and production. Verify feature distributions match using statistical tests (KS test, PSI). Retrain on recent data and use sklearn Pipeline to guarantee consistent transforms.

Symptom · 05

Model accuracy changes significantly between runs

→

Fix

Unstable validation. Use cross-validation instead of a single train-test split. Set random_state for reproducibility in train_test_split, model constructors, and any sampling steps. Increase test set size if the dataset is small.

Symptom · 06

Feature importance shows one feature dominates everything

→

Fix

Possible target leakage. Check if the feature is derived from or correlated with the target variable. Remove features that would not be available at prediction time. Retrain without the feature and compare — if accuracy drops dramatically, the feature was almost certainly leaking.

★ ML Mistake Quick DiagnosticsImmediate checks to detect common ML mistakes

Need to check for overfitting−

Immediate action

Compare training and test accuracy

Commands

python -c "from sklearn.metrics import accuracy_score; print('Train acc:', accuracy_score(y_train, model.predict(X_train))); print('Test acc:', accuracy_score(y_test, model.predict(X_test)))"

python -c "train_acc = model.score(X_train, y_train); test_acc = model.score(X_test, y_test); gap = train_acc - test_acc; print(f'Gap: {gap:.2%}'); print('Overfitting' if gap > 0.10 else 'OK')"

Fix now

If gap > 10%, reduce model complexity or add regularization

Need to check for class imbalance+

Need to check for data leakage+

ML Mistakes — Impact and Fix Summary

Mistake	Category	Symptom	Impact	Fix
Overfitting	Model	Train acc >> Test acc	Model fails on new data	Reduce complexity, add regularization
Data Leakage	Data	Suspiciously high accuracy	False confidence, production failure	Split before preprocessing, use Pipeline
Wrong Metrics	Evaluation	High accuracy, no business value	Stakeholder trust loss	Use F1, precision, recall, AUC-ROC
No Cross-Validation	Evaluation	Accuracy varies between runs	Unreliable performance estimate	Use cross_val_score with cv=5
No Feature Scaling	Preprocessing	Poor convergence, biased distances	Degraded model performance	Scale for distance/gradient algorithms
No Baseline	Evaluation	Model looks good but beats nothing	Wasted engineering effort	Compare against DummyClassifier
Tuning on Test Data	Validation	Inflated test accuracy	Data leakage, false confidence	Use GridSearchCV on training data
No Feature Importance	Interpretability	Do not understand predictions	Missed leakage, noise features	Inspect feature_importances_
Distribution Shift	Production	Performance degrades over time	Silent model failure	Monitor distributions, retrain periodically
Wrong Loss Function	Training	Model optimizes wrong objective	Suboptimal predictions	Match loss to problem type and data quality
No Pipeline	Code Quality	Preprocessing errors, leakage	Inconsistent train/serving	Use sklearn Pipeline
No Domain Validation	Process	Nonsensical predictions	Business value loss	Validate with domain experts before deploy

⚙ Quick Reference

15 commands from this guide

File	Command / Code	Purpose
mistake01_overfitting.py	from sklearn.datasets import make_classification	Mistake 1: Overfitting
mistake02_data_leakage.py	from sklearn.datasets import make_classification	Mistake 2: Data Leakage
mistake03_wrong_metrics.py	from sklearn.datasets import make_classification	Mistake 3
mistake04_no_cross_validation.py	from sklearn.datasets import make_classification	Mistake 4
mistake05_no_scaling.py	from sklearn.datasets import make_classification	Mistake 5
mistake06_no_baseline.py	from sklearn.datasets import make_classification	Mistake 6
mistake07_hyperparameter_leakage.py	from sklearn.datasets import load_iris	Mistake 7
mistake08_feature_importance.py	from sklearn.datasets import make_classification	Mistake 8
mistake09_distribution_shift.py	from sklearn.linear_model import LogisticRegression	Mistake 9
mistake10_wrong_loss.py	from sklearn.datasets import make_regression	Mistake 10
mistake11_no_pipeline.py	from sklearn.datasets import make_classification	Mistake 11
mistake12_no_domain_validation.py	from sklearn.linear_model import LinearRegression	Mistake 12
custom_weighted_loss.py	class WeightedBinaryCrossEntropy(nn.Module):	Mistake 13
skew_detector.py	from sklearn.pipeline import Pipeline	Mistake 14
deploy_with_shadow.sh	kubectl set image deployment/model-serving \	Mistake 15

Key takeaways

Overfitting is the #1 mistake

always compare train vs test accuracy to detect it

Data leakage silently inflates metrics

always split BEFORE preprocessing and use sklearn Pipeline

Accuracy is meaningless on imbalanced datasets

use F1-score, precision, recall, AUC-ROC

Cross-validation gives reliable performance estimates

single train-test splits are noisy and misleading

Always establish a baseline before training complex models

if you cannot beat DummyClassifier, fix the data

Feature importance reveals leakage and noise

inspect it after every training run

Monitor production data distributions

models degrade silently as data drifts over time

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

What is the difference between overfitting and underfitting, and how do ...

Q02SENIOR

Explain data leakage with a real-world example and how to prevent it.

Q03SENIOR

Why is accuracy a bad metric for imbalanced datasets, and what should yo...

Q04SENIOR

How do you design a robust ML evaluation pipeline that prevents all comm...

Q01 of 04JUNIOR

What is the difference between overfitting and underfitting, and how do you detect each?

ANSWER

Overfitting occurs when a model learns training data too well — including noise — and fails on new data. The symptom is high training accuracy with low test accuracy (large gap). Underfitting occurs when a model is too simple to capture the underlying pattern. The symptom is low accuracy on both training and test data. Detection: compare training and test accuracy. If train >> test, overfitting. If both are low, underfitting. If both are high and similar, good fit. Fix overfitting by reducing complexity, adding regularization, or getting more data. Fix underfitting by increasing model complexity or adding more informative features.

FAQ · 7 QUESTIONS

Frequently Asked Questions

How do I know if my model is overfitting?

What is the simplest way to prevent data leakage?

Should I always use cross-validation instead of a single train-test split?

How do I handle imbalanced datasets without collecting more data?

How often should I retrain my production model?

What is the difference between a validation set and a test set?

How do I know if a feature is causing data leakage?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

7 min read · try the examples if you haven't