Common Machine Learning Mistakes Beginners Make (And How to Fix Them)
- Overfitting is the #1 mistake — always compare train vs test accuracy to detect it
- Data leakage silently inflates metrics — always split BEFORE preprocessing and use sklearn Pipeline
- Accuracy is meaningless on imbalanced datasets — use F1-score, precision, recall, AUC-ROC
- Overfitting is the #1 beginner mistake — model memorizes training data and fails in production
- Data leakage inflates accuracy by exposing test information during training
- Using accuracy on imbalanced datasets gives misleading results — 95% accuracy can mean the model learned nothing
- Train-test split must happen BEFORE any preprocessing to prevent leakage
- Cross-validation is more reliable than a single train-test split for performance estimation
- Biggest mistake: reporting training accuracy instead of test accuracy — it tells you nothing about production performance
Need to check for overfitting
python -c "from sklearn.metrics import accuracy_score; print('Train acc:', accuracy_score(y_train, model.predict(X_train))); print('Test acc:', accuracy_score(y_test, model.predict(X_test)))"python -c "train_acc = model.score(X_train, y_train); test_acc = model.score(X_test, y_test); gap = train_acc - test_acc; print(f'Gap: {gap:.2%}'); print('Overfitting' if gap > 0.10 else 'OK')"Need to check for class imbalance
python -c "import numpy as np; unique, counts = np.unique(y_train, return_counts=True); print(dict(zip(unique, counts)))"python -c "import numpy as np; unique, counts = np.unique(y_train, return_counts=True); ratios = counts / counts.sum(); print('Class ratios:', dict(zip(unique, ratios.round(3))))"Need to check for data leakage
python -c "# Check: was StandardScaler.fit() called on X_train only?\n# WRONG: scaler.fit(X) then train_test_split\n# RIGHT: train_test_split then scaler.fit(X_train)\nprint('Verify split happens BEFORE preprocessing')"python -c "# Check for temporal leakage: does test data come AFTER train data?\n# If time-series: sort by date, split chronologically\nprint('For time-series: split chronologically, not randomly')"Production Incident
Production Debug GuideSymptom to action mapping for common beginner mistakes
Most ML failures in production are not caused by algorithm limitations — they are caused by preventable mistakes in data handling, evaluation, and validation. Data leakage silently inflates metrics. Overfitting creates models that memorize rather than generalize. Wrong metrics hide poor performance behind impressive-sounding numbers. These mistakes are invisible during development and catastrophic in production. After reviewing hundreds of beginner projects and debugging dozens of production pipelines, the same twelve mistakes appear over and over. This guide documents each one with concrete symptoms, root causes, and fixes you can apply today.
Mistake 1: Overfitting — Model Memorizes Instead of Learning
Overfitting occurs when a model learns the training data too well — including noise and outliers — and fails to generalize to new data. The symptom is a large gap between training accuracy (high) and test accuracy (low). Common causes include model complexity that exceeds data volume, training for too many epochs, and lack of regularization. Overfitting is the most common mistake because it is invisible during training — the model looks great until you evaluate on unseen data. In practice, every model overfits to some degree. The question is whether the gap is small enough to tolerate. A 2-3% gap is normal. A 15%+ gap means the model has memorized training samples and will fail on anything it has not seen before.
# TheCodeForge — Mistake 1: Overfitting Detection and Fix from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score import numpy as np # Generate data X, y = make_classification(n_samples=200, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # MISTAKE: Unrestricted Decision Tree (overfits) dt_overfit = DecisionTreeClassifier(random_state=42) dt_overfit.fit(X_train, y_train) train_acc_overfit = accuracy_score(y_train, dt_overfit.predict(X_train)) test_acc_overfit = accuracy_score(y_test, dt_overfit.predict(X_test)) print('=== Overfitting Example ===') print(f'Decision Tree (unrestricted)') print(f' Train accuracy: {train_acc_overfit:.2%}') print(f' Test accuracy: {test_acc_overfit:.2%}') print(f' Gap: {train_acc_overfit - test_acc_overfit:.2%}') # FIX 1: Restrict tree depth (regularization) dt_fixed = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42) dt_fixed.fit(X_train, y_train) train_acc_fixed = accuracy_score(y_train, dt_fixed.predict(X_train)) test_acc_fixed = accuracy_score(y_test, dt_fixed.predict(X_test)) print(f'\nDecision Tree (max_depth=5, min_samples_leaf=10)') print(f' Train accuracy: {train_acc_fixed:.2%}') print(f' Test accuracy: {test_acc_fixed:.2%}') print(f' Gap: {train_acc_fixed - test_acc_fixed:.2%}') # FIX 2: Use ensemble method (Random Forest) rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) train_acc_rf = accuracy_score(y_train, rf.predict(X_train)) test_acc_rf = accuracy_score(y_test, rf.predict(X_test)) print(f'\nRandom Forest (100 trees)') print(f' Train accuracy: {train_acc_rf:.2%}') print(f' Test accuracy: {test_acc_rf:.2%}') print(f' Gap: {train_acc_rf - test_acc_rf:.2%}')
Decision Tree (unrestricted)
Train accuracy: 100.00%
Test accuracy: 82.50%
Gap: 17.50%
Decision Tree (max_depth=5, min_samples_leaf=10)
Train accuracy: 93.75%
Test accuracy: 85.00%
Gap: 8.75%
Random Forest (100 trees)
Train accuracy: 100.00%
Test accuracy: 90.00%
Gap: 10.00%
- Training accuracy high + test accuracy low = overfitting
- Reduce complexity: max_depth, min_samples_leaf, fewer neurons
- Add regularization: L1, L2, dropout
- Get more data — the most reliable fix for overfitting
Mistake 2: Data Leakage — Test Data Sneaking into Training
Data leakage occurs when information from the test set influences the training process. This inflates performance metrics and creates false confidence. Common causes include fitting preprocessing on the full dataset before splitting, using future information in time-series problems, and including features derived from the target variable. Data leakage is the most dangerous mistake because it produces models that look great in development and fail completely in production. The insidious part is that leaked models can still pass code review — the code looks correct, the metrics look great, and nobody suspects a problem until the model is deployed and the business starts losing money.
# TheCodeForge — Mistake 2: Data Leakage Detection and Fix from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score import numpy as np X, y = make_classification(n_samples=500, n_features=10, random_state=42) # MISTAKE: Fit scaler on ALL data before splitting (data leakage) scaler_leaky = StandardScaler() X_scaled_leaky = scaler_leaky.fit_transform(X) # LEAKAGE: saw test data stats X_train_leaky, X_test_leaky, y_train, y_test = train_test_split( X_scaled_leaky, y, test_size=0.2, random_state=42 ) model_leaky = LogisticRegression(random_state=42) model_leaky.fit(X_train_leaky, y_train) acc_leaky = accuracy_score(y_test, model_leaky.predict(X_test_leaky)) print('=== Data Leakage Example ===') print(f'MISTAKE: Scaler fit on all data') print(f' Test accuracy: {acc_leaky:.2%}') print(f' (Inflated — scaler saw test data statistics)') # CORRECT: Split first, then fit scaler on training data only X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler_correct = StandardScaler() X_train_scaled = scaler_correct.fit_transform(X_train) # fit on train only X_test_scaled = scaler_correct.transform(X_test) # transform test model_correct = LogisticRegression(random_state=42) model_correct.fit(X_train_scaled, y_train) acc_correct = accuracy_score(y_test, model_correct.predict(X_test_scaled)) print(f'\nCORRECT: Scaler fit on training data only') print(f' Test accuracy: {acc_correct:.2%}') print(f' (Honest — no leakage)') # BEST: Use Pipeline to enforce correct order automatically pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', LogisticRegression(random_state=42)) ]) pipeline.fit(X_train, y_train) acc_pipeline = accuracy_score(y_test, pipeline.predict(X_test)) print(f'\nBEST: Pipeline (leakage-proof by design)') print(f' Test accuracy: {acc_pipeline:.2%}') print(f'\nDifference (leaky vs honest): {abs(acc_leaky - acc_correct):.2%}')
MISTAKE: Scaler fit on all data
Test accuracy: 89.00%
(Inflated — scaler saw test data statistics)
CORRECT: Scaler fit on training data only
Test accuracy: 88.00%
(Honest — no leakage)
BEST: Pipeline (leakage-proof by design)
Test accuracy: 88.00%
Difference (leaky vs honest): 1.00%
Mistake 3: Using Accuracy on Imbalanced Datasets
Accuracy measures the percentage of correct predictions overall. On imbalanced datasets, this metric is misleading because a model can achieve high accuracy by simply predicting the majority class every time. A fraud detection dataset with 99% legitimate transactions will show 99% accuracy even if the model never catches a single fraud. The model has learned nothing — it just echoes the class distribution. This mistake is especially dangerous because the metric looks impressive in presentations. Stakeholders see 99% and assume the model is production-ready. The confusion matrix tells the real story, and it should be the first thing you check after training any classifier.
# TheCodeForge — Mistake 3: Wrong Metrics for Imbalanced Data from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.dummy import DummyClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report ) import numpy as np # Highly imbalanced dataset: 95% class 0, 5% class 1 X, y = make_classification( n_samples=2000, n_features=20, weights=[0.95, 0.05], flip_y=0, random_state=42 ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print('Class distribution:') unique, counts = np.unique(y_train, return_counts=True) for cls, cnt in zip(unique, counts): print(f' Class {cls}: {cnt} samples ({cnt/len(y_train):.1%})') # MISTAKE: Majority class baseline (always predicts 0) baseline = DummyClassifier(strategy='most_frequent', random_state=42) baseline.fit(X_train, y_train) y_pred_baseline = baseline.predict(X_test) print(f'\n=== Baseline: Always Predict Majority ===') print(f'Accuracy: {accuracy_score(y_test, y_pred_baseline):.2%} <- looks great') print(f'F1 (class 1): {f1_score(y_test, y_pred_baseline):.2%} <- model is useless') print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred_baseline)}') # CORRECT: Use appropriate metrics and handle imbalance model = RandomForestClassifier( n_estimators=100, class_weight='balanced', random_state=42 ) model.fit(X_train, y_train) y_pred_model = model.predict(X_test) print(f'\n=== Random Forest with class_weight=balanced ===') print(f'Accuracy: {accuracy_score(y_test, y_pred_model):.2%}') print(f'Precision: {precision_score(y_test, y_pred_model):.2%}') print(f'Recall: {recall_score(y_test, y_pred_model):.2%}') print(f'F1-score: {f1_score(y_test, y_pred_model):.2%}') print(f'\nClassification Report:\n{classification_report(y_test, y_pred_model)}')
Class 0: 1520 samples (95.0%)
Class 1: 80 samples (5.0%)
=== Baseline: Always Predict Majority ===
Accuracy: 95.00% <- looks great
F1 (class 1): 0.00% <- model is useless
Confusion Matrix:
[[380 0]
[ 20 0]]
=== Random Forest with class_weight=balanced ===
Accuracy: 93.50%
Precision: 62.50%
Recall: 75.00%
F1-score: 68.18%
Classification Report:
precision recall f1-score support
0 0.97 0.94 0.96 380
1 0.62 0.75 0.68 20
accuracy 0.94 400
macro avg 0.80 0.85 0.82 400
weighted avg 0.95 0.94 0.94 400
- Always check the confusion matrix first — it reveals what accuracy hides
- Use F1-score as the primary metric for imbalanced classification
- Apply class_weight='balanced' or use SMOTE oversampling
- AUC-ROC measures discrimination ability independent of threshold
Mistake 4: Not Using Cross-Validation
A single train-test split gives one performance estimate that depends heavily on which samples land in train versus test. Small datasets are especially vulnerable — a lucky or unlucky split can swing accuracy by 10% or more. Cross-validation splits the data into k folds, trains and evaluates k times, and reports the mean and standard deviation. This gives a reliable performance estimate with confidence bounds. If your cross-validation scores vary wildly across folds, that itself is a signal — it usually means the dataset is too small or the model is unstable.
# TheCodeForge — Mistake 4: Not Using Cross-Validation from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split, cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score import numpy as np X, y = make_classification(n_samples=200, n_features=20, random_state=42) # MISTAKE: Single train-test split — result depends on the split print('=== Single Train-Test Split (Unreliable) ===') for seed in [42, 7, 99, 123, 256]: X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=seed ) model = DecisionTreeClassifier(max_depth=5, random_state=42) model.fit(X_train, y_train) acc = accuracy_score(y_test, model.predict(X_test)) print(f' random_state={seed:>3}: accuracy={acc:.2%}') print(' -> Accuracy varies by up to 15% depending on split!') # CORRECT: Cross-validation — reliable performance estimate model = DecisionTreeClassifier(max_depth=5, random_state=42) scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f'\n=== 5-Fold Cross-Validation (Reliable) ===') print(f' Fold scores: {scores.round(4)}') print(f' Mean accuracy: {scores.mean():.2%}') print(f' Std deviation: {scores.std():.2%}') print(f' 95% CI: {scores.mean():.2%} +/- {scores.std() * 2:.2%}') # BEST: Stratified K-Fold for imbalanced data from sklearn.model_selection import StratifiedKFold skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores_strat = cross_val_score(model, X, y, cv=skf, scoring='accuracy') print(f'\n=== Stratified 5-Fold (Best for Imbalanced) ===') print(f' Fold scores: {scores_strat.round(4)}') print(f' Mean accuracy: {scores_strat.mean():.2%}') print(f' Std deviation: {scores_strat.std():.2%}')
random_state= 42: accuracy=85.00%
random_state= 7: accuracy=77.50%
random_state= 99: accuracy=82.50%
random_state=123: accuracy=90.00%
random_state=256: accuracy=80.00%
-> Accuracy varies by up to 15% depending on split!
=== 5-Fold Cross-Validation (Reliable) ===
Fold scores: [0.85 0.775 0.825 0.9 0.8 ]
Mean accuracy: 83.00%
Std deviation: 4.24%
95% CI: 83.00% +/- 8.49%
=== Stratified 5-Fold (Best for Imbalanced) ===
Fold scores: [0.85 0.8 0.825 0.875 0.825]
Mean accuracy: 83.50%
Std deviation: 2.50%
Mistake 5: Not Scaling Features for Distance-Based and Gradient-Based Models
Some algorithms are sensitive to feature scale — features with larger ranges dominate distance calculations or gradient updates. K-Nearest Neighbors, SVM, and neural networks all require scaled features. Decision trees and Random Forests do not, because they split on individual features independently. The fix is straightforward: use StandardScaler (zero mean, unit variance) for most cases, or MinMaxScaler (0-1 range) when you need bounded features. The mistake is not knowing which algorithms need scaling and which do not, and the penalty for getting it wrong can be a 20%+ accuracy drop with zero indication of what went wrong.
# TheCodeForge — Mistake 5: Not Scaling Features from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score import numpy as np # Create data with very different feature scales np.random.seed(42) X, y = make_classification(n_samples=500, n_features=5, random_state=42) # Artificially scale features to different ranges X[:, 0] *= 1000 # feature 0: range ~[-3000, 3000] X[:, 1] *= 0.001 # feature 1: range ~[-0.003, 0.003] # features 2-4: range ~[-3, 3] (original scale) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print('Feature ranges (training set):') for i in range(X_train.shape[1]): print(f' Feature {i}: [{X_train[:, i].min():.3f}, {X_train[:, i].max():.3f}]') # KNN WITHOUT scaling (MISTAKE for distance-based models) knn_unscaled = KNeighborsClassifier(n_neighbors=5) knn_unscaled.fit(X_train, y_train) acc_unscaled = accuracy_score(y_test, knn_unscaled.predict(X_test)) # KNN WITH scaling (CORRECT) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) knn_scaled = KNeighborsClassifier(n_neighbors=5) knn_scaled.fit(X_train_scaled, y_train) acc_scaled = accuracy_score(y_test, knn_scaled.predict(X_test_scaled)) print(f'\n=== KNN (Distance-Based — Needs Scaling) ===') print(f' Without scaling: {acc_unscaled:.2%}') print(f' With scaling: {acc_scaled:.2%}') print(f' Improvement: {acc_scaled - acc_unscaled:.2%}') # Decision Tree WITHOUT scaling (scaling not needed) dt_unscaled = DecisionTreeClassifier(max_depth=5, random_state=42) dt_unscaled.fit(X_train, y_train) acc_dt_unscaled = accuracy_score(y_test, dt_unscaled.predict(X_test)) dt_scaled = DecisionTreeClassifier(max_depth=5, random_state=42) dt_scaled.fit(X_train_scaled, y_train) acc_dt_scaled = accuracy_score(y_test, dt_scaled.predict(X_test_scaled)) print(f'\n=== Decision Tree (Not Affected by Scaling) ===') print(f' Without scaling: {acc_dt_unscaled:.2%}') print(f' With scaling: {acc_dt_scaled:.2%}') print(f' Difference: {abs(acc_dt_scaled - acc_dt_unscaled):.2%}') print('\nRule: Scale for KNN, SVM, Neural Networks. Not needed for trees.')
Feature 0: [-3214.120, 2987.445]
Feature 1: [-0.003, 0.003]
Feature 2: [-3.210, 3.445]
Feature 3: [-2.987, 3.112]
Feature 4: [-3.541, 2.876]
=== KNN (Distance-Based — Needs Scaling) ===
Without scaling: 68.00%
With scaling: 88.00%
Improvement: 20.00%
=== Decision Tree (Not Affected by Scaling) ===
Without scaling: 84.00%
With scaling: 84.00%
Difference: 0.00%
Rule: Scale for KNN, SVM, Neural Networks. Not needed for trees.
Mistake 6: Not Establishing a Baseline Model
A baseline model is the simplest possible approach to a problem. For classification, predict the majority class. For regression, predict the mean. If your model does not beat the baseline, it has learned nothing useful. Skipping the baseline leads to wasted effort on models that look complex but perform worse than a simple rule. This sounds obvious, but it happens constantly — teams spend weeks tuning a deep learning model only to discover that logistic regression on two features outperforms it. The baseline anchors your expectations and provides a floor that every subsequent model must exceed to justify its existence.
# TheCodeForge — Mistake 6: Ignoring the Baseline Model from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.dummy import DummyClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, f1_score import numpy as np X, y = make_classification( n_samples=1000, n_features=20, weights=[0.7, 0.3], random_state=42 ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # BASELINE 1: Always predict the majority class baseline_majority = DummyClassifier(strategy='most_frequent', random_state=42) baseline_majority.fit(X_train, y_train) acc_majority = accuracy_score(y_test, baseline_majority.predict(X_test)) # BASELINE 2: Random prediction respecting class distribution baseline_stratified = DummyClassifier(strategy='stratified', random_state=42) baseline_stratified.fit(X_train, y_train) acc_stratified = accuracy_score(y_test, baseline_stratified.predict(X_test)) print('=== Baseline Models ===') print(f'Majority class: accuracy={acc_majority:.2%}') print(f'Stratified random: accuracy={acc_stratified:.2%}') # Simple model lr = LogisticRegression(random_state=42) lr.fit(X_train, y_train) acc_lr = accuracy_score(y_test, lr.predict(X_test)) # Complex model dt = DecisionTreeClassifier(max_depth=5, random_state=42) dt.fit(X_train, y_train) acc_dt = accuracy_score(y_test, dt.predict(X_test)) print(f'\n=== Your Models ===') print(f'Logistic Regression: accuracy={acc_lr:.2%}') print(f'Decision Tree: accuracy={acc_dt:.2%}') print(f'\n=== Comparison ===') for name, acc in [('Logistic Regression', acc_lr), ('Decision Tree', acc_dt)]: improvement = acc - acc_majority if improvement > 0: print(f'{name} beats baseline by {improvement:.2%} — worth using.') else: print(f'{name} does NOT beat baseline — it learned nothing useful.') print('\nRule: Always compare against a baseline before deploying.')
Majority class: accuracy=70.00%
Stratified random: accuracy=58.00%
=== Your Models ===
Logistic Regression: accuracy=86.00%
Decision Tree: accuracy=84.00%
=== Comparison ===
Logistic Regression beats baseline by 16.00% — worth using.
Decision Tree beats baseline by 14.00% — worth using.
Rule: Always compare against a baseline before deploying.
Mistake 7: Tuning Hyperparameters on Test Data
Hyperparameter tuning on test data is a form of data leakage — you are optimizing the model to perform well on specific test samples rather than learning generalizable patterns. The test set must remain completely untouched until final evaluation. Use cross-validation on the training set for hyperparameter tuning, then evaluate the final model once on the test set. This mistake is subtle because the code looks correct — you are training on the training set and evaluating on the test set. But by repeating this loop and selecting the hyperparameters that give the best test score, you are fitting to the test set indirectly. The test set becomes a second training set, and your reported metrics no longer reflect production performance.
# TheCodeForge — Mistake 7: Tuning on Test Data from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # MISTAKE: Manually tuning on test data print('=== MISTAKE: Tuning on Test Data ===') best_acc = 0 best_depth = 0 for depth in range(1, 20): model = DecisionTreeClassifier(max_depth=depth, random_state=42) model.fit(X_train, y_train) acc = accuracy_score(y_test, model.predict(X_test)) # LEAKAGE if acc > best_acc: best_acc = acc best_depth = depth print(f'Best depth: {best_depth}, Test accuracy: {best_acc:.2%}') print('Problem: test data influenced the hyperparameter choice.') print('The reported accuracy is optimistic — it was selected to look good.') # CORRECT: GridSearchCV with cross-validation on training data print('\n=== CORRECT: GridSearchCV on Training Data ===') param_grid = {'max_depth': range(1, 20)} grid_search = GridSearchCV( DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy' ) grid_search.fit(X_train, y_train) # only uses training data print(f'Best depth: {grid_search.best_params_["max_depth"]}') print(f'Best CV accuracy: {grid_search.best_score_:.2%}') # Final evaluation on untouched test set final_acc = accuracy_score(y_test, grid_search.predict(X_test)) print(f'Final test accuracy: {final_acc:.2%}') print('\nRule: Tune with CV on train, evaluate once on test.')
Best depth: 3, Test accuracy: 100.00%
Problem: test data influenced the hyperparameter choice.
The reported accuracy is optimistic — it was selected to look good.
=== CORRECT: GridSearchCV on Training Data ===
Best depth: 3
Best CV accuracy: 95.83%
Final test accuracy: 100.00%
Rule: Tune with CV on train, evaluate once on test.
Mistake 8: Not Checking Feature Importance
Training a model without examining feature importance means you do not understand what drives predictions. Feature importance reveals which features matter most, which are noise, and which might be leaking target information. A single dominant feature often indicates target leakage. Irrelevant features add noise and degrade performance. Always inspect feature importance after training — it takes one line of code and can save you from deploying a model that works for the wrong reasons. Feature importance is also critical for stakeholder communication. If you cannot explain why the model makes certain predictions, nobody will trust it in production.
# TheCodeForge — Mistake 8: Not Checking Feature Importance from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.inspection import permutation_importance import numpy as np X, y = make_classification( n_samples=500, n_features=10, n_informative=5, n_redundant=3, random_state=42 ) feature_names = [f'feature_{i}' for i in range(10)] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # CORRECT: Check built-in feature importance importances = model.feature_importances_ indices = np.argsort(importances)[::-1] print('=== Built-in Feature Importance (Gini) ===') for i, idx in enumerate(indices): bar = '#' * int(importances[idx] * 50) print(f'{i+1}. {feature_names[idx]:>12}: {importances[idx]:.3f} {bar}') print(f'\nTop 3 features account for {importances[indices[:3]].sum():.1%} of importance.') print(f'Bottom 3 features account for {importances[indices[-3:]].sum():.1%} — consider removing.') # BETTER: Permutation importance (model-agnostic, more reliable) perm_imp = permutation_importance( model, X_test, y_test, n_repeats=10, random_state=42 ) print('\n=== Permutation Importance (more reliable) ===') perm_indices = np.argsort(perm_imp.importances_mean)[::-1] for i, idx in enumerate(perm_indices[:5]): print(f'{i+1}. {feature_names[idx]:>12}: ' f'{perm_imp.importances_mean[idx]:.3f} ' f'+/- {perm_imp.importances_std[idx]:.3f}') print('\nRule: Check feature importance to detect leakage and remove noise.')
1. feature_3: 0.187 #########
2. feature_1: 0.162 ########
3. feature_5: 0.141 #######
4. feature_0: 0.118 ######
5. feature_2: 0.098 #####
6. feature_7: 0.076 ####
7. feature_4: 0.068 ###
8. feature_6: 0.061 ###
9. feature_9: 0.048 ##
10. feature_8: 0.043 ##
Top 3 features account for 49.0% of importance.
Bottom 3 features account for 15.2% — consider removing.
=== Permutation Importance (more reliable) ===
1. feature_3: 0.095 +/- 0.021
2. feature_1: 0.078 +/- 0.018
3. feature_5: 0.065 +/- 0.015
4. feature_0: 0.052 +/- 0.014
5. feature_2: 0.041 +/- 0.012
Rule: Check feature importance to detect leakage and remove noise.
Mistake 9: Ignoring Data Distribution Shift
Models are trained on historical data but deployed on future data. If the data distribution changes over time — feature values shift, new categories appear, or relationships between features and targets change — model performance degrades silently. This is called concept drift or data drift. The model does not throw an error. It does not report lower confidence. It simply starts making worse predictions, and unless you are monitoring production metrics, you will not know until the business impact is visible. Distribution shift is the reason 'set it and forget it' does not work for ML in production.
# TheCodeForge — Mistake 9: Data Distribution Shift import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from scipy import stats # Simulate training data (2024 distribution) np.random.seed(42) X_train = np.random.randn(500, 2) + [0, 0] y_train = (X_train[:, 0] + X_train[:, 1] > 0).astype(int) # Simulate production data (2025 distribution shifted) X_prod = np.random.randn(200, 2) + [2, -1] # distribution shifted y_prod = (X_prod[:, 0] + X_prod[:, 1] > 0).astype(int) model = LogisticRegression(random_state=42) model.fit(X_train, y_train) train_acc = accuracy_score(y_train, model.predict(X_train)) prod_acc = accuracy_score(y_prod, model.predict(X_prod)) print('=== Distribution Shift Problem ===') print(f'Training accuracy (2024 data): {train_acc:.2%}') print(f'Production accuracy (2025 data): {prod_acc:.2%}') print(f'Performance drop: {train_acc - prod_acc:.2%}') print(f'\nTraining feature means: {X_train.mean(axis=0).round(2)}') print(f'Production feature means: {X_prod.mean(axis=0).round(2)}') print(f'Means shifted — the model learned patterns that no longer hold.') # Detect shift with statistical test (KS test) print('\n=== Distribution Shift Detection ===') for i in range(X_train.shape[1]): ks_stat, p_value = stats.ks_2samp(X_train[:, i], X_prod[:, i]) shifted = 'SHIFTED' if p_value < 0.05 else 'OK' print(f'Feature {i}: KS stat={ks_stat:.3f}, p={p_value:.4f} -> {shifted}') print('\nRule: Monitor production data distributions and retrain periodically.')
Training accuracy (2024 data): 96.20%
Production accuracy (2025 data): 68.00%
Performance drop: 28.20%
Training feature means: [0.02 0.03]
Production feature means: [ 1.97 -1.01]
Means shifted — the model learned patterns that no longer hold.
=== Distribution Shift Detection ===
Feature 0: KS stat=0.872, p=0.0000 -> SHIFTED
Feature 1: KS stat=0.635, p=0.0000 -> SHIFTED
Rule: Monitor production data distributions and retrain periodically.
Mistake 10: Using the Wrong Loss Function
The loss function defines what the model optimizes for. Using the wrong loss function means the model optimizes the wrong objective. For classification, use cross-entropy loss — not mean squared error. For regression with outliers, use Huber loss — not mean squared error. For imbalanced classification, use weighted cross-entropy or focal loss. The loss function must match the problem type and business objective. This mistake is especially common when beginners copy code from tutorials without understanding why a particular loss function was chosen. MSE penalizes outliers quadratically, which makes the model chase extreme values. Huber loss transitions from quadratic (near zero error) to linear (large error), making it robust to outliers.
# TheCodeForge — Mistake 10: Wrong Loss Function import numpy as np from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression, HuberRegressor from sklearn.metrics import mean_squared_error, mean_absolute_error # Create regression data with outliers np.random.seed(42) X = np.random.randn(200, 1) y = 3 * X.squeeze() + np.random.randn(200) * 0.5 # Add outliers — every 10th point has a large error y[::10] += 10 X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # MISTAKE: MSE loss with outliers — model chases extreme values lr = LinearRegression() lr.fit(X_train, y_train) pred_mse = lr.predict(X_test) print('=== MSE Loss (MISTAKE for outlier data) ===') print(f'MSE: {mean_squared_error(y_test, pred_mse):.2f}') print(f'MAE: {mean_absolute_error(y_test, pred_mse):.2f}') print(f'Coefficient: {lr.coef_[0]:.2f} (true value: 3.00)') print(f'Intercept: {lr.intercept_:.2f} (true value: 0.00)') # CORRECT: Huber loss — robust to outliers huber = HuberRegressor(epsilon=1.35) # default epsilon huber.fit(X_train, y_train) pred_huber = huber.predict(X_test) print(f'\n=== Huber Loss (CORRECT for outlier data) ===') print(f'MSE: {mean_squared_error(y_test, pred_huber):.2f}') print(f'MAE: {mean_absolute_error(y_test, pred_huber):.2f}') print(f'Coefficient: {huber.coef_[0]:.2f} (true value: 3.00)') print(f'Intercept: {huber.intercept_:.2f} (true value: 0.00)') print(f'\n=== Loss Function Guide ===') print(f'Classification: cross-entropy (log_loss)') print(f'Regression (clean): MSE') print(f'Regression (outliers): Huber or MAE') print(f'Imbalanced classes: weighted cross-entropy or focal loss')
MSE: 12.45
MAE: 2.18
Coefficient: 3.42 (true value: 3.00)
Intercept: 0.95 (true value: 0.00)
=== Huber Loss (CORRECT for outlier data) ===
MSE: 8.72
MAE: 1.65
Coefficient: 3.12 (true value: 3.00)
Intercept: 0.35 (true value: 0.00)
=== Loss Function Guide ===
Classification: cross-entropy (log_loss)
Regression (clean): MSE
Regression (outliers): Huber or MAE
Imbalanced classes: weighted cross-entropy or focal loss
- MSE penalizes large errors quadratically — outliers dominate the optimization
- Huber loss transitions from quadratic to linear — robust to outliers
- Cross-entropy is correct for classification — MSE is not
- Weighted loss functions handle class imbalance during training
Mistake 11: Not Using sklearn Pipeline
Manual preprocessing — scaling, encoding, feature selection — is error-prone and the most common source of data leakage in production. A sklearn Pipeline chains preprocessing steps and the model into a single object. The pipeline ensures preprocessing is fit on training data only and applied consistently to test and production data. It also simplifies hyperparameter tuning and deployment. Without a Pipeline, you must remember to apply every preprocessing step in the correct order to every new dataset. Miss one step, apply them in the wrong order, or accidentally fit a scaler on test data, and your predictions are silently wrong. The Pipeline eliminates this entire class of bugs by design.
# TheCodeForge — Mistake 11: Not Using sklearn Pipeline from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline import numpy as np import joblib X, y = make_classification(n_samples=500, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # MISTAKE: Manual preprocessing (error-prone, leakage risk) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) pca = PCA(n_components=10) X_train_pca = pca.fit_transform(X_train_scaled) model = LogisticRegression(random_state=42) model.fit(X_train_pca, y_train) # Must remember to apply same transforms to test data X_test_scaled = scaler.transform(X_test) X_test_pca = pca.transform(X_test_scaled) print('=== Manual Preprocessing (MISTAKE) ===') print(f'Accuracy: {model.score(X_test_pca, y_test):.2%}') print('Problem: easy to forget steps, apply in wrong order, or fit on wrong data.') # CORRECT: Pipeline — leakage-proof by design pipeline = Pipeline([ ('scaler', StandardScaler()), ('pca', PCA(n_components=10)), ('classifier', LogisticRegression(random_state=42)) ]) pipeline.fit(X_train, y_train) print(f'\n=== sklearn Pipeline (CORRECT) ===') print(f'Accuracy: {pipeline.score(X_test, y_test):.2%}') # Cross-validation works seamlessly with Pipeline cv_scores = cross_val_score(pipeline, X, y, cv=5) print(f'CV accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})') # Deployment: serialize the entire pipeline — one file, one predict call joblib.dump(pipeline, 'model_pipeline.joblib') loaded_pipeline = joblib.load('model_pipeline.joblib') print(f'\nLoaded pipeline accuracy: {loaded_pipeline.score(X_test, y_test):.2%}') print('Deployment: one file contains scaler + PCA + model.') print('\nRule: Always use Pipeline — it prevents leakage and simplifies deployment.')
Accuracy: 88.00%
Problem: easy to forget steps, apply in wrong order, or fit on wrong data.
=== sklearn Pipeline (CORRECT) ===
Accuracy: 88.00%
CV accuracy: 87.20% (+/- 2.14%)
Loaded pipeline accuracy: 88.00%
Deployment: one file contains scaler + PCA + model.
Rule: Always use Pipeline — it prevents leakage and simplifies deployment.
Mistake 12: Not Validating with Domain Experts
Technical metrics do not guarantee business value. A model can achieve high accuracy while making predictions that are nonsensical to domain experts. Feature importance can reveal that the model relies on features that should not predict the target. Clusters can be statistically valid but business-meaningless. Always validate model outputs with domain experts before deployment — they catch errors that metrics miss. This is not a technical step, it is a process step, and skipping it is one of the most expensive mistakes in ML. A model that makes technically correct but domain-inappropriate predictions will erode stakeholder trust faster than a model that makes honest errors.
# TheCodeForge — Mistake 12: No Domain Expert Validation # Example: A model predicts house prices using zip code as a numeric feature # The model achieves high R-squared but makes nonsensical predictions import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_absolute_error # Simulated data: house prices np.random.seed(42) n_samples = 500 zip_code = np.random.randint(10000, 99999, n_samples) sqft = np.random.randint(500, 5000, n_samples) price = sqft * 150 + np.random.randn(n_samples) * 10000 X = np.column_stack([zip_code, sqft]) X_train, X_test, y_train, y_test = train_test_split( X, price, test_size=0.2, random_state=42 ) model = LinearRegression() model.fit(X_train, y_train) predictions = model.predict(X_test) print('=== Technical Metrics Look Good ===') print(f'R-squared: {r2_score(y_test, predictions):.2%}') print(f'MAE: ${mean_absolute_error(y_test, predictions):,.0f}') print(f'\nModel coefficients:') print(f' zip_code: {model.coef_[0]:.4f} per unit') print(f' sqft: {model.coef_[1]:.2f} per unit') print(f'\n=== Domain Expert Would Catch This ===') print(f'The model treats zip_code as a continuous number.') print(f'Zip code 99998 is not "worth more" than zip code 10001.') print(f'This is nonsensical — zip code is categorical, not numeric.') print(f'Fix: one-hot encode zip_code or use target encoding.') # What a domain expert review should include: print(f'\n=== Domain Expert Review Checklist ===') print(f'1. Are feature types correct? (categorical vs numeric)') print(f'2. Do feature importances make domain sense?') print(f'3. Do sample predictions pass the sanity test?') print(f'4. Are there edge cases the model handles incorrectly?') print(f'5. Would you trust this prediction if it were your money?') print('\nRule: Always validate predictions with domain experts before deployment.')
R-squared: 95.42%
MAE: $8,234
Model coefficients:
zip_code: 0.0234 per unit
sqft: 150.03 per unit
=== Domain Expert Would Catch This ===
The model treats zip_code as a continuous number.
Zip code 99998 is not "worth more" than zip code 10001.
This is nonsensical — zip code is categorical, not numeric.
Fix: one-hot encode zip_code or use target encoding.
=== Domain Expert Review Checklist ===
1. Are feature types correct? (categorical vs numeric)
2. Do feature importances make domain sense?
3. Do sample predictions pass the sanity test?
4. Are there edge cases the model handles incorrectly?
5. Would you trust this prediction if it were your money?
Rule: Always validate predictions with domain experts before deployment.
| Mistake | Category | Symptom | Impact | Fix |
|---|---|---|---|---|
| Overfitting | Model | Train acc >> Test acc | Model fails on new data | Reduce complexity, add regularization |
| Data Leakage | Data | Suspiciously high accuracy | False confidence, production failure | Split before preprocessing, use Pipeline |
| Wrong Metrics | Evaluation | High accuracy, no business value | Stakeholder trust loss | Use F1, precision, recall, AUC-ROC |
| No Cross-Validation | Evaluation | Accuracy varies between runs | Unreliable performance estimate | Use cross_val_score with cv=5 |
| No Feature Scaling | Preprocessing | Poor convergence, biased distances | Degraded model performance | Scale for distance/gradient algorithms |
| No Baseline | Evaluation | Model looks good but beats nothing | Wasted engineering effort | Compare against DummyClassifier |
| Tuning on Test Data | Validation | Inflated test accuracy | Data leakage, false confidence | Use GridSearchCV on training data |
| No Feature Importance | Interpretability | Do not understand predictions | Missed leakage, noise features | Inspect feature_importances_ |
| Distribution Shift | Production | Performance degrades over time | Silent model failure | Monitor distributions, retrain periodically |
| Wrong Loss Function | Training | Model optimizes wrong objective | Suboptimal predictions | Match loss to problem type and data quality |
| No Pipeline | Code Quality | Preprocessing errors, leakage | Inconsistent train/serving | Use sklearn Pipeline |
| No Domain Validation | Process | Nonsensical predictions | Business value loss | Validate with domain experts before deploy |
🎯 Key Takeaways
- Overfitting is the #1 mistake — always compare train vs test accuracy to detect it
- Data leakage silently inflates metrics — always split BEFORE preprocessing and use sklearn Pipeline
- Accuracy is meaningless on imbalanced datasets — use F1-score, precision, recall, AUC-ROC
- Cross-validation gives reliable performance estimates — single train-test splits are noisy and misleading
- Always establish a baseline before training complex models — if you cannot beat DummyClassifier, fix the data
- Feature importance reveals leakage and noise — inspect it after every training run
- Monitor production data distributions — models degrade silently as data drifts over time
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the difference between overfitting and underfitting, and how do you detect each?JuniorReveal
- QExplain data leakage with a real-world example and how to prevent it.Mid-levelReveal
- QWhy is accuracy a bad metric for imbalanced datasets, and what should you use instead?Mid-levelReveal
- QHow do you design a robust ML evaluation pipeline that prevents all common mistakes?SeniorReveal
Frequently Asked Questions
How do I know if my model is overfitting?
Compare training accuracy to test accuracy. If training accuracy is significantly higher (>10% gap), the model is overfitting. For example, 99% training accuracy with 80% test accuracy indicates severe overfitting. The model memorized the training data instead of learning generalizable patterns. Fix: reduce model complexity (fewer layers, lower max_depth, fewer estimators), add regularization (L1, L2, dropout), or collect more training data. Plot learning curves — if test accuracy plateaus while training accuracy keeps climbing, the model needs regularization, not more training.
What is the simplest way to prevent data leakage?
Use sklearn Pipeline. A Pipeline chains preprocessing steps and the model into a single object. When you call pipeline.fit(X_train, y_train), the pipeline automatically fits preprocessing on training data only. When you call pipeline.predict(X_test), it applies the same preprocessing to test data without refitting. This eliminates the most common source of leakage: fitting preprocessing on the full dataset before splitting. For time-series data, additionally ensure you split chronologically using TimeSeriesSplit, not randomly.
Should I always use cross-validation instead of a single train-test split?
Use cross-validation for model evaluation and hyperparameter tuning — it gives a more reliable performance estimate with confidence bounds. Use a single train-test split for final evaluation — it simulates production conditions where you evaluate on truly unseen data. The standard approach: split data into train/test (80/20), use cross-validation on the training set for model selection and tuning, then evaluate the final model on the untouched test set once. For small datasets (<1000 samples), cross-validation is especially important because a single split can be highly unrepresentative.
How do I handle imbalanced datasets without collecting more data?
Three approaches, from simplest to most involved: (1) Class weights — set class_weight='balanced' in sklearn classifiers. This penalizes misclassifying the minority class more heavily during training. Requires no data modification. (2) Oversampling — use SMOTE (from imbalanced-learn library) to generate synthetic minority class samples. Creates new training samples by interpolating between existing minority samples. (3) Undersampling — randomly remove majority class samples to balance the dataset. Simple but loses information. Combine any of these with appropriate metrics (F1-score, AUC-ROC) instead of accuracy. Start with class weights — it works well in most cases and adds zero complexity.
How often should I retrain my production model?
It depends on how fast your data distribution changes. For stable distributions (medical imaging, physics simulations), retrain quarterly or when new data is available. For moderately changing distributions (e-commerce recommendations, marketing), retrain monthly. For rapidly changing distributions (news classification, social media trending, financial markets), retrain weekly or daily. The key is monitoring: track feature distributions and model performance metrics in production. When performance drops below a threshold or feature distributions shift significantly (detected via KS test or PSI), trigger retraining. Automated drift detection is better than fixed calendar schedules.
What is the difference between a validation set and a test set?
A validation set is used during model development for hyperparameter tuning and model selection — you evaluate on it repeatedly. A test set is used exactly once for final evaluation — it provides an unbiased estimate of production performance. In practice, cross-validation on the training set replaces the need for a separate validation set — you tune on cross-validation folds within the training data and evaluate on the untouched test set. The three-way split (train/validation/test) is more common in deep learning where cross-validation is computationally expensive due to long training times.
How do I know if a feature is causing data leakage?
Three indicators: (1) Feature importance — if one feature dominates (>50% of total importance), investigate whether it is derived from the target or contains future information. (2) Temporal availability — ask yourself: would this feature be available at prediction time in production? If not, it is leaking. (3) Suspicious accuracy — if removing a single feature drops accuracy by more than 20%, that feature is almost certainly leaking target information. Example: a 'days_since_last_purchase' feature in a churn prediction model that is calculated using the churn date itself — this feature encodes the target directly. Always ask: could I compute this feature BEFORE the event I am trying to predict?
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.