Junior 7 min · March 06, 2026

Overfitting - When Fraud Detection Blocks All Transactions

Fraud alert rate jumped from 2% to 80% in hours due to overfitting on imbalanced data; never trust training accuracy - use our production debug guide..

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Overfitting: training error low, validation error high — model memorises noise
  • Underfitting: both training and validation errors high — model misses signal
  • Use learning curves: training/validation error vs training set size
  • Fix overfitting: regularisation, more data, simplify model, early stopping
  • Fix underfitting: increase complexity, add features, train longer
  • Biggest mistake: adding more data when the model is underfitting — it won't help
✦ Definition~90s read
What is Overfitting and Underfitting?

Overfitting in the context of fraud detection occurs when a machine learning model learns the training data too precisely, including its noise and outliers, to the point where it fails to generalize to new, unseen transactions. In the extreme case described—'blocks all transactions'—the model has become so narrowly tailored to the specific fraudulent patterns in its training set that it incorrectly classifies legitimate transactions as fraudulent, effectively shutting down all activity.

Imagine you're studying for a history exam.

This is not a sign of a robust system but a failure of model validation and regularization, where the model's complexity exceeds what the underlying signal in the data can support.

This phenomenon exists because fraud detection models are often trained on highly imbalanced datasets with rare but varied fraudulent behaviors. To maximize recall on the training set, the model may memorize spurious correlations—such as specific merchant IDs, time-of-day quirks, or user-agent strings—that happen to align with fraud in the historical data but are irrelevant or misleading in production.

Without proper cross-validation, early stopping, or regularization techniques like dropout or L1/L2 penalties, the model's decision boundary becomes overly intricate, leading to a high false-positive rate that can paralyze transaction processing.

Overfitting fits into the broader machine learning lifecycle as a critical failure mode during the training-validation gap. It is the opposite of underfitting, where the model is too simple to capture patterns. In fraud detection specifically, overfitting undermines the core business goal: distinguishing genuine from fraudulent transactions with acceptable precision and recall.

Addressing it requires techniques such as simplifying the model architecture, increasing training data diversity, applying regularization, and rigorously testing on out-of-time or out-of-distribution samples to ensure the model remains robust in production.

Plain-English First

Imagine you're studying for a history exam. If you memorise every question from last year's paper word-for-word, you'll ace a re-run but completely blank on any new question — that's overfitting. If you barely glance at the textbook and just guess 'World War 2' for everything, you'll fail because you learned too little — that's underfitting. A great student learns the patterns and principles, not the exact answers. Your ML model needs to do exactly the same thing.

Every ML model you build has one job: make good predictions on data it has never seen before. It sounds simple, but the single biggest reason models fail in production isn't bad algorithms or messy data — it's getting the balance of learning wrong. A model that learns too much from its training data becomes obsessed with noise and quirks that don't generalise. A model that learns too little never captures the real signal in the first place. Both failures have names, both are measurable, and both are fixable once you understand what's actually happening inside the model.

Overfitting and underfitting sit at opposite ends of a spectrum called the bias-variance tradeoff. Understanding this tradeoff is what separates engineers who tune models by intuition from those who tune them systematically. When you know WHY a model overfits, you stop throwing random regularisation at it and start making deliberate, principled decisions about complexity, data size, and training strategy.

By the end of this article you'll be able to plot a learning curve and diagnose whether your model is overfitting or underfitting just by looking at it. You'll have working Python code that deliberately creates both problems and then fixes them — so the concepts stick in your hands, not just your head. And you'll walk away knowing exactly which levers to pull in each scenario.

Overfitting vs. Underfitting — The Two Failure Modes of Learning

Overfitting is when a model learns the training data too precisely, including its noise and outliers, resulting in poor generalization to unseen data. Underfitting is the opposite: the model fails to capture the underlying patterns, performing poorly even on the training set. The core mechanic is a mismatch between model capacity and data complexity — too much capacity relative to signal leads to memorization; too little leads to oversimplification.

In practice, overfitting manifests as high variance: small changes in input cause large swings in predictions. Underfitting shows high bias: the model consistently misses the true relationship. A classic symptom of overfitting is training accuracy near 100% while validation accuracy plateaus or drops. Underfitting shows both training and validation accuracy stuck below acceptable thresholds. The key property is that both degrade real-world performance, but they require opposite remedies — more regularization or more data for overfitting, more capacity or better features for underfitting.

You encounter these when deploying any learned model to production. For fraud detection, an overfit model might block 99% of legitimate transactions because it memorized rare fraud patterns from training, while an underfit model might let through obvious fraud because it never learned the signal. The practical goal is to find the sweet spot where validation error is minimized — typically by monitoring the gap between training and validation metrics and using techniques like cross-validation, early stopping, or pruning.

The Double Descent Trap
Modern deep networks can overfit even with massive data — but sometimes performance improves again after the overfitting peak. Never assume more parameters always cause overfitting.
Production Insight
Fraud detection model trained on last month's transactions blocks 40% of legitimate purchases on Black Friday because it overfit to seasonal patterns.
Symptom: validation accuracy drops sharply while training accuracy stays high — the classic generalization gap.
Rule of thumb: if your validation loss starts rising while training loss still falls, stop training immediately — you've entered the overfitting regime.
Key Takeaway
Overfitting is memorization, not learning — it fails on new data because it learned the noise.
Underfitting is ignorance — the model never captured the signal, so it fails everywhere.
The only reliable cure is to measure generalization error on a held-out set and stop before the gap widens.
Overfitting vs Underfitting in Fraud Detection THECODEFORGE.IO Overfitting vs Underfitting in Fraud Detection Model failure modes: memorization vs. underlearning Overfitting: Memorizes Training High variance, low bias; blocks all transactions Underfitting: Won't Learn High bias, low variance; misses fraud patterns Validation Curves Detect overfitting early via train/test gap Regularization & Dropout Penalize complexity; reduce overfitting Add Complexity & Features Fix underfitting with more capacity Production Monitoring Alert on drift, performance degradation ⚠ Overfit model blocks all transactions silently Monitor false positive rate and validation gap continuously THECODEFORGE.IO
thecodeforge.io
Overfitting vs Underfitting in Fraud Detection
Overfitting Underfitting

The Technical Root: Bias vs. Variance

To fix a failing model, you must diagnose its soul. Underfitting is caused by High Bias—the model makes simplistic assumptions about the data. Overfitting is caused by High Variance—the model is overly sensitive to small fluctuations in the training set. In a production environment, we use Learning Curves (plotting Error vs. Training Set Size) to visualize this struggle.

The bias-variance tradeoff is fundamental. A model with high bias pays little attention to data and consistently underfits. A model with high variance pays too much attention to data, including noise. Finding the sweet spot is what model tuning is all about.

In practice, you'll see three patterns
  • Both curves high and close together: high bias (underfitting)
  • Training curve low, validation curve high with a gap: high variance (overfitting)
  • Both curves low and close together: good fit
model_diagnostics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve

# io.thecodeforge approach: Systematic Diagnostic
def generate_learning_curve(model, X, y):
    train_sizes, train_scores, test_scores = learning_curve(
        model, X, y, cv=5, scoring='neg_mean_squared_error'
    )
    
    # Calculate mean and standard deviation
    train_mean = -np.mean(train_scores, axis=1)
    test_mean = -np.mean(test_scores, axis=1)
    
    return train_sizes, train_mean, test_mean

# Underfitting: Linear model on non-linear data
underfit_model = LinearRegression()

# Overfitting: High-degree polynomial
overfit_model = Pipeline([
    ("poly_features", PolynomialFeatures(degree=15, include_bias=False)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", LinearRegression()),
])
Output
[Learning Curve Data Generated]
The Convergence Trap
In Underfitting, the training and validation curves converge quickly but at a high error rate. Adding more data won't help; you need a more complex model.
Production Insight
A common production mistake is training a linear model on non-linear data and then throwing more data at it.
The curves will converge at high error — clear high bias.
Rule: if both errors are high and close, increase model complexity, not data size.
Key Takeaway
Bias trades off against variance.
Underfitting = high bias (curves high and close).
Overfitting = high variance (curves diverge).

Detecting Overfitting Early with Validation Curves

Validation curves show how model performance changes with a hyperparameter (e.g., polynomial degree, tree depth). They help you find the sweet spot before overfitting takes hold. Plotting validation curves during development is far cheaper than discovering the problem in production.

In production systems, we automate this with CI/CD pipelines that generate validation curves for every candidate model. The pipeline rejects any model where the validation curve shows a gap larger than a configurable threshold (typically 10-15% for classification accuracy).

Here's how to generate a validation curve for polynomial degree using scikit-learn:

validation_curve.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.model_selection import validation_curve
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np

def generate_validation_curve(X, y, param_range):
    model = Pipeline([
        ("poly", PolynomialFeatures(include_bias=False)),
        ("reg", LinearRegression()),
    ])
    train_scores, test_scores = validation_curve(
        model, X, y,
        param_name="poly__degree",
        param_range=param_range,
        cv=5,
        scoring="neg_mean_squared_error"
    )
    train_mean = -np.mean(train_scores, axis=1)
    test_mean = -np.mean(test_scores, axis=1)
    return param_range, train_mean, test_mean
Automated Thresholding
In our CI/CD pipelines, we set a max allowed gap of 0.15 (15%) between training and validation accuracy. Any model exceeding this is automatically rejected and logged for review.
Production Insight
Validation curves are not just for research — they catch overfitting before deployment.
Automate the gap check: if training accuracy - validation accuracy > 0.15, the model likely overfits.
Rule: reject models with high variance early, before they hit production.
Key Takeaway
Validation curves reveal the optimal hyperparameter range.
Stop before the validation error starts rising.
Automate the gap threshold check in your MLOps pipeline.

Fixing Overfitting: Regularisation, Dropout, and Pruning

When you've confirmed overfitting, the toolkit is broad but principled. The most effective levers are:

  • L1/L2 Regularisation: Adds a penalty to large weights. L1 drives weights to zero (feature selection), L2 shrinks them.
  • Dropout: Randomly drops neurons during training — forces the network to learn redundant representations.
  • Early Stopping: Monitors validation loss and stops training when it starts increasing.
  • Reduce Model Complexity: Fewer layers, fewer trees, lower polynomial degree.
  • Increase Training Data: More data reduces the impact of noise.
  • Feature Selection: Remove irrelevant features that introduce noise.

Here's a production-ready Python example that applies all three regularisation techniques:

io/thecodeforge/fix_overfitting.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
from sklearn.neural_network import MLPRegressor
from tensorflow import keras
from keras import layers, regularizers, callbacks

def build_regularised_model(input_dim):
    model = keras.Sequential([
        layers.Dense(64, activation='relu', 
                     kernel_regularizer=regularizers.l2(0.01)),
        layers.Dropout(0.5),
        layers.Dense(32, activation='relu',
                     kernel_regularizer=regularizers.l2(0.01)),
        layers.Dropout(0.3),
        layers.Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse')
    return model

# Early stopping callback
early_stop = callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

model = build_regularised_model(X_train.shape[1])
history = model.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=200, callbacks=[early_stop], verbose=0)
Regularisation as a Constraint
  • L2 regularisation adds a penalty proportional to the square of weights — common in neural nets.
  • L1 regularisation adds penalty proportional to absolute weight — useful for feature selection.
  • Dropout randomly turns off neurons — like training an ensemble of simpler models.
  • Early stopping cuts training when validation stops improving — prevents memorisation.
Production Insight
Dropout should typically be 0.2-0.5 — too high and the model underfits.
L2 regularisation coefficient (λ) is often tuned via cross-validation; 0.01 is a common starting point.
Rule: always combine regularisation with early stopping in production pipelines.
Key Takeaway
Regularisation reduces variance.
Dropout, L2, and early stopping are the three pillars.
Always validate on a holdout set after applying them.

Fixing Underfitting: Complexity, Features, and Training Time

Underfitting means your model is too simple. The solution is to give it more capacity. But adding complexity blindly can tip into overfitting — you need a controlled approach.

Common fixes for underfitting
  • Increase Model Complexity: Use higher-degree polynomials, deeper trees, more layers.
  • Engineer Better Features: Interactions, polynomial features, domain-specific aggregations.
  • Reduce Regularisation: Too much regularisation can itself cause underfitting.
  • Train Longer: Sometimes the model just needs more epochs to converge.
  • Reduce Feature Noise: Remove irrelevant features that dilute signal.

Here's a Python workflow that detects underfitting and applies fixes systematically:

io/thecodeforge/fix_underfitting.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

def diagnose_and_fix_underfitting(X, y):
    # Baseline linear model
    linear_model = LinearRegression()
    linear_score = -cross_val_score(linear_model, X, y, cv=5, 
                                    scoring='neg_mean_squared_error').mean()
    
    # Try polynomial features (degree 3)
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=3, include_bias=False)),
        ('scaler', StandardScaler()),
        ('reg', LinearRegression())
    ])
    poly_score = -cross_val_score(poly_model, X, y, cv=5, 
                                  scoring='neg_mean_squared_error').mean()
    
    if poly_score < linear_score * 0.8:
        print("Underfitting fixed with polynomial features. Using poly model.")
        return poly_model
    else:
        print("Still underfitting. Try feature engineering or more complex algorithm.")
        return None
The Complexity Trap
Don't jump from a linear model straight to a 10-layer neural net. Increase complexity step by step and validate at each step. Otherwise you'll overshoot and end up overfitting.
Production Insight
In production, I've seen teams add neural networks to solve underfitting when polynomial features would have done the job.
Start with the simplest fix (polynomial features, interaction terms) and escalate.
Rule: underfitting is often a feature engineering problem, not a model problem.
Key Takeaway
Underfitting = not enough capacity.
Fix: more features, more complexity, less regularisation.
Validate with cross-validation after each change.

Production-Grade Monitoring and Alerting for Model Drift

Once your model is in production, it can still drift into overfitting or underfitting as data distributions change. This is called concept drift or covariate shift. You need automated monitoring.

At TheCodeForge, we deploy a model health monitor that checks
  • Prediction distribution: Does the average prediction stay stable?
  • Feature distribution: Are incoming features within the training range?
  • Error rate over time: Is the model's error creeping up?

Here's a Java implementation that logs warnings when drift indicators trigger:

io/thecodeforge/ml/ModelHealthMonitor.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.ml;

import java.util.logging.Logger;
import java.time.Instant;

public class ModelHealthMonitor {
    private static final Logger logger = Logger.getLogger(ModelHealthMonitor.class.getName());
    private double baselineErrorRate;
    private double driftThreshold = 0.2; // 20% increase
    
    public ModelHealthMonitor(double baselineErrorRate) {
        this.baselineErrorRate = baselineErrorRate;
    }
    
    public void checkCurrentErrorRate(double currentErrorRate) {
        double relativeChange = (currentErrorRate - baselineErrorRate) / baselineErrorRate;
        if (relativeChange > driftThreshold) {
            logger.warning("CRITICAL: Model error rate increased by " + 
                           Math.round(relativeChange * 100) + "%. Overfitting? Underfitting? Check learning curves.");
        } else {
            logger.info("Model error rate within bounds.");
        }
    }
    
    public static void main(String[] args) {
        ModelHealthMonitor monitor = new ModelHealthMonitor(0.10);
        monitor.checkCurrentErrorRate(0.18);
    }
}
Output
CRITICAL: Model error rate increased by 80%. Overfitting? Underfitting? Check learning curves.
Retrain Cadence
Most models need retraining when error increases by more than 15-20% from baseline. Automate retraining triggers, but always validate on a holdout set before deploying the new model.
Production Insight
Drift detection is not optional — it's a production requirement.
I've seen teams lose millions because they didn't monitor for covariate shift.
Rule: deploy a monitoring endpoint that returns current error rates and distribution stats.
Key Takeaway
Models degrade in production.
Monitor error rates and feature distributions.
Alert when drift exceeds 20% and trigger retraining.

What is Underfitting? — The Model That Won't Learn

Underfitting is what happens when your model is too stupid to see the patterns right in front of it. It performs like garbage on both training and test data. This isn't a generalization problem — it's a failure to learn in the first place.

The root cause is always the same: the model's capacity is too low for the complexity of the data. You're trying to fit a straight line through a sine wave. Common culprits include: a model that's too simple (think linear regression on a non-linear problem), brutal regularization that kills signal, weak or irrelevant features, or you just didn't train long enough for the optimizer to converge.

From the bias-variance lens, underfitting is pure high bias territory. The model makes strong, dumb assumptions before seeing any data — "all relationships are linear" — and it ignores real patterns to stick with those assumptions. Variance is low because the model gives the same crappy output no matter what data you throw at it. Underfitting = High Bias + Low Variance. Memorize that.

DetectUnderfitting.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate synthetic non-linear data
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = np.sin(X).ravel() + 0.1 * np.random.normal(size=100)

# Deliberately underfit with a linear model
model = LinearRegression()
model.fit(X, y)

y_train_pred = model.predict(X)
print(f"Train MSE: {mean_squared_error(y, y_train_pred):.4f}")

# Both training and test errors are high
X_test = np.linspace(-3, 3, 50).reshape(-1, 1)
y_test = np.sin(X_test).ravel() + 0.1 * np.random.normal(size=50)
y_test_pred = model.predict(X_test)
print(f"Test MSE:  {mean_squared_error(y_test, y_test_pred):.4f}")
Output
Train MSE: 0.5231
Test MSE: 0.5189
Production Trap:
Don't confuse underfitting with 'not enough data.' If your training loss hasn't plateaued and you're still seeing high error on both splits, you need a more complex model — not more data. More data won't fix a linear model trying to learn a sine wave.
Key Takeaway
When training and test errors are both high and similar, you're underfitting. Add model capacity before you add data.

What is Overfitting? — The Model That Memorized Your Trash

Overfitting is the opposite failure mode: your model became a savant at memorizing the training data, including every outlier, every random noise spike, and every accidental duplicate. It nails training accuracy — 99% baby! — then falls apart on production data like a house of cards in a hurricane.

This happens when the model has too much capacity relative to the data. Too many parameters, too many features, too many layers, not enough regularization. The model finds patterns that don't exist. It learns the noise.

Bias-Variance breakdown: overfitting is high variance, low bias. The model makes almost no assumptions (low bias), so it contorts itself to fit every single training point. Variance is through the roof because if you train on a different sample, the model learns completely different noise patterns. Your predictions are all over the place. Overfitting = Low Bias + High Variance. If your validation loss starts going UP while training loss keeps dropping, you're here. Pull the cord.

SignalVsNoiseOverfit.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate data with intentional noise features
np.random.seed(42)
n = 50
X_signal = np.random.uniform(-1, 1, (n, 2))
y = 0.8 * X_signal[:, 0] + 0.2 * X_signal[:, 1] + 0.1 * np.random.normal(size=n)

# Add 50 random noise features — classic overfitting trap
X_noise = np.random.normal(size=(n, 50))
X_full = np.hstack([X_signal, X_noise])

# Deep forest with no regularization — pure overfit
model = RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_leaf=1)
model.fit(X_full, y)

# Train performance deceptively good
train_pred = model.predict(X_full)
print(f"Train MSE: {mean_squared_error(y, train_pred):.4f}")

# Test on fresh data — watch the train wreck
X_test_signal = np.random.uniform(-1, 1, (n, 2))
y_test = 0.8 * X_test_signal[:, 0] + 0.2 * X_test_signal[:, 1] + 0.1 * np.random.normal(size=n)
X_test_noise = np.random.normal(size=(n, 50))
X_test_full = np.hstack([X_test_signal, X_test_noise])
test_pred = model.predict(X_test_full)
print(f"Test MSE:  {mean_squared_error(y_test, test_pred):.4f}")
Output
Train MSE: 0.0002
Test MSE: 1.2874
Senior Shortcut:
Watch the ratio of test MSE to train MSE. Anything above 5x means your model is memorizing noise. Cut features, add regularization, or get more data. Don't chase 99% training accuracy — the market doesn't reward overconfident failures.
Key Takeaway
If your training error is near zero but test error sucks, you're overfitting. The gap between train and test performance is your alarm bell.

Why Every Model Needs a Holdout Set From Day One

Overfitting starts the second you evaluate on training data. You don't need complex math—you need a wall between your model and its score. A holdout set is that wall. Split before you train, not after. The test set defines reality; everything else is a lie.

In production, you can't trust a model that's never seen unseen data. Validation curves hide behind averages. Holdout exposes per-sample failure. It's the cheapest insurance you'll ever buy. If your pipeline doesn't enforce a strict 70/15/15 split at the dataset constructor, you're shipping technical debt.

Senior teams force holdout sets into CI/CD. Data leakage kills models silently. Start with a held-back slice, never touch it during tuning, and only use it for final kill/no-kill decisions. That one slice prevents months of wasted retraining.

HoldoutSplitter.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.model_selection import train_test_split

# Simulate real-world dataset with temporal ordering
np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0] + 0.5 * np.random.randn(1000) > 0).astype(int)

# Enforce split BEFORE any preprocessing
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X, y, test_size=0.15, shuffle=False
)

# Only train/val splits see tuning
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.176, random_state=42
)

print(f"Train: {X_train.shape[0]} | Val: {X_val.shape[0]} | Holdout: {X_holdout.shape[0]}")
Output
Train: 705 | Val: 150 | Holdout: 150
Production Trap:
Never shuffle temporal data. Leaking future into past kills time-series models. Use shuffle=False for any log-ordered dataset.
Key Takeaway
A holdout set is the only honest evaluation. Split before you touch any preprocessing.

The Learning Curve Lie: Why Accuracy Metrics Hide Overfitting

Accuracy on a training set is a vanity metric. It reliably hides the silent failure of overfitting: memorization of noise rather than learning signal. You deploy a model that scores 98% accuracy on validation, and within weeks, production predictions drift, user engagement tanks, and your feature flags get blamed. The root cause is almost always a model that learned the quirks of your static holdout set instead of the true underlying patterns.

Stop trusting accuracy alone. Instead, monitor the generalization gap—the divergence between training and validation loss. A widening gap is your earliest warning that your model is memorizing, not generalizing. Apply regularization, cross-validation, or simpler architectures early. In production, track rolling accuracy windows on live data; if it dips while your eval curve looks fine, you’re seeing the lie in real time. The fix is not better data—it’s better vigilance.

generalization_gap_monitor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X, y = np.random.rand(1000, 20), np.random.randint(0, 2, 1000)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(n_estimators=100, max_depth=None)
model.fit(X_train, y_train)

train_acc = accuracy_score(y_train, model.predict(X_train))
val_acc = accuracy_score(y_val, model.predict(X_val))
gap = train_acc - val_acc

print(f"Train acc: {train_acc:.3f}")
print(f"Val acc:   {val_acc:.3f}")
print(f"Gap:       {gap:.3f}")
if gap > 0.1:
    print("WARNING: Overfitting detected — reduce model complexity.")
Output
Train acc: 1.000
Val acc: 0.530
Gap: 0.470
WARNING: Overfitting detected — reduce model complexity.
Production Trap:
Never deploy based on test accuracy alone. A 95%+ accuracy model that fails on 5% of edge cases can tank your entire pipeline. Always monitor the gap, not just the score.
Key Takeaway
If your training accuracy is significantly higher than validation accuracy, your model is memorizing, not learning. Shrink the gap, not the dataset.
● Production incidentPOST-MORTEMseverity: high

When a Fraud Detection Model Starts Calling Every Transaction Fraudulent

Symptom
Fraud alert rate jumped from 2% to 80% in hours. Customer complaints flooded in. The model's training accuracy was 99.9%.
Assumption
Higher training accuracy means better performance. The team assumed a more complex model would catch more fraud.
Root cause
The model was a deep neural network trained on historical data with a massive class imbalance. During retraining, the team added more layers and trained for 500 epochs without early stopping. The model memorised the exact fraud patterns from the training set, including noise, but failed on new transaction patterns.
Fix
Rolled back to the previous simpler model. Then applied L2 regularisation (λ=0.01), added dropout (0.5), and used early stopping with a patience of 10 epochs based on validation AUC. Retained only the top 50 features. The false positive rate dropped back to 2%.
Key lesson
  • Never trust training accuracy alone — especially on imbalanced data.
  • Always monitor validation metrics during retraining.
  • If the validation curve starts diverging from training, stop and check.
  • More complexity isn't better; it's just more dangerous.
Production debug guideUse these symptom-action pairs when a model's performance degrades after deployment4 entries
Symptom · 01
Model accuracy is high on training but low on live traffic
Fix
Plot learning curves: training and validation error vs training set size. A large gap indicates overfitting.
Symptom · 02
Model is consistently wrong on both training and validation data
Fix
Check feature engineering — are you using enough features? Try a more complex model (e.g., RandomForest vs LinearReg). This is underfitting.
Symptom · 03
Validation error oscillates wildly from run to run
Fix
Likely high variance. Reduce model complexity or increase regularisation. Check if training set is too small.
Symptom · 04
Adding more training data barely changes validation accuracy
Fix
If both curves are flat and close together at high error, you have high bias. More data won't help — need better features or algorithm.
★ Quick Debug Cheat Sheet for Overfitting/UnderfittingWhen you suspect your model is overfitting or underfitting, run these commands and checks immediately.
Training accuracy > 95%, validation accuracy < 70%
Immediate action
Stop training and inspect last epochs for divergence
Commands
python -c "import numpy as np; print('Gap:', np.abs(train_acc[-1] - val_acc[-1]))"
plot_learning_curve(model, X_train, y_train, X_val, y_val)
Fix now
Add L2 regularisation (alpha=0.01) or reduce number of layers/features
Training and validation accuracy both below 60%+
Immediate action
Check if model is too simple (e.g., linear on non-linear data)
Commands
print('Training accuracy:', train_acc[-1], 'Validation:', val_acc[-1])
model_complexity_curve(model, X_train, y_train, X_val, y_val, param_range)
Fix now
Increase model complexity: add polynomial features or use a more powerful algorithm (tree-based)
Validation loss starts rising after epoch N+
Immediate action
Stop training immediately and revert to epoch N-1 checkpoint
Commands
# Early stopping callback already triggered keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
print('Best epoch:', best_epoch, 'Val loss:', min(val_losses))
Fix now
Retrain with early stopping and reduce learning rate by factor of 0.5
Overfitting vs Underfitting at a Glance
FeatureUnderfitting (High Bias)Overfitting (High Variance)
Training ErrorHighVery Low
Validation ErrorHighHigh
CauseModel is too simple (e.g., Linear for Non-linear)Model is too complex (e.g., Deep Tree for noisy data)
Primary FixIncrease complexity, add featuresRegularization (L1/L2), Pruning, More Data
Learning Curve PatternBoth curves high and close togetherTraining low, validation high with gap
Effect of More DataLittle to no improvementReduces gap, helps generalize

Key takeaways

1
Underfitting = High Bias. The model is too dumb for the data.
2
Overfitting = High Variance. The model is too 'clever' and sees patterns in noise.
3
The validation set is your compass; never optimize based on the test set alone.
4
Regularization (Dropout, L1, L2) is the primary weapon against overfitting.
5
For underfitting, increase model complexity or engineer better features
not more data.
6
Automate model monitoring in production to catch drift early.

Common mistakes to avoid

5 patterns
×

Throwing more data at an underfitting model

Symptom
Validation error remains high despite increasing dataset size. Training and validation errors stay close and high.
Fix
Stop adding data. Instead, increase model complexity (e.g., add polynomial features, use a tree-based model) or engineer better features.
×

Ignoring the validation set until the end of the project

Symptom
Model achieves 99% accuracy on test set but fails in production. Actually the test set was used for hyperparameter tuning, causing information leakage.
Fix
Split data into training, validation, and test sets upfront. Never tune hyperparameters on the test set. Use cross-validation for tuning.
×

Over-tuning hyperparameters on the test set (data leakage)

Symptom
Model performs great on held-out test set but fails on new data. Hidden overfitting to the test set.
Fix
Use a separate validation set for tuning. The test set should only be used once at the end to estimate real-world performance.
×

Using a high-degree polynomial on a small dataset without regularization

Symptom
Training error near zero, but validation error is enormous. The model follows every data point.
Fix
Reduce polynomial degree, add L2 regularization, or use cross-validation to select optimal degree.
×

Assuming more features always improve performance

Symptom
Model complexity increases, validation error rises due to noise features.
Fix
Perform feature selection (e.g., L1 regularization, mutual information) to keep only relevant features.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
How do you distinguish between high bias and high variance using a learn...
Q02JUNIOR
What is the 'Double Descent' phenomenon in modern Deep Learning?
Q03JUNIOR
Explain the role of Early Stopping in preventing overfitting.
Q04JUNIOR
How would you set up a CI/CD pipeline to automatically reject overfitted...
Q01 of 04JUNIOR

How do you distinguish between high bias and high variance using a learning curve?

ANSWER
High Bias is identified when both training and validation errors are high and close to each other, indicating the model hasn't captured the underlying trend. High Variance is identified by a large 'gap' between a low training error and a significantly higher validation error. In a production debugging scenario, you'd plot learning curves for different training set sizes and look for convergence or divergence.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Can a model be both overfit and underfit?
02
Why does regularization reduce overfitting?
03
Does increasing the number of features cause overfitting?
04
How do I know if I need more data or a different model?
05
What is the difference between cross-validation and a validation set?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's ML Basics. Mark it forged?

7 min read · try the examples if you haven't

Previous
ML Workflow — Data to Deployment
4 / 26 · ML Basics
Next
Train Test Split and Cross Validation