Intermediate 6 min · March 06, 2026

Overfitting and Underfitting

Overfitting - When Fraud Detection Blocks All Transactions

Q: Can a model be both overfit and underfit?

Technically, yes, in complex multi-output models or deep neural networks where different layers or regions of the data exhibit different behaviors (e.g., overfitting on one class but underfitting on another due to class imbalance). Also in ensemble methods, some base models may overfit while others underfit.

Q: Why does regularization reduce overfitting?

Regularization adds a penalty term to the loss function based on the size of the model weights. This discourages the model from relying too heavily on any single feature, forcing it to find simpler, more robust patterns. L2 regularisation assumes weights are normally distributed; L1 assumes a Laplace distribution and encourages sparsity.

Q: Does increasing the number of features cause overfitting?

Yes, this is known as the 'Curse of Dimensionality.' As you add more features, the model has more opportunities to find accidental correlations that don't exist in the broader population. Feature selection and regularisation are critical when dealing with high-dimensional data.

Q: How do I know if I need more data or a different model?

Plot learning curves. If both training and validation errors are high and close together, you need a more complex model (underfitting). If training error is low and validation error is high with a gap, adding more data will typically reduce the gap (overfitting). If the gap is small but both are high, you need more complex features.

Q: What is the difference between cross-validation and a validation set?

Cross-validation splits the training data into k folds and trains k models, each using a different fold for validation. It gives a more robust estimate of performance. A single validation set is a fixed holdout used for hyperparameter tuning. In production, we often use k-fold cross-validation for model selection and a separate test set for final evaluation.

Fraud alert rate jumped from 2% to 80% in hours due to overfitting on imbalanced data; never trust training accuracy - use our production debug guide..

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Overfitting: training error low, validation error high — model memorises noise
Underfitting: both training and validation errors high — model misses signal
Use learning curves: training/validation error vs training set size
Fix overfitting: regularisation, more data, simplify model, early stopping
Fix underfitting: increase complexity, add features, train longer
Biggest mistake: adding more data when the model is underfitting — it won't help

✦ Definition~90s read

What is Overfitting and Underfitting?

Overfitting in the context of fraud detection occurs when a machine learning model learns the training data too precisely, including its noise and outliers, to the point where it fails to generalize to new, unseen transactions. In the extreme case described—'blocks all transactions'—the model has become so narrowly tailored to the specific fraudulent patterns in its training set that it incorrectly classifies legitimate transactions as fraudulent, effectively shutting down all activity.

★

Imagine you're studying for a history exam.

This is not a sign of a robust system but a failure of model validation and regularization, where the model's complexity exceeds what the underlying signal in the data can support.

This phenomenon exists because fraud detection models are often trained on highly imbalanced datasets with rare but varied fraudulent behaviors. To maximize recall on the training set, the model may memorize spurious correlations—such as specific merchant IDs, time-of-day quirks, or user-agent strings—that happen to align with fraud in the historical data but are irrelevant or misleading in production.

Without proper cross-validation, early stopping, or regularization techniques like dropout or L1/L2 penalties, the model's decision boundary becomes overly intricate, leading to a high false-positive rate that can paralyze transaction processing.

Overfitting fits into the broader machine learning lifecycle as a critical failure mode during the training-validation gap. It is the opposite of underfitting, where the model is too simple to capture patterns. In fraud detection specifically, overfitting undermines the core business goal: distinguishing genuine from fraudulent transactions with acceptable precision and recall.

Addressing it requires techniques such as simplifying the model architecture, increasing training data diversity, applying regularization, and rigorously testing on out-of-time or out-of-distribution samples to ensure the model remains robust in production.

Plain-English First

Imagine you're studying for a history exam. If you memorise every question from last year's paper word-for-word, you'll ace a re-run but completely blank on any new question — that's overfitting. If you barely glance at the textbook and just guess 'World War 2' for everything, you'll fail because you learned too little — that's underfitting. A great student learns the patterns and principles, not the exact answers. Your ML model needs to do exactly the same thing.

Every ML model you build has one job: make good predictions on data it has never seen before. It sounds simple, but the single biggest reason models fail in production isn't bad algorithms or messy data — it's getting the balance of learning wrong. A model that learns too much from its training data becomes obsessed with noise and quirks that don't generalise. A model that learns too little never captures the real signal in the first place. Both failures have names, both are measurable, and both are fixable once you understand what's actually happening inside the model.

Overfitting and underfitting sit at opposite ends of a spectrum called the bias-variance tradeoff. Understanding this tradeoff is what separates engineers who tune models by intuition from those who tune them systematically. When you know WHY a model overfits, you stop throwing random regularisation at it and start making deliberate, principled decisions about complexity, data size, and training strategy.

By the end of this article you'll be able to plot a learning curve and diagnose whether your model is overfitting or underfitting just by looking at it. You'll have working Python code that deliberately creates both problems and then fixes them — so the concepts stick in your hands, not just your head. And you'll walk away knowing exactly which levers to pull in each scenario.

Overfitting vs. Underfitting — The Two Failure Modes of Learning

Overfitting is when a model learns the training data too precisely, including its noise and outliers, resulting in poor generalization to unseen data. Underfitting is the opposite: the model fails to capture the underlying patterns, performing poorly even on the training set. The core mechanic is a mismatch between model capacity and data complexity — too much capacity relative to signal leads to memorization; too little leads to oversimplification.

In practice, overfitting manifests as high variance: small changes in input cause large swings in predictions. Underfitting shows high bias: the model consistently misses the true relationship. A classic symptom of overfitting is training accuracy near 100% while validation accuracy plateaus or drops. Underfitting shows both training and validation accuracy stuck below acceptable thresholds. The key property is that both degrade real-world performance, but they require opposite remedies — more regularization or more data for overfitting, more capacity or better features for underfitting.

You encounter these when deploying any learned model to production. For fraud detection, an overfit model might block 99% of legitimate transactions because it memorized rare fraud patterns from training, while an underfit model might let through obvious fraud because it never learned the signal. The practical goal is to find the sweet spot where validation error is minimized — typically by monitoring the gap between training and validation metrics and using techniques like cross-validation, early stopping, or pruning.

⚠ The Double Descent Trap

Modern deep networks can overfit even with massive data — but sometimes performance improves again after the overfitting peak. Never assume more parameters always cause overfitting.

📊 Production Insight

Fraud detection model trained on last month's transactions blocks 40% of legitimate purchases on Black Friday because it overfit to seasonal patterns.

Symptom: validation accuracy drops sharply while training accuracy stays high — the classic generalization gap.

Rule of thumb: if your validation loss starts rising while training loss still falls, stop training immediately — you've entered the overfitting regime.

🎯 Key Takeaway

Overfitting is memorization, not learning — it fails on new data because it learned the noise.

Underfitting is ignorance — the model never captured the signal, so it fails everywhere.

The only reliable cure is to measure generalization error on a held-out set and stop before the gap widens.

thecodeforge.io

Overfitting Underfitting

The Technical Root: Bias vs. Variance

To fix a failing model, you must diagnose its soul. Underfitting is caused by High Bias—the model makes simplistic assumptions about the data. Overfitting is caused by High Variance—the model is overly sensitive to small fluctuations in the training set. In a production environment, we use Learning Curves (plotting Error vs. Training Set Size) to visualize this struggle.

The bias-variance tradeoff is fundamental. A model with high bias pays little attention to data and consistently underfits. A model with high variance pays too much attention to data, including noise. Finding the sweet spot is what model tuning is all about.

In practice, you'll see three patterns

Both curves high and close together: high bias (underfitting)
Training curve low, validation curve high with a gap: high variance (overfitting)
Both curves low and close together: good fit

model_diagnostics.pyPYTHON

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve

# io.thecodeforge approach: Systematic Diagnostic
def generate_learning_curve(model, X, y):
    train_sizes, train_scores, test_scores = learning_curve(
        model, X, y, cv=5, scoring='neg_mean_squared_error'
    )
    
    # Calculate mean and standard deviation
    train_mean = -np.mean(train_scores, axis=1)
    test_mean = -np.mean(test_scores, axis=1)
    
    return train_sizes, train_mean, test_mean

# Underfitting: Linear model on non-linear data
underfit_model = LinearRegression()

# Overfitting: High-degree polynomial
overfit_model = Pipeline([
    ("poly_features", PolynomialFeatures(degree=15, include_bias=False)),
    ("std_scaler", StandardScaler()),
    ("lin_reg", LinearRegression()),
])

Output

[Learning Curve Data Generated]

⚠ The Convergence Trap

In Underfitting, the training and validation curves converge quickly but at a high error rate. Adding more data won't help; you need a more complex model.

📊 Production Insight

A common production mistake is training a linear model on non-linear data and then throwing more data at it.

The curves will converge at high error — clear high bias.

Rule: if both errors are high and close, increase model complexity, not data size.

🎯 Key Takeaway

Bias trades off against variance.

Underfitting = high bias (curves high and close).

Overfitting = high variance (curves diverge).

Detecting Overfitting Early with Validation Curves

Validation curves show how model performance changes with a hyperparameter (e.g., polynomial degree, tree depth). They help you find the sweet spot before overfitting takes hold. Plotting validation curves during development is far cheaper than discovering the problem in production.

In production systems, we automate this with CI/CD pipelines that generate validation curves for every candidate model. The pipeline rejects any model where the validation curve shows a gap larger than a configurable threshold (typically 10-15% for classification accuracy).

Here's how to generate a validation curve for polynomial degree using scikit-learn:

validation_curve.pyPYTHON

from sklearn.model_selection import validation_curve
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np

def generate_validation_curve(X, y, param_range):
    model = Pipeline([
        ("poly", PolynomialFeatures(include_bias=False)),
        ("reg", LinearRegression()),
    ])
    train_scores, test_scores = validation_curve(
        model, X, y,
        param_name="poly__degree",
        param_range=param_range,
        cv=5,
        scoring="neg_mean_squared_error"
    )
    train_mean = -np.mean(train_scores, axis=1)
    test_mean = -np.mean(test_scores, axis=1)
    return param_range, train_mean, test_mean

💡Automated Thresholding

In our CI/CD pipelines, we set a max allowed gap of 0.15 (15%) between training and validation accuracy. Any model exceeding this is automatically rejected and logged for review.

📊 Production Insight

Validation curves are not just for research — they catch overfitting before deployment.

Automate the gap check: if training accuracy - validation accuracy > 0.15, the model likely overfits.

Rule: reject models with high variance early, before they hit production.

🎯 Key Takeaway

Validation curves reveal the optimal hyperparameter range.

Stop before the validation error starts rising.

Automate the gap threshold check in your MLOps pipeline.

thecodeforge.io

Overfitting Underfitting

Fixing Overfitting: Regularisation, Dropout, and Pruning

When you've confirmed overfitting, the toolkit is broad but principled. The most effective levers are:

L1/L2 Regularisation: Adds a penalty to large weights. L1 drives weights to zero (feature selection), L2 shrinks them.
Dropout: Randomly drops neurons during training — forces the network to learn redundant representations.
Early Stopping: Monitors validation loss and stops training when it starts increasing.
Reduce Model Complexity: Fewer layers, fewer trees, lower polynomial degree.
Increase Training Data: More data reduces the impact of noise.
Feature Selection: Remove irrelevant features that introduce noise.

Here's a production-ready Python example that applies all three regularisation techniques:

io/thecodeforge/fix_overfitting.pyPYTHON

import numpy as np
from sklearn.neural_network import MLPRegressor
from tensorflow import keras
from keras import layers, regularizers, callbacks

def build_regularised_model(input_dim):
    model = keras.Sequential([
        layers.Dense(64, activation='relu', 
                     kernel_regularizer=regularizers.l2(0.01)),
        layers.Dropout(0.5),
        layers.Dense(32, activation='relu',
                     kernel_regularizer=regularizers.l2(0.01)),
        layers.Dropout(0.3),
        layers.Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse')
    return model

# Early stopping callback
early_stop = callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

model = build_regularised_model(X_train.shape[1])
history = model.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=200, callbacks=[early_stop], verbose=0)

Mental Model

Regularisation as a Constraint

Think of regularisation as adding friction to your model's learning process — it can't swing its weights wildly.

L2 regularisation adds a penalty proportional to the square of weights — common in neural nets.
L1 regularisation adds penalty proportional to absolute weight — useful for feature selection.
Dropout randomly turns off neurons — like training an ensemble of simpler models.
Early stopping cuts training when validation stops improving — prevents memorisation.

📊 Production Insight

Dropout should typically be 0.2-0.5 — too high and the model underfits.

L2 regularisation coefficient (λ) is often tuned via cross-validation; 0.01 is a common starting point.

Rule: always combine regularisation with early stopping in production pipelines.

🎯 Key Takeaway

Regularisation reduces variance.

Dropout, L2, and early stopping are the three pillars.

Always validate on a holdout set after applying them.

Fixing Underfitting: Complexity, Features, and Training Time

Underfitting means your model is too simple. The solution is to give it more capacity. But adding complexity blindly can tip into overfitting — you need a controlled approach.

Common fixes for underfitting

Increase Model Complexity: Use higher-degree polynomials, deeper trees, more layers.
Engineer Better Features: Interactions, polynomial features, domain-specific aggregations.
Reduce Regularisation: Too much regularisation can itself cause underfitting.
Train Longer: Sometimes the model just needs more epochs to converge.
Reduce Feature Noise: Remove irrelevant features that dilute signal.

Here's a Python workflow that detects underfitting and applies fixes systematically:

io/thecodeforge/fix_underfitting.pyPYTHON

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

def diagnose_and_fix_underfitting(X, y):
    # Baseline linear model
    linear_model = LinearRegression()
    linear_score = -cross_val_score(linear_model, X, y, cv=5, 
                                    scoring='neg_mean_squared_error').mean()
    
    # Try polynomial features (degree 3)
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=3, include_bias=False)),
        ('scaler', StandardScaler()),
        ('reg', LinearRegression())
    ])
    poly_score = -cross_val_score(poly_model, X, y, cv=5, 
                                  scoring='neg_mean_squared_error').mean()
    
    if poly_score < linear_score * 0.8:
        print("Underfitting fixed with polynomial features. Using poly model.")
        return poly_model
    else:
        print("Still underfitting. Try feature engineering or more complex algorithm.")
        return None

⚠ The Complexity Trap

Don't jump from a linear model straight to a 10-layer neural net. Increase complexity step by step and validate at each step. Otherwise you'll overshoot and end up overfitting.

📊 Production Insight

In production, I've seen teams add neural networks to solve underfitting when polynomial features would have done the job.

Start with the simplest fix (polynomial features, interaction terms) and escalate.

Rule: underfitting is often a feature engineering problem, not a model problem.

🎯 Key Takeaway

Underfitting = not enough capacity.

Fix: more features, more complexity, less regularisation.

Validate with cross-validation after each change.

Production-Grade Monitoring and Alerting for Model Drift

Once your model is in production, it can still drift into overfitting or underfitting as data distributions change. This is called concept drift or covariate shift. You need automated monitoring.

At TheCodeForge, we deploy a model health monitor that checks

Prediction distribution: Does the average prediction stay stable?
Feature distribution: Are incoming features within the training range?
Error rate over time: Is the model's error creeping up?

Here's a Java implementation that logs warnings when drift indicators trigger:

io/thecodeforge/ml/ModelHealthMonitor.javaJAVA

package io.thecodeforge.ml;

import java.util.logging.Logger;
import java.time.Instant;

public class ModelHealthMonitor {
    private static final Logger logger = Logger.getLogger(ModelHealthMonitor.class.getName());
    private double baselineErrorRate;
    private double driftThreshold = 0.2; // 20% increase
    
    public ModelHealthMonitor(double baselineErrorRate) {
        this.baselineErrorRate = baselineErrorRate;
    }
    
    public void checkCurrentErrorRate(double currentErrorRate) {
        double relativeChange = (currentErrorRate - baselineErrorRate) / baselineErrorRate;
        if (relativeChange > driftThreshold) {
            logger.warning("CRITICAL: Model error rate increased by " + 
                           Math.round(relativeChange * 100) + "%. Overfitting? Underfitting? Check learning curves.");
        } else {
            logger.info("Model error rate within bounds.");
        }
    }
    
    public static void main(String[] args) {
        ModelHealthMonitor monitor = new ModelHealthMonitor(0.10);
        monitor.checkCurrentErrorRate(0.18);
    }
}

Output

CRITICAL: Model error rate increased by 80%. Overfitting? Underfitting? Check learning curves.

🔥Retrain Cadence

Most models need retraining when error increases by more than 15-20% from baseline. Automate retraining triggers, but always validate on a holdout set before deploying the new model.

📊 Production Insight

Drift detection is not optional — it's a production requirement.

I've seen teams lose millions because they didn't monitor for covariate shift.

Rule: deploy a monitoring endpoint that returns current error rates and distribution stats.

🎯 Key Takeaway

Models degrade in production.

Monitor error rates and feature distributions.

Alert when drift exceeds 20% and trigger retraining.

What is Underfitting? — The Model That Won't Learn

Underfitting is what happens when your model is too stupid to see the patterns right in front of it. It performs like garbage on both training and test data. This isn't a generalization problem — it's a failure to learn in the first place.

The root cause is always the same: the model's capacity is too low for the complexity of the data. You're trying to fit a straight line through a sine wave. Common culprits include: a model that's too simple (think linear regression on a non-linear problem), brutal regularization that kills signal, weak or irrelevant features, or you just didn't train long enough for the optimizer to converge.

From the bias-variance lens, underfitting is pure high bias territory. The model makes strong, dumb assumptions before seeing any data — "all relationships are linear" — and it ignores real patterns to stick with those assumptions. Variance is low because the model gives the same crappy output no matter what data you throw at it. Underfitting = High Bias + Low Variance. Memorize that.

DetectUnderfitting.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate synthetic non-linear data
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = np.sin(X).ravel() + 0.1 * np.random.normal(size=100)

# Deliberately underfit with a linear model
model = LinearRegression()
model.fit(X, y)

y_train_pred = model.predict(X)
print(f"Train MSE: {mean_squared_error(y, y_train_pred):.4f}")

# Both training and test errors are high
X_test = np.linspace(-3, 3, 50).reshape(-1, 1)
y_test = np.sin(X_test).ravel() + 0.1 * np.random.normal(size=50)
y_test_pred = model.predict(X_test)
print(f"Test MSE:  {mean_squared_error(y_test, y_test_pred):.4f}")

Output

Train MSE: 0.5231

Test MSE: 0.5189

⚠ Production Trap:

Don't confuse underfitting with 'not enough data.' If your training loss hasn't plateaued and you're still seeing high error on both splits, you need a more complex model — not more data. More data won't fix a linear model trying to learn a sine wave.

🎯 Key Takeaway

When training and test errors are both high and similar, you're underfitting. Add model capacity before you add data.

What is Overfitting? — The Model That Memorized Your Trash

Overfitting is the opposite failure mode: your model became a savant at memorizing the training data, including every outlier, every random noise spike, and every accidental duplicate. It nails training accuracy — 99% baby! — then falls apart on production data like a house of cards in a hurricane.

This happens when the model has too much capacity relative to the data. Too many parameters, too many features, too many layers, not enough regularization. The model finds patterns that don't exist. It learns the noise.

Bias-Variance breakdown: overfitting is high variance, low bias. The model makes almost no assumptions (low bias), so it contorts itself to fit every single training point. Variance is through the roof because if you train on a different sample, the model learns completely different noise patterns. Your predictions are all over the place. Overfitting = Low Bias + High Variance. If your validation loss starts going UP while training loss keeps dropping, you're here. Pull the cord.

SignalVsNoiseOverfit.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate data with intentional noise features
np.random.seed(42)
n = 50
X_signal = np.random.uniform(-1, 1, (n, 2))
y = 0.8 * X_signal[:, 0] + 0.2 * X_signal[:, 1] + 0.1 * np.random.normal(size=n)

# Add 50 random noise features — classic overfitting trap
X_noise = np.random.normal(size=(n, 50))
X_full = np.hstack([X_signal, X_noise])

# Deep forest with no regularization — pure overfit
model = RandomForestRegressor(n_estimators=500, max_depth=None, min_samples_leaf=1)
model.fit(X_full, y)

# Train performance deceptively good
train_pred = model.predict(X_full)
print(f"Train MSE: {mean_squared_error(y, train_pred):.4f}")

# Test on fresh data — watch the train wreck
X_test_signal = np.random.uniform(-1, 1, (n, 2))
y_test = 0.8 * X_test_signal[:, 0] + 0.2 * X_test_signal[:, 1] + 0.1 * np.random.normal(size=n)
X_test_noise = np.random.normal(size=(n, 50))
X_test_full = np.hstack([X_test_signal, X_test_noise])
test_pred = model.predict(X_test_full)
print(f"Test MSE:  {mean_squared_error(y_test, test_pred):.4f}")

Output

Train MSE: 0.0002

Test MSE: 1.2874

🔥Senior Shortcut:

Watch the ratio of test MSE to train MSE. Anything above 5x means your model is memorizing noise. Cut features, add regularization, or get more data. Don't chase 99% training accuracy — the market doesn't reward overconfident failures.

🎯 Key Takeaway

If your training error is near zero but test error sucks, you're overfitting. The gap between train and test performance is your alarm bell.

Why Every Model Needs a Holdout Set From Day One

Overfitting starts the second you evaluate on training data. You don't need complex math—you need a wall between your model and its score. A holdout set is that wall. Split before you train, not after. The test set defines reality; everything else is a lie.

In production, you can't trust a model that's never seen unseen data. Validation curves hide behind averages. Holdout exposes per-sample failure. It's the cheapest insurance you'll ever buy. If your pipeline doesn't enforce a strict 70/15/15 split at the dataset constructor, you're shipping technical debt.

Senior teams force holdout sets into CI/CD. Data leakage kills models silently. Start with a held-back slice, never touch it during tuning, and only use it for final kill/no-kill decisions. That one slice prevents months of wasted retraining.

HoldoutSplitter.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import numpy as np
from sklearn.model_selection import train_test_split

# Simulate real-world dataset with temporal ordering
np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0] + 0.5 * np.random.randn(1000) > 0).astype(int)

# Enforce split BEFORE any preprocessing
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X, y, test_size=0.15, shuffle=False
)

# Only train/val splits see tuning
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.176, random_state=42
)

print(f"Train: {X_train.shape[0]} | Val: {X_val.shape[0]} | Holdout: {X_holdout.shape[0]}")

Output

Train: 705 | Val: 150 | Holdout: 150

⚠ Production Trap:

Never shuffle temporal data. Leaking future into past kills time-series models. Use shuffle=False for any log-ordered dataset.

🎯 Key Takeaway

A holdout set is the only honest evaluation. Split before you touch any preprocessing.

The Learning Curve Lie: Why Accuracy Metrics Hide Overfitting

Accuracy on a training set is a vanity metric. It reliably hides the silent failure of overfitting: memorization of noise rather than learning signal. You deploy a model that scores 98% accuracy on validation, and within weeks, production predictions drift, user engagement tanks, and your feature flags get blamed. The root cause is almost always a model that learned the quirks of your static holdout set instead of the true underlying patterns.

Stop trusting accuracy alone. Instead, monitor the generalization gap—the divergence between training and validation loss. A widening gap is your earliest warning that your model is memorizing, not generalizing. Apply regularization, cross-validation, or simpler architectures early. In production, track rolling accuracy windows on live data; if it dips while your eval curve looks fine, you’re seeing the lie in real time. The fix is not better data—it’s better vigilance.

generalization_gap_monitor.pyPYTHON

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X, y = np.random.rand(1000, 20), np.random.randint(0, 2, 1000)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(n_estimators=100, max_depth=None)
model.fit(X_train, y_train)

train_acc = accuracy_score(y_train, model.predict(X_train))
val_acc = accuracy_score(y_val, model.predict(X_val))
gap = train_acc - val_acc

print(f"Train acc: {train_acc:.3f}")
print(f"Val acc:   {val_acc:.3f}")
print(f"Gap:       {gap:.3f}")
if gap > 0.1:
    print("WARNING: Overfitting detected — reduce model complexity.")

Output

Train acc: 1.000

Val acc: 0.530

Gap: 0.470

WARNING: Overfitting detected — reduce model complexity.

⚠ Production Trap:

Never deploy based on test accuracy alone. A 95%+ accuracy model that fails on 5% of edge cases can tank your entire pipeline. Always monitor the gap, not just the score.

🎯 Key Takeaway

If your training accuracy is significantly higher than validation accuracy, your model is memorizing, not learning. Shrink the gap, not the dataset.

● Production incidentPOST-MORTEMseverity: high

When a Fraud Detection Model Starts Calling Every Transaction Fraudulent

Symptom

Fraud alert rate jumped from 2% to 80% in hours. Customer complaints flooded in. The model's training accuracy was 99.9%.

Assumption

Higher training accuracy means better performance. The team assumed a more complex model would catch more fraud.

Root cause

The model was a deep neural network trained on historical data with a massive class imbalance. During retraining, the team added more layers and trained for 500 epochs without early stopping. The model memorised the exact fraud patterns from the training set, including noise, but failed on new transaction patterns.

Fix

Rolled back to the previous simpler model. Then applied L2 regularisation (λ=0.01), added dropout (0.5), and used early stopping with a patience of 10 epochs based on validation AUC. Retained only the top 50 features. The false positive rate dropped back to 2%.

Key lesson

Never trust training accuracy alone — especially on imbalanced data.
Always monitor validation metrics during retraining.
If the validation curve starts diverging from training, stop and check.
More complexity isn't better; it's just more dangerous.

Production debug guideUse these symptom-action pairs when a model's performance degrades after deployment4 entries

Symptom · 01

Model accuracy is high on training but low on live traffic

→

Fix

Plot learning curves: training and validation error vs training set size. A large gap indicates overfitting.

Symptom · 02

Model is consistently wrong on both training and validation data

→

Fix

Check feature engineering — are you using enough features? Try a more complex model (e.g., RandomForest vs LinearReg). This is underfitting.

Symptom · 03

Validation error oscillates wildly from run to run

→

Fix

Likely high variance. Reduce model complexity or increase regularisation. Check if training set is too small.

Symptom · 04

Adding more training data barely changes validation accuracy

→

Fix

If both curves are flat and close together at high error, you have high bias. More data won't help — need better features or algorithm.

★ Quick Debug Cheat Sheet for Overfitting/UnderfittingWhen you suspect your model is overfitting or underfitting, run these commands and checks immediately.

Training accuracy > 95%, validation accuracy < 70%−

Immediate action

Stop training and inspect last epochs for divergence

Commands

python -c "import numpy as np; print('Gap:', np.abs(train_acc[-1] - val_acc[-1]))"

plot_learning_curve(model, X_train, y_train, X_val, y_val)

Fix now

Add L2 regularisation (alpha=0.01) or reduce number of layers/features

Training and validation accuracy both below 60%+

Validation loss starts rising after epoch N+

Overfitting vs Underfitting at a Glance

Feature	Underfitting (High Bias)	Overfitting (High Variance)
Training Error	High	Very Low
Validation Error	High	High
Cause	Model is too simple (e.g., Linear for Non-linear)	Model is too complex (e.g., Deep Tree for noisy data)
Primary Fix	Increase complexity, add features	Regularization (L1/L2), Pruning, More Data
Learning Curve Pattern	Both curves high and close together	Training low, validation high with gap
Effect of More Data	Little to no improvement	Reduces gap, helps generalize

⚙ Quick Reference

9 commands from this guide

File	Command / Code	Purpose
model_diagnostics.py	from sklearn.pipeline import Pipeline	The Technical Root
validation_curve.py	from sklearn.model_selection import validation_curve	Detecting Overfitting Early with Validation Curves
iothecodeforgefix_overfitting.py	from sklearn.neural_network import MLPRegressor	Fixing Overfitting
iothecodeforgefix_underfitting.py	from sklearn.linear_model import LinearRegression	Fixing Underfitting
iothecodeforgemlModelHealthMonitor.java	public class ModelHealthMonitor {	Production-Grade Monitoring and Alerting for Model Drift
DetectUnderfitting.py	from sklearn.linear_model import LinearRegression	What is Underfitting?
SignalVsNoiseOverfit.py	from sklearn.ensemble import RandomForestRegressor	What is Overfitting?
HoldoutSplitter.py	from sklearn.model_selection import train_test_split	Why Every Model Needs a Holdout Set From Day One
generalization_gap_monitor.py	from sklearn.model_selection import train_test_split	The Learning Curve Lie

Key takeaways

Underfitting = High Bias. The model is too dumb for the data.

Overfitting = High Variance. The model is too 'clever' and sees patterns in noise.

The validation set is your compass; never optimize based on the test set alone.

Regularization (Dropout, L1, L2) is the primary weapon against overfitting.

For underfitting, increase model complexity or engineer better features

not more data.

Automate model monitoring in production to catch drift early.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

How do you distinguish between high bias and high variance using a learn...

Q02JUNIOR

What is the 'Double Descent' phenomenon in modern Deep Learning?

Q03JUNIOR

Explain the role of Early Stopping in preventing overfitting.

Q04JUNIOR

How would you set up a CI/CD pipeline to automatically reject overfitted...

Q01 of 04JUNIOR

How do you distinguish between high bias and high variance using a learning curve?

ANSWER

High Bias is identified when both training and validation errors are high and close to each other, indicating the model hasn't captured the underlying trend. High Variance is identified by a large 'gap' between a low training error and a significantly higher validation error. In a production debugging scenario, you'd plot learning curves for different training set sizes and look for convergence or divergence.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Can a model be both overfit and underfit?

Why does regularization reduce overfitting?

Does increasing the number of features cause overfitting?

How do I know if I need more data or a different model?

What is the difference between cross-validation and a validation set?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

6 min read · try the examples if you haven't