Overfitting - When Fraud Detection Blocks All Transactions
Fraud alert rate jumped from 2% to 80% in hours due to overfitting on imbalanced data; never trust training accuracy - use our production debug guide..
20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.
- Overfitting: training error low, validation error high — model memorises noise
- Underfitting: both training and validation errors high — model misses signal
- Use learning curves: training/validation error vs training set size
- Fix overfitting: regularisation, more data, simplify model, early stopping
- Fix underfitting: increase complexity, add features, train longer
- Biggest mistake: adding more data when the model is underfitting — it won't help
Imagine you're studying for a history exam. If you memorise every question from last year's paper word-for-word, you'll ace a re-run but completely blank on any new question — that's overfitting. If you barely glance at the textbook and just guess 'World War 2' for everything, you'll fail because you learned too little — that's underfitting. A great student learns the patterns and principles, not the exact answers. Your ML model needs to do exactly the same thing.
Every ML model you build has one job: make good predictions on data it has never seen before. It sounds simple, but the single biggest reason models fail in production isn't bad algorithms or messy data — it's getting the balance of learning wrong. A model that learns too much from its training data becomes obsessed with noise and quirks that don't generalise. A model that learns too little never captures the real signal in the first place. Both failures have names, both are measurable, and both are fixable once you understand what's actually happening inside the model.
Overfitting and underfitting sit at opposite ends of a spectrum called the bias-variance tradeoff. Understanding this tradeoff is what separates engineers who tune models by intuition from those who tune them systematically. When you know WHY a model overfits, you stop throwing random regularisation at it and start making deliberate, principled decisions about complexity, data size, and training strategy.
By the end of this article you'll be able to plot a learning curve and diagnose whether your model is overfitting or underfitting just by looking at it. You'll have working Python code that deliberately creates both problems and then fixes them — so the concepts stick in your hands, not just your head. And you'll walk away knowing exactly which levers to pull in each scenario.
Overfitting vs. Underfitting — The Two Failure Modes of Learning
Overfitting is when a model learns the training data too precisely, including its noise and outliers, resulting in poor generalization to unseen data. Underfitting is the opposite: the model fails to capture the underlying patterns, performing poorly even on the training set. The core mechanic is a mismatch between model capacity and data complexity — too much capacity relative to signal leads to memorization; too little leads to oversimplification.
In practice, overfitting manifests as high variance: small changes in input cause large swings in predictions. Underfitting shows high bias: the model consistently misses the true relationship. A classic symptom of overfitting is training accuracy near 100% while validation accuracy plateaus or drops. Underfitting shows both training and validation accuracy stuck below acceptable thresholds. The key property is that both degrade real-world performance, but they require opposite remedies — more regularization or more data for overfitting, more capacity or better features for underfitting.
You encounter these when deploying any learned model to production. For fraud detection, an overfit model might block 99% of legitimate transactions because it memorized rare fraud patterns from training, while an underfit model might let through obvious fraud because it never learned the signal. The practical goal is to find the sweet spot where validation error is minimized — typically by monitoring the gap between training and validation metrics and using techniques like cross-validation, early stopping, or pruning.
The Technical Root: Bias vs. Variance
To fix a failing model, you must diagnose its soul. Underfitting is caused by High Bias—the model makes simplistic assumptions about the data. Overfitting is caused by High Variance—the model is overly sensitive to small fluctuations in the training set. In a production environment, we use Learning Curves (plotting Error vs. Training Set Size) to visualize this struggle.
The bias-variance tradeoff is fundamental. A model with high bias pays little attention to data and consistently underfits. A model with high variance pays too much attention to data, including noise. Finding the sweet spot is what model tuning is all about.
- Both curves high and close together: high bias (underfitting)
- Training curve low, validation curve high with a gap: high variance (overfitting)
- Both curves low and close together: good fit
Detecting Overfitting Early with Validation Curves
Validation curves show how model performance changes with a hyperparameter (e.g., polynomial degree, tree depth). They help you find the sweet spot before overfitting takes hold. Plotting validation curves during development is far cheaper than discovering the problem in production.
In production systems, we automate this with CI/CD pipelines that generate validation curves for every candidate model. The pipeline rejects any model where the validation curve shows a gap larger than a configurable threshold (typically 10-15% for classification accuracy).
Here's how to generate a validation curve for polynomial degree using scikit-learn:
Fixing Overfitting: Regularisation, Dropout, and Pruning
When you've confirmed overfitting, the toolkit is broad but principled. The most effective levers are:
- L1/L2 Regularisation: Adds a penalty to large weights. L1 drives weights to zero (feature selection), L2 shrinks them.
- Dropout: Randomly drops neurons during training — forces the network to learn redundant representations.
- Early Stopping: Monitors validation loss and stops training when it starts increasing.
- Reduce Model Complexity: Fewer layers, fewer trees, lower polynomial degree.
- Increase Training Data: More data reduces the impact of noise.
- Feature Selection: Remove irrelevant features that introduce noise.
Here's a production-ready Python example that applies all three regularisation techniques:
- L2 regularisation adds a penalty proportional to the square of weights — common in neural nets.
- L1 regularisation adds penalty proportional to absolute weight — useful for feature selection.
- Dropout randomly turns off neurons — like training an ensemble of simpler models.
- Early stopping cuts training when validation stops improving — prevents memorisation.
Fixing Underfitting: Complexity, Features, and Training Time
Underfitting means your model is too simple. The solution is to give it more capacity. But adding complexity blindly can tip into overfitting — you need a controlled approach.
- Increase Model Complexity: Use higher-degree polynomials, deeper trees, more layers.
- Engineer Better Features: Interactions, polynomial features, domain-specific aggregations.
- Reduce Regularisation: Too much regularisation can itself cause underfitting.
- Train Longer: Sometimes the model just needs more epochs to converge.
- Reduce Feature Noise: Remove irrelevant features that dilute signal.
Here's a Python workflow that detects underfitting and applies fixes systematically:
Production-Grade Monitoring and Alerting for Model Drift
Once your model is in production, it can still drift into overfitting or underfitting as data distributions change. This is called concept drift or covariate shift. You need automated monitoring.
- Prediction distribution: Does the average prediction stay stable?
- Feature distribution: Are incoming features within the training range?
- Error rate over time: Is the model's error creeping up?
Here's a Java implementation that logs warnings when drift indicators trigger:
What is Underfitting? — The Model That Won't Learn
Underfitting is what happens when your model is too stupid to see the patterns right in front of it. It performs like garbage on both training and test data. This isn't a generalization problem — it's a failure to learn in the first place.
The root cause is always the same: the model's capacity is too low for the complexity of the data. You're trying to fit a straight line through a sine wave. Common culprits include: a model that's too simple (think linear regression on a non-linear problem), brutal regularization that kills signal, weak or irrelevant features, or you just didn't train long enough for the optimizer to converge.
From the bias-variance lens, underfitting is pure high bias territory. The model makes strong, dumb assumptions before seeing any data — "all relationships are linear" — and it ignores real patterns to stick with those assumptions. Variance is low because the model gives the same crappy output no matter what data you throw at it. Underfitting = High Bias + Low Variance. Memorize that.
What is Overfitting? — The Model That Memorized Your Trash
Overfitting is the opposite failure mode: your model became a savant at memorizing the training data, including every outlier, every random noise spike, and every accidental duplicate. It nails training accuracy — 99% baby! — then falls apart on production data like a house of cards in a hurricane.
This happens when the model has too much capacity relative to the data. Too many parameters, too many features, too many layers, not enough regularization. The model finds patterns that don't exist. It learns the noise.
Bias-Variance breakdown: overfitting is high variance, low bias. The model makes almost no assumptions (low bias), so it contorts itself to fit every single training point. Variance is through the roof because if you train on a different sample, the model learns completely different noise patterns. Your predictions are all over the place. Overfitting = Low Bias + High Variance. If your validation loss starts going UP while training loss keeps dropping, you're here. Pull the cord.
Why Every Model Needs a Holdout Set From Day One
Overfitting starts the second you evaluate on training data. You don't need complex math—you need a wall between your model and its score. A holdout set is that wall. Split before you train, not after. The test set defines reality; everything else is a lie.
In production, you can't trust a model that's never seen unseen data. Validation curves hide behind averages. Holdout exposes per-sample failure. It's the cheapest insurance you'll ever buy. If your pipeline doesn't enforce a strict 70/15/15 split at the dataset constructor, you're shipping technical debt.
Senior teams force holdout sets into CI/CD. Data leakage kills models silently. Start with a held-back slice, never touch it during tuning, and only use it for final kill/no-kill decisions. That one slice prevents months of wasted retraining.
The Learning Curve Lie: Why Accuracy Metrics Hide Overfitting
Accuracy on a training set is a vanity metric. It reliably hides the silent failure of overfitting: memorization of noise rather than learning signal. You deploy a model that scores 98% accuracy on validation, and within weeks, production predictions drift, user engagement tanks, and your feature flags get blamed. The root cause is almost always a model that learned the quirks of your static holdout set instead of the true underlying patterns.
Stop trusting accuracy alone. Instead, monitor the generalization gap—the divergence between training and validation loss. A widening gap is your earliest warning that your model is memorizing, not generalizing. Apply regularization, cross-validation, or simpler architectures early. In production, track rolling accuracy windows on live data; if it dips while your eval curve looks fine, you’re seeing the lie in real time. The fix is not better data—it’s better vigilance.
When a Fraud Detection Model Starts Calling Every Transaction Fraudulent
- Never trust training accuracy alone — especially on imbalanced data.
- Always monitor validation metrics during retraining.
- If the validation curve starts diverging from training, stop and check.
- More complexity isn't better; it's just more dangerous.
python -c "import numpy as np; print('Gap:', np.abs(train_acc[-1] - val_acc[-1]))"plot_learning_curve(model, X_train, y_train, X_val, y_val)Key takeaways
Common mistakes to avoid
5 patternsThrowing more data at an underfitting model
Ignoring the validation set until the end of the project
Over-tuning hyperparameters on the test set (data leakage)
Using a high-degree polynomial on a small dataset without regularization
Assuming more features always improve performance
Interview Questions on This Topic
How do you distinguish between high bias and high variance using a learning curve?
Frequently Asked Questions
20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.
That's ML Basics. Mark it forged?
7 min read · try the examples if you haven't