Skip to content
Home ML / AI Drift Detection — Covariate Drift Cost a Fraud Model $2M

Drift Detection — Covariate Drift Cost a Fraud Model $2M

Where developers are forged. · Structured learning · Free forever.
📍 Part of: MLOps → Topic 8 of 9
Fraud model accuracy fell from 92% to 67% due to covariate drift.
🔥 Advanced — solid ML / AI foundation required
In this tutorial, you'll learn
Fraud model accuracy fell from 92% to 67% due to covariate drift.
  • Model monitoring is not optional — real data always drifts.
  • Detect drift using PSI for categorical, KS for continuous, KL for directional shifts.
  • Use a rolling 30-day window with confirmation before alerting to avoid false positives.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Model monitoring tracks prediction quality and data distributions over time
  • Drift detection uses statistical tests like PSI, KL divergence, and KS test
  • Covariate drift = input distribution changes; concept drift = label relationship changes
  • PSI > 0.25 typically indicates significant drift in production
  • Production insight: most drift alert fatigue comes from testing on too-small windows
  • Biggest mistake: treating drift detection as a binary yes/no instead of a severity scale
🚨 START HERE

Drift Detection Quick Reference

Commands and actions for common drift scenarios.
🟡

Need to calculate PSI on a feature

Immediate ActionCompute expected and actual distribution bins
Commands
python -c "from scipy.stats import chi2; from io.thecodeforge.monitoring import psi; print(psi(expected_bins, actual_bins))"
python -c "import pandas as pd; expected=pd.read_csv('train_features.csv')['amount']; actual=pd.read_csv('production_features.csv')['amount']; print(psi(expected, actual, bins=10))"
Fix NowIf PSI > 0.25, retrain the model with recent data and schedule a refresh of the monitoring baseline.
🟡

Need to compare two distributions with KS test

Immediate ActionRun two-sample KS test
Commands
python -c "from scipy.stats import ks_2samp; stat, p = ks_2samp(train_sample, prod_sample); print('KS statistic:', stat, 'p-value:', p)"
python -c "import numpy as np; train=np.random.normal(0,1,1000); prod=np.random.normal(0.5,1,1000); print(ks_2samp(train, prod))"
Fix NowIf p-value < 0.05 and KS statistic > 0.1, investigate feature drift and consider retraining.
🟡

Need to detect concept drift without labels

Immediate ActionMonitor prediction distribution over time
Commands
python -c "from io.thecodeforge.monitoring import prediction_drift; drift_score = prediction_drift(current_predictions, baseline_predictions); print('Prediction drift score:', drift_score)"
python -c "import numpy as np; from scipy.stats import entropy; current = np.histogram(predictions, bins=10)[0]; baseline = np.histogram(baseline, bins=10)[0]; print(entropy(current, baseline))"
Fix NowIf drift score > 0.2, flag the model for retraining and trigger a manual evaluation against ground truth.
Production Incident

The Silent Churn: How a Fraud Model Lost $2M Before Anyone Noticed

A production fraud detection model started classifying genuine high-value transactions as fraudulent after a gradual shift in transaction patterns over three months — no alert fired.
SymptomFraud detection accuracy dropped from 92% to 67% over three months. False positive rate tripled. No error logs, no downtime.
AssumptionThe team assumed the model would maintain its performance because retraining happened monthly. They only monitored binary accuracy on a static holdout set.
Root causeCovariate drift: the distribution of transaction amounts, merchant categories, and geolocation features shifted as the company expanded into a new market. The holdout set was never refreshed. Concept drift also occurred because fraudsters adapted to the model's patterns.
FixImplemented a monitoring pipeline that tracks per-feature PSI and monthly KS tests on production data. Added a dashboard with trend lines over 30-day rolling windows. Set up alerts triggered at PSI > 0.2 with a 7-day confirmation window to filter out noise.
Key Lesson
Monitor data distributions, not just accuracy — accuracy can stay high while the model misses critical segments.Refresh holdout sets quarterly with current production data.Combine covariate and concept drift detection: use PSI for inputs and prediction distribution comparison for labels.Always confirm drift alerts over multiple windows before paging anyone.
Production Debug Guide

Trace the root cause when your model's predictions start degrading.

Model accuracy dropped but no feature changesCheck covariate drift: run PSI on each feature between training and production data for the last 7 days.
Prediction distribution shifted but feature stats look normalRun concept drift detection: compare prediction vs actual label distributions using KS test.
Drift alerts firing every dayIncrease the monitoring window from 1 day to 7 days and apply a severity threshold (e.g., PSI > 0.3). Check if the alert is driven by low-volume segments.
Model performs well on recent data but fails on new dataVerify train/test split recency — if your training data is older than 3 months, consider retraining with more recent samples.

Every ML model has an expiry date — you just don't know when it is. The moment you deploy a model to production, the clock starts ticking. Real-world data is a living thing: customer behaviour shifts, sensor calibrations drift, economic conditions flip, and language evolves. A model trained on yesterday's data makes yesterday's decisions, and in fast-moving domains that gap kills business value silently and expensively. Unlike a crashed server, a drifting model doesn't throw an error. It just quietly becomes wrong.

The core problem is that ML models are frozen snapshots of a world that keeps moving. Traditional software has deterministic logic you can test; a model's 'logic' is baked into millions of learned parameters that have no automatic self-correction mechanism. When the statistical relationship between your input features and your target label changes, the model has no way of knowing. It will keep producing confident predictions that are increasingly divorced from reality — and your monitoring stack needs to catch that before your users or your business does.

By the end of this article you'll be able to implement a production-grade monitoring pipeline that detects covariate drift, concept drift, and prediction drift using PSI, KL divergence, and the Kolmogorov-Smirnov test. You'll understand which detector to reach for in which situation, the statistical subtleties that trip up even experienced engineers, and how to wire all of it into an alerting workflow that won't wake you up for false positives at 3 a.m.

What Is Model Monitoring and Drift Detection?

Model monitoring is the practice of continuously observing a deployed ML model's performance and input data. Drift detection identifies when the statistical properties of the data or the relationship between inputs and outputs change from the training baseline. Without monitoring, you're flying blind: your model could be making decisions based on patterns that no longer exist.

Drift falls into three categories
  • Covariate drift: the distribution of input features changes (e.g., user age shifts from 25–35 to 35–45)
  • Concept drift: the relationship between features and target changes (e.g., what was considered 'fraud' looks different today)
  • Prediction drift: the distribution of model outputs shifts (can signal concept drift even without labels)

In production, you need to detect all three. Each requires a different statistical test and a different response.

drift_detection_basic.py · PYTHON
123456789101112131415161718
from typing import List, Optional
import numpy as np
from scipy.stats import ks_2samp

def detect_covariate_drift(
    train: np.ndarray,
    production: np.ndarray,
    threshold: float = 0.1
) -> Optional[float]:
    """Detect covariate drift using two-sample KS test.
    Namespace: io.thecodeforge.monitoring.drift
    """
    if len(train) == 0 or len(production) == 0:
        return None
    stat, p_value = ks_2samp(train, production)
    if p_value < 0.05 and stat > threshold:
        return stat
    return None
▶ Output
None
Mental Model
The River Crossing Metaphor
Think of training data as one bank of a river, and production data as the opposite bank. Drift tests measure how wide and fast the river is flowing.
  • Covariate drift = the water level changed (input distributions)
  • Concept drift = the river changed course (relationship changed)
  • Prediction drift = the bridge (model) is swaying (outputs shifted)
  • You need different tools for each: PSI for water level, KS for course change
📊 Production Insight
Most teams monitor only accuracy — a lagging indicator.
By the time accuracy drops, drift has been present for weeks.
Detect drift upstream using feature distribution tests, not downstream metrics.
🎯 Key Takeaway
Drift detection is a leading indicator of model failure.
You monitor inputs, not just outputs.
Be proactive: test distributions, not just business metrics.

Statistical Tests: PSI, KL Divergence, and Kolmogorov-Smirnov

  1. Population Stability Index (PSI): Measures how much a variable's distribution has shifted between two samples. Formula: sum((actual_prop_i - expected_prop_i) * ln(actual_prop_i / expected_prop_i)). PSI < 0.1 = no shift, 0.1–0.25 = minor, > 0.25 = significant.
  2. KL Divergence: Measures the information lost when using expected distribution to approximate actual. Asymmetric — order matters. Use PSI for symmetric stability, KL for asymmetrical change detection.
  3. Kolmogorov-Smirnov (KS) Test: Non-parametric test comparing two empirical distributions. Returns a statistic (max difference) and a p-value. Works for continuous features. More sensitive than PSI for location shifts.

In practice, use PSI for categorical/binned features, KS for continuous. KL divergence is useful when you care about directionality of change.

psi_and_ks.py · PYTHON
123456789101112131415
import numpy as np
from scipy.stats import ks_2samp
from io.thecodeforge.monitoring.stats import psi

def compute_drift_report(train: np.ndarray, prod: np.ndarray, bins: int = 10):
    train_bins = np.histogram(train, bins=bins)[0] / len(train)
    prod_bins = np.histogram(prod, bins=bins)[0] / len(prod)
    psi_value = psi(train_bins, prod_bins)
    ks_stat, p_value = ks_2samp(train, prod)
    return {
        'psi': round(psi_value, 4),
        'ks_stat': round(ks_stat, 4),
        'ks_p_value': round(p_value, 6),
        'drift_detected': psi_value > 0.25 or (p_value < 0.05 and ks_stat > 0.1)
    }
▶ Output
{
'psi': 0.321,
'ks_stat': 0.184,
'ks_p_value': 0.0001,
'drift_detected': True
}
🔥When PSI Breaks
PSI assumes categorical bins with at least 5% expected proportion. If a bin has 0 expected count, the log blows up. Always smooth bins with a small epsilon (1e-6) before computing.
📊 Production Insight
PSI threshold of 0.25 works for most business features but not for tail-heavy distributions.
For credit risk models, even 0.1 PSI triggers action.
Always calibrate thresholds on your own production data — never blindly copy Kaggle values.
🎯 Key Takeaway
PSI for stability, KS for sensitivity, KL for direction.
Always smooth bins before PSI.
Test on a rolling 30-day window, not a single day.
Which Test to Use
IfFeature is continuous with known distribution (e.g., age, income)
UseUse KS test — more sensitive to location shifts
IfFeature is categorical or binned (e.g., region code, segment)
UseUse PSI — captures proportional shifts
IfNeed to measure information loss directionally
UseUse KL divergence — asymmetric, good for detecting unexpected distributions

Building a Production Monitoring Pipeline

  1. Data collection: Log model inputs and outputs to a time-series store (e.g., Kafka + InfluxDB). Store at least 30 days of raw feature vectors and predictions.
  2. Drift computation: Run scheduled jobs (e.g., Airflow DAG every 6 hours) that compute PSI, KS, and prediction drift for each feature vs. the training baseline. Store results in a separate metrics table.
  3. Alerting: Tiered alerts: INFO (PSI 0.1–0.2), WARNING (0.2–0.3), CRITICAL (>0.3). Confirm drift over at least two consecutive windows before paging. Avoid single-day spikes that are just noise.
  4. Retraining trigger: When drift exceeds threshold and is confirmed, automatically trigger a retraining job with the latest 30 days of production data. Validate on a recent holdout set before deploying.

This architecture separates detection from action — you can tune alerts without affecting retraining logic.

monitoring_pipeline.py · PYTHON
12345678910111213141516171819
from datetime import datetime, timedelta
import pandas as pd
from io.thecodeforge.monitoring.drift import compute_all_drift
from io.thecodeforge.alerts import evaluate_alert

def run_monitoring_check():
    now = datetime.utcnow()
    window_start = now - timedelta(days=30)
    # Load production features from the last 30 days
    prod_data = load_features(start_time=window_start, end_time=now)
    # Load training baseline (stored as parquet)
    train = pd.read_parquet('s3://model-baselines/latest/train_features.parquet')
    # Compute drift for each feature
    drift_results = compute_all_drift(train, prod_data)
    # Evaluate alert severity
    alert = evaluate_alert(drift_results)
    if alert.severity in ['WARNING', 'CRITICAL']:
        trigger_retraining_job(reason=alert.summary)
    log_metrics(drift_results, alert)
▶ Output
2026-04-15 03:00:00 INFO: Drift check completed. 2 features PSI > 0.1. Alert severity: WARNING. Retraining triggered.
⚠ Don't Over-Alert
Single-day spikes in PSI are common after a data glitch or A/B test. Always confirm drift over three consecutive windows before firing Slack messages. Your team will thank you.
📊 Production Insight
Running drift checks every hour on 100 features costs ~$2/day in compute.
But missing a single drift that causes a 10% revenue drop costs $50k+/week.
Invest in the pipeline early — pay the compute cost, not the trust cost.
🎯 Key Takeaway
Separate detection from action.
Use tiered alerting with confirmation windows.
Automate retraining triggers but always require validation.

Common Pitfalls in Drift Detection

  1. Testing drift on the wrong baseline: Always compare against the training data distribution, not a previous production snapshot. Production distributions shift gradually — if you compare against last month, you'll miss long-term drift.
  2. Ignoring feature interactions: Drift in one feature may be harmless when another feature compensates. For example, if 'age' drifts up but 'income' drifts up proportionally, the model may still work. Single-feature drift tests alone can cause false alarms.
  3. Using only p-values: A tiny p-value with a tiny KS statistic (e.g., 0.02) may be statistically significant but practically irrelevant. Always check effect size alongside p-value.
  4. Not handling missing data: If production data is missing for a feature, the distribution collapses to a spike at 0, which looks like extreme drift. Handle missing values explicitly before computing tests.
Mental Model
The Gardener Analogy
Monitoring a model is like tending a garden: you check the soil (input distributions) and the plants (outputs) regularly, but you don't yank every weed the moment it appears.
  • PSI threshold = how much weed you tolerate before acting
  • Confirmation window = wait a week before pulling
  • Feature interaction = some weeds help the soil
  • Missing data = a patch of bare dirt — fix the irrigation, don't just spray herbicide
📊 Production Insight
A team spent 3 months chasing 'drift' that was actually an ETL bug dropping NaN values.
Always validate your monitoring pipeline against known-good data first.
Drift detection is only as good as your data quality.
🎯 Key Takeaway
Test against training baseline, not previous production.
Consider feature interactions; use effect size not just p-value.
Handle missing data before computing drift.

Advanced: Multivariate Drift Detection and A/B Testing Integration

Single-feature tests scale linearly but miss interactions. For high-dimensional models (e.g., embeddings, tabular with 100+ features), use:

  • Maximum Mean Discrepancy (MMD): A kernel-based test that compares two high-dimensional distributions. More powerful than per-feature tests but computationally expensive.
  • Drift Detection on Model Embeddings: If your model has a latent layer (e.g., 64-dim), compute PSI on the embedding distribution. This catches joint shifts that single features miss.
  • A/B Test Validation: When you deploy a new model version, run both models in shadow mode for a week. Compute drift between the candidate's predictions and the champion's. Treat prediction distribution divergence as a prerequisite for go-live.

In production, combine single-feature tests for explainability with multivariate tests for sensitivity. This gives you both the 'what changed' and the 'where to look'.

multivariate_drift.py · PYTHON
12345678910111213
from sklearn.metrics import pairwise_kernels
import numpy as np
from io.thecodeforge.monitoring.mmd import mmd_test

def detect_embedding_drift(
    train_embeddings: np.ndarray,
    prod_embeddings: np.ndarray,
    kernel: str = 'rbf',
    threshold: float = 0.05
) -> bool:
    """Detect drift in high-dimensional embeddings using MMD."""
    stat, p_value = mmd_test(train_embeddings, prod_embeddings, kernel=kernel)
    return p_value < threshold  # significant drift
▶ Output
MMD statistic: 0.243, p-value: 0.001 -> Drift detected.
💡When to Use MMD
MMD is powerful but slow for 10M+ samples. Use per-feature PSI for daily checks, and run MMD on a 10% sample weekly for high-sensitivity areas like fraud or recommendation.
📊 Production Insight
A major e-commerce team used MMD on user embedding vectors and detected a drift no single feature caught: users from a new region had different browse-add-to-cart patterns.
Single-feature tests showed no drift in 'time_on_site' or 'cart_size'.
Multivariate tests caught the interaction.
🎯 Key Takeaway
Multivariate drift catches interactions that single-feature tests miss.
Use embedding drift for deep models, MMD for tabular.
Combine both for a complete picture.
🗂 Drift Detection Methods
When to use each statistical test in production
MethodBest ForSensitivityCompute CostInterpretability
PSICategorical / binned featuresModerate (proportional shifts)LowHigh (bins-based)
KL DivergenceDirectional change detectionHigh (asymmetric)LowModerate
KS TestContinuous featuresHigh (location shifts)LowHigh (max diff point)
MMDHigh-dimensional / embeddingsVery High (joint shifts)High (kernel matrix)Low (black-box)

🎯 Key Takeaways

  • Model monitoring is not optional — real data always drifts.
  • Detect drift using PSI for categorical, KS for continuous, KL for directional shifts.
  • Use a rolling 30-day window with confirmation before alerting to avoid false positives.
  • Combine single-feature and multivariate tests for complete coverage.
  • Automate retraining triggers but always validate on recent holdout data.
  • Blindly trusting 'accuracy' will hide silent failures — monitor distributions.

⚠ Common Mistakes to Avoid

    Using only accuracy as a monitoring metric
    Symptom

    Model accuracy remains high while false positive rate triples because the model stops predicting certain classes (e.g., it never predicts fraud anymore). Accuracy hides class imbalance shift.

    Fix

    Monitor per-class metrics (precision, recall, F1) AND data distributions. Use PSI on prediction probabilities to catch silent failures.

    Computing drift on a single day's data
    Symptom

    Daily drift alerts that are actually random noise. The ops team disables all alerts, and real drift goes undetected.

    Fix

    Always use a rolling window of at least 7 days (preferably 30) for drift computation. Confirm drift over 2+ consecutive windows before triggering an alert.

    Ignoring missing values in production data
    Symptom

    A feature that is missing 30% of values (e.g., due to a pipeline bug) shows a distribution spike at 0. PSI jumps to 0.8, triggering a false alarm.

    Fix

    Handle missing values explicitly: impute with training median or flag as a separate category. Monitor feature completeness separately from distribution drift.

    Not refreshing the baseline after retraining
    Symptom

    After retraining on new data, the drift detection still compares against the original 2023 baseline. Every feature shows drift because the model's world has already moved.

    Fix

    After each retraining, compute a new baseline from the training data used in that retraining. Store the baseline version alongside the model version.

Interview Questions on This Topic

  • QExplain the difference between covariate drift and concept drift. How would you detect each in a production ML system?SeniorReveal
    Covariate drift means the distribution of input features has changed (e.g., user age distribution shifted from 25–35 to 35–45). Concept drift means the relationship between features and the target has changed (e.g., what constituted 'fraud' in 2023 is different now). To detect covariate drift: use PSI or KS test comparing current feature distributions against the training data baseline. For concept drift: compare prediction distribution vs actual labels using KS test on prediction residuals, or monitor prediction drift over time (if you have ground truth with delay). In practice, run both tests in parallel. If only covariate drifts, you may need to retrain with recent data. If concept drifts, you need to re-engineer features or reconsider the business logic.
  • QWhat is PSI and how do you interpret its value? When would you choose KS test over PSI?Mid-levelReveal
    Population Stability Index (PSI) measures the shift in a variable's distribution between two samples (expected vs actual). Formula: sum((P_i - Q_i) * ln(P_i / Q_i)). Interpretation: <0.1 = no shift, 0.1–0.25 = minor, >0.25 = significant shift requiring investigation. Choose KS test over PSI when: the variable is continuous (PSI requires binning which loses information), you care about location shifts (KS gives max difference point), or you want a p-value for statistical significance. PSI is better for categorical/binned features and when you want a symmetric stability measure.
  • QDesign a monitoring system for a credit scoring model that serves 100k predictions per day. What metrics would you track, what thresholds, and how would you alert?SeniorReveal
    I'd design a three-tier system: 1. Data quality: Track missing rate per feature, value range checks, and data staleness (time since last batch). Alert if any metric exceeds 2x weekly average. 2. Drift detection: Run daily PSI on 10 key features (income, age, loan amount, DTI ratio). Run KS on 3 continuous features (credit score, interest rate). Compare against training baseline from the last retraining (refreshed monthly). Thresholds: PSI > 0.15 -> WARNING, PSI > 0.25 -> CRITICAL. KS p-value < 0.05 with stat > 0.1 -> investigate. 3. Business impact: Monitor mean predicted probability, approval rate, and default rate (when ground truth arrives). Compare to expected values. Alert if approval rate deviates >5% from expected. Alerting: WARNING -> Slack notification for data team; CRITICAL -> PagerDuty with 30-min response time. Use confirmation windows: require 2 out of 3 consecutive days above threshold before paging.

Frequently Asked Questions

How often should I run drift detection in production?

For most systems, once every 24 hours is sufficient. If your data flows in real-time and the business impact of delay is high (e.g., fraud detection), run every hour on a rolling 24-hour window. Batch systems can run daily after the batch completes.

What PSI threshold should I use for my model?

Start with 0.25 as a default, but calibrate on your own data. For high-stakes models (credit, healthcare), use 0.1. For low-stakes models (recommendation, content ranking), 0.25–0.3 is fine. Plot PSI over time for a month to understand your baseline noise level.

Can drift detection work without labels?

Yes — covariate drift detection works purely on input features. Prediction drift (comparing output distributions) can hint at concept drift even without ground truth. For concept drift, you need delayed labels, but you can use proxy metrics (e.g., conversion rate) as a signal.

Does retraining always fix drift?

No. If the drift is caused by a fundamental change in the data generation process (e.g., new product launch, regulatory change), retraining on the same features may not help. You may need to re-engineer features or add new data sources. Always investigate the root cause before retraining.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousExperiment Tracking with MLflowNext →How to Deploy Your First ML Model with Flask or FastAPI (Beginner)
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged