Senior 6 min · March 06, 2026

Drift Detection — Covariate Drift Cost a Fraud Model $2M

Fraud model accuracy fell from 92% to 67% due to covariate drift.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Model monitoring tracks prediction quality and data distributions over time
  • Drift detection uses statistical tests like PSI, KL divergence, and KS test
  • Covariate drift = input distribution changes; concept drift = label relationship changes
  • PSI > 0.25 typically indicates significant drift in production
  • Production insight: most drift alert fatigue comes from testing on too-small windows
  • Biggest mistake: treating drift detection as a binary yes/no instead of a severity scale
✦ Definition~90s read
What is Model Monitoring and Drift Detection?

Model monitoring is the practice of continuously observing a deployed ML model's performance and input data. Drift detection identifies when the statistical properties of the data or the relationship between inputs and outputs change from the training baseline. Without monitoring, you're flying blind: your model could be making decisions based on patterns that no longer exist.

Imagine you trained a spam filter in 2020, and it worked perfectly.

Drift falls into three categories: - Covariate drift: the distribution of input features changes (e.g., user age shifts from 25–35 to 35–45) - Concept drift: the relationship between features and target changes (e.g., what was considered 'fraud' looks different today) - Prediction drift: the distribution of model outputs shifts (can signal concept drift even without labels)

In production, you need to detect all three. Each requires a different statistical test and a different response.

Plain-English First

Imagine you trained a spam filter in 2020, and it worked perfectly. But by 2023, spammers started writing emails that sound like friendly messages — 'Hey buddy, check out this crypto opportunity!' Your filter never saw that style of spam, so it stops catching it. Your model didn't break. The world changed around it. Model monitoring is the alarm system that notices the world has changed. Drift detection is the tool that figures out exactly what changed and how badly.

Every ML model has an expiry date — you just don't know when it is. The moment you deploy a model to production, the clock starts ticking. Real-world data is a living thing: customer behaviour shifts, sensor calibrations drift, economic conditions flip, and language evolves. A model trained on yesterday's data makes yesterday's decisions, and in fast-moving domains that gap kills business value silently and expensively. Unlike a crashed server, a drifting model doesn't throw an error. It just quietly becomes wrong.

The core problem is that ML models are frozen snapshots of a world that keeps moving. Traditional software has deterministic logic you can test; a model's 'logic' is baked into millions of learned parameters that have no automatic self-correction mechanism. When the statistical relationship between your input features and your target label changes, the model has no way of knowing. It will keep producing confident predictions that are increasingly divorced from reality — and your monitoring stack needs to catch that before your users or your business does.

By the end of this article you'll be able to implement a production-grade monitoring pipeline that detects covariate drift, concept drift, and prediction drift using PSI, KL divergence, and the Kolmogorov-Smirnov test. You'll understand which detector to reach for in which situation, the statistical subtleties that trip up even experienced engineers, and how to wire all of it into an alerting workflow that won't wake you up for false positives at 3 a.m.

What Is Model Monitoring and Drift Detection?

Model monitoring is the practice of continuously observing a deployed ML model's performance and input data. Drift detection identifies when the statistical properties of the data or the relationship between inputs and outputs change from the training baseline. Without monitoring, you're flying blind: your model could be making decisions based on patterns that no longer exist.

Drift falls into three categories
  • Covariate drift: the distribution of input features changes (e.g., user age shifts from 25–35 to 35–45)
  • Concept drift: the relationship between features and target changes (e.g., what was considered 'fraud' looks different today)
  • Prediction drift: the distribution of model outputs shifts (can signal concept drift even without labels)

In production, you need to detect all three. Each requires a different statistical test and a different response.

drift_detection_basic.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from typing import List, Optional
import numpy as np
from scipy.stats import ks_2samp

def detect_covariate_drift(
    train: np.ndarray,
    production: np.ndarray,
    threshold: float = 0.1
) -> Optional[float]:
    """Detect covariate drift using two-sample KS test.
    Namespace: io.thecodeforge.monitoring.drift
    """
    if len(train) == 0 or len(production) == 0:
        return None
    stat, p_value = ks_2samp(train, production)
    if p_value < 0.05 and stat > threshold:
        return stat
    return None
Output
None
The River Crossing Metaphor
  • Covariate drift = the water level changed (input distributions)
  • Concept drift = the river changed course (relationship changed)
  • Prediction drift = the bridge (model) is swaying (outputs shifted)
  • You need different tools for each: PSI for water level, KS for course change
Production Insight
Most teams monitor only accuracy — a lagging indicator.
By the time accuracy drops, drift has been present for weeks.
Detect drift upstream using feature distribution tests, not downstream metrics.
Key Takeaway
Drift detection is a leading indicator of model failure.
You monitor inputs, not just outputs.
Be proactive: test distributions, not just business metrics.
Drift Detection Pipeline for Fraud Models THECODEFORGE.IO Drift Detection Pipeline for Fraud Models From monitoring to retraining: avoiding costly data drift mistakes Model Monitoring & Drift Detection Track input distributions and model performance over time Statistical Tests: PSI, KL, KS Quantify drift between reference and production data Production Monitoring Pipeline Automated alerts and dashboards for drift metrics Multivariate Drift & A/B Testing Detect interactions and compare model versions Retrain on Drifted Data Update model with recent data to maintain accuracy ⚠ Retraining on drifted data can amplify bias and degrade performance Always validate drift type (covariate vs concept) before retraining THECODEFORGE.IO
thecodeforge.io
Drift Detection Pipeline for Fraud Models
Model Monitoring Drift Detection

Statistical Tests: PSI, KL Divergence, and Kolmogorov-Smirnov

  1. Population Stability Index (PSI): Measures how much a variable's distribution has shifted between two samples. Formula: sum((actual_prop_i - expected_prop_i) * ln(actual_prop_i / expected_prop_i)). PSI < 0.1 = no shift, 0.1–0.25 = minor, > 0.25 = significant.
  2. KL Divergence: Measures the information lost when using expected distribution to approximate actual. Asymmetric — order matters. Use PSI for symmetric stability, KL for asymmetrical change detection.
  3. Kolmogorov-Smirnov (KS) Test: Non-parametric test comparing two empirical distributions. Returns a statistic (max difference) and a p-value. Works for continuous features. More sensitive than PSI for location shifts.

In practice, use PSI for categorical/binned features, KS for continuous. KL divergence is useful when you care about directionality of change.

psi_and_ks.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
from scipy.stats import ks_2samp
from io.thecodeforge.monitoring.stats import psi

def compute_drift_report(train: np.ndarray, prod: np.ndarray, bins: int = 10):
    train_bins = np.histogram(train, bins=bins)[0] / len(train)
    prod_bins = np.histogram(prod, bins=bins)[0] / len(prod)
    psi_value = psi(train_bins, prod_bins)
    ks_stat, p_value = ks_2samp(train, prod)
    return {
        'psi': round(psi_value, 4),
        'ks_stat': round(ks_stat, 4),
        'ks_p_value': round(p_value, 6),
        'drift_detected': psi_value > 0.25 or (p_value < 0.05 and ks_stat > 0.1)
    }
Output
{
'psi': 0.321,
'ks_stat': 0.184,
'ks_p_value': 0.0001,
'drift_detected': True
}
When PSI Breaks
PSI assumes categorical bins with at least 5% expected proportion. If a bin has 0 expected count, the log blows up. Always smooth bins with a small epsilon (1e-6) before computing.
Production Insight
PSI threshold of 0.25 works for most business features but not for tail-heavy distributions.
For credit risk models, even 0.1 PSI triggers action.
Always calibrate thresholds on your own production data — never blindly copy Kaggle values.
Key Takeaway
PSI for stability, KS for sensitivity, KL for direction.
Always smooth bins before PSI.
Test on a rolling 30-day window, not a single day.
Which Test to Use
IfFeature is continuous with known distribution (e.g., age, income)
UseUse KS test — more sensitive to location shifts
IfFeature is categorical or binned (e.g., region code, segment)
UseUse PSI — captures proportional shifts
IfNeed to measure information loss directionally
UseUse KL divergence — asymmetric, good for detecting unexpected distributions

Building a Production Monitoring Pipeline

  1. Data collection: Log model inputs and outputs to a time-series store (e.g., Kafka + InfluxDB). Store at least 30 days of raw feature vectors and predictions.
  2. Drift computation: Run scheduled jobs (e.g., Airflow DAG every 6 hours) that compute PSI, KS, and prediction drift for each feature vs. the training baseline. Store results in a separate metrics table.
  3. Alerting: Tiered alerts: INFO (PSI 0.1–0.2), WARNING (0.2–0.3), CRITICAL (>0.3). Confirm drift over at least two consecutive windows before paging. Avoid single-day spikes that are just noise.
  4. Retraining trigger: When drift exceeds threshold and is confirmed, automatically trigger a retraining job with the latest 30 days of production data. Validate on a recent holdout set before deploying.

This architecture separates detection from action — you can tune alerts without affecting retraining logic.

monitoring_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from datetime import datetime, timedelta
import pandas as pd
from io.thecodeforge.monitoring.drift import compute_all_drift
from io.thecodeforge.alerts import evaluate_alert

def run_monitoring_check():
    now = datetime.utcnow()
    window_start = now - timedelta(days=30)
    # Load production features from the last 30 days
    prod_data = load_features(start_time=window_start, end_time=now)
    # Load training baseline (stored as parquet)
    train = pd.read_parquet('s3://model-baselines/latest/train_features.parquet')
    # Compute drift for each feature
    drift_results = compute_all_drift(train, prod_data)
    # Evaluate alert severity
    alert = evaluate_alert(drift_results)
    if alert.severity in ['WARNING', 'CRITICAL']:
        trigger_retraining_job(reason=alert.summary)
    log_metrics(drift_results, alert)
Output
2026-04-15 03:00:00 INFO: Drift check completed. 2 features PSI > 0.1. Alert severity: WARNING. Retraining triggered.
Don't Over-Alert
Single-day spikes in PSI are common after a data glitch or A/B test. Always confirm drift over three consecutive windows before firing Slack messages. Your team will thank you.
Production Insight
Running drift checks every hour on 100 features costs ~$2/day in compute.
But missing a single drift that causes a 10% revenue drop costs $50k+/week.
Invest in the pipeline early — pay the compute cost, not the trust cost.
Key Takeaway
Separate detection from action.
Use tiered alerting with confirmation windows.
Automate retraining triggers but always require validation.

Common Pitfalls in Drift Detection

  1. Testing drift on the wrong baseline: Always compare against the training data distribution, not a previous production snapshot. Production distributions shift gradually — if you compare against last month, you'll miss long-term drift.
  2. Ignoring feature interactions: Drift in one feature may be harmless when another feature compensates. For example, if 'age' drifts up but 'income' drifts up proportionally, the model may still work. Single-feature drift tests alone can cause false alarms.
  3. Using only p-values: A tiny p-value with a tiny KS statistic (e.g., 0.02) may be statistically significant but practically irrelevant. Always check effect size alongside p-value.
  4. Not handling missing data: If production data is missing for a feature, the distribution collapses to a spike at 0, which looks like extreme drift. Handle missing values explicitly before computing tests.
The Gardener Analogy
  • PSI threshold = how much weed you tolerate before acting
  • Confirmation window = wait a week before pulling
  • Feature interaction = some weeds help the soil
  • Missing data = a patch of bare dirt — fix the irrigation, don't just spray herbicide
Production Insight
A team spent 3 months chasing 'drift' that was actually an ETL bug dropping NaN values.
Always validate your monitoring pipeline against known-good data first.
Drift detection is only as good as your data quality.
Key Takeaway
Test against training baseline, not previous production.
Consider feature interactions; use effect size not just p-value.
Handle missing data before computing drift.

Advanced: Multivariate Drift Detection and A/B Testing Integration

Single-feature tests scale linearly but miss interactions. For high-dimensional models (e.g., embeddings, tabular with 100+ features), use:

  • Maximum Mean Discrepancy (MMD): A kernel-based test that compares two high-dimensional distributions. More powerful than per-feature tests but computationally expensive.
  • Drift Detection on Model Embeddings: If your model has a latent layer (e.g., 64-dim), compute PSI on the embedding distribution. This catches joint shifts that single features miss.
  • A/B Test Validation: When you deploy a new model version, run both models in shadow mode for a week. Compute drift between the candidate's predictions and the champion's. Treat prediction distribution divergence as a prerequisite for go-live.

In production, combine single-feature tests for explainability with multivariate tests for sensitivity. This gives you both the 'what changed' and the 'where to look'.

multivariate_drift.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.metrics import pairwise_kernels
import numpy as np
from io.thecodeforge.monitoring.mmd import mmd_test

def detect_embedding_drift(
    train_embeddings: np.ndarray,
    prod_embeddings: np.ndarray,
    kernel: str = 'rbf',
    threshold: float = 0.05
) -> bool:
    """Detect drift in high-dimensional embeddings using MMD."""
    stat, p_value = mmd_test(train_embeddings, prod_embeddings, kernel=kernel)
    return p_value < threshold  # significant drift
Output
MMD statistic: 0.243, p-value: 0.001 -> Drift detected.
When to Use MMD
MMD is powerful but slow for 10M+ samples. Use per-feature PSI for daily checks, and run MMD on a 10% sample weekly for high-sensitivity areas like fraud or recommendation.
Production Insight
A major e-commerce team used MMD on user embedding vectors and detected a drift no single feature caught: users from a new region had different browse-add-to-cart patterns.
Single-feature tests showed no drift in 'time_on_site' or 'cart_size'.
Multivariate tests caught the interaction.
Key Takeaway
Multivariate drift catches interactions that single-feature tests miss.
Use embedding drift for deep models, MMD for tabular.
Combine both for a complete picture.

Why You Monitor for Data Drift Before Concept Drift (And What Happens When You Don't)

New engineers always ask me: "Should I track data drift or concept drift first?" The answer is data drift, every time. Here's the cold logic: data drift breaks your input pipeline silently. Concept drift breaks your predictions. If you catch data drift first, you can alert before your model serves garbage. If you chase concept drift without monitoring data, you'll waste weeks debugging model architecture when the real culprit is a corrupted feature source.

I've seen teams deploy sophisticated concept drift detectors, only to discover their data pipeline had been feeding NaN-filled parquets for three days. The model wasn't drifting — it was starving. Data drift detection acts as the canary. It tells you when the world changed in ways your training distribution never saw. Only after confirming your inputs are valid should you look for changes in the relationship between features and labels.

The practical reality: deploy data drift monitors on every upstream feature. Use KS tests for continuous features, chi-square for categorical. Set alerting thresholds at p < 0.01 (not 0.05 — you want sensitivity, not statistical posturing). When the alarm fires, check the pipeline before you touch the model.

DataDriftFirst.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — ml-ai tutorial

// Practical data drift monitor with alerting
import numpy as np
from scipy.stats import ks_2samp
import logging

logging.basicConfig(level=logging.WARNING)

def monitor_data_drift(reference_sample, production_sample, feature_name, alpha=0.01):
    """KS test for continuous features. Logs alert if drift detected."""
    statistic, p_value = ks_2samp(reference_sample, production_sample)
    
    if p_value < alpha:
        logging.warning(
            f"DRIFT DETECTED on feature '{feature_name}' | "
            f"KS stat={statistic:.4f}, p-value={p_value:.6f}"
        )
        return True
    return False

# Example: monitoring 'transaction_amount' in a fraud model
historical_amounts = np.random.exponential(scale=100, size=10000)
current_batch = np.random.exponential(scale=150, size=1000)  # drift introduced

if monitor_data_drift(historical_amounts, current_batch, "transaction_amount"):
    print("Pipeline check triggered: inspect upstream source")
Output
WARNING:root:DRIFT DETECTED on feature 'transaction_amount' | KS stat=0.1523, p-value=0.000000
Pipeline check triggered: inspect upstream source
Production Trap:
Don't set alpha to 0.05 just because textbooks do. In production, you want early warnings, not academic rigor. 0.01 gives you more sensitivity to subtle drift that compounds over time.
Key Takeaway
Monitor data drift before concept drift — your input pipeline is the weakest link.

The Hidden Cost of Retraining On Drifted Data: Feedback Loops That Destroy Your Model

Here's the trap nobody talks about. You detect drift, you retrain your model on the new production data, and you deploy. Congratulations — you just locked in the drift as the new normal. If the drift was temporary (a holiday spike, a bot attack, a data pipeline glitch), you've now poisoned your model with garbage.

I consulted for a fintech startup that retrained their credit risk model every time they saw drift in application volumes. Three months later, the model started rejecting good applicants. Why? A promotional campaign caused a temporary spike in high-risk applications. The team retrained on that data, and the model learned to associate higher volume with higher risk. When the campaign ended, legitimate applicants got flagged. They spent two quarters unwinding that feedback loop.

The fix: never retrain blindly on drifted data. First, classify the drift. Is it temporary (seasonal, campaign-driven) or permanent (regulatory change, new user segment)? Use a drift classification model or heuristic rules. If temporary, keep the old model and suppress alerts. If permanent, retrain with a warm-start from the last stable checkpoint, then validate against a holdout set that spans before and after the drift onset. The holdout tells you if the retrain actually improved generalization or just memorized the noise.

FeedbackLoopGuard.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — ml-ai tutorial

// Detect and classify drift before retraining
import datetime

def classify_drift(feature_timestamps, drift_start_index, lookback_days=7):
    """Heuristic: if drift disappears within lookback, it's temporary."""
    pre_drift = feature_timestamps[:drift_start_index]
    post_drift = feature_timestamps[drift_start_index:]
    
    pre_mean = np.mean(pre_drift)
    post_mean = np.mean(post_drift)
    
    # Check if post-drift values return to pre-drift range
    if abs(post_mean - pre_mean) < 0.1 * pre_mean:  # temporary threshold
        return "temporary"
    else:
        return "permanent"

# Example usage
from datetime import datetime, timedelta

dates = [datetime.now() - timedelta(days=i) for i in range(30)]
values = [100] * 20 + [200] * 5 + [100] * 5  # spike then recovery

type = classify_drift(values, 20)
print(f"Drift type: {type}")  # Should be 'temporary'
Output
Drift type: temporary
Senior Shortcut:
Maintain a 'stable baseline' snapshot of your training data. When drift is detected, compare production data to this baseline, not to the latest retrain set. This prevents your model from drifting along with the data.
Key Takeaway
Never retrain on drifted data without classifying the drift first — or you'll bake temporary anomalies into permanent model degradation.
● Production incidentPOST-MORTEMseverity: high

The Silent Churn: How a Fraud Model Lost $2M Before Anyone Noticed

Symptom
Fraud detection accuracy dropped from 92% to 67% over three months. False positive rate tripled. No error logs, no downtime.
Assumption
The team assumed the model would maintain its performance because retraining happened monthly. They only monitored binary accuracy on a static holdout set.
Root cause
Covariate drift: the distribution of transaction amounts, merchant categories, and geolocation features shifted as the company expanded into a new market. The holdout set was never refreshed. Concept drift also occurred because fraudsters adapted to the model's patterns.
Fix
Implemented a monitoring pipeline that tracks per-feature PSI and monthly KS tests on production data. Added a dashboard with trend lines over 30-day rolling windows. Set up alerts triggered at PSI > 0.2 with a 7-day confirmation window to filter out noise.
Key lesson
  • Monitor data distributions, not just accuracy — accuracy can stay high while the model misses critical segments.
  • Refresh holdout sets quarterly with current production data.
  • Combine covariate and concept drift detection: use PSI for inputs and prediction distribution comparison for labels.
  • Always confirm drift alerts over multiple windows before paging anyone.
Production debug guideTrace the root cause when your model's predictions start degrading.4 entries
Symptom · 01
Model accuracy dropped but no feature changes
Fix
Check covariate drift: run PSI on each feature between training and production data for the last 7 days.
Symptom · 02
Prediction distribution shifted but feature stats look normal
Fix
Run concept drift detection: compare prediction vs actual label distributions using KS test.
Symptom · 03
Drift alerts firing every day
Fix
Increase the monitoring window from 1 day to 7 days and apply a severity threshold (e.g., PSI > 0.3). Check if the alert is driven by low-volume segments.
Symptom · 04
Model performs well on recent data but fails on new data
Fix
Verify train/test split recency — if your training data is older than 3 months, consider retraining with more recent samples.
★ Drift Detection Quick ReferenceCommands and actions for common drift scenarios.
Need to calculate PSI on a feature
Immediate action
Compute expected and actual distribution bins
Commands
python -c "from scipy.stats import chi2; from io.thecodeforge.monitoring import psi; print(psi(expected_bins, actual_bins))"
python -c "import pandas as pd; expected=pd.read_csv('train_features.csv')['amount']; actual=pd.read_csv('production_features.csv')['amount']; print(psi(expected, actual, bins=10))"
Fix now
If PSI > 0.25, retrain the model with recent data and schedule a refresh of the monitoring baseline.
Need to compare two distributions with KS test+
Immediate action
Run two-sample KS test
Commands
python -c "from scipy.stats import ks_2samp; stat, p = ks_2samp(train_sample, prod_sample); print('KS statistic:', stat, 'p-value:', p)"
python -c "import numpy as np; train=np.random.normal(0,1,1000); prod=np.random.normal(0.5,1,1000); print(ks_2samp(train, prod))"
Fix now
If p-value < 0.05 and KS statistic > 0.1, investigate feature drift and consider retraining.
Need to detect concept drift without labels+
Immediate action
Monitor prediction distribution over time
Commands
python -c "from io.thecodeforge.monitoring import prediction_drift; drift_score = prediction_drift(current_predictions, baseline_predictions); print('Prediction drift score:', drift_score)"
python -c "import numpy as np; from scipy.stats import entropy; current = np.histogram(predictions, bins=10)[0]; baseline = np.histogram(baseline, bins=10)[0]; print(entropy(current, baseline))"
Fix now
If drift score > 0.2, flag the model for retraining and trigger a manual evaluation against ground truth.
Drift Detection Methods
MethodBest ForSensitivityCompute CostInterpretability
PSICategorical / binned featuresModerate (proportional shifts)LowHigh (bins-based)
KL DivergenceDirectional change detectionHigh (asymmetric)LowModerate
KS TestContinuous featuresHigh (location shifts)LowHigh (max diff point)
MMDHigh-dimensional / embeddingsVery High (joint shifts)High (kernel matrix)Low (black-box)

Key takeaways

1
Model monitoring is not optional
real data always drifts.
2
Detect drift using PSI for categorical, KS for continuous, KL for directional shifts.
3
Use a rolling 30-day window with confirmation before alerting to avoid false positives.
4
Combine single-feature and multivariate tests for complete coverage.
5
Automate retraining triggers but always validate on recent holdout data.
6
Blindly trusting 'accuracy' will hide silent failures
monitor distributions.

Common mistakes to avoid

4 patterns
×

Using only accuracy as a monitoring metric

Symptom
Model accuracy remains high while false positive rate triples because the model stops predicting certain classes (e.g., it never predicts fraud anymore). Accuracy hides class imbalance shift.
Fix
Monitor per-class metrics (precision, recall, F1) AND data distributions. Use PSI on prediction probabilities to catch silent failures.
×

Computing drift on a single day's data

Symptom
Daily drift alerts that are actually random noise. The ops team disables all alerts, and real drift goes undetected.
Fix
Always use a rolling window of at least 7 days (preferably 30) for drift computation. Confirm drift over 2+ consecutive windows before triggering an alert.
×

Ignoring missing values in production data

Symptom
A feature that is missing 30% of values (e.g., due to a pipeline bug) shows a distribution spike at 0. PSI jumps to 0.8, triggering a false alarm.
Fix
Handle missing values explicitly: impute with training median or flag as a separate category. Monitor feature completeness separately from distribution drift.
×

Not refreshing the baseline after retraining

Symptom
After retraining on new data, the drift detection still compares against the original 2023 baseline. Every feature shows drift because the model's world has already moved.
Fix
After each retraining, compute a new baseline from the training data used in that retraining. Store the baseline version alongside the model version.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between covariate drift and concept drift. How wo...
Q02SENIOR
What is PSI and how do you interpret its value? When would you choose KS...
Q03SENIOR
Design a monitoring system for a credit scoring model that serves 100k p...
Q01 of 03SENIOR

Explain the difference between covariate drift and concept drift. How would you detect each in a production ML system?

ANSWER
Covariate drift means the distribution of input features has changed (e.g., user age distribution shifted from 25–35 to 35–45). Concept drift means the relationship between features and the target has changed (e.g., what constituted 'fraud' in 2023 is different now). To detect covariate drift: use PSI or KS test comparing current feature distributions against the training data baseline. For concept drift: compare prediction distribution vs actual labels using KS test on prediction residuals, or monitor prediction drift over time (if you have ground truth with delay). In practice, run both tests in parallel. If only covariate drifts, you may need to retrain with recent data. If concept drifts, you need to re-engineer features or reconsider the business logic.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How often should I run drift detection in production?
02
What PSI threshold should I use for my model?
03
Can drift detection work without labels?
04
Does retraining always fix drift?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's MLOps. Mark it forged?

6 min read · try the examples if you haven't

Previous
Experiment Tracking with MLflow
8 / 14 · MLOps
Next
How to Deploy Your First ML Model with Flask or FastAPI (Beginner)