Advanced 5 min · March 06, 2026

Model Monitoring and Drift Detection

Drift Detection — Covariate Drift Cost a Fraud Model $2M

Q: How often should I run drift detection in production?

For most systems, once every 24 hours is sufficient. If your data flows in real-time and the business impact of delay is high (e.g., fraud detection), run every hour on a rolling 24-hour window. Batch systems can run daily after the batch completes.

Q: What PSI threshold should I use for my model?

Start with 0.25 as a default, but calibrate on your own data. For high-stakes models (credit, healthcare), use 0.1. For low-stakes models (recommendation, content ranking), 0.25–0.3 is fine. Plot PSI over time for a month to understand your baseline noise level.

Q: Can drift detection work without labels?

Yes — covariate drift detection works purely on input features. Prediction drift (comparing output distributions) can hint at concept drift even without ground truth. For concept drift, you need delayed labels, but you can use proxy metrics (e.g., conversion rate) as a signal.

Q: Does retraining always fix drift?

No. If the drift is caused by a fundamental change in the data generation process (e.g., new product launch, regulatory change), retraining on the same features may not help. You may need to re-engineer features or add new data sources. Always investigate the root cause before retraining.

Fraud model accuracy fell from 92% to 67% due to covariate drift.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Model monitoring tracks prediction quality and data distributions over time
Drift detection uses statistical tests like PSI, KL divergence, and KS test
Covariate drift = input distribution changes; concept drift = label relationship changes
PSI > 0.25 typically indicates significant drift in production
Production insight: most drift alert fatigue comes from testing on too-small windows
Biggest mistake: treating drift detection as a binary yes/no instead of a severity scale

✦ Definition~90s read

What is Model Monitoring and Drift Detection?

Model monitoring is the practice of continuously observing a deployed ML model's performance and input data. Drift detection identifies when the statistical properties of the data or the relationship between inputs and outputs change from the training baseline. Without monitoring, you're flying blind: your model could be making decisions based on patterns that no longer exist.

★

Imagine you trained a spam filter in 2020, and it worked perfectly.

Drift falls into three categories: - Covariate drift: the distribution of input features changes (e.g., user age shifts from 25–35 to 35–45) - Concept drift: the relationship between features and target changes (e.g., what was considered 'fraud' looks different today) - Prediction drift: the distribution of model outputs shifts (can signal concept drift even without labels)

In production, you need to detect all three. Each requires a different statistical test and a different response.

Plain-English First

Imagine you trained a spam filter in 2020, and it worked perfectly. But by 2023, spammers started writing emails that sound like friendly messages — 'Hey buddy, check out this crypto opportunity!' Your filter never saw that style of spam, so it stops catching it. Your model didn't break. The world changed around it. Model monitoring is the alarm system that notices the world has changed. Drift detection is the tool that figures out exactly what changed and how badly.

Every ML model has an expiry date — you just don't know when it is. The moment you deploy a model to production, the clock starts ticking. Real-world data is a living thing: customer behaviour shifts, sensor calibrations drift, economic conditions flip, and language evolves. A model trained on yesterday's data makes yesterday's decisions, and in fast-moving domains that gap kills business value silently and expensively. Unlike a crashed server, a drifting model doesn't throw an error. It just quietly becomes wrong.

The core problem is that ML models are frozen snapshots of a world that keeps moving. Traditional software has deterministic logic you can test; a model's 'logic' is baked into millions of learned parameters that have no automatic self-correction mechanism. When the statistical relationship between your input features and your target label changes, the model has no way of knowing. It will keep producing confident predictions that are increasingly divorced from reality — and your monitoring stack needs to catch that before your users or your business does.

By the end of this article you'll be able to implement a production-grade monitoring pipeline that detects covariate drift, concept drift, and prediction drift using PSI, KL divergence, and the Kolmogorov-Smirnov test. You'll understand which detector to reach for in which situation, the statistical subtleties that trip up even experienced engineers, and how to wire all of it into an alerting workflow that won't wake you up for false positives at 3 a.m.

What Is Model Monitoring and Drift Detection?

Drift falls into three categories

Covariate drift: the distribution of input features changes (e.g., user age shifts from 25–35 to 35–45)
Concept drift: the relationship between features and target changes (e.g., what was considered 'fraud' looks different today)
Prediction drift: the distribution of model outputs shifts (can signal concept drift even without labels)

In production, you need to detect all three. Each requires a different statistical test and a different response.

drift_detection_basic.pyPYTHON

from typing import List, Optional
import numpy as np
from scipy.stats import ks_2samp

def detect_covariate_drift(
    train: np.ndarray,
    production: np.ndarray,
    threshold: float = 0.1
) -> Optional[float]:
    """Detect covariate drift using two-sample KS test.
    Namespace: io.thecodeforge.monitoring.drift
    """
    if len(train) == 0 or len(production) == 0:
        return None
    stat, p_value = ks_2samp(train, production)
    if p_value < 0.05 and stat > threshold:
        return stat
    return None

Output

None

Mental Model

The River Crossing Metaphor

Think of training data as one bank of a river, and production data as the opposite bank. Drift tests measure how wide and fast the river is flowing.

Covariate drift = the water level changed (input distributions)
Concept drift = the river changed course (relationship changed)
Prediction drift = the bridge (model) is swaying (outputs shifted)
You need different tools for each: PSI for water level, KS for course change

📊 Production Insight

Most teams monitor only accuracy — a lagging indicator.

By the time accuracy drops, drift has been present for weeks.

Detect drift upstream using feature distribution tests, not downstream metrics.

🎯 Key Takeaway

Drift detection is a leading indicator of model failure.

You monitor inputs, not just outputs.

Be proactive: test distributions, not just business metrics.

thecodeforge.io

Model Monitoring Drift Detection

Statistical Tests: PSI, KL Divergence, and Kolmogorov-Smirnov

Three tests dominate production drift detection:

Population Stability Index (PSI): Measures how much a variable's distribution has shifted between two samples. Formula: sum((actual_prop_i - expected_prop_i) * ln(actual_prop_i / expected_prop_i)). PSI < 0.1 = no shift, 0.1–0.25 = minor, > 0.25 = significant.
KL Divergence: Measures the information lost when using expected distribution to approximate actual. Asymmetric — order matters. Use PSI for symmetric stability, KL for asymmetrical change detection.
Kolmogorov-Smirnov (KS) Test: Non-parametric test comparing two empirical distributions. Returns a statistic (max difference) and a p-value. Works for continuous features. More sensitive than PSI for location shifts.

In practice, use PSI for categorical/binned features, KS for continuous. KL divergence is useful when you care about directionality of change.

psi_and_ks.pyPYTHON

import numpy as np
from scipy.stats import ks_2samp
from io.thecodeforge.monitoring.stats import psi

def compute_drift_report(train: np.ndarray, prod: np.ndarray, bins: int = 10):
    train_bins = np.histogram(train, bins=bins)[0] / len(train)
    prod_bins = np.histogram(prod, bins=bins)[0] / len(prod)
    psi_value = psi(train_bins, prod_bins)
    ks_stat, p_value = ks_2samp(train, prod)
    return {
        'psi': round(psi_value, 4),
        'ks_stat': round(ks_stat, 4),
        'ks_p_value': round(p_value, 6),
        'drift_detected': psi_value > 0.25 or (p_value < 0.05 and ks_stat > 0.1)
    }

Output

{

'psi': 0.321,

'ks_stat': 0.184,

'ks_p_value': 0.0001,

'drift_detected': True

}

🔥When PSI Breaks

PSI assumes categorical bins with at least 5% expected proportion. If a bin has 0 expected count, the log blows up. Always smooth bins with a small epsilon (1e-6) before computing.

📊 Production Insight

PSI threshold of 0.25 works for most business features but not for tail-heavy distributions.

For credit risk models, even 0.1 PSI triggers action.

Always calibrate thresholds on your own production data — never blindly copy Kaggle values.

🎯 Key Takeaway

PSI for stability, KS for sensitivity, KL for direction.

Always smooth bins before PSI.

Test on a rolling 30-day window, not a single day.

Which Test to Use

IfFeature is continuous with known distribution (e.g., age, income)

→

UseUse KS test — more sensitive to location shifts

IfFeature is categorical or binned (e.g., region code, segment)

→

UseUse PSI — captures proportional shifts

IfNeed to measure information loss directionally

→

UseUse KL divergence — asymmetric, good for detecting unexpected distributions

Building a Production Monitoring Pipeline

A robust monitoring pipeline has four layers:

Data collection: Log model inputs and outputs to a time-series store (e.g., Kafka + InfluxDB). Store at least 30 days of raw feature vectors and predictions.
Drift computation: Run scheduled jobs (e.g., Airflow DAG every 6 hours) that compute PSI, KS, and prediction drift for each feature vs. the training baseline. Store results in a separate metrics table.
Alerting: Tiered alerts: INFO (PSI 0.1–0.2), WARNING (0.2–0.3), CRITICAL (>0.3). Confirm drift over at least two consecutive windows before paging. Avoid single-day spikes that are just noise.
Retraining trigger: When drift exceeds threshold and is confirmed, automatically trigger a retraining job with the latest 30 days of production data. Validate on a recent holdout set before deploying.

This architecture separates detection from action — you can tune alerts without affecting retraining logic.

monitoring_pipeline.pyPYTHON

from datetime import datetime, timedelta
import pandas as pd
from io.thecodeforge.monitoring.drift import compute_all_drift
from io.thecodeforge.alerts import evaluate_alert

def run_monitoring_check():
    now = datetime.utcnow()
    window_start = now - timedelta(days=30)
    # Load production features from the last 30 days
    prod_data = load_features(start_time=window_start, end_time=now)
    # Load training baseline (stored as parquet)
    train = pd.read_parquet('s3://model-baselines/latest/train_features.parquet')
    # Compute drift for each feature
    drift_results = compute_all_drift(train, prod_data)
    # Evaluate alert severity
    alert = evaluate_alert(drift_results)
    if alert.severity in ['WARNING', 'CRITICAL']:
        trigger_retraining_job(reason=alert.summary)
    log_metrics(drift_results, alert)

Output

2026-04-15 03:00:00 INFO: Drift check completed. 2 features PSI > 0.1. Alert severity: WARNING. Retraining triggered.

⚠ Don't Over-Alert

Single-day spikes in PSI are common after a data glitch or A/B test. Always confirm drift over three consecutive windows before firing Slack messages. Your team will thank you.

📊 Production Insight

Running drift checks every hour on 100 features costs ~$2/day in compute.

But missing a single drift that causes a 10% revenue drop costs $50k+/week.

Invest in the pipeline early — pay the compute cost, not the trust cost.

🎯 Key Takeaway

Separate detection from action.

Use tiered alerting with confirmation windows.

Automate retraining triggers but always require validation.

thecodeforge.io

Model Monitoring Drift Detection

Common Pitfalls in Drift Detection

Even seasoned MLOps teams make these mistakes:

Testing drift on the wrong baseline: Always compare against the training data distribution, not a previous production snapshot. Production distributions shift gradually — if you compare against last month, you'll miss long-term drift.
Ignoring feature interactions: Drift in one feature may be harmless when another feature compensates. For example, if 'age' drifts up but 'income' drifts up proportionally, the model may still work. Single-feature drift tests alone can cause false alarms.
Using only p-values: A tiny p-value with a tiny KS statistic (e.g., 0.02) may be statistically significant but practically irrelevant. Always check effect size alongside p-value.
Not handling missing data: If production data is missing for a feature, the distribution collapses to a spike at 0, which looks like extreme drift. Handle missing values explicitly before computing tests.

Mental Model

The Gardener Analogy

Monitoring a model is like tending a garden: you check the soil (input distributions) and the plants (outputs) regularly, but you don't yank every weed the moment it appears.

PSI threshold = how much weed you tolerate before acting
Confirmation window = wait a week before pulling
Feature interaction = some weeds help the soil
Missing data = a patch of bare dirt — fix the irrigation, don't just spray herbicide

📊 Production Insight

A team spent 3 months chasing 'drift' that was actually an ETL bug dropping NaN values.

Always validate your monitoring pipeline against known-good data first.

Drift detection is only as good as your data quality.

🎯 Key Takeaway

Test against training baseline, not previous production.

Consider feature interactions; use effect size not just p-value.

Handle missing data before computing drift.

Advanced: Multivariate Drift Detection and A/B Testing Integration

Single-feature tests scale linearly but miss interactions. For high-dimensional models (e.g., embeddings, tabular with 100+ features), use:

Maximum Mean Discrepancy (MMD): A kernel-based test that compares two high-dimensional distributions. More powerful than per-feature tests but computationally expensive.
Drift Detection on Model Embeddings: If your model has a latent layer (e.g., 64-dim), compute PSI on the embedding distribution. This catches joint shifts that single features miss.
A/B Test Validation: When you deploy a new model version, run both models in shadow mode for a week. Compute drift between the candidate's predictions and the champion's. Treat prediction distribution divergence as a prerequisite for go-live.

In production, combine single-feature tests for explainability with multivariate tests for sensitivity. This gives you both the 'what changed' and the 'where to look'.

multivariate_drift.pyPYTHON

from sklearn.metrics import pairwise_kernels
import numpy as np
from io.thecodeforge.monitoring.mmd import mmd_test

def detect_embedding_drift(
    train_embeddings: np.ndarray,
    prod_embeddings: np.ndarray,
    kernel: str = 'rbf',
    threshold: float = 0.05
) -> bool:
    """Detect drift in high-dimensional embeddings using MMD."""
    stat, p_value = mmd_test(train_embeddings, prod_embeddings, kernel=kernel)
    return p_value < threshold  # significant drift

Output

MMD statistic: 0.243, p-value: 0.001 -> Drift detected.

💡When to Use MMD

MMD is powerful but slow for 10M+ samples. Use per-feature PSI for daily checks, and run MMD on a 10% sample weekly for high-sensitivity areas like fraud or recommendation.

📊 Production Insight

A major e-commerce team used MMD on user embedding vectors and detected a drift no single feature caught: users from a new region had different browse-add-to-cart patterns.

Single-feature tests showed no drift in 'time_on_site' or 'cart_size'.

Multivariate tests caught the interaction.

🎯 Key Takeaway

Multivariate drift catches interactions that single-feature tests miss.

Use embedding drift for deep models, MMD for tabular.

Combine both for a complete picture.

Why You Monitor for Data Drift Before Concept Drift (And What Happens When You Don't)

New engineers always ask me: "Should I track data drift or concept drift first?" The answer is data drift, every time. Here's the cold logic: data drift breaks your input pipeline silently. Concept drift breaks your predictions. If you catch data drift first, you can alert before your model serves garbage. If you chase concept drift without monitoring data, you'll waste weeks debugging model architecture when the real culprit is a corrupted feature source.

I've seen teams deploy sophisticated concept drift detectors, only to discover their data pipeline had been feeding NaN-filled parquets for three days. The model wasn't drifting — it was starving. Data drift detection acts as the canary. It tells you when the world changed in ways your training distribution never saw. Only after confirming your inputs are valid should you look for changes in the relationship between features and labels.

The practical reality: deploy data drift monitors on every upstream feature. Use KS tests for continuous features, chi-square for categorical. Set alerting thresholds at p < 0.01 (not 0.05 — you want sensitivity, not statistical posturing). When the alarm fires, check the pipeline before you touch the model.

DataDriftFirst.pyPYTHON

// io.thecodeforge — ml-ai tutorial

// Practical data drift monitor with alerting
import numpy as np
from scipy.stats import ks_2samp
import logging

logging.basicConfig(level=logging.WARNING)

def monitor_data_drift(reference_sample, production_sample, feature_name, alpha=0.01):
    """KS test for continuous features. Logs alert if drift detected."""
    statistic, p_value = ks_2samp(reference_sample, production_sample)
    
    if p_value < alpha:
        logging.warning(
            f"DRIFT DETECTED on feature '{feature_name}' | "
            f"KS stat={statistic:.4f}, p-value={p_value:.6f}"
        )
        return True
    return False

# Example: monitoring 'transaction_amount' in a fraud model
historical_amounts = np.random.exponential(scale=100, size=10000)
current_batch = np.random.exponential(scale=150, size=1000)  # drift introduced

if monitor_data_drift(historical_amounts, current_batch, "transaction_amount"):
    print("Pipeline check triggered: inspect upstream source")

Output

WARNING:root:DRIFT DETECTED on feature 'transaction_amount' | KS stat=0.1523, p-value=0.000000

Pipeline check triggered: inspect upstream source

⚠ Production Trap:

Don't set alpha to 0.05 just because textbooks do. In production, you want early warnings, not academic rigor. 0.01 gives you more sensitivity to subtle drift that compounds over time.

🎯 Key Takeaway

Monitor data drift before concept drift — your input pipeline is the weakest link.

The Hidden Cost of Retraining On Drifted Data: Feedback Loops That Destroy Your Model

Here's the trap nobody talks about. You detect drift, you retrain your model on the new production data, and you deploy. Congratulations — you just locked in the drift as the new normal. If the drift was temporary (a holiday spike, a bot attack, a data pipeline glitch), you've now poisoned your model with garbage.

I consulted for a fintech startup that retrained their credit risk model every time they saw drift in application volumes. Three months later, the model started rejecting good applicants. Why? A promotional campaign caused a temporary spike in high-risk applications. The team retrained on that data, and the model learned to associate higher volume with higher risk. When the campaign ended, legitimate applicants got flagged. They spent two quarters unwinding that feedback loop.

The fix: never retrain blindly on drifted data. First, classify the drift. Is it temporary (seasonal, campaign-driven) or permanent (regulatory change, new user segment)? Use a drift classification model or heuristic rules. If temporary, keep the old model and suppress alerts. If permanent, retrain with a warm-start from the last stable checkpoint, then validate against a holdout set that spans before and after the drift onset. The holdout tells you if the retrain actually improved generalization or just memorized the noise.

FeedbackLoopGuard.pyPYTHON

// io.thecodeforge — ml-ai tutorial

// Detect and classify drift before retraining
import datetime

def classify_drift(feature_timestamps, drift_start_index, lookback_days=7):
    """Heuristic: if drift disappears within lookback, it's temporary."""
    pre_drift = feature_timestamps[:drift_start_index]
    post_drift = feature_timestamps[drift_start_index:]
    
    pre_mean = np.mean(pre_drift)
    post_mean = np.mean(post_drift)
    
    # Check if post-drift values return to pre-drift range
    if abs(post_mean - pre_mean) < 0.1 * pre_mean:  # temporary threshold
        return "temporary"
    else:
        return "permanent"

# Example usage
from datetime import datetime, timedelta

dates = [datetime.now() - timedelta(days=i) for i in range(30)]
values = [100] * 20 + [200] * 5 + [100] * 5  # spike then recovery

type = classify_drift(values, 20)
print(f"Drift type: {type}")  # Should be 'temporary'

Output

Drift type: temporary

💡Senior Shortcut:

Maintain a 'stable baseline' snapshot of your training data. When drift is detected, compare production data to this baseline, not to the latest retrain set. This prevents your model from drifting along with the data.

🎯 Key Takeaway

Never retrain on drifted data without classifying the drift first — or you'll bake temporary anomalies into permanent model degradation.

● Production incidentPOST-MORTEMseverity: high

The Silent Churn: How a Fraud Model Lost $2M Before Anyone Noticed

Symptom

Fraud detection accuracy dropped from 92% to 67% over three months. False positive rate tripled. No error logs, no downtime.

Assumption

The team assumed the model would maintain its performance because retraining happened monthly. They only monitored binary accuracy on a static holdout set.

Root cause

Covariate drift: the distribution of transaction amounts, merchant categories, and geolocation features shifted as the company expanded into a new market. The holdout set was never refreshed. Concept drift also occurred because fraudsters adapted to the model's patterns.

Fix

Implemented a monitoring pipeline that tracks per-feature PSI and monthly KS tests on production data. Added a dashboard with trend lines over 30-day rolling windows. Set up alerts triggered at PSI > 0.2 with a 7-day confirmation window to filter out noise.

Key lesson

Monitor data distributions, not just accuracy — accuracy can stay high while the model misses critical segments.
Refresh holdout sets quarterly with current production data.
Combine covariate and concept drift detection: use PSI for inputs and prediction distribution comparison for labels.
Always confirm drift alerts over multiple windows before paging anyone.

Production debug guideTrace the root cause when your model's predictions start degrading.4 entries

Symptom · 01

Model accuracy dropped but no feature changes

→

Fix

Check covariate drift: run PSI on each feature between training and production data for the last 7 days.

Symptom · 02

Prediction distribution shifted but feature stats look normal

→

Fix

Run concept drift detection: compare prediction vs actual label distributions using KS test.

Symptom · 03

Drift alerts firing every day

→

Fix

Increase the monitoring window from 1 day to 7 days and apply a severity threshold (e.g., PSI > 0.3). Check if the alert is driven by low-volume segments.

Symptom · 04

Model performs well on recent data but fails on new data

→

Fix

Verify train/test split recency — if your training data is older than 3 months, consider retraining with more recent samples.

★ Drift Detection Quick ReferenceCommands and actions for common drift scenarios.

Need to calculate PSI on a feature−

Immediate action

Compute expected and actual distribution bins

Commands

python -c "from scipy.stats import chi2; from io.thecodeforge.monitoring import psi; print(psi(expected_bins, actual_bins))"

python -c "import pandas as pd; expected=pd.read_csv('train_features.csv')['amount']; actual=pd.read_csv('production_features.csv')['amount']; print(psi(expected, actual, bins=10))"

Fix now

If PSI > 0.25, retrain the model with recent data and schedule a refresh of the monitoring baseline.

Need to compare two distributions with KS test+

Need to detect concept drift without labels+

Drift Detection Methods

Method	Best For	Sensitivity	Compute Cost	Interpretability
PSI	Categorical / binned features	Moderate (proportional shifts)	Low	High (bins-based)
KL Divergence	Directional change detection	High (asymmetric)	Low	Moderate
KS Test	Continuous features	High (location shifts)	Low	High (max diff point)
MMD	High-dimensional / embeddings	Very High (joint shifts)	High (kernel matrix)	Low (black-box)

⚙ Quick Reference

6 commands from this guide

File	Command / Code	Purpose
drift_detection_basic.py	from typing import List, Optional	What Is Model Monitoring and Drift Detection?
psi_and_ks.py	from scipy.stats import ks_2samp	Statistical Tests
monitoring_pipeline.py	from datetime import datetime, timedelta	Building a Production Monitoring Pipeline
multivariate_drift.py	from sklearn.metrics import pairwise_kernels	Advanced
DataDriftFirst.py	from scipy.stats import ks_2samp	Why You Monitor for Data Drift Before Concept Drift (And Wha
FeedbackLoopGuard.py	def classify_drift(feature_timestamps, drift_start_index, lookback_days=7):	The Hidden Cost of Retraining On Drifted Data

Key takeaways

Model monitoring is not optional

real data always drifts.

Detect drift using PSI for categorical, KS for continuous, KL for directional shifts.

Use a rolling 30-day window with confirmation before alerting to avoid false positives.

Combine single-feature and multivariate tests for complete coverage.

Automate retraining triggers but always validate on recent holdout data.

Blindly trusting 'accuracy' will hide silent failures

monitor distributions.

Common mistakes to avoid

4 patterns

Using only accuracy as a monitoring metric

Symptom

Model accuracy remains high while false positive rate triples because the model stops predicting certain classes (e.g., it never predicts fraud anymore). Accuracy hides class imbalance shift.

Fix

Monitor per-class metrics (precision, recall, F1) AND data distributions. Use PSI on prediction probabilities to catch silent failures.

Computing drift on a single day's data

Symptom

Daily drift alerts that are actually random noise. The ops team disables all alerts, and real drift goes undetected.

Fix

Always use a rolling window of at least 7 days (preferably 30) for drift computation. Confirm drift over 2+ consecutive windows before triggering an alert.

Ignoring missing values in production data

Symptom

A feature that is missing 30% of values (e.g., due to a pipeline bug) shows a distribution spike at 0. PSI jumps to 0.8, triggering a false alarm.

Fix

Handle missing values explicitly: impute with training median or flag as a separate category. Monitor feature completeness separately from distribution drift.

Not refreshing the baseline after retraining

Symptom

After retraining on new data, the drift detection still compares against the original 2023 baseline. Every feature shows drift because the model's world has already moved.

Fix

After each retraining, compute a new baseline from the training data used in that retraining. Store the baseline version alongside the model version.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the difference between covariate drift and concept drift. How wo...

Q02SENIOR

What is PSI and how do you interpret its value? When would you choose KS...

Q03SENIOR

Design a monitoring system for a credit scoring model that serves 100k p...

Q01 of 03SENIOR

Explain the difference between covariate drift and concept drift. How would you detect each in a production ML system?

ANSWER

Covariate drift means the distribution of input features has changed (e.g., user age distribution shifted from 25–35 to 35–45). Concept drift means the relationship between features and the target has changed (e.g., what constituted 'fraud' in 2023 is different now). To detect covariate drift: use PSI or KS test comparing current feature distributions against the training data baseline. For concept drift: compare prediction distribution vs actual labels using KS test on prediction residuals, or monitor prediction drift over time (if you have ground truth with delay). In practice, run both tests in parallel. If only covariate drifts, you may need to retrain with recent data. If concept drifts, you need to re-engineer features or reconsider the business logic.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How often should I run drift detection in production?

What PSI threshold should I use for my model?

Can drift detection work without labels?

Does retraining always fix drift?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's MLOps. Mark it forged?

5 min read · try the examples if you haven't