Mid-level 8 min · March 06, 2026

ML Workflow — When a Churn Model Stops Predicting

A churn model showed 94% accuracy while missing triple the new churners due to concept drift from a new pricing tier.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • The ML workflow is a repeatable pipeline: collect data, clean it, engineer features, train a model, evaluate, deploy, and monitor
  • Data quality dominates model quality — 80% of production ML failures trace back to bad data, not bad algorithms
  • Always split data into train/validation/test BEFORE any processing to prevent data leakage — this is the most commonly violated rule in beginner ML projects
  • Evaluation must use a holdout set the model never saw during training — otherwise your metrics are fiction and your production performance will be worse than you think
  • Deployment is not the finish line — models degrade over time as real-world data drifts from training data, and they do it silently
  • The biggest production mistake: skipping the monitoring loop and assuming the model stays accurate forever
Plain-English First

Imagine you want to teach a friend to recognise spam emails. First you show them hundreds of examples — some spam, some not. They spot patterns: 'spam always mentions free money', 'real emails come from addresses I recognise', 'spam has five exclamation marks in the subject line'. Then you test them on emails they've never seen before. If they pass, you let them sort your inbox. Six months later you check: are they still catching spam? Or has spam evolved in ways they haven't seen yet? That entire process — collecting examples, finding patterns, testing, putting your friend to work, and checking they're still performing — IS the machine learning workflow. The computer is just the friend, and the model is what it learned. The part most tutorials skip is that last check. Your friend won't tell you when they stop being good at the job. You have to ask.

Every time Netflix recommends a show you actually want to watch, or your phone unlocks with your face, or Gmail catches a phishing email before you open it — machine learning is running quietly in the background. None of that happens by accident. Behind every useful prediction is a structured, repeatable process that engineers follow from raw data to production system. That process is the ML workflow, and understanding it is the single most important mental model you can build before writing a single line of ML code.

The problem most beginners run into is that they jump straight into code — loading a dataset, calling .fit() — without understanding why each step exists. That's like baking a cake by throwing ingredients in a bowl in whatever order feels right. You need to know why you preheat the oven, why you cream the butter before adding flour, and why you don't open the oven door mid-bake. The ML workflow is that recipe. Skip a step and your model either fails to learn properly, learns the wrong patterns entirely, or works perfectly on your laptop and falls apart the moment real users touch it.

This guide won't just show you the commands. It'll show you why each stage exists, what breaks when you skip it, and what the failure looks like in production so you can recognise it before it costs someone money. We'll build a complete example — predicting whether a bank customer will leave — so every concept is grounded in something concrete rather than abstract theory.

By the end you'll be able to describe every stage of the ML workflow in plain English, explain why each stage exists, write working Python code that walks through each stage end to end, and talk confidently about real production ML when an interviewer asks.

Stage 1 — Data Collection and Understanding

The ML workflow starts before any code runs, and it starts with a question most beginners skip: do I actually have the data I need to solve this problem? Data collection is the foundation, and it is where the majority of production ML failures originate — not in the model, not in the training loop, but in the data itself.

Data understanding means profiling your dataset systematically: checking distributions, identifying missing values, spotting class imbalances, looking for obvious quality issues, and understanding where the data came from and how it was collected. A column where 80% of values are null is not a feature — it is noise that will mislead your model. A target variable where 99% of records belong to one class is not a classification problem in the traditional sense — it is an anomaly detection problem, and treating it as a standard classification task will produce a model that is technically accurate but entirely useless.

The critical mistake beginners make is treating data as a given — something that arrives clean and complete. In production, data is messy, incomplete, inconsistently formatted, and changes without notice. A date column formatted as YYYY-MM-DD in January may become MM/DD/YYYY in March after someone updated a data export script. Your pipeline must handle schema evolution, missing fields, unexpected enum values, and data type changes gracefully — or your model will fail silently with garbage inputs while returning predictions that look plausible.

Data understanding also means understanding representativeness: was this data collected in a way that reflects the population you'll predict against? Training on US customer data and deploying on Indian customers will produce a model that is confidently wrong. The question is not just what is in the data, but what is missing from it and whether what's missing matters.

data_collection_and_understanding.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import pandas as pd
import numpy as np

# Load the dataset
# In production this comes from a database query, S3 bucket, or API call
# For this guide: a bank customer churn dataset with 10,000 records
df = pd.read_csv('bank_customers.csv')

# ─────────────────────────────────────────
# STEP 1: Basic shape and data types
# Know what you're working with before anything else
# ─────────────────────────────────────────
print(f'Dataset shape: {df.shape}')  # (rows, columns)
print(f'Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')
print(f'\nColumn types:\n{df.dtypes}')

# ─────────────────────────────────────────
# STEP 2: Missing value audit
# Always do this BEFORE assuming data is clean — it never is
# ─────────────────────────────────────────
missing       = df.isnull().sum()
missing_pct   = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({'missing_count': missing, 'missing_pct': missing_pct})
missing_report = missing_report[missing_report.missing_count > 0].sort_values('missing_pct', ascending=False)
print(f'\nMissing values:\n{missing_report}')

# Columns with >30% missing usually can't be reliably imputed — flag them
high_missing = missing_report[missing_report.missing_pct > 30].index.tolist()
if high_missing:
    print(f'WARNING: High missing rate in: {high_missing} — consider dropping or flagging')

# ─────────────────────────────────────────
# STEP 3: Target variable distribution
# If imbalanced, accuracy alone will be a misleading metric
# ─────────────────────────────────────────
print(f'\nTarget distribution (churn):')
print(df['churn'].value_counts(normalize=True).round(3))
# Example output: 0 (stayed) = 0.84, 1 (churned) = 0.16
# A model that ALWAYS predicts 'stayed' gets 84% accuracy
# but catches exactly zero churners — useless for the business
imbalance_ratio = df['churn'].value_counts().max() / df['churn'].value_counts().min()
if imbalance_ratio > 5:
    print(f'WARNING: Imbalance ratio {imbalance_ratio:.1f}:1 — accuracy will be misleading. Use AUC or F1.')

# ─────────────────────────────────────────
# STEP 4: Numeric feature distributions
# Look for outliers, skew, and values that break domain rules
# ─────────────────────────────────────────
print(f'\nNumeric summary:')
desc = df.describe().round(2)
print(desc)

# Flag features with extreme skew — log transformation often helps
skewness = df.select_dtypes(include=np.number).skew().round(2)
high_skew = skewness[abs(skewness) > 2]
if not high_skew.empty:
    print(f'\nHighly skewed features (|skew| > 2): {high_skew.to_dict()}')
    print('Consider log or Box-Cox transformation')

# ─────────────────────────────────────────
# STEP 5: Categorical feature cardinality
# High cardinality (>50 unique values) needs different encoding strategies
# ─────────────────────────────────────────
cat_cols = df.select_dtypes(include='object').columns
print('\nCategorical feature cardinality:')
for col in cat_cols:
    unique_count = df[col].nunique()
    print(f'  {col}: {unique_count} unique values')
    if unique_count > 50:
        print(f'    WARNING: High cardinality — consider target encoding or grouping rare values')
    elif unique_count == 1:
        print(f'    WARNING: Constant column — carries no information, drop it')

# ─────────────────────────────────────────
# STEP 6: Duplicate records check
# Exact duplicates in training data can inflate evaluation metrics
# ─────────────────────────────────────────
dup_count = df.duplicated().sum()
print(f'\nDuplicate rows: {dup_count} ({dup_count/len(df)*100:.2f}%)')
if dup_count > 0:
    print('Remove duplicates before splitting: df = df.drop_duplicates()')
Output
Dataset shape: (10000, 14)
Memory usage: 1.1 MB
Column types:
customer_id int64
credit_score int64
country object
gender object
age int64
tenure float64
balance float64
...
Missing values:
missing_count missing_pct
tenure 90 0.90
Target distribution (churn):
0 0.838
1 0.162
WARNING: Imbalance ratio 5.2:1 — accuracy will be misleading. Use AUC or F1.
Numeric summary:
credit_score age tenure balance ...
count 10000.0 10000 9910.0 10000.0
mean 650.5 38.9 5.0 76485.9
...
Highly skewed features (|skew| > 2): {'balance': 2.31}
Consider log or Box-Cox transformation
Categorical feature cardinality:
country: 3 unique values
gender: 2 unique values
Duplicate rows: 0 (0.00%)
Data Quality Mental Model
  • Garbage in, garbage out is not a cliché — it is the documented root cause of the majority of ML production failures
  • A model trained on US customer data will fail on Indian customer data without retraining — representativeness matters as much as cleanliness
  • Missing values are not random noise — the reason data is missing often carries signal itself (missing-not-at-random), and ignoring that loses information
  • Class imbalance means accuracy is a lie — a model that always predicts the majority class can score 84% accuracy while being completely useless
  • Data understanding takes 60-80% of total project time, and that is exactly where it should go — a well-understood dataset with a simple model beats a poorly understood dataset with a complex model every time
Production Insight
In production, data sources change schema without notice — a column renamed downstream, a new enum value added to a categorical, a date format changed when a third-party vendor updates their API.
If your preprocessing pipeline assumes fixed columns or fixed value ranges, it crashes silently on the new shape and produces garbage features that look valid to the serving layer.
Rule: validate schema at ingestion time using a schema registry or explicit column type checks — reject or quarantine records that don't match expected shapes rather than letting them corrupt your model's inputs.
Key Takeaway
Data quality dominates model quality — 80% of production ML failures trace back to bad data, not bad algorithms.
Profile your dataset systematically before touching a model: missing values, class balance, feature distributions, cardinality, and duplicates.
The model inherits every bias and gap in your training data — there is no algorithm magic that fixes upstream data problems.

Stage 2 — Data Preprocessing and Feature Engineering

Raw data is never model-ready. Preprocessing transforms messy, real-world data into clean numeric arrays that algorithms can consume. This includes handling missing values, encoding categorical variables, scaling numeric features, creating new features from existing ones, and crucially — splitting the data into training and test sets at the right moment.

Feature engineering is where domain knowledge becomes model performance. A raw transaction timestamp is useless to most algorithms. But 'days since last transaction' or 'number of transactions in the last 30 days' or 'average transaction value versus account average' can be among the strongest predictors in your model. The best ML practitioners are not the ones who know the most algorithms — they are the ones who extract the most signal from raw data by understanding what the numbers actually represent.

The rule that beginners get wrong most often: ALL preprocessing must happen AFTER the train/test split, and the test set must be transformed using statistics computed from the training set only. If you compute the mean of the entire dataset before splitting and use that to impute missing values, you have leaked information from the test set into the training process. The model has seen the future. Your evaluation metrics are fiction. This is called data leakage, and it is the most common reason ML models that look excellent in development perform disappointingly in production.

The solution is mechanical: split first, fit on train, transform both. No exceptions.

preprocessing_and_features.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

df = pd.read_csv('bank_customers.csv')
df = df.drop_duplicates()  # remove duplicates found in Stage 1

# ─────────────────────────────────────────
# FEATURE ENGINEERING — create signal from raw data
# Do this BEFORE splitting so feature logic is consistent
# but computed statistics (means, stds) must come from train only
# ─────────────────────────────────────────

# Ratio features often carry more signal than raw values
# +1 prevents division by zero on edge cases
df['balance_to_salary_ratio']    = df['balance'] / (df['estimated_salary'] + 1)
df['products_per_tenure_year']   = df['products_number'] / (df['tenure'] + 1)

# Binary flags capture categorical behavioral patterns without cardinality issues
df['is_zero_balance']            = (df['balance'] == 0).astype(int)
df['has_multiple_products']      = (df['products_number'] > 1).astype(int)
df['is_senior']                  = (df['age'] >= 60).astype(int)  # domain knowledge

# ─────────────────────────────────────────
# CRITICAL RULE: SPLIT BEFORE ANY FITTING
# Fitting any transformer before the split leaks test information
# into training — your metrics become fiction
# ─────────────────────────────────────────
X = df.drop(['customer_id', 'churn'], axis=1)
y = df['churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y   # stratify ensures train and test have the same churn ratio
)

print(f'Train size: {X_train.shape[0]:,} | Test size: {X_test.shape[0]:,}')
print(f'Train churn rate: {y_train.mean():.3f} | Test churn rate: {y_test.mean():.3f}')
# Both rates should match — stratify ensures this

# ─────────────────────────────────────────
# IMPUTE MISSING VALUES
# fit_transform on train: learns the median from training data
# transform on test: applies the SAME median — does not look at test data
# ─────────────────────────────────────────
num_cols = X_train.select_dtypes(include=np.number).columns.tolist()
cat_cols = X_train.select_dtypes(include='object').columns.tolist()

num_imputer = SimpleImputer(strategy='median')  # median is robust to outliers vs mean
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_test[num_cols]  = num_imputer.transform(X_test[num_cols])   # transform only — not fit

# ─────────────────────────────────────────
# ENCODE CATEGORICAL VARIABLES
# One-hot encoding for low cardinality categoricals
# ─────────────────────────────────────────
X_train = pd.get_dummies(X_train, columns=['country', 'gender'], drop_first=True)
X_test  = pd.get_dummies(X_test,  columns=['country', 'gender'], drop_first=True)

# Align columns — if test set is missing a category seen in train, add it as zeros
# This handles the case where a rare category value only appears in train
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

# ─────────────────────────────────────────
# SCALE NUMERIC FEATURES
# StandardScaler: transforms each feature to mean=0, std=1
# Required for: Logistic Regression, SVM, KNN, Neural Networks
# NOT required for: Random Forest, XGBoost, LightGBM (tree-based, scale-invariant)
# ─────────────────────────────────────────
scaler = StandardScaler()
remaining_num_cols = X_train.select_dtypes(include=np.number).columns
X_train[remaining_num_cols] = scaler.fit_transform(X_train[remaining_num_cols])
X_test[remaining_num_cols]  = scaler.transform(X_test[remaining_num_cols])

print(f'\nPreprocessing complete.')
print(f'Final feature count: {X_train.shape[1]}')
print(f'New engineered features: balance_to_salary_ratio, products_per_tenure_year, is_zero_balance, has_multiple_products, is_senior')
Output
Train size: 8,000 | Test size: 2,000
Train churn rate: 0.162 | Test churn rate: 0.162
Preprocessing complete.
Final feature count: 17
New engineered features: balance_to_salary_ratio, products_per_tenure_year, is_zero_balance, has_multiple_products, is_senior
Data Leakage Warning — The Most Common Beginner Mistake
If you fit ANY transformer — a scaler, imputer, or encoder — on the full dataset before splitting into train and test, you leak test information into training. The model indirectly sees the future. Your test metrics look better than they deserve to be. And when the model hits production, it underperforms your metrics because the real world doesn't have that leaked information baked in. The fix is mechanical and non-negotiable: split first, fit on train only, transform both. In production, bundle the full preprocessing chain into an sklearn Pipeline so the same transforms run in both training and serving without any manual coordination.
Production Insight
Feature engineering done in a Jupyter notebook diverges from what runs in production more often than you would expect. Different code paths, different edge case handling, different library versions, and the notebook running with global state that the production API does not have.
The solution is sklearn Pipeline: bundle every preprocessing step and the model into one serialisable object. The exact same transform logic runs at training time and serving time because it is literally the same code path.
Rule: never deploy a model without its preprocessing pipeline attached. They are one deployable unit, not two.
Key Takeaway
Feature engineering is where domain knowledge becomes model performance — raw columns are almost never what your model actually needs.
Split first, fit on train, transform both — this is the single most commonly violated rule in beginner ML, and it is the most consequential.
Data leakage from preprocessing before the split is the most reliable way to build a model that looks great in development and disappoints in production.

Stage 3 — Model Selection and Training

Model selection is not about picking the most sophisticated algorithm. It is about matching the right tool to the problem given your constraints: prediction accuracy, latency requirements, interpretability needs, training data size, and long-term maintenance cost. A logistic regression model that is interpretable and trains in seconds often beats a gradient boosting ensemble that is opaque and takes hours to retrain, especially when business stakeholders need to explain decisions to regulators or customers.

Always start with a simple baseline. This is not a compromise — it is a professional discipline. If your baseline achieves 76% AUC and a complex model achieves 78% AUC, ask whether that 2% improvement justifies the added training time, serving latency, debugging difficulty, and retraining complexity. In many production systems it does not. In production, simpler models fail more predictably, are faster to serve, and are easier to diagnose when something goes wrong.

Training is not just calling .fit(). It involves cross-validation to get a robust estimate of real-world performance (not just memorisation of the training set), hyperparameter tuning to find the best configuration, and monitoring the bias-variance tradeoff. A model that memorises training data is called overfitting — it performs well on training data but poorly on anything new. A model that is too simple to capture real patterns is underfitting — it performs poorly everywhere. The goal is the sweet spot between them.

Cross-validation is the tool for this. Instead of training once and evaluating on one validation split, you train five times on five different portions of your training data and average the results. This gives you a much more reliable estimate of how the model will perform on unseen data, and it reveals whether your model's performance is consistent or just got lucky on one particular data split.

model_training.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Assume X_train, X_test, y_train, y_test are from Stage 2

# ─────────────────────────────────────────
# STEP 1: BASELINE — always train this first
# Logistic Regression is fast, interpretable, and sets the benchmark to beat
# If a complex model doesn't clearly beat this, the complexity isn't worth it
# ─────────────────────────────────────────
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

baseline     = LogisticRegression(max_iter=1000, random_state=42)
baseline_cv  = cross_val_score(baseline, X_train, y_train, cv=cv, scoring='roc_auc')
print(f'Baseline (Logistic Regression):')
print(f'  CV AUC: {baseline_cv.mean():.4f} +/- {baseline_cv.std():.4f}')
print(f'  This is your benchmark — any more complex model must beat this to justify the added cost')

# ─────────────────────────────────────────
# STEP 2: COMPARE — try more complex models
# Only add complexity if the baseline is genuinely insufficient
# ─────────────────────────────────────────
model_candidates = {
    'Random Forest':       RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42)
}

results = {}
for name, model in model_candidates.items():
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc')
    results[name] = scores
    print(f'\n{name}:')
    print(f'  CV AUC: {scores.mean():.4f} +/- {scores.std():.4f}')
    improvement = scores.mean() - baseline_cv.mean()
    print(f'  Improvement over baseline: {improvement:+.4f}')

# ─────────────────────────────────────────
# STEP 3: SELECT — pick based on validation performance
# Note: we have NOT touched the test set yet — it stays sacred until Stage 4
# ─────────────────────────────────────────
best_model = GradientBoostingClassifier(
    n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42
)
best_model.fit(X_train, y_train)

print(f'\nSelected: Gradient Boosting — best CV AUC, improvement justifies complexity')

# ─────────────────────────────────────────
# STEP 4: FEATURE IMPORTANCE — understand what drives predictions
# Critical for stakeholder trust, model auditing, and debugging drift
# ─────────────────────────────────────────
importances = pd.Series(
    best_model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

print(f'\nTop 8 features by importance:')
for feature, importance in importances.head(8).items():
    bar = '█' * int(importance * 100)
    print(f'  {feature:<35} {bar} {importance:.4f}')
Output
Baseline (Logistic Regression):
CV AUC: 0.7621 +/- 0.0134
This is your benchmark — any more complex model must beat this to justify the added cost
Random Forest:
CV AUC: 0.8543 +/- 0.0098
Improvement over baseline: +0.0922
Gradient Boosting:
CV AUC: 0.8687 +/- 0.0112
Improvement over baseline: +0.1066
Selected: Gradient Boosting — best CV AUC, improvement justifies complexity
Top 8 features by importance:
products_number ████████████████████████████ 0.2841
age ██████████████ 0.1423
is_zero_balance █████████ 0.0987
active_member ████████ 0.0834
balance_to_salary_ratio ███████ 0.0712
credit_score ██████ 0.0651
is_senior █████ 0.0489
tenure ████ 0.0401
Pro Tip: The Baseline Rule in Practice
Always train a logistic regression baseline before trying anything more complex. If your complex model only marginally outperforms it, the complexity is not worth it — a logistic regression trains in seconds, explains its predictions through coefficients, and degrades gracefully when data drifts. In production, the maintenance cost of a complex model is often higher than the accuracy gain it provides. The baseline also gives you a meaningful benchmark: stakeholders and future engineers need to know what 'improvement' means relative to something concrete.
Production Insight
In production, the model with the highest AUC is not always the right model to deploy. Gradient boosting with 500 estimators might score 2% higher AUC than a 50-estimator version but take 150ms to serve a prediction against a 50ms SLA.
Latency, memory footprint, interpretability for audits, and retraining time all matter in production and are invisible in AUC comparisons.
Rule: evaluate candidate models on at least four axes — accuracy metric, serving latency on production hardware, memory usage, and retraining time — before making the deployment decision.
Key Takeaway
Start with a simple baseline — complexity is a cost, not a virtue, and every increase in complexity must be justified by a clear accuracy improvement.
Cross-validation gives you a robust estimate of real-world performance; training accuracy tells you only how well the model memorised training data.
The best model is the one that balances accuracy, latency, interpretability, and maintainability for your specific production constraints — not the one that wins a benchmark in isolation.
Model Selection Decision Framework
IfYou need interpretability — stakeholders ask 'why did the model predict this?' or regulators require it
UseUse Logistic Regression or Decision Tree — coefficients and rules are directly readable and defensible
IfYou need maximum accuracy on tabular data with mixed numeric and categorical features
UseUse gradient boosting — XGBoost, LightGBM, or CatBoost consistently win on tabular data and handle mixed feature types well
IfYou have image, text, audio, or sequential time-series data
UseUse deep learning — CNNs for images, Transformers for text and sequences, LSTMs for short time series with irregular intervals
IfYou have fewer than 1,000 training samples
UseUse simple models with strong regularisation — complex models will overfit on small data. Logistic regression or SVM with cross-validation is often best.
IfLatency requirement is under 10ms per prediction in a serving API
UseAvoid large ensembles — a gradient boosting model with 500 trees may take 30-50ms. Use logistic regression, a single decision tree, or a distilled/quantised model.

Stage 4 — Evaluation and Validation

Evaluation answers one question: will this model work on data it has never seen before? Not data it trained on. Not data it was cross-validated on. Entirely new data from the real world. The test set is your proxy for that real world, which is why it must be held sacred: you look at it exactly once, after all model selection and hyperparameter tuning decisions are final, and you report what you see without going back to adjust.

If you evaluate on the test set, find the performance unsatisfactory, tune the model, and evaluate again — you have now used the test set as part of your training process. It is no longer a fair estimate of real-world performance. This is a common mistake and it produces models that look good on paper but disappoint in deployment.

But looking at test set aggregate metrics is not enough. You need to understand where the model fails and whether those failures have a business cost. A model with 86% overall accuracy might have only 45% recall on the minority class — meaning it misses more than half of the customers who actually churn. That 54% miss rate is not an abstract metric. Each missed churner is a customer who leaves without a retention offer. The confusion matrix translates directly into revenue impact.

Metrics must also match business objectives. If catching every churner matters more than minimising false alarms, you optimise for recall. If false alarms are expensive — say, each false alarm triggers a costly discount offer to a customer who was never going to leave — you optimise for precision. Accuracy as a sole metric on an imbalanced dataset is a guaranteed way to build a model that satisfies a metric while failing the business.

evaluation_and_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, precision_recall_curve,
    average_precision_score
)
import numpy as np
import pandas as pd

# Assume best_model, X_train, X_test, y_train, y_test are from Stage 3

# ─────────────────────────────────────────
# CRITICAL: Look at the test set ONCE
# All decisions were made using cross-validation on X_train
# This is the only honest measure of real-world performance
# ─────────────────────────────────────────
y_pred  = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print('=== FINAL TEST SET EVALUATION ===')
print(f'ROC AUC: {roc_auc_score(y_test, y_proba):.4f}')
print(f'Average Precision (PR AUC): {average_precision_score(y_test, y_proba):.4f}')
print(f'\nClassification Report (default threshold = 0.5):')
print(classification_report(y_test, y_pred, target_names=['stayed', 'churned']))

# ─────────────────────────────────────────
# CONFUSION MATRIX — see exactly where the model fails
# Each cell has a name and a business meaning
# ─────────────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
print('\nConfusion Matrix:')
print(f'  True Negatives  (correctly predicted stayed):  {cm[0][0]:>4}')
print(f'  False Positives (predicted churn, stayed):     {cm[0][1]:>4}')
print(f'  False Negatives (predicted stayed, churned):   {cm[1][0]:>4}  ← these hurt the most')
print(f'  True Positives  (correctly predicted churn):   {cm[1][1]:>4}')

# ─────────────────────────────────────────
# BUSINESS IMPACT — translate metrics to money
# This is what stakeholders actually care about
# ─────────────────────────────────────────
avg_annual_revenue_per_customer = 500
cost_of_retention_offer         = 50
retention_acceptance_rate       = 0.30  # 30% of offered customers accept and stay

tn, fp, fn, tp = cm.ravel()

revenue_saved   = tp * avg_annual_revenue_per_customer * retention_acceptance_rate
wasted_offers   = fp * cost_of_retention_offer
revenue_missed  = fn * avg_annual_revenue_per_customer
net_impact      = revenue_saved - wasted_offers

print(f'\nBusiness Impact (batch of {len(y_test):,} customers):')
print(f'  Revenue saved from caught churners:     ${revenue_saved:>8,.0f}')
print(f'  Cost of false-alarm retention offers:   ${wasted_offers:>8,.0f}')
print(f'  Revenue missed from uncaught churners:  ${revenue_missed:>8,.0f}  ← biggest loss')
print(f'  Net value of model vs no model:         ${net_impact:>8,.0f}')

# ─────────────────────────────────────────
# THRESHOLD TUNING — 0.5 is almost never optimal
# Find the threshold that maximises F1 or business value
# ─────────────────────────────────────────
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
f1_scores    = 2 * (precision * recall) / (precision + recall + 1e-8)
optimal_idx  = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx] if optimal_idx < len(thresholds) else 0.5

print(f'\nThreshold Analysis:')
print(f'  Default threshold: 0.50')
print(f'  Optimal F1 threshold: {optimal_threshold:.3f}')
print(f'  F1 improvement: {f1_scores[optimal_idx] - f1_scores[int(len(thresholds)*0.5)]:.4f}')

# Apply optimal threshold and compare
y_pred_tuned = (y_proba >= optimal_threshold).astype(int)
print(f'\nOptimised Classification Report (threshold = {optimal_threshold:.3f}):')
print(classification_report(y_test, y_pred_tuned, target_names=['stayed', 'churned']))
Output
=== FINAL TEST SET EVALUATION ===
ROC AUC: 0.8634
Average Precision (PR AUC): 0.6891
Classification Report (default threshold = 0.5):
precision recall f1-score support
stayed 0.88 0.96 0.92 1676
churned 0.72 0.45 0.55 324
accuracy 0.86 2000
macro avg 0.80 0.71 0.74 2000
Confusion Matrix:
True Negatives (correctly predicted stayed): 1609
False Positives (predicted churn, stayed): 67
False Negatives (predicted stayed, churned): 178 ← these hurt the most
True Positives (correctly predicted churn): 146
Business Impact (batch of 2,000 customers):
Revenue saved from caught churners: $ 21,900
Cost of false-alarm retention offers: $ 3,350
Revenue missed from uncaught churners: $ 89,000 ← biggest loss
Net value of model vs no model: $ 18,550
Threshold Analysis:
Default threshold: 0.50
Optimal F1 threshold: 0.327
F1 improvement: 0.0612
Optimised Classification Report (threshold = 0.327):
precision recall f1-score support
stayed 0.91 0.91 0.91 1676
churned 0.53 0.62 0.57 324
accuracy 0.86 2000
macro avg 0.72 0.76 0.74 2000
Evaluation Mental Model
  • The test set is sacred — look at it once after all decisions are made, or your metrics are biased and you will overestimate real-world performance
  • Accuracy on imbalanced data is a vanity metric — a model predicting 'no churn' for everyone gets 84% accuracy but catches zero churners and has zero business value
  • The confusion matrix tells the business story: false negatives are customers who left without a retention attempt, false positives are wasted discount budget
  • Threshold tuning converts a probability model into a business decision tool — 0.5 is the statistical default, not the business-optimal choice
  • Always translate metrics to business impact — stakeholders do not care about AUC-ROC, they care about revenue saved and budget spent
Production Insight
The default classification threshold of 0.5 is optimal only if false positives and false negatives have exactly equal cost, which is almost never true in real business problems.
For churn prediction, the cost of missing a churner (lost customer revenue) is typically 5-10x the cost of a false alarm (wasted retention offer).
Rule: tune the decision threshold using a precision-recall curve and a business cost matrix specific to your problem — never deploy a classification model with the default threshold without at least evaluating whether it is appropriate.
Key Takeaway
Accuracy on imbalanced data is meaningless — report precision, recall, F1, and AUC-ROC, and always include the confusion matrix.
The confusion matrix has a dollar value — translate each cell to business impact so stakeholders understand what the model actually does.
Threshold tuning is the bridge between model probability output and business decision — never accept the default 0.5 without analysis.

Stage 5 — Deployment and Monitoring

A model that lives in a Jupyter notebook generates zero business value. Deployment means serving predictions to real users in real time — typically via a REST API for online predictions, a batch scoring job for overnight processing, or an embedded library for edge devices. Getting the model into production is one engineering challenge. Keeping it working is a separate and ongoing challenge that most teams underinvest in.

Models degrade. The world changes and data with it. Customer behaviour shifts when you launch a new product. Feature distributions change when a data pipeline upstream gets modified. Economic conditions shift seasonality patterns. A data vendor changes their schema and suddenly a key feature is zero for every new record. None of these failures will raise an exception in your serving code. The model will continue returning predictions that look syntactically valid while being semantically wrong, and the first signal you'll see is a business metric moving in the wrong direction weeks after the root cause occurred.

The deployment stack must include model versioning so you can roll back to a known good state in minutes, not days. It must include shadow scoring so new model versions are validated against live traffic before they replace the production model. It must include feature drift detection so you know when the inputs to your model have shifted meaningfully from what it was trained on. And it must include business metric monitoring alongside the ML metrics — because sometimes the model is technically correct but the business outcomes are not.

None of this is optional in production. It is the difference between a model that gets deployed and forgotten and one that remains a reliable part of your system for years.

deployment_and_monitoring.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import joblib
from flask import Flask, request, jsonify
import pandas as pd
import numpy as np
from datetime import datetime
import hashlib

# ─────────────────────────────────────────
# SAVE: Serialize the FULL pipeline — not just the model
# The imputer, scaler, encoder, and model are one deployable unit
# ─────────────────────────────────────────
def save_pipeline(model, scaler, imputer, feature_columns, metrics, path='churn_pipeline_v1.0.0.joblib'):
    pipeline_artifact = {
        'model':            model,
        'scaler':           scaler,
        'imputer':          imputer,
        'feature_columns':  feature_columns,
        'version':          '1.0.0',
        'trained_at':       datetime.now().isoformat(),
        'training_metrics': metrics,   # store test-time metrics for comparison in monitoring
        'decision_threshold': 0.327    # the tuned threshold from Stage 4
    }
    joblib.dump(pipeline_artifact, path)
    print(f'Pipeline saved: {path}')
    print(f'Artifact contains: {list(pipeline_artifact.keys())}')
    return path


# ─────────────────────────────────────────
# SERVE: Flask API for real-time predictions
# ─────────────────────────────────────────
app        = Flask(__name__)
pipeline   = joblib.load('churn_pipeline_v1.0.0.joblib')
pred_log   = []  # in-memory log for drift monitoring — use a database in production

@app.route('/predict', methods=['POST'])
def predict():
    try:
        data = request.get_json()
        if not data:
            return jsonify({'error': 'Request body must be JSON'}), 400

        # Align to expected feature columns — fill missing with 0
        df = pd.DataFrame([data])
        df = df.reindex(columns=pipeline['feature_columns'], fill_value=0)

        # Apply SAME preprocessing as training
        num_cols = df.select_dtypes(include=np.number).columns
        df[num_cols] = pipeline['imputer'].transform(df[num_cols])
        df[num_cols] = pipeline['scaler'].transform(df[num_cols])

        probability  = pipeline['model'].predict_proba(df)[0][1]
        prediction   = int(probability >= pipeline['decision_threshold'])

        # Log for drift monitoring — essential for Stage 6
        pred_log.append({
            'timestamp': datetime.now().isoformat(),
            'probability': float(probability),
            'prediction':  prediction,
            'features':    data
        })

        return jsonify({
            'churn_probability': round(probability, 4),
            'will_churn':        bool(prediction),
            'model_version':     pipeline['version'],
            'threshold_used':    pipeline['decision_threshold']
        })

    except Exception as e:
        # Log the error with context — never swallow exceptions silently
        print(f'Prediction error: {e} | Input: {request.get_json()}')
        return jsonify({'error': 'Prediction failed', 'detail': str(e)}), 500

@app.route('/health', methods=['GET'])
def health():
    return jsonify({
        'status':              'healthy',
        'model_version':       pipeline['version'],
        'predictions_served':  len(pred_log),
        'trained_at':          pipeline['trained_at']
    })

# ─────────────────────────────────────────
# MONITOR: Drift detection — run weekly
# Compare live feature distributions against training baselines
# ─────────────────────────────────────────
def calculate_psi(train_values, live_values, buckets=10):
    """Population Stability Index — measures how much a distribution has shifted.
    PSI < 0.1:  no significant shift
    PSI 0.1-0.2: moderate shift — monitor closely
    PSI > 0.2:   significant drift — trigger retraining
    """
    def get_distribution(values, buckets):
        percentiles   = np.linspace(0, 100, buckets + 1)
        boundaries    = np.percentile(train_values, percentiles)
        boundaries[0] = -np.inf
        boundaries[-1] = np.inf
        counts        = np.histogram(values, bins=boundaries)[0]
        proportions   = (counts + 1e-8) / len(values)  # +1e-8 avoids log(0)
        return proportions

    train_dist = get_distribution(train_values, buckets)
    live_dist  = get_distribution(live_values,  buckets)
    psi        = np.sum((live_dist - train_dist) * np.log(live_dist / train_dist))
    return round(psi, 4)

def run_drift_check(training_stats, recent_predictions, retraining_threshold=0.2):
    drifted_features = []
    for feature, train_values in training_stats.items():
        live_values = [p['features'].get(feature) for p in recent_predictions if feature in p['features']]
        if len(live_values) < 100:
            continue  # not enough live data to calculate PSI reliably
        psi = calculate_psi(np.array(train_values), np.array(live_values))
        status = 'DRIFT' if psi > retraining_threshold else 'OK'
        print(f'  {feature:<30} PSI={psi:.4f}  [{status}]')
        if psi > retraining_threshold:
            drifted_features.append(feature)

    if drifted_features:
        print(f'\nWARNING: Drift detected in {len(drifted_features)} features: {drifted_features}')
        print('Action:  Trigger retraining pipeline. Do not wait for business metrics to degrade.')
    else:
        print('\nNo significant drift detected. Model inputs remain stable.')

    return drifted_features
Output
Pipeline saved: churn_pipeline_v1.0.0.joblib
Artifact contains: ['model', 'scaler', 'imputer', 'feature_columns', 'version', 'trained_at', 'training_metrics', 'decision_threshold']
Prediction API response:
{
"churn_probability": 0.7234,
"will_churn": true,
"model_version": "1.0.0",
"threshold_used": 0.327
}
Weekly drift check:
age PSI=0.0412 [OK]
products_number PSI=0.2341 [DRIFT]
balance_to_salary_ratio PSI=0.0891 [OK]
active_member PSI=0.2109 [DRIFT]
WARNING: Drift detected in 2 features: ['products_number', 'active_member']
Action: Trigger retraining pipeline. Do not wait for business metrics to degrade.
Deployment Without Monitoring Is a Ticking Time Bomb
A deployed model without monitoring is not a finished product — it is a liability. Feature distributions shift, user behaviour changes, data pipelines break and start feeding unexpected values. Without drift detection, your model degrades invisibly. By the time business metrics surface the problem, weeks of bad predictions have already reached customers. Monitor feature distributions weekly using PSI, track your model's predicted probability distribution over time, compare against actual outcomes when labels become available, and always maintain the previous model version for instant rollback.
Production Insight
The model and its preprocessing pipeline are one deployable unit — never treat them separately.
A model deployed without its exact preprocessing pipeline will receive raw unscaled inputs and produce predictions that are different from anything it saw during training, silently and without errors.
Rule: serialise the complete pipeline — imputer, scaler, encoder, and model — as a single versioned artifact. Test it end-to-end on a known input before deploying. The serving code and the training code must use the exact same preprocessing logic, and the only reliable way to guarantee that is to make them the same object.
Key Takeaway
Deployment is the starting line, not the finish line — the model's operational life begins when it goes live, and it requires ongoing care.
Monitor feature drift weekly using PSI and retrain when thresholds are exceeded — do not wait for business metrics to degrade as your first signal.
The full pipeline is one artifact: serialize preprocessing and model together, version everything, and maintain rollback capability so you can recover in minutes when something goes wrong.
Deployment Architecture Decision
IfPredictions needed synchronously in under 100ms per request
UseDeploy as a REST API using FastAPI or Flask behind a load balancer with horizontal auto-scaling — containerise with Docker for environment consistency
IfPredictions needed for bulk records overnight or on a schedule
UseDeploy as a batch job using Airflow or Prefect — reads from database, scores all records, writes results back, logs timing and drift metrics
IfModel must run on mobile devices or embedded hardware with no network dependency
UseExport to ONNX or TensorFlow Lite — optimise for model size and inference speed, and test on target hardware before shipping
IfModel is retrained frequently and multiple versions coexist in production
UseUse a model registry such as MLflow or Weights and Biases — track versions, metrics, and lineage; implement shadow scoring before promoting new versions
● Production incidentPOST-MORTEMseverity: high

The Silent Model Decay — When a Churn Predictor Stops Predicting

Symptom
Business stakeholders reported rising churn rates and escalating customer acquisition costs. The ML dashboard still showed 94% model accuracy. New customers were leaving at triple the historical rate, but the model flagged almost nobody as high-risk. Retention campaigns were not triggering because the model saw no one worth targeting.
Assumption
The team assumed model accuracy measured on a static historical test set would remain valid indefinitely. They had no monitoring for data drift or concept drift. The model was deployed once and never retrained. Nobody asked whether the data distribution from six months ago still described the customers the model was scoring today.
Root cause
Three months after deployment, the company launched a new pricing tier and changed its onboarding flow to onboard enterprise accounts differently. The new customer segment had completely different behavioural patterns — shorter session durations in the first 30 days, fewer feature adoptions in month one, different geographic distribution — but the model had never seen this distribution during training. This is concept drift: the relationship between features and the target variable changed, but the model kept applying its old learned patterns to a population it was never designed for. The 94% test set accuracy was measured against historical customers who no longer represented the majority of the active user base.
Fix
Implemented a monitoring pipeline that tracks feature distributions weekly using Population Stability Index (PSI) and compares live prediction probability distributions against training-time distributions. Added automated retraining triggers when PSI exceeds 0.2 on any tier-1 feature. Deployed shadow scoring — the retrained model runs in parallel with the production model for one full week before promotion, with both scores logged for comparison. Added a business metric crosscheck: if the model's predicted churn rate diverges from actual observed churn rate by more than 15% over a rolling 14-day window, a critical alert fires regardless of PSI values.
Key lesson
  • A model's test accuracy is a snapshot taken at a moment in time, not a guarantee — it reflects performance on data that may no longer represent the real-world population the model is scoring today
  • Always monitor for data drift (feature distributions shifting over time) and concept drift (the feature-to-target relationship changing) — these are different problems with different fixes
  • Set up automated retraining pipelines with drift-triggered conditions, not calendar schedules — retrain when drift is detected, not every Monday regardless of whether the data has changed
  • Shadow scoring before model promotion is non-negotiable for any model that influences business-critical decisions — a week of parallel scoring on live traffic catches distribution problems that test sets miss
Production debug guideSymptom-driven actions for the most common ML pipeline failures5 entries
Symptom · 01
Model accuracy is suspiciously high (99%+) on test data but much lower on new production data
Fix
Check for data leakage — features that indirectly encode the target variable or were computed using future information. Common culprits: future-dated columns, ID fields that correlate with the target, target-encoded columns computed on the full dataset before splitting, or preprocessing fit on the full dataset rather than the training set alone. Run permutation importance on the test set and investigate any feature with disproportionate importance.
Symptom · 02
Model performs well in training but noticeably worse on validation or test data (overfitting)
Fix
Measure the gap: training accuracy minus validation accuracy. If the gap exceeds 10 percentage points, you're overfitting. Apply regularization (L1/L2 for linear models, min_samples_leaf for trees), reduce model complexity by lowering max_depth or n_estimators, add dropout for neural networks, or collect more training data. Cross-validation will confirm whether the gap is consistent across folds.
Symptom · 03
Model predictions are dominated by one class — almost always predicts the majority class
Fix
Check the class distribution: df['target'].value_counts(normalize=True). If the minority class is under 10%, the model learned that always predicting the majority gives the lowest loss. Apply class_weight='balanced' in the model constructor, use SMOTE oversampling from imblearn, or switch your optimisation metric from accuracy to F1-score or AUC-PR. Report precision and recall separately rather than relying on accuracy.
Symptom · 04
Feature importance output shows one feature dominates everything else by a wide margin
Fix
Investigate whether that feature is a proxy for the target (data leakage). Check its correlation with the target: df.corr()['target'][suspicious_feature]. If correlation exceeds 0.95, remove it and retrain. If it's legitimate domain knowledge, verify the feature will be available at serving time with the same distribution — a feature that's clean in training but missing or calculated differently in production will cause serving failures.
Symptom · 05
Model retraining produces worse results than the previous version
Fix
Do not promote the new model automatically. Compare feature distributions between the old training set and the new training set using PSI — if a key feature has shifted significantly, the new training data may contain distribution problems. Check label quality in the new training data, and verify that preprocessing pipeline versions are pinned. Compare side-by-side predictions on a fixed evaluation set to isolate whether the degradation is in the data or the pipeline.
★ ML Pipeline Quick Debug ReferenceCommands to diagnose common ML workflow issues. No theory — just copy, paste, diagnose.
Suspect data leakage — accuracy too good to be true
Immediate action
Check feature correlations with the target variable before and after the split
Commands
df.corr()['target'].sort_values(ascending=False).head(10)
from sklearn.inspection import permutation_importance; result = permutation_importance(model, X_test, y_test, n_repeats=10); print(dict(zip(X_test.columns, result.importances_mean.round(4))))
Fix now
If any feature has correlation above 0.95 with the target, it is almost certainly leaking the answer. Remove it and retrain. Permutation importance on the test set is more reliable than built-in feature importance for detecting leakage.
Class imbalance — model predicts only the majority class+
Immediate action
Check class distribution in training data and review per-class metrics
Commands
df['target'].value_counts(normalize=True)
from sklearn.metrics import classification_report; print(classification_report(y_test, y_pred, target_names=['stayed', 'churned']))
Fix now
If minority class is under 10%, pass class_weight='balanced' to your model constructor. For more aggressive correction, use SMOTE: from imblearn.over_sampling import SMOTE; X_res, y_res = SMOTE(random_state=42).fit_resample(X_train, y_train)
Model training is extremely slow or runs out of memory+
Immediate action
Check dataset size, feature count, and memory usage before choosing a strategy
Commands
print(f'Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB')
df.info(memory_usage='deep')
Fix now
If over 1M rows, switch to LightGBM with subsample=0.8 — it trains 10-50x faster than sklearn's GradientBoostingClassifier. If over 1000 features, apply variance threshold or SelectFromModel before training. Downcast dtypes: df = df.apply(pd.to_numeric, downcast='float', errors='ignore')
Predictions differ between training environment and production API+
Immediate action
Compare library versions and verify the full pipeline is serialised, not just the model
Commands
import sklearn, joblib; print(f'sklearn {sklearn.__version__}, joblib {joblib.__version__}')
loaded = joblib.load('pipeline.joblib'); print(type(loaded)); print(loaded.named_steps.keys() if hasattr(loaded, 'named_steps') else 'WARNING: model only, no pipeline')
Fix now
Pin all library versions in requirements.txt and freeze them in your Docker image. Serialise using sklearn Pipeline that includes all preprocessing steps: imputer, scaler, encoder, and model as one object. A pipeline object guarantees training-time and serving-time preprocessing are identical.
Model performance degrades over weeks after deployment+
Immediate action
Measure data drift using Population Stability Index on key features
Commands
pip install evidently
from evidently.report import Report; from evidently.metric_preset import DataDriftPreset; report = Report(metrics=[DataDriftPreset()]); report.run(reference_data=train_df, current_data=recent_df); report.show()
Fix now
If PSI exceeds 0.2 on any tier-1 feature, trigger retraining. Set up a weekly scheduled job that runs this comparison automatically and fires an alert when the threshold is crossed. Do not wait for business metrics to degrade — drift detection should be your early warning system.
ML Workflow Stages — Inputs, Outputs, and Common Failures
Workflow StageInputOutputMost Common Failure
Data CollectionRaw data sources: databases, APIs, CSV files, event streamsProfiled dataset with documented schema, class distribution, and missing value reportMissing values silently ignored; class imbalance not detected; data assumed clean without verification
PreprocessingRaw dataset plus domain knowledge about what features meanNumeric feature matrix split into train and test sets with consistent transforms appliedData leakage: fitting transformers on the full dataset before splitting — test metrics become fiction
Model TrainingPreprocessed training set with feature matrix and target labelsTrained model artifact with cross-validation performance estimates and feature importancesOverfitting: high training AUC, meaningfully lower validation AUC — model memorised training data
EvaluationTrained model plus the held-out test set that the model has never seenHonest performance metrics, confusion matrix, business impact translation, tuned decision thresholdEvaluating on training data; using accuracy on imbalanced data; repeatedly tuning against the test set
DeploymentTrained model plus complete preprocessing pipeline as one versioned artifactServing endpoint (REST API or batch job) with health check, version metadata, and prediction loggingModel deployed without preprocessing pipeline; no versioning; no rollback capability
MonitoringLive prediction logs, live feature distributions, training baseline statistics, actual outcomes when availableDrift alerts, automated retraining triggers, model performance dashboards, rollback decisionsNo monitoring at all — model silently degrades for months before business metrics surface the problem

Key takeaways

1
The ML workflow is a repeatable six-stage pipeline
collect data, preprocess, train, evaluate, deploy, monitor. Skip any stage and the system breaks in predictable and sometimes invisible ways.
2
Data quality dominates model quality
80% of production ML failures trace back to bad data, not bad algorithms. Profile your dataset before touching a model.
3
Always split before preprocessing. Fit on train, transform both. Data leakage from preprocessing before the split is the most common reason models overestimate their real-world performance.
4
Accuracy on imbalanced data is a vanity metric. Use precision, recall, F1, or AUC-ROC. Always translate metrics to business impact
stakeholders need to understand the cost of each error type.
5
Deployment is the starting line, not the finish line. Models degrade as the world changes. Monitor feature drift weekly and retrain when thresholds are exceeded
do not wait for business metrics to surface the problem.
6
Start with a simple baseline and only add complexity when the baseline is genuinely insufficient for business requirements. Complexity is a cost that must be justified by clear, material improvement.

Common mistakes to avoid

6 patterns
×

Fitting preprocessing transformers on the full dataset before the train/test split

Symptom
Model achieves suspiciously high accuracy during development — better than domain experts would expect. When deployed, production performance is notably worse than test metrics promised. The test metrics were optimistic because test-set statistics leaked into training.
Fix
Split data first. Then fit imputers, scalers, and encoders on the training set only. Apply those fitted transformers to the test set without refitting. Use sklearn Pipeline to enforce this pattern mechanically — it becomes impossible to accidentally fit on test data when the pipeline structure prevents it.
×

Using accuracy as the primary metric on imbalanced datasets

Symptom
Model reports 92% accuracy but catches only 10% of the minority class. Business stakeholders see no value from the model because it misses almost every case that actually matters.
Fix
Switch to precision, recall, F1-score, or AUC-ROC. Report the full confusion matrix alongside any aggregate metric. Tune the classification threshold using the precision-recall curve. Present the business cost of false negatives and false positives separately so stakeholders understand the trade-offs.
×

Deploying the model without its preprocessing pipeline

Symptom
Predictions in production differ from predictions in the notebook on identical input data. Debugging reveals that scaling, imputation, or encoding is missing or applied differently in the serving code.
Fix
Serialise the complete pipeline — imputer, scaler, encoder, feature engineering logic, and model — as a single artifact using sklearn Pipeline plus joblib. Version it with a semantic version number. Test end-to-end predictions on a known input before any deployment. Never separate the model from its preprocessing.
×

Deploying a model with no drift monitoring

Symptom
Model performance silently degrades over weeks or months. Business metrics worsen — churn increases, fraud goes undetected — but the ML dashboard still shows the historical test accuracy from deployment day. By the time the problem is visible, months of bad predictions have already impacted customers.
Fix
Implement weekly drift monitoring using Population Stability Index or Kolmogorov-Smirnov tests on key features. Compare live feature distributions against training baselines. Set automated retraining triggers when PSI exceeds 0.2. Cross-reference with business outcomes when ground truth labels become available.
×

Using the test set multiple times for model selection, hyperparameter tuning, and feature selection

Symptom
Model performs well on the held-out test set during development but underperforms on truly new production data. The test set was used iteratively, making it an implicit part of the training process.
Fix
Use the test set exactly once, after all decisions are final. Perform model selection and hyperparameter tuning exclusively on the training set using cross-validation. If iterative evaluation is needed, create a separate validation split — the test set remains sealed until the very end.
×

Jumping to complex models without establishing a baseline

Symptom
Team spends weeks tuning a gradient boosting model with 500 estimators that achieves 87% AUC. A logistic regression baseline trained in five minutes achieves 84% AUC. The 3% improvement does not justify the added complexity, serving latency, and maintenance cost.
Fix
Always train a simple baseline first — logistic regression for classification, linear regression for regression problems. Document the baseline AUC as the explicit benchmark. Only invest in complexity when the improvement is material relative to business requirements and the added cost is justified.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through the ML workflow from raw data to a deployed model. What ...
Q02SENIOR
What is data leakage in ML, how does it happen, and how do you prevent i...
Q03SENIOR
You've deployed a churn prediction model. After three months, business s...
Q04JUNIOR
Why is accuracy a bad metric for imbalanced classification problems, and...
Q05SENIOR
What is the difference between a model and a pipeline in ML deployment, ...
Q01 of 05SENIOR

Walk me through the ML workflow from raw data to a deployed model. What happens at each stage, and what are the most common mistakes?

ANSWER
The ML workflow has six stages. First, data collection and understanding — profile the dataset, check missing values, class balance, distributions, and whether the data is representative of production. Second, preprocessing and feature engineering — handle missing values, encode categoricals, scale numerics, and create domain-driven features. The critical rule here: split data before any fitting to prevent data leakage. Third, model selection and training — start with a simple baseline, compare models using cross-validation on the training set only, select based on validation performance. Fourth, evaluation — use the held-out test set exactly once, report precision, recall, F1, and AUC-ROC rather than accuracy on imbalanced data, tune the decision threshold, and translate metrics to business impact. Fifth, deployment — serialise the complete preprocessing pipeline and model as one artifact, serve via API or batch job, version everything. Sixth, monitoring — track feature drift weekly using PSI, retrain when drift exceeds thresholds, and maintain rollback capability. The three most common production mistakes are: fitting preprocessing before the split (data leakage), reporting accuracy on imbalanced data (misleading metric), and deploying without any monitoring (silent model decay).
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the ML workflow in simple terms?
02
Do I need to know math to follow the ML workflow?
03
How long does it take to go through the ML workflow for a real project?
04
What tools do I need to implement the ML workflow?
🔥

That's ML Basics. Mark it forged?

8 min read · try the examples if you haven't

Previous
Supervised vs Unsupervised Learning
3 / 25 · ML Basics
Next
Overfitting and Underfitting