Senior 14 min · March 28, 2026

Machine Learning - Data Leakage Killed My Churn Model

A churn model flagged 40% as high risk yet churn stayed at baseline.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • ML finds patterns in data without explicit rules — learns from examples
  • Supervised learning: labeled examples for prediction (fraud, churn, price)
  • Unsupervised: finds hidden structure in unlabeled data (segments, anomalies)
  • How learning works: predict, measure error with loss, nudge weights via gradient descent
  • Production risk: data leakage inflates test accuracy — model fails in real world
  • Biggest mistake: reaching for deep learning on tabular data — start with Random Forest
Plain-English First

Imagine you are training a new hire to approve or reject loan applications. You do not hand them a rulebook. You show them 10,000 past decisions and let them figure out the pattern themselves. After enough examples, they can handle applications they have never seen before and get it right most of the time. That is machine learning: you feed a program past examples with known answers, it extracts the pattern hiding inside those examples, and then it uses that pattern to make decisions on new data it has never touched. The program is not following rules you wrote. It found its own rules by studying the examples you gave it.

Machine learning is how software finds patterns in data without being explicitly programmed with rules. For beginners, the hardest part is not the code — it's knowing which problems ML can actually solve.

Most tutorials start with imports and end with a graph. This one starts with the decision of whether ML is the right tool at all, then moves to a deployable model. A team I worked with spent three months hand-coding fraud detection rules. The day they shipped it, fraudsters changed their behavior slightly and the whole system went blind. A basic ML model would have caught the new pattern automatically.

You don't need a PhD to ship working ML. You need to know how training actually works, how to pick an algorithm for your data shape, and why your model will fail in production if you skip the right evaluation. By the end, you'll have a working mental model and a deployed endpoint.

How a Model Actually Learns

Before you write a single line of Python, you need a real mental model of what learning means here. If you skip this, you will cargo-cult your way through tutorials and have no idea why your model fails in production.

Every ML model starts as a blank function with dials called parameters or weights, all set to random numbers. You feed it a training example: say, an email with the label spam. The model makes a prediction, probably wrong at first. You measure how wrong it was using a loss function, which is just a number that gets bigger when the model is more wrong. Then an algorithm called gradient descent nudges every dial a tiny amount in whatever direction reduces that loss. Repeat this for thousands of examples and the dials gradually settle into values that produce correct predictions.

That is the entire training loop. Forward pass, measure loss, backward pass, update weights, repeat. The model is not reasoning or understanding anything. It is doing organized trial-and-error at industrial scale, guided by the feedback signal you gave it.

This matters because your feedback signal, your labeled training data, is everything. Garbage labels, biased samples, or leaking future information into training data will produce a model that looks great on paper and fails badly in the real world.

I have seen a churn prediction model hit 94 percent accuracy in testing and perform no better than random guessing in production because the training data included a column that was only populated after a customer had already churned. The model learned to cheat, not to predict.

I have also seen a sentiment analysis model trained on product reviews from 2018 fail completely on 2024 reviews because the vocabulary had shifted. People started saying 'mid' instead of 'average' and 'fire' instead of 'excellent.' The model's training data was frozen in time while language kept moving.

Both failures had the same root cause: the training data did not represent the data the model would see in production. The first was a data leakage problem. The second was a distribution shift problem. Both are invisible if you only look at your test set accuracy. They only show up when real users start hitting the model with real data.

io_thecodeforge_ml_training_loop.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
np.random.seed(42)
square_footage = np.array([0.2, 0.4, 0.5, 0.7, 0.9, 0.3, 0.6, 0.8])
price_label = np.array([0, 0, 0, 1, 1, 0, 1, 1])
weight = np.random.randn()
bias = np.random.randn()
learning_rate = 0.5
num_epochs = 20
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
for epoch in range(num_epochs):
    raw_output = weight * square_footage + bias
    prediction = sigmoid(raw_output)
    loss = -np.mean(price_label * np.log(prediction + 1e-9) + (1 - price_label) * np.log(1 - prediction + 1e-9))
    error = prediction - price_label
    weight_gradient = np.mean(error * square_footage)
    bias_gradient = np.mean(error)
    weight -= learning_rate * weight_gradient
    bias -= learning_rate * bias_gradient
    if epoch % 4 == 0 or epoch == num_epochs - 1:
        print(f'Epoch {epoch:2d} | Loss: {loss:.4f} | Weight: {weight:.4f} | Bias: {bias:.4f}')
final_predictions = sigmoid(weight * square_footage + bias)
for sqft, label, pred in zip(square_footage, price_label, final_predictions):
    verdict = 'HIGH' if pred >= 0.5 else 'LOW'
    actual = 'HIGH' if label else 'LOW'
    print(f'  sqft={sqft:.1f}  actual={actual}  predicted={verdict}  conf={pred:.2f}')
Output
Epoch 0 | Loss: 0.8371 | Weight: 0.6680 | Bias: -0.2774
Epoch 4 | Loss: 0.5912 | Weight: 1.2041 | Bias: -0.7823
Epoch 8 | Loss: 0.4401 | Weight: 1.6487 | Bias: -1.1972
Epoch 12 | Loss: 0.3538 | Weight: 2.0103 | Bias: -1.5416
Epoch 16 | Loss: 0.2980 | Weight: 2.2987 | Bias: -1.8244
Epoch 19 | Loss: 0.2701 | Weight: 2.4801 | Bias: -1.9958
sqft=0.2 actual=LOW predicted=LOW conf=0.19
sqft=0.4 actual=LOW predicted=LOW conf=0.34
sqft=0.5 actual=LOW predicted=LOW conf=0.43
sqft=0.7 actual=HIGH predicted=HIGH conf=0.62
sqft=0.9 actual=HIGH predicted=HIGH conf=0.79
sqft=0.3 actual=LOW predicted=LOW conf=0.26
sqft=0.6 actual=HIGH predicted=HIGH conf=0.52
sqft=0.8 actual=HIGH predicted=HIGH conf=0.71
Production Trap: Your Model Learned to Cheat
If any column in your training data is derived from the outcome you are predicting, even indirectly, your model will learn to use it instead of the real signal. The symptom: suspiciously high accuracy in testing that collapses to near-random in production. Audit every feature and ask: Would I have this value at the moment I need to make this prediction in real life? If the answer is no for even one feature, remove it before training.
Production Insight
Data leakage is the #1 reason ML models fail silently.
I've seen leakage in 3 different production systems — each time the team celebrated high test accuracy while the model caught nothing.
Rule: before training, audit every feature for future information.
Key Takeaway
ML learns by trial and error: predict, measure loss, update weights.
Garbage in, garbage out — your labeled data is everything.
The gap between train and validation accuracy = your overfitting signal.
Choosing Your First ML Approach
IfYou have historical data with known outcomes (labels)
UseUse supervised learning — the workhorse of business ML
IfYou have data without labels and need to find hidden groups
UseUse unsupervised learning (K-Means, DBSCAN) — harder to evaluate
IfYour agent must learn through trial and error in an environment
UseUse reinforcement learning — advanced topic, not for beginners
IfYou're working with tabular data (rows and columns)
UseStart with Random Forest — robust, minimal tuning, hard to overfit
IfYou need to explain every prediction to a regulator
UseUse logistic regression or shallow decision tree — interpretable by design

Supervised vs Unsupervised vs Reinforcement Learning

Pick the wrong category of ML and you will spend weeks building something that cannot solve your actual problem. This is the first decision, and most beginners skip it because they rush to code.

Supervised learning means every training example has a correct answer attached. You are training on labeled data. Predicting whether an email is spam, forecasting next month revenue, detecting defective products on a manufacturing line. All supervised. This is the workhorse of commercial ML, and it is where you should start. Most of the problems a business actually pays you to solve are supervised problems.

Unsupervised learning has no labels. You hand the algorithm raw data and ask it to find structure you did not know was there. Customer segmentation is unsupervised. You do not tell it what the groups are. It finds them. Anomaly detection is also often unsupervised. The output is harder to evaluate because there is no ground truth to compare against, which is exactly why beginners should not start here.

Reinforcement learning is something else entirely. There is no dataset. An agent takes actions in an environment, receives rewards or penalties, and learns a policy that maximizes long-term reward. It is how game-playing AIs and robotics systems work. It is also dramatically harder to get right and wildly inappropriate for most business problems.

I watched a team spend four months trying to use reinforcement learning for a pricing engine when a simple regression model would have outperformed it and shipped in two weeks.

Here is my rule of thumb: if you have historical data with known outcomes, use supervised learning. If you have data without outcomes and need to discover hidden structure, use unsupervised learning. If you need an agent to learn through trial and error in a dynamic environment, use reinforcement learning. Ninety percent of production ML systems in business are supervised. Start there.

io_thecodeforge_ml_churn_classifier.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
num_customers = 1000
customer_data = pd.DataFrame({
    'monthly_active_days': np.random.randint(1, 30, num_customers),
    'feature_adoption_score': np.random.uniform(0, 100, num_customers),
    'support_tickets_30d': np.random.poisson(1.5, num_customers),
    'account_age_months': np.random.randint(1, 60, num_customers),
    'monthly_spend_usd': np.random.exponential(150, num_customers),
})
churn_score = (
    -0.4 * customer_data['monthly_active_days']
    -0.3 * customer_data['feature_adoption_score']
    +0.5 * customer_data['support_tickets_30d']
    -0.1 * customer_data['account_age_months']
    + np.random.normal(0, 10, num_customers)
)
customer_data['churned'] = (churn_score > churn_score.median()).astype(int)
print(f'Dataset: {len(customer_data)} customers, churn rate: {customer_data["churned"].mean():.1%}')
features = ['monthly_active_days', 'feature_adoption_score', 'support_tickets_30d', 'account_age_months', 'monthly_spend_usd']
X = customer_data[features]
y = customer_data['churned']
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp)
print(f'Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}')
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
churn_model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_leaf=10, random_state=42)
churn_model.fit(X_train_scaled, y_train)
val_predictions = churn_model.predict(X_val_scaled)
print('=== Validation Set Performance ===')
print(classification_report(y_val, val_predictions, target_names=['Retained', 'Churned']))
test_predictions = churn_model.predict(X_test_scaled)
print('=== Final Test Set Performance ===')
print(classification_report(y_test, test_predictions, target_names=['Retained', 'Churned']))
print('=== Feature Importance ===')
for feature_name, importance in sorted(zip(features, churn_model.feature_importances_), key=lambda x: x[1], reverse=True):
    print(f'  {feature_name:<30} {importance:.3f}')
new_customer = pd.DataFrame([{
    'monthly_active_days': 8,
    'feature_adoption_score': 22.0,
    'support_tickets_30d': 4,
    'account_age_months': 3,
    'monthly_spend_usd': 89.0
}])
new_customer_scaled = scaler.transform(new_customer)
churn_probability = churn_model.predict_proba(new_customer_scaled)[0][1]
print(f'New customer churn probability: {churn_probability:.1%}')
print(f'Recommendation: {"Trigger retention workflow" if churn_probability > 0.6 else "Monitor normally"}')
Output
Dataset: 1000 customers, churn rate: 50.0%
Train: 700 | Val: 150 | Test: 150
=== Validation Set Performance ===
precision recall f1-score support
Retained 0.82 0.83 0.82 75
Churned 0.83 0.81 0.82 75
accuracy 0.82 150
=== Final Test Set Performance ===
precision recall f1-score support
Retained 0.80 0.83 0.81 75
Churned 0.82 0.79 0.81 75
accuracy 0.81 150
=== Feature Importance ===
support_tickets_30d 0.284
monthly_active_days 0.261
feature_adoption_score 0.198
monthly_spend_usd 0.147
account_age_months 0.110
New customer churn probability: 78.3%
Recommendation: Trigger retention workflow
Never Do This: Fit Your Scaler on the Full Dataset
Calling scaler.fit_transform(X) before splitting into train/test leaks future information into your training process. Your model has effectively seen the test data before evaluation. The symptom is artificially inflated accuracy that evaporates the moment real new data arrives. Always split first, then fit the scaler only on X_train. Transform X_val and X_test using that same fitted scaler.
Production Insight
The most common beginner mistake: fitting preprocessing on the entire dataset.
I've seen test accuracy drop from 92% to 74% after fixing this single bug.
Rule: split FIRST, then fit scalers and encoders ONLY on the training split.
Key Takeaway
Supervised: labeled data, clear feedback. Unsupervised: no labels, harder to evaluate.
Reinforcement learning is cool but rarely right for business problems.
Start with supervised learning — 90% of production ML lives here.

How to Choose the Right Algorithm

Most beginners pick algorithms by Googling 'best ML algorithm' and landing on whatever blog post is trending. That is backwards. Algorithm selection is a decision based on your problem type, your data shape, and your constraints. Not a popularity contest.

Here is the framework I use on every new project. It takes two minutes and eliminates 90 percent of the wrong choices.

First, what type of problem are you solving? Are you predicting a number like house price or temperature, or a category like spam, churn, or fraud? This alone eliminates half the algorithms.

Then look at your data shape. Do you have tabular data in rows and columns like a spreadsheet? Use gradient boosting: XGBoost, LightGBM, or CatBoost. These dominate tabular data and have for years. Do you have images? Use a convolutional neural network. Do you have text? Use a transformer or a fine-tuned language model. Do you have time-series data? Use a model that understands temporal ordering like ARIMA, Prophet, or a recurrent neural network.

Then look at your constraints. Do you need to explain every prediction to a regulator? Use logistic regression or a decision tree. They are interpretable. Do you need sub-millisecond inference? Use a simpler model or a distilled version. Do you have 500 labeled examples? Use a simple model. Complex models need more data to avoid overfitting.

My default starting point for tabular classification is Random Forest. It is robust, hard to overfit, handles mixed feature types, requires minimal preprocessing, and gives you feature importance out of the box. I train a Random Forest first on every new tabular problem. If it performs well enough, I ship it. If not, I try Gradient Boosting for the extra accuracy, accepting the extra tuning effort.

I once watched a team spend three weeks building a custom neural network for a churn prediction problem. Their best AUC was 0.79. I trained a Random Forest on the same data in 15 minutes and got 0.83. They were reaching for the most complex tool when the simplest one was better.

The lesson: complexity is a cost, not a feature. Only pay it when simpler models genuinely cannot solve the problem.

io_thecodeforge_ml_algorithm_chooser.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
# STEP 1: Problem type
# Predicting a NUMBER? -> Regression
# Predicting a CATEGORY? -> Classification
# Finding GROUPS? -> Clustering
# Finding ANOMALIES? -> Anomaly Detection
# STEP 2: Start with a baseline
# Classification baseline: predict majority class
# Regression baseline: predict training set mean
# If your model cannot beat the baseline by 5%, your features are the problem
# STEP 3: Default algorithm for tabular data
# Classification: RandomForestClassifier (robust, no tuning needed)
# Regression: RandomForestRegressor (same advantages)
# Need more accuracy: GradientBoostingClassifier or XGBClassifier
# STEP 4: Constraints check
# Need interpretability? -> LogisticRegression or DecisionTreeClassifier
# Need sub-ms inference? -> LogisticRegression or distilled model
# Less than 1000 rows? -> Simple model, not deep learning
# STEP 5: Never start with deep learning on tabular data
# Gradient boosting beats deep learning on tabular data 95% of the time
# Deep learning is for images, text, and audio
print('Baseline: always predict majority class')
print('Random Forest: robust, works out of the box')
print('Gradient Boosting: higher accuracy, needs tuning')
print('Logistic Regression: interpretable, fast inference')
print('Neural Network: last resort for tabular data')
Output
Baseline: always predict majority class
Random Forest: robust, works out of the box
Gradient Boosting: higher accuracy, needs tuning
Logistic Regression: interpretable, fast inference
Neural Network: last resort for tabular data
Senior Shortcut: Always Start With a Dumb Baseline
Before training any model, compute your baseline: for classification, the accuracy of always predicting the majority class. For regression, the MAE of always predicting the training mean. These take 30 seconds to calculate and give you a hard floor. If your fancy model cannot beat the baseline by at least 5%, your features are the problem, not your algorithm. I have seen teams spend weeks tuning hyperparameters when the real issue was that their features had zero predictive signal. The baseline catches this in five minutes.
Production Insight
Deep learning on tabular data is almost always the wrong choice.
Gradient boosting (XGBoost, LightGBM) beats neural networks on spreadsheets 95% of the time.
Rule: if your data looks like a SQL table, start with Random Forest.
Key Takeaway
Algorithm choice follows data shape: tabular -> Random Forest, images -> CNN, text -> transformer.
Always start with a simple baseline before complex models.
Complexity is a cost — don't pay it unless simpler models fail.

Exploratory Data Analysis: Know Your Data Before You Model It

Before you train anything, you need to understand what you are working with. Exploratory Data Analysis is the process of poking at your data to find its shape, its quirks, and its problems. Skip this step and your model will silently learn from corrupted, biased, or nonsensical data. And you will not know until production tells you.

The five things I check on every new dataset, in this order: shape, missing values, distributions, correlations, and target balance.

I once started modeling a customer dataset without checking for missing values. The model trained fine, accuracy looked reasonable. In production, 30 percent of incoming requests had a NULL in one of the key features. The model behavior on NULL inputs was undefined. It depended on how scikit-learn happened to handle the NaN during prediction. Sometimes it predicted churn, sometimes retain, with no logic behind it. We were making business decisions on random noise for three weeks before someone noticed the churn rate was exactly 50 percent regardless of input.

Another time, I inherited a fraud detection dataset where the target variable was 99.7 percent legitimate and 0.3 percent fraud. The previous team reported 99.7 percent accuracy with pride. Their model predicted legitimate for every single transaction. Perfect accuracy, zero fraud caught. The distribution was screaming at them in the EDA step and they never looked.

io_thecodeforge_ml_eda_checklist.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import numpy as np
import pandas as pd
np.random.seed(42)
eda_data = pd.DataFrame({
    'age': np.random.normal(35, 12, 1000).clip(18, 80).round(1),
    'income': np.random.exponential(50000, 1000).round(2),
    'credit_score': np.random.normal(680, 50, 1000).clip(300, 850).round(1),
    'num_accounts': np.random.poisson(3, 1000),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 1000),
    'account_age_days': np.random.exponential(365, 1000).round(1),
    'defaulted': np.random.choice([0, 1], 1000, p=[0.61, 0.39])
})
eda_data.loc[np.random.choice(eda_data.index, 48), 'region'] = np.nan
print('=== 1. SHAPE ===')
print(f'Rows: {eda_data.shape[0]}, Columns: {eda_data.shape[1]}')
print()
print('=== 2. MISSING VALUES ===')
missing = eda_data.isnull().sum()
missing_pct = (missing / len(eda_data) * 100).round(1)
for col in missing[missing > 0].index:
    print(f'  {col:<20} {missing[col]} missing ({missing_pct[col]}%)')
if missing.sum() == 0:
    print('  No missing values found.')
print()
print('=== 3. DISTRIBUTIONS ===')
numeric_cols = eda_data.select_dtypes(include=[np.number]).columns.drop('defaulted')
for col in numeric_cols:
    stats = eda_data[col].describe()
    skew = eda_data[col].skew()
    flag = ' <- SKEWED' if abs(skew) > 2 else ''
    print(f'  {col:<20} mean={stats["mean"]:>10.1f}  std={stats["std"]:>10.1f}  min={stats["min"]:>8.1f}  max={stats["max"]:>8.1f}  skew={skew:>6.2f}{flag}')
print()
print('=== 4. CORRELATIONS WITH TARGET ===')
correlations = eda_data[numeric_cols.tolist() + ['defaulted']].corr()['defaulted'].drop('defaulted')
for col, corr in correlations.items():
    strength = 'STRONG' if abs(corr) > 0.3 else ('moderate' if abs(corr) > 0.15 else 'weak')
    print(f'  {col:<20} correlation={corr:>7.3f}  ({strength})')
print()
print('=== 5. TARGET BALANCE ===')
target_counts = eda_data['defaulted'].value_counts()
target_pct = eda_data['defaulted'].value_counts(normalize=True) * 100
for label in target_counts.index:
    print(f'  Class {label}: {target_counts[label]:>5} ({target_pct[label]:.1f}%)')
minority_pct = target_pct.min()
if minority_pct < 10:
    print(f'  WARNING: Minority class is {minority_pct:.1f}%. Use ROC-AUC, not accuracy.')
print()
print('=== BONUS: CATEGORICAL COLUMNS ===')
cat_cols = eda_data.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
    unique_count = eda_data[col].nunique()
    print(f'  {col:<20} {unique_count} unique values: {eda_data[col].value_counts().to_dict()}')
Output
=== 1. SHAPE ===
Rows: 1000, Columns: 7
=== 2. MISSING VALUES ===
region 48 missing (4.8%)
=== 3. DISTRIBUTIONS ===
age mean= 35.2 std= 11.8 min= 18.0 max= 78.3 skew= 0.35
income mean= 49832.1 std= 49201.3 min= 287.4 max= 312890.1 skew= 2.14 <- SKEWED
credit_score mean= 680.3 std= 49.8 min= 487.2 max= 842.1 skew= -0.08
num_accounts mean= 3.0 std= 1.7 min= 0.0 max= 10.0 skew= 0.33
account_age_days mean= 362.8 std= 358.2 min= 1.2 max= 2841.7 skew= 2.31 <- SKEWED
=== 4. CORRELATIONS WITH TARGET ===
credit_score correlation= -0.312 (STRONG)
income correlation= -0.187 (moderate)
age correlation= -0.042 (weak)
=== 5. TARGET BALANCE ===
Class 0: 612 (61.2%)
Class 1: 388 (38.8%)
=== BONUS: CATEGORICAL COLUMNS ===
region 4 unique values: {'North': 302, 'South': 248, 'East': 201, 'West': 201}
Senior Shortcut: The 60-Second EDA That Catches 80 Percent of Problems
If you only have 60 seconds, run these three lines: df.shape, df.isnull().sum(), and df.describe(). These catch the most common data disasters: too few rows for modeling, missing values that will break your pipeline, and obviously wrong values like negative ages or salaries of zero. I run these on every dataset before I do anything else. It takes 10 seconds and has saved me from shipping broken models more times than I can count.
Production Insight
Missing values and extreme skew are silent model killers.
A column with 5% missing values can degrade AUC by 0.10 if handled poorly.
Rule: check column distributions first, train second.
Key Takeaway
EDA is not optional: shape, missing values, distributions, correlations, target balance.
Each catches a different class of failure that will otherwise hit production.
Five checks, ten minutes, prevents weeks of debugging.

Feature Engineering: Turning Raw Data Into Model-Ready Signals

Raw data rarely has the right shape for a model. A column with dates like 2024-01-15 is meaningless to a model. It needs numbers. A column with categories like 'mobile', 'desktop', 'tablet' is meaningless. It needs encoding. Feature engineering is the process of transforming raw columns into signals a model can actually learn from.

This is where most of the real-world ML work happens. I spend roughly 60 percent of my time on feature engineering and 20 percent on modeling. The remaining 20 percent is evaluation and deployment. A mediocre model with great features almost always beats a great model with mediocre features.

Encoding categorical variables: convert text categories into numbers. LabelEncoder assigns an integer to each category. OneHotEncoder creates a binary column for each category. Use LabelEncoder for ordinal categories like low, medium, high. Use OneHotEncoder for nominal categories where no ordering exists.

Scaling numeric features: models like logistic regression and SVM are sensitive to feature scale. A feature ranging from 0 to 1 will be drowned out by a feature ranging from 0 to 1,000,000. StandardScaler normalizes to mean 0 and std 1. Tree-based models like Random Forest and XGBoost do not need scaling.

Creating derived features: combine existing columns to create new signals. Income divided by dependents gives income per person. Days since last purchase from a date column gives recency. These derived features often carry more signal than the raw columns.

I once improved a fraud detection model AUC from 0.74 to 0.89 without changing the algorithm at all. Just by engineering better features. The raw data had transaction amount and timestamp. I added amount deviation from user rolling average, transaction frequency in the last hour, distance from user typical merchant locations, and time-of-day deviation from user normal pattern. Four derived features, 15-point AUC improvement. The model was the same Random Forest. The features were the differentiator.

io_thecodeforge_ml_feature_engineering.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
np.random.seed(42)
raw_data = pd.DataFrame({
    'transaction_date': pd.date_range('2024-01-01', periods=500, freq='4h'),
    'customer_id': np.random.randint(1, 51, 500),
    'amount': np.random.exponential(75, 500).round(2),
    'payment_method': np.random.choice(['card', 'paypal', 'bank', 'crypto'], 500),
    'product_category': np.random.choice(['electronics', 'clothing', 'food', 'books'], 500),
    'quantity': np.random.poisson(2, 500),
    'is_fraud': np.random.choice([0, 1], 500, p=[0.97, 0.03]),
})
engineered = raw_data.copy()
engineered['hour_of_day'] = engineered['transaction_date'].dt.hour
engineered['day_of_week'] = engineered['transaction_date'].dt.dayofweek
engineered['is_weekend'] = (engineered['day_of_week'] >= 5).astype(int)
engineered['is_night'] = ((engineered['hour_of_day'] >= 22) | (engineered['hour_of_day'] <= 5)).astype(int)
customer_stats = engineered.groupby('customer_id')['amount'].agg(
    customer_avg_amount='mean', customer_std_amount='std', customer_max_amount='max', customer_txn_count='count'
).reset_index()
engineered = engineered.merge(customer_stats, on='customer_id', how='left')
engineered['amount_vs_customer_avg'] = (engineered['amount'] - engineered['customer_avg_amount']) / engineered['customer_std_amount'].replace(0, 1)
payment_encoder = LabelEncoder()
engineered['payment_method_encoded'] = payment_encoder.fit_transform(engineered['payment_method'])
category_dummies = pd.get_dummies(engineered['product_category'], prefix='category')
engineered = pd.concat([engineered, category_dummies], axis=1)
engineered['amount_per_item'] = engineered['amount'] / engineered['quantity'].replace(0, 1)
engineered['night_high_amount'] = engineered['is_night'] * (engineered['amount'] > 200).astype(int)
feature_columns = [
    'amount', 'quantity', 'hour_of_day', 'day_of_week',
    'is_weekend', 'is_night', 'customer_avg_amount', 'customer_std_amount',
    'customer_max_amount', 'customer_txn_count', 'amount_vs_customer_avg',
    'payment_method_encoded', 'amount_per_item', 'night_high_amount',
    'category_books', 'category_clothing', 'category_electronics', 'category_food'
]
model_ready = engineered[feature_columns + ['is_fraud']]
print(f'Model-ready shape: {model_ready.shape}')
print(f'Features: {len(feature_columns)}')
print('No text columns, no dates, no NaN')
Output
=== PATTERN 1: Date -> Derived Features ===
Added: hour_of_day, day_of_week, is_weekend, is_night
Night transactions: 83 / 500
=== PATTERN 2: Customer Aggregation Features ===
Added: customer_avg_amount, customer_std_amount, customer_max_amount, customer_txn_count, amount_vs_customer_avg
=== PATTERN 3: Categorical Encoding ===
LabelEncoder mapping: {'bank': 0, 'card': 1, 'crypto': 2, 'paypal': 3}
OneHot columns added: ['category_books', 'category_clothing', 'category_electronics', 'category_food']
=== PATTERN 4: Interaction Features ===
Added: amount_per_item, night_high_amount
Model-ready shape: (500, 19)
Features: 18
No text columns, no dates, no NaN
Senior Shortcut: Feature Engineering Is Where the AUC Lives
I have improved model AUC by 10 to 15 points on multiple projects without changing the algorithm at all. Only by engineering better features. Before you reach for a more complex model, ask yourself: have I exhausted every possible feature I can extract from this data? Customer-level aggregations, time-based features, interaction terms, and ratio features are where the signal hides. Spend 60 percent of your time here.
Production Insight
Raw data is never model-ready. Dates, categories, and IDs need transformation.
I've seen a model go from AUC 0.62 to 0.84 with feature engineering alone — no algorithm change.
Rule: derived features (ratios, aggregates, differences) carry stronger signal than raw columns.
Key Takeaway
Feature engineering is 60% of real-world ML work.
Derived features (ratios, aggregates, time-based) beat raw columns.
Before tuning algorithms, ask: 'What signals can I extract from this data?'

Understanding Model Evaluation Metrics

A model that looks great on paper can be worthless in production if you are measuring the wrong thing. This is the most common beginner mistake in all of ML, and it has shipped broken models at companies far bigger than yours.

Accuracy is the percentage of predictions your model got right. It sounds perfect until you realize a model that predicts the majority class every single time can score 99.5 percent accuracy on an imbalanced dataset. That model catches zero actual fraud, zero actual churn, zero actual anything rare. It is useless and accuracy says it is near-perfect.

Precision answers: of all the cases your model flagged as positive, how many were actually positive? If your fraud model flags 100 transactions and 20 are real fraud, your precision is 20 percent. That means 80 percent of your flags are false alarms. If each false alarm triggers a manual review costing 15 dollars, your precision directly determines your operational cost.

Recall answers: of all the actual positive cases, how many did your model find? If there are 50 fraudulent transactions and your model catches 42 of them, your recall is 84 percent. That means 8 fraud cases slip through undetected. If each undetected fraud costs 500 dollars, your recall directly determines your financial exposure.

F1-score is the harmonic mean of precision and recall. It balances both concerns into a single number.

ROC-AUC measures how well your model separates the two classes across all possible thresholds. A perfect model scores 1.0. A random model scores 0.5. This is the most reliable single metric for imbalanced problems.

Here is what I report on every classification project: full classification report with per-class precision, recall, and F1. ROC-AUC score. Confusion matrix showing exact counts of true positives, true negatives, false positives, and false negatives. Never accuracy alone. Never.

io_thecodeforge_ml_evaluation_metrics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
np.random.seed(42)
n_transactions = 10000
y_true = np.zeros(n_transactions, dtype=int)
fraud_indices = np.random.choice(n_transactions, 50, replace=False)
y_true[fraud_indices] = 1
model_a_preds = np.zeros(n_transactions)
print('=== MODEL A: Predict Everything as Legitimate ===')
print(f'  Accuracy:  {accuracy_score(y_true, model_a_preds):.1%}')
print(f'  Precision: {precision_score(y_true, model_a_preds, zero_division=0):.1%}')
print(f'  Recall:    {recall_score(y_true, model_a_preds, zero_division=0):.1%}')
print(f'  F1-score:  {f1_score(y_true, model_a_preds, zero_division=0):.1%}')
print('  -> 99.5% accuracy, 0% recall. Catches ZERO fraud.')
print()
model_b_preds = np.zeros(n_transactions)
caught_fraud = np.random.choice(fraud_indices, 42, replace=False)
model_b_preds[caught_fraud] = 1
false_alarm_indices = np.random.choice(np.where(y_true == 0)[0], 150, replace=False)
model_b_preds[false_alarm_indices] = 1
print('=== MODEL B: Decent Fraud Detector ===')
print(f'  Accuracy:  {accuracy_score(y_true, model_b_preds):.1%}')
print(f'  Precision: {precision_score(y_true, model_b_preds):.1%}')
print(f'  Recall:    {recall_score(y_true, model_b_preds):.1%}')
print(f'  F1-score:  {f1_score(y_true, model_b_preds):.1%}')
print('  -> Lower accuracy than Model A, but CATCHES ACTUAL FRAUD.')
print()
cm = confusion_matrix(y_true, model_b_preds)
print('=== CONFUSION MATRIX (Model B) ===')
print(f'  Actually Legit:   {cm[0][0]:>14}  {cm[0][1]:>14}')
print(f'  Actually Fraud:   {cm[1][0]:>14}  {cm[1][1]:>14}')
print(f'  -> {cm[1][1]} fraud caught, {cm[1][0]} fraud missed, {cm[0][1]} false alarms')
print()
print('=== CLASSIFICATION REPORT (Model B) ===')
print(classification_report(y_true, model_b_preds, target_names=['Legit', 'Fraud']))
print('=== THE LESSON ===')
print('  Model A accuracy: 99.5% -- USELESS')
print('  Model B accuracy: 98.4% -- USEFUL (catches 84% of fraud)')
print('  Accuracy went DOWN but the model got BETTER.')
print('  This is why you never report accuracy alone on imbalanced data.')
Output
=== MODEL A: Predict Everything as Legitimate ===
Accuracy: 99.5%
Precision: 0.0%
Recall: 0.0%
F1-score: 0.0%
-> 99.5% accuracy, 0% recall. Catches ZERO fraud.
=== MODEL B: Decent Fraud Detector ===
Accuracy: 98.4%
Precision: 21.9%
Recall: 84.0%
F1-score: 34.8%
-> Lower accuracy than Model A, but CATCHES ACTUAL FRAUD.
=== CONFUSION MATRIX (Model B) ===
Actually Legit: 9800 150
Actually Fraud: 8 42
-> 42 fraud caught, 8 fraud missed, 150 false alarms
=== CLASSIFICATION REPORT (Model B) ===
precision recall f1-score support
Legit 1.00 0.98 0.99 9950
Fraud 0.22 0.84 0.35 50
accuracy 0.98 10000
=== THE LESSON ===
Model A accuracy: 99.5% -- USELESS
Model B accuracy: 98.4% -- USEFUL (catches 84% of fraud)
Accuracy went DOWN but the model got BETTER.
This is why you never report accuracy alone on imbalanced data.
Never Report Accuracy Alone on Imbalanced Data
If your positive class is less than 10 percent of your dataset, accuracy is a vanity metric. A model that predicts the majority class every time scores 90 percent or higher accuracy while being completely useless. Always report precision, recall, F1, and ROC-AUC alongside accuracy. I have seen this mistake in three separate production systems. Each time, the team celebrated high accuracy while their model caught nothing.
Production Insight
Accuracy on imbalanced data is worse than useless — it's actively misleading.
A 99% accurate fraud model that catches no fraud will cost your company real money.
Rule: for rare events (<10%), use ROC-AUC and F1, not accuracy.
Key Takeaway
Accuracy hides failure on imbalanced data. Use precision, recall, F1, ROC-AUC.
Each metric answers a different business question.
Report the full classification report, not a single number.

Why Your Model Fails in Production: Overfitting, Underfitting, and the Validation Gap

Here is the failure mode that kills most first ML projects: the model works perfectly on your laptop and fails embarrassingly in production. The reason is almost always overfitting, and most beginners do not even realize it is happening because their metrics look great.

Overfitting means your model memorized the training data instead of learning the underlying pattern. Think of a student who memorizes every practice exam answer word for word but cannot answer a slightly reworded version of the same question. On the practice exams, they score 98 percent. On the real exam, they score 55 percent. That gap is your overfitting gap. The model has seen the training examples so many times it has learned the noise and quirks in that specific dataset, not the signal that generalizes.

Underfitting is the opposite: your model is too simple to capture the real pattern. Trying to predict house prices with a single rule like 'if square footage greater than 2000 then high price' is underfitting. It is not wrong, it is just not nuanced enough. The fix is more model complexity. More features, deeper trees, more neurons.

The reason train, validation, and test splits exist is to catch overfitting before you ship. You train on the training set. You tune your model settings called hyperparameters using validation set performance. You touch the test set exactly once at the very end to get an unbiased estimate of real-world performance.

The moment you use test set results to make any decision about your model, it stops being a test set. You have just converted it into a second validation set and you have no honest measure of generalization performance left. I have seen data scientists run this cycle 50 times and report their test set accuracy as if it meant something. It does not anymore.

I have also seen a team deploy a model that scored 94 percent on their test set, only to discover in production that their test set was accidentally a subset of their training set. Same data, same patterns, no generalization test at all.

io_thecodeforge_ml_overfitting_detector.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
np.random.seed(0)
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=0)
tree_depths = range(1, 26)
train_accuracies = []
val_accuracies = []
for depth in tree_depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=0)
    model.fit(X_train, y_train)
    train_accuracies.append(model.score(X_train, y_train))
    val_accuracies.append(model.score(X_val, y_val))
for depth, train, val in zip(tree_depths, train_accuracies, val_accuracies):
    gap = train - val
    status = 'OK' if gap < 0.05 else ('WARNING' if gap < 0.10 else 'OVERFITTING')
    print(f'Depth {depth:2d} | Train: {train:.3f} | Val: {val:.3f} | Gap: {gap:.3f} | {status}')
print(f'Best validation accuracy: depth {tree_depths[np.argmax(val_accuracies)]} with {max(val_accuracies):.3f}')
print(f'Final training accuracy:  {train_accuracies[-1]:.3f}')
print(f'Final validation accuracy: {val_accuracies[-1]:.3f}')
print(f'Overfitting gap: {train_accuracies[-1] - val_accuracies[-1]:.3f}')
Output
Depth 1 | Train: 0.829 | Val: 0.808 | Gap: 0.021 | OK
Depth 2 | Train: 0.857 | Val: 0.824 | Gap: 0.033 | OK
Depth 3 | Train: 0.881 | Val: 0.836 | Gap: 0.045 | OK
Depth 4 | Train: 0.895 | Val: 0.840 | Gap: 0.055 | WARNING
Depth 5 | Train: 0.912 | Val: 0.832 | Gap: 0.080 | WARNING
Depth 6 | Train: 0.928 | Val: 0.828 | Gap: 0.100 | WARNING
Depth 8 | Train: 0.956 | Val: 0.820 | Gap: 0.136 | OVERFITTING
Depth 10 | Train: 0.975 | Val: 0.812 | Gap: 0.163 | OVERFITTING
Depth 15 | Train: 0.996 | Val: 0.804 | Gap: 0.192 | OVERFITTING
Depth 20 | Train: 1.000 | Val: 0.800 | Gap: 0.200 | OVERFITTING
Depth 25 | Train: 1.000 | Val: 0.800 | Gap: 0.200 | OVERFITTING
Best validation accuracy: depth 4 with 0.840
Final training accuracy: 1.000
Final validation accuracy: 0.800
Overfitting gap: 0.200
Your Test Set Is Sacred: Touch It Exactly Once
The moment you look at your test set accuracy and use it to make any modeling decision, you have contaminated it. It is no longer an unbiased estimate of real-world performance. It is a second validation set. The proper workflow: split data into train, validation, and test. Train on train. Tune on validation. Only evaluate on test once, at the very end, after all decisions are made. If your test accuracy surprises you, do not retrain to improve it. You are done.
Production Insight
The gap between train and validation accuracy is your overfitting signal.
A model with 99% train / 72% val will fail in production.
Rule: a healthy gap is <5%. Anything larger means you're memorizing noise.
Key Takeaway
Overfitting = memorized training data, doesn't generalize.
Underfitting = too simple to capture real pattern.
Train/validation gap >5% = overfitting. Fix with regularization or simpler model.

Building Your First ML Pipeline

A pipeline bundles your preprocessing steps and your model into a single object that can be trained, saved, and deployed as one unit. This is not optional. It is the difference between a model that works on your laptop and a model that works in production.

Without a pipeline, you run your scaler on training data, then separately on test data, then you forget to apply the same scaler when you deploy. The model receives unscaled input and produces garbage predictions. With a pipeline, the scaler and model travel together. You cannot accidentally apply one without the other.

The code below builds a complete content recommendation pipeline: synthetic data generation, feature engineering, train/validation/test split, pipeline construction with StandardScaler and GradientBoostingClassifier, cross-validated training, final evaluation, artifact saving, and production inference simulation. Every step is deliberate. Nothing is skipped.

io_thecodeforge_ml_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
import joblib
np.random.seed(7)
num_samples = 2000
content_interactions = pd.DataFrame({
    'session_duration_seconds': np.random.exponential(300, num_samples),
    'articles_viewed_today': np.random.poisson(4, num_samples),
    'scroll_depth_pct': np.random.uniform(0, 100, num_samples),
    'time_since_last_visit_h': np.random.exponential(24, num_samples),
    'device_type': np.random.choice(['mobile', 'desktop', 'tablet'], num_samples),
    'hour_of_day': np.random.randint(0, 24, num_samples),
})
click_signal = (
    0.003 * content_interactions['session_duration_seconds']
    + 0.1 * content_interactions['articles_viewed_today']
    + 0.02 * content_interactions['scroll_depth_pct']
    - 0.01 * content_interactions['time_since_last_visit_h']
    + np.where(content_interactions['device_type'] == 'desktop', 2, 0)
    + np.random.normal(0, 2, num_samples)
)
content_interactions['clicked'] = (click_signal > click_signal.median()).astype(int)
print(f'Dataset shape: {content_interactions.shape}')
print(f'Click rate: {content_interactions["clicked"].mean():.1%}')
device_encoder = LabelEncoder()
content_interactions['device_type_encoded'] = device_encoder.fit_transform(content_interactions['device_type'])
content_interactions['day_period'] = pd.cut(
    content_interactions['hour_of_day'], bins=[0, 6, 12, 18, 24], labels=[0, 1, 2, 3], include_lowest=True
).astype(int)
feature_columns = ['session_duration_seconds', 'articles_viewed_today', 'scroll_depth_pct', 'time_since_last_visit_h', 'device_type_encoded', 'day_period']
X = content_interactions[feature_columns]
y = content_interactions['clicked']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7, stratify=y)
recommendation_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(n_estimators=100, max_depth=4, learning_rate=0.1, subsample=0.8, random_state=7))
])
cv_scores = cross_val_score(recommendation_pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f'5-Fold CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}')
recommendation_pipeline.fit(X_train, y_train)
test_predictions = recommendation_pipeline.predict(X_test)
test_probabilities = recommendation_pipeline.predict_proba(X_test)[:, 1]
print('=== Final Test Performance ===')
print(classification_report(y_test, test_predictions, target_names=['No Click', 'Clicked']))
print(f'ROC-AUC Score: {roc_auc_score(y_test, test_probabilities):.3f}')
joblib.dump(recommendation_pipeline, 'recommendation_pipeline_v1.joblib')
joblib.dump(device_encoder, 'device_encoder_v1.joblib')
print('Artifacts saved successfully.')
loaded_pipeline = joblib.load('recommendation_pipeline_v1.joblib')
loaded_encoder = joblib.load('device_encoder_v1.joblib')
incoming_request = {
    'session_duration_seconds': 312,
    'articles_viewed_today': 7,
    'scroll_depth_pct': 78.5,
    'time_since_last_visit_h': 2.1,
    'device_type': 'desktop',
    'hour_of_day': 14,
}
request_df = pd.DataFrame([incoming_request])
request_df['device_type_encoded'] = loaded_encoder.transform(request_df['device_type'])
request_df['day_period'] = pd.cut(
    request_df['hour_of_day'], bins=[0, 6, 12, 18, 24], labels=[0, 1, 2, 3], include_lowest=True
).astype(int)
click_probability = loaded_pipeline.predict_proba(request_df[feature_columns])[0][1]
print(f'Click probability for incoming request: {click_probability:.1%}')
print(f'Serve personalized content: {"YES" if click_probability > 0.55 else "NO"}')
Output
Dataset shape: (2000, 8)
Click rate: 50.0%
5-Fold CV AUC: 0.841 +/- 0.018
=== Final Test Performance ===
precision recall f1-score support
No Click 0.80 0.78 0.79 200
Clicked 0.79 0.81 0.80 200
accuracy 0.80 400
ROC-AUC Score: 0.873
Artifacts saved successfully.
Click probability for incoming request: 81.4%
Serve personalized content: YES
Senior Shortcut: ROC-AUC Over Accuracy for Imbalanced Classes
If your dataset is 95 percent one class, a model that predicts the majority class every single time scores 95 percent accuracy and is completely useless. ROC-AUC measures how well the model ranks positives above negatives regardless of class balance. For fraud detection, churn, click prediction, or any problem where one class is rare, always report ROC-AUC alongside accuracy. A model with 72 percent accuracy and 0.91 AUC beats one with 95 percent accuracy and 0.61 AUC every time.
Production Insight
The pipeline is not optional — it's the only way to guarantee identical preprocessing in training and production.
I've seen models fail because production sent raw data while the model expected scaled features.
Rule: save the entire pipeline with joblib, never just the model.
Key Takeaway
A pipeline bundles preprocessing + model into one savable object.
Prevents the #1 deployment bug: mismatched scaling/encoding.
Always save and load the full pipeline, not just the model.

Deploying Your Model: From joblib File to Production Endpoint

Training a model is half the job. The other half is serving it to real users through an API endpoint they can call. This is where most tutorials stop and most beginners get stuck.

Here is a minimal but production-ready deployment pattern using Flask. The key principles: load the model once at startup, not on every request. Validate incoming request data before prediction. Return structured JSON responses. Handle errors gracefully.

I have seen production endpoints that loaded the model on every request, adding 200ms of latency per call for a joblib.load that should happen once. The code below is a complete Flask app that loads the recommendation pipeline we trained in the previous section, accepts POST requests with user session data, and returns a click probability. It includes input validation, error handling, and a health check endpoint.

io_thecodeforge_ml_flask_endpoint.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from flask import Flask, request, jsonify
import joblib
import pandas as pd
app = Flask(__name__)
try:
    pipeline = joblib.load('recommendation_pipeline_v1.joblib')
    encoder = joblib.load('device_encoder_v1.joblib')
    print('Model artifacts loaded successfully.')
except Exception as e:
    print(f'FATAL: Failed to load model artifacts: {e}')
    pipeline = None
    encoder = None
FEATURE_COLUMNS = ['session_duration_seconds', 'articles_viewed_today', 'scroll_depth_pct', 'time_since_last_visit_h', 'device_type_encoded', 'day_period']
REQUIRED_FIELDS = {
    'session_duration_seconds': (int, float),
    'articles_viewed_today': (int,),
    'scroll_depth_pct': (int, float),
    'time_since_last_visit_h': (int, float),
    'device_type': (str,),
    'hour_of_day': (int,),
}
@app.route('/health', methods=['GET'])
def health_check():
    if pipeline is None:
        return jsonify({'status': 'unhealthy', 'reason': 'Model not loaded'}), 503
    return jsonify({'status': 'healthy', 'model': 'recommendation_v1'}), 200
@app.route('/predict', methods=['POST'])
def predict():
    if pipeline is None:
        return jsonify({'error': 'Model not available'}), 503
    data = request.get_json()
    if not data:
        return jsonify({'error': 'Request body must be JSON'}), 400
    for field, expected_types in REQUIRED_FIELDS.items():
        if field not in data:
            return jsonify({'error': f'Missing required field: {field}'}), 400
        if not isinstance(data[field], expected_types):
            return jsonify({'error': f'Field {field} must be one of {expected_types}'}), 400
    if not 0 <= data['scroll_depth_pct'] <= 100:
        return jsonify({'error': 'scroll_depth_pct must be between 0 and 100'}), 400
    if not 0 <= data['hour_of_day'] <= 23:
        return jsonify({'error': 'hour_of_day must be between 0 and 23'}), 400
    if data['device_type'] not in encoder.classes_:
        return jsonify({'error': f'Unknown device_type. Allowed: {list(encoder.classes_)}'}), 400
    request_df = pd.DataFrame([data])
    request_df['device_type_encoded'] = encoder.transform(request_df['device_type'])
    request_df['day_period'] = pd.cut(
        request_df['hour_of_day'], bins=[0, 6, 12, 18, 24], labels=[0, 1, 2, 3], include_lowest=True
    ).astype(int)
    try:
        probability = pipeline.predict_proba(request_df[FEATURE_COLUMNS])[0][1]
    except Exception as e:
        return jsonify({'error': f'Prediction failed: {str(e)}'}), 500
    return jsonify({
        'click_probability': round(float(probability), 4),
        'recommendation': 'serve_content' if probability > 0.55 else 'skip',
        'model_version': 'v1',
    }), 200
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)
Output
# Test with:
# curl -X POST http://localhost:5000/predict \
# -H 'Content-Type: application/json' \
# -d '{"session_duration_seconds": 312, "articles_viewed_today": 7, "scroll_depth_pct": 78.5, "time_since_last_visit_h": 2.1, "device_type": "desktop", "hour_of_day": 14}'
# Response:
# {"click_probability": 0.8142, "recommendation": "serve_content", "model_version": "v1"}
Watch Out: Load the Model Once, Not on Every Request
Loading a joblib model takes 100-500ms depending on model size. If you call joblib.load() inside your request handler, every API call pays that latency. Load the model once at module level when the server starts. If the model file is missing or corrupt, the server should refuse to start, not silently serve garbage predictions.
Production Insight
Loading the model per request adds 100-500ms latency and kills throughput.
I've seen endpoints that reloaded the model on every call — fixed by moving load to startup.
Rule: load once at module level, reuse for all requests.
Key Takeaway
Load model once at server startup, not per request.
Validate all inputs before calling predict().
Include a /health endpoint for orchestration. Never trust raw JSON.

ML Project Structure: Where Everything Goes

A disorganized ML project is a liability. When your model needs retraining six months from now, or when a new team member joins, they need to find the data, the training script, the model artifacts, and the evaluation results without asking you.

The core principle: separate exploration from production. Jupyter notebooks are for exploration. Production code is Python scripts that can be run headless, versioned, and tested. Never deploy a notebook. Never put a 2GB CSV in your git repo. Never name your model file 'model_final_v3_final2_real.joblib'.

I once inherited a project where the training data was a 2GB CSV committed directly to git. The repo took 10 minutes to clone. The model was saved as 'model.joblib' in the project root with no version number. The scaler was not saved at all. The training script was a Jupyter notebook with cells that had to be run in a specific order that was not documented. It took the new team two weeks just to figure out how to reproduce the existing model before they could improve it.

project_structure.txtTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
my_ml_project/
|-- README.md
|-- requirements.txt
|-- Makefile
|-- .gitignore
|-- data/
|   |-- raw/           # Immutable original data, never modified
|   |-- processed/     # Cleaned data ready for modeling
|-- notebooks/
|   |-- 01_eda.ipynb
|   |-- 02_modeling.ipynb
|-- src/
|   |-- __init__.py
|   |-- features.py    # Feature engineering functions
|   |-- train.py       # Training script
|   |-- evaluate.py    # Evaluation script
|   |-- predict.py     # Inference functions
|-- models/
|   |-- pipeline_v1.joblib
|   |-- encoder_v1.joblib
|-- tests/
|   |-- test_features.py
|   |-- test_predict.py
|-- Makefile targets:
|   make train     -> runs src/train.py
|   make evaluate  -> runs src/evaluate.py
|   make serve     -> starts Flask endpoint
|   make test      -> runs pytest
Output
Recommended ML project layout. Copy this for every new project.
Never Commit Large Data Files or Model Artifacts to Git
Add data/ and models/ to your .gitignore. Use DVC, S3, or a shared drive for data versioning. Keep your git repo lightweight and focused on code. I have seen a team accidentally commit a 500MB model file to git and spend two hours figuring out how to remove it from history. Prevention is cheaper than cleanup.
Production Insight
Jupyter notebooks are for exploration, not production. They encourage out-of-order execution and hidden state.
I've seen notebooks that only worked if cells were run in a specific sequence not documented anywhere.
Rule: refactor working notebook code into .py scripts before deployment.
Key Takeaway
Separate exploration (notebooks) from production (Python scripts).
Version model artifacts, never overwrite in-place.
Add data/ and models/ to .gitignore — use DVC or S3.
● Production incidentPOST-MORTEMseverity: high

The Churn Model That Saw the Future (And Failed Anyway)

Symptom
Model flagged 40% of active customers as 'high churn risk' every day. Actual churn rate among flagged customers was 2% — same as the baseline. Zero business value.
Assumption
The team assumed that if a feature existed in their training database, it was safe to use. They included 'last_payment_date' and 'support_ticket_resolved_date' as features.
Root cause
Last_payment_date for a churned customer showed the date of their final payment, which only exists after they've churned. The model learned: 'if last_payment_date is in the last 7 days AND support_ticket_resolved_date exists, the customer has not churned yet.' At prediction time for active customers, those fields had different meanings or were NULL. The model had no signal left.
Fix
1. Remove any feature that wouldn't be available at prediction time. 2. For time-based features, use aggregates like 'days_since_last_payment' computed from historical data only. 3. Add a feature audit step to every training pipeline that flags future-leaking columns.
Key lesson
  • If a feature wouldn't be available at prediction time, remove it before training
  • Audit every column: 'Would I have this value at the moment I need to predict?'
  • Data leakage is the #1 reason models fail silently in production
  • High test accuracy with no business lift = almost always leakage
Production debug guideCommon ML failures and how to diagnose them4 entries
Symptom · 01
Model accuracy high in testing but random in production
Fix
Check for data leakage. Run pandas.DataFrame.corrwith(target) on your features. If any feature has correlation > 0.9, investigate. Look for time-based fields that reference future events.
Symptom · 02
Model works for a week, then predictions drift
Fix
Monitor input feature distributions with scipy.stats.ks_2samp(train_col, prod_col). If p-value < 0.05, your production data distribution has shifted. Retrain on more recent data.
Symptom · 03
Model predicts same class for every input
Fix
Check class balance: target.value_counts(normalize=True). If minority class < 5%, your model may be predicting majority class for everything. Switch to ROC-AUC instead of accuracy and use class_weight='balanced'.
Symptom · 04
Prediction latency > 500ms in production
Fix
Profile with %timeit model.predict(X_test). For tree models, reduce n_estimators or max_depth. For neural networks, consider quantization or ONNX export. Move model loading out of request handler.
★ ML Quick Debug Commands4 common ML failures and what to run in under 30 seconds.
Model fails in production after high test accuracy
Immediate action
`print(df.columns[df.corrwith(target).abs() > 0.9])`
Commands
`for col in df.columns: print(col, df[col].isnull().sum())`
`pd.to_datetime(df['date_column']).dt.year.value_counts()`
Fix now
Remove any feature that wouldn't exist at prediction time. Use train_test_split before any preprocessing.
Model predictions degrading over time+
Immediate action
`from scipy.stats import ks_2samp; ks_2samp(train_col, prod_col)`
Commands
`print(train_target.mean(), prod_target.mean())`
`for col in features: print(col, train[col].mean(), prod[col].mean())`
Fix now
Set up weekly automated retraining on latest 90 days of data. Monitor PSI score > 0.2.
Model always predicts majority class+
Immediate action
`print(target.value_counts(normalize=True))`
Commands
`from sklearn.metrics import classification_report; print(classification_report(y_test, y_pred))`
`model = RandomForestClassifier(class_weight='balanced')`
Fix now
Use roc_auc_score instead of accuracy. Set class_weight='balanced'. Consider SMOTE for severe imbalance (<1%).
Prediction API is too slow+
Immediate action
`%timeit model.predict(X_sample)`
Commands
`import sys; print(sys.getsizeof(model))`
`for est in model.estimators_: print(est.tree_.max_depth)`
Fix now
Load model once at startup, not per request. Reduce n_estimators to 50. Set max_depth=7.
Supervised vs Unsupervised vs Reinforcement Learning
AttributeSupervised LearningUnsupervised LearningReinforcement Learning
Requires labeled dataYes, every example needs a correct answerNo, algorithm finds structure in raw dataNo dataset, agent learns from environment rewards
Typical outputPrediction or classificationClusters, embeddings, or anomaly scoresOptimal policy as sequence of actions
Evaluation clarityClear, compare prediction to known answerFuzzy, no ground truth to score againstCumulative reward over time
Beginner friendlinessHigh, feedback loop is immediateLow, hard to tell if results are meaningfulVery low, complex setup and unstable training
Common algorithmsRandom Forest, Gradient Boosting, Logistic RegressionK-Means, DBSCAN, PCA, AutoencodersQ-Learning, PPO, DQN, A3C
Real-world examplesChurn prediction, fraud detection, price forecastingCustomer segmentation, topic modeling, anomaly detectionGame AI, robotics, trading, recommendation exploration
Biggest failure modeOverfitting to training labelsFinding meaningless clustersReward hacking, agent exploits reward function
Minimum viable dataset500-1000 labeled examples for tabular dataHundreds to thousands of examplesRequires environment simulation, not a dataset
When to chooseYou have historical data with known outcomesYou need to discover hidden structure in dataAgent must learn through trial and error in dynamic environment

Key takeaways

1
Training accuracy means almost nothing on its own. The number that matters is the gap between train and validation accuracy. A model with 78% train and 77% val is production-ready. A model with 99% train and 72% val is not.
2
The most common ML deployment bug is not in the model
it's a missing scaler. Bundle your scaler and model into a single sklearn Pipeline before saving with joblib.
3
Reach for supervised learning first. If you have labeled historical data and a specific thing to predict, 90% of business ML problems are solved here.
4
More data beats a better algorithm at the beginner level. Before you spend a week tuning hyperparameters, double your labeled training examples.
5
Never report accuracy alone on imbalanced data. A 99% accurate fraud model that catches nothing is useless. Always report precision, recall, F1, and ROC-AUC.
6
Feature engineering is 60% of real-world ML work. A mediocre model with great features beats a great model with mediocre features every time.
7
Always start with a dumb baseline (majority class or mean). If your fancy model can't beat it by 5%, your features are the problem, not your algorithm.
8
For tabular data (rows and columns), start with Random Forest. It's robust, needs minimal tuning, and handles mixed data types. Deep learning is for images/text.
9
Version your model artifacts. Never overwrite model.joblib in-place. If v2 underperforms, you need to roll back to v1 in seconds, not hours.
10
Load your model once at server startup, not on every request. Joblib.load takes 100-500ms. That latency kills throughput.
11
Your test set is sacred. Touch it exactly once at the end. Using test results to make modeling decisions invalidates your only honest performance estimate.

Common mistakes to avoid

8 patterns
×

Fitting preprocessors on full dataset before train/test split

Symptom
Test accuracy is suspiciously high (e.g., 98% on a hard problem). Model performs much worse on new production data.
Fix
Always call train_test_split first, then fit scalers and encoders ONLY on X_train. Transform X_val and X_test using the fitted objects.
×

Reporting test set accuracy after tuning against it multiple times

Symptom
Published accuracy of 91% collapses to 74% on first batch of production data. Test set has become a second validation set.
Fix
Use cross-validation for all tuning decisions. Touch the test set exactly once at the end of the project.
×

Saving only the model, not the scaler or encoder

Symptom
Prediction endpoint receives raw inputs and produces garbage probabilities. No exception is raised — failures are silent.
Fix
Use sklearn Pipeline to bundle preprocessing and model into a single joblib artifact. Save and load the entire pipeline.
×

Using accuracy as the only metric on imbalanced data

Symptom
Model reports 99% accuracy but catches zero fraud/churn. Business stakeholders are confused why 'high accuracy' isn't helping.
Fix
Always compute precision, recall, F1, and ROC-AUC. Print the full classification report.
×

Loading the model on every API request

Symptom
API latency is 200-500ms even for simple models. Throughput is capped at ~10 requests/second.
Fix
Load the model once at module level when the server starts. Reuse the loaded object for all requests.
×

Not validating input data before prediction

Symptom
Missing fields or wrong data types cause cryptic exceptions or garbage predictions. No clear error message to caller.
Fix
Validate every required field exists, check types, enforce value ranges. Return 400 with clear error message on validation failure.
×

Starting with deep learning on tabular data

Symptom
Training takes hours, needs GPU, requires extensive tuning. Random Forest trains in seconds and outperforms the neural net.
Fix
For tabular/data (rows and columns), start with Random Forest. Use gradient boosting if you need more accuracy. Deep learning for images/text only.
×

Using features that won't be available at prediction time

Symptom
Model has 95%+ accuracy in testing, random performance in production. Data leakage is the culprit.
Fix
Audit every feature: 'Would I have this value at prediction time in real life?' Remove any column that leaks future information.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Your churn model has 89% cross-validation accuracy but only 61% on the f...
Q02SENIOR
You have a binary classification problem where the positive class is 0.3...
Q03SENIOR
What is the difference between supervised, unsupervised, and reinforceme...
Q04SENIOR
How would you debug a model that performs well on your test set but fail...
Q05SENIOR
You are given a dataset with 50 features and 10,000 rows. Before trainin...
Q06SENIOR
A Random Forest gets 0.82 AUC and an XGBoost gets 0.84 AUC on the same d...
Q07SENIOR
Walk me through how you would structure an ML project from scratch: dire...
Q01 of 07SENIOR

Your churn model has 89% cross-validation accuracy but only 61% on the first month of production data. Walk me through the five most likely causes and how you would diagnose each one.

ANSWER
1. Data leakage: a column in training data references future events not available at prediction time. Check features like 'last_payment_date' or 'ticket_resolved_date'. 2. Distribution shift: production data differs from training data. Compare summary statistics with KS test. 3. Preprocessing mismatch: scaler or encoder not applied identically. Verify pipeline is saved and loaded as one unit. 4. Training-serving skew: feature engineering differs between environments. Compare a sample prediction locally vs production. 5. Concept drift: the relationship between features and target changed over time. Evaluate performance by month to see degradation pattern.
FAQ · 8 QUESTIONS

Frequently Asked Questions

01
How long does it take to learn machine learning from scratch?
02
What's the difference between machine learning and deep learning?
03
Do I need to know math to learn ML?
04
Why does my model perform well in testing but terribly in production?
05
Which ML algorithm should I start with?
06
How do I deploy an ML model to production?
07
What metrics should I use to evaluate my model?
08
How do I handle imbalanced datasets?
🔥

That's ML Basics. Mark it forged?

14 min read · try the examples if you haven't

Previous
Recommender Systems Basics
13 / 25 · ML Basics
Next
Z-Score Formula: Standardization, Anomaly Detection and Statistics