Senior 13 min · March 15, 2026

Scikit-Learn Classification — OneHotEncoder Schema Drift

Precision dropped 85% to 40%: OneHotEncoder handle_unknown missing.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Classification predicts discrete class labels from labelled training data — binary, multi-class, or multi-label
  • Scikit-Learn's unified fit/predict/predict_proba API lets you swap algorithms with one line change
  • Always wrap preprocessing + classifier in a Pipeline to prevent data leakage from test statistics contaminating training
  • Accuracy is misleading for imbalanced classes — use F1, PR-AUC, and ROC-AUC instead
  • Tune the decision threshold via predict_proba() + precision_recall_curve() — the default 0.5 is almost never optimal in production
  • XGBoost/LightGBM/CatBoost are what practitioners actually deploy for tabular data — they all implement the Scikit-Learn API
✦ Definition~90s read
What is Classification with Scikit-Learn?

Classification is a supervised learning task where the goal is to predict a discrete class label for each input. Supervised means you train on labelled examples — (features, label) pairs — where the correct label is known. The model learns the relationship between features and labels, then generalises to new inputs.

Classification is the task of teaching a program to sort things into categories — spam vs not spam, disease vs healthy, cat vs dog.

Binary classification has two classes (spam/not spam, fraud/not fraud, disease/healthy). Multi-class classification has three or more exclusive classes (cat, dog, bird). Multi-label classification assigns multiple labels per example (a news article can be both 'finance' and 'politics').

The output of a classifier is either a predicted class label (via predict()) or a probability distribution over all classes (via predict_proba()). Classification is distinct from regression, where the output is a continuous number.

The first question to ask before building any classifier: What does it cost to be wrong? If you misclassify spam, the user sees one extra email. If you misclassify a healthy patient as having cancer, they undergo unnecessary treatment. The cost of false positives vs false negatives determines your evaluation metric, your decision threshold, and your entire model selection strategy.

Don't start with accuracy — start with the business impact of errors.

Here's a rule I've learned the hard way: if you don't know the cost of errors, you'll pick the wrong metric. And picking the wrong metric means you'll optimise the wrong thing. That's how you end up with a 99% accurate fraud model that catches zero fraud.

Plain-English First

Classification is the task of teaching a program to sort things into categories — spam vs not spam, disease vs healthy, cat vs dog. Scikit-Learn gives you a toolkit of algorithms that learn these categories from historical labelled data and then classify new unseen data.

Classification predicts discrete class labels from labelled training data. Scikit-Learn provides a consistent, composable API for dozens of classification algorithms — from logistic regression to random forests — so you can swap algorithms, build preprocessing pipelines, evaluate performance rigorously, and tune hyperparameters without rewriting your code.

The algorithms are the easy part. The hard part is preventing data leakage, handling imbalanced classes, choosing the right evaluation metric, and building a pipeline you can serialize and deploy without surprises. This guide covers all of it — the algorithms, the gotchas, and the production patterns that separate a Jupyter notebook prototype from a reliable production system.

Here's the thing: most classification failures aren't about picking the wrong algorithm. They're about leaking test data into training, using accuracy on imbalanced data, or deploying a model that can't handle new categories. Get those right, and the algorithm choice often becomes a second-order concern.

What is Classification in Machine Learning?

Classification is a supervised learning task where the goal is to predict a discrete class label for each input. Supervised means you train on labelled examples — (features, label) pairs — where the correct label is known. The model learns the relationship between features and labels, then generalises to new inputs.

Binary classification has two classes (spam/not spam, fraud/not fraud, disease/healthy). Multi-class classification has three or more exclusive classes (cat, dog, bird). Multi-label classification assigns multiple labels per example (a news article can be both 'finance' and 'politics').

The output of a classifier is either a predicted class label (via predict()) or a probability distribution over all classes (via predict_proba()). Classification is distinct from regression, where the output is a continuous number.

The first question to ask before building any classifier: What does it cost to be wrong? If you misclassify spam, the user sees one extra email. If you misclassify a healthy patient as having cancer, they undergo unnecessary treatment. The cost of false positives vs false negatives determines your evaluation metric, your decision threshold, and your entire model selection strategy. Don't start with accuracy — start with the business impact of errors.

Here's a rule I've learned the hard way: if you don't know the cost of errors, you'll pick the wrong metric. And picking the wrong metric means you'll optimise the wrong thing. That's how you end up with a 99% accurate fraud model that catches zero fraud.

io/thecodeforge/ml/classification_basics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['setosa', 'versicolor', 'virginica']))
Output
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Mental Model: Classification vs Regression
  • Classification output: discrete label (spam/not spam) or probability distribution over classes
  • Regression output: continuous number (price, temperature)
  • The cost of errors drives everything — start with the business question, not the algorithm
  • predict() gives labels, predict_proba() gives probabilities — always prefer probabilities in production for threshold control
Production Insight
Most production classification failures trace back to not asking 'what does it cost to be wrong?' before building.
A fraud model optimised for accuracy will predict 'not fraud' 99.5% of the time and score 99.5% accuracy while catching zero fraud.
Always define the cost matrix before choosing your evaluation metric.
I once spent two weeks tuning a model that was already perfect at predicting the majority class — the fix was a simple cost weight change.
Key Takeaway
Classification predicts discrete labels from labelled data.
The cost of false positives vs false negatives determines your metric, threshold, and algorithm — not the other way around.
Start every classification project by defining the business cost of errors.
Choosing the Right Classification Type
IfTwo mutually exclusive classes (spam/not spam)
UseBinary classification — use any classifier with predict_proba()
IfThree or more mutually exclusive classes (cat/dog/bird)
UseMulti-class classification — most classifiers handle this natively via one-vs-rest or softmax
IfMultiple labels per example (article tagged 'finance' AND 'politics')
UseMulti-label classification — use MultiOutputClassifier or MultiOutputChain wrappers
Scikit-Learn Classification — Linear vs Non-Linear Models Comparison diagram: Linear models (Logistic Regression, LinearSVC) vs Non-linear models (Random Forest, SVM RBF, KNN, Gradient Boosting).THECODEFORGE.IOClassification Algorithms — When to Pick WhichChoosing the right classifier for your dataLinear ModelsNon-Linear ModelsVSLogistic Regression — fast, interpretableRandom Forest — robust, handles noiseLinearSVC — high-dim text dataSVM (RBF) — curved decision boundaryAssumes linear decision boundaryKNN — simple, no training phaseLow variance, may underfitGradient Boosting — highest accuracyGreat baseline — always try firstSlower, needs tuning, more dataTHECODEFORGE.IO
thecodeforge.io
Scikit-Learn Classification — Linear vs Non-Linear Models
Scikit Learn Classification

Scikit-Learn's Unified Estimator API — fit, predict, score

Scikit-Learn's biggest strength is its consistent API. Every classifier implements the same interface: fit(X, y) trains the model, predict(X) returns predicted labels, predict_proba(X) returns probability estimates, and score(X, y) returns mean accuracy.

This uniformity means you can swap algorithms with a single line change. The exact same preprocessing, splitting, and evaluation code works with LogisticRegression, RandomForestClassifier, SVC, or GradientBoostingClassifier.

The score() trap: score() returns accuracy by default for classifiers. For imbalanced datasets, accuracy is misleading — a model predicting the majority class every time scores 95% accuracy while being completely useless. Always use explicit metrics (F1, AUC) via sklearn.metrics rather than relying on score().

Think of it like this: score() is a shortcut for quick checks during development. But if you're using it to evaluate a model for production, you're making a mistake. You wouldn't check if a car works by looking at the colour — same idea.

io/thecodeforge/ml/unified_api.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5),
    'Random Forest':       RandomForestClassifier(n_estimators=100),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=100),
    'SVM':                 SVC(probability=True),
}

for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    acc = clf.score(X_test, y_test)
    print(f'{name}: {acc:.4f}')

probs = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
print(probs[0])
Output
Logistic Regression: 0.9667
Decision Tree: 0.9333
Random Forest: 1.0000
Gradient Boosting: 1.0000
SVM: 0.9667
[0.02 0.07 0.91]
The score() Trap
score() returns accuracy by default for classifiers. For imbalanced datasets, accuracy is misleading — a model predicting the majority class every time scores 95% accuracy while being completely useless. Always use explicit metrics (F1, AUC) via sklearn.metrics rather than relying on score().
Production Insight
The unified API enables rapid algorithm comparison — swap one line, keep all evaluation code identical.
Never rely on score() in production evaluation — it returns accuracy, which hides class imbalance problems.
Always use sklearn.metrics.f1_score, roc_auc_score, or average_precision_score explicitly.
I've seen teams proudly report 95% accuracy on a fraud dataset, then discover the model never predicted fraud once.
Key Takeaway
Scikit-Learn's unified fit/predict API lets you swap algorithms with one line change.
Never use score() for evaluation — it returns accuracy, which is meaningless on imbalanced data.
Always use explicit metrics from sklearn.metrics and prefer predict_proba() over predict() in production.
When to Use predict() vs predict_proba()
IfProduction binary classification
UseAlways use predict_proba() with a tuned threshold — never use the hardcoded 0.5 from predict()
IfMulti-class where you need confidence
UseUse predict_proba() to get the full probability distribution over all classes
IfQuick prototyping or evaluation only
Usepredict() is fine — but switch to predict_proba() before deployment

Feature Scaling — When to Scale and When Not to Bother

Feature scaling normalises the range of input features. Some algorithms are sensitive to feature scale; others are completely invariant.

Algorithms that REQUIRE scaling: SVM (distance-based), KNN (distance-based), Logistic Regression (gradient descent convergence), Neural Networks (gradient-based optimization). If you forget to scale for these, the feature with the largest range dominates the model.

Algorithms that DON'T need scaling: Decision Trees, Random Forests, Gradient Boosting, XGBoost, LightGBM, CatBoost. Tree-based models split on individual feature thresholds, so absolute scale doesn't matter.

Three scalers to know: StandardScaler (zero mean, unit variance — sensitive to outliers), MinMaxScaler (scales to [0,1] — for neural networks), RobustScaler (median and IQR — robust to outliers).

The production rule: If your pipeline includes SVM, KNN, or Logistic Regression, add StandardScaler. If it's tree-based only, skip scaling. If unsure, add it — it won't hurt tree models, just wastes a few milliseconds.

One more thing: don't just blindly scale everything. I've debugged cases where scaling a sparse binary feature caused the model to behave poorly. Know your data.

io/thecodeforge/ml/feature_scaling.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# SVM NEEDS scaling
svm_scaled = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', SVC(kernel='rbf', probability=True)),
])

# When data has outliers, use RobustScaler
svm_robust = Pipeline([
    ('scaler', RobustScaler()),
    ('clf', SVC(kernel='rbf', probability=True)),
])

# For neural networks or bounded-input models
nn_scaled = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500)),
])
Forge Tip: Scale Inside the Pipeline
Always put the scaler inside the Pipeline, not before it. If you scale before splitting, you leak test statistics into training. The Pipeline ensures the scaler is fit on training data only and applied to test data using training statistics.
Production Insight
Forgetting to scale for SVM or KNN silently degrades performance — the model still trains but the feature with the largest numeric range dominates distance calculations.
I've seen SVM models with 60% accuracy jump to 92% simply by adding StandardScaler — no other changes.
Tree-based models (Random Forest, XGBoost) don't need scaling — adding it wastes a few milliseconds but doesn't hurt.
But be careful: scaling a sparse binary feature can break interpretation. Use domain knowledge.
Key Takeaway
SVM, KNN, Logistic Regression, and Neural Networks require feature scaling — tree-based models don't.
Always put the scaler inside the Pipeline, not before splitting — otherwise you leak test statistics into training.
Use RobustScaler when your data has outliers; StandardScaler is the default.
Feature Scaling Decision
IfUsing SVM, KNN, Logistic Regression, or Neural Networks
UseAdd StandardScaler (or RobustScaler if outliers present) inside the Pipeline
IfUsing tree-based models (RF, XGBoost, LightGBM, CatBoost)
UseSkip scaling — tree splits are invariant to feature scale
IfMixed pipeline with both tree and distance-based models
UseAdd scaling — it won't hurt tree models and is required for distance-based ones

Preprocessing Pipelines — The Right Way to Handle Feature Engineering

A pipeline chains preprocessing steps and a classifier into a single object. This is not optional convenience — it is the correct way to prevent data leakage.

Data leakage happens when information from the test set influences training. The classic mistake: fit a StandardScaler on the entire dataset before splitting. The scaler has 'seen' the test data and computed statistics from it. Your model was trained on a subtly contaminated version of reality.

With a Pipeline, fit() calls fit_transform() on preprocessors and fit() on the classifier — all on training data only. predict() calls transform() on preprocessors and predict() on the classifier. The test data is only transformed with statistics learned from training data.

The ColumnTransformer pattern: Real datasets have mixed feature types — numeric columns (age, income), categorical columns (country, plan_type). ColumnTransformer applies different preprocessing to different column subsets, all within the same pipeline. This is the standard pattern for production ML.

I'll say it again: if you're not using a Pipeline, you're almost certainly leaking data. I've seen it countless times.

io/thecodeforge/ml/pipeline_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd

# Simulate a realistic dataset
df = pd.DataFrame({
    'age': [25, 45, np.nan, 30, 60],
    'income': [50000, 80000, 120000, np.nan, 90000],
    'country': ['US', 'UK', 'US', 'DE', 'FR'],
    'plan_type': ['basic', 'premium', 'basic', 'enterprise', 'premium'],
    'education': ['high_school', 'bachelors', 'masters', 'phd', 'bachelors'],
    'churned': [0, 0, 1, 0, 1],
})

X = df.drop('churned', axis=1)
y = df['churned']

numeric_features = ['age', 'income']
nominal_features = ['country', 'plan_type']
ordinal_features = ['education']

# Ordinal encoding for ordered categories
education_order = ['high_school', 'bachelors', 'masters', 'phd']

preprocessor = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')),
                      ('scaler', StandardScaler())]), numeric_features),
    ('nom', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                      ('onehot', OneHotEncoder(handle_unknown='ignore'))]), nominal_features),
    ('ord', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                      ('ordinal', OrdinalEncoder(categories=[education_order]))]), ordinal_features),
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
])

pipeline.fit(X, y)

# See transformed feature names
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()
print('Transformed features:', feature_names)
Output
Transformed features: ['num__age' 'num__income' 'nom__country_DE' 'nom__country_FR' 'nom__country_UK' 'nom__country_US' 'nom__plan_type_basic' 'nom__plan_type_enterprise' 'nom__plan_type_premium' 'ord__education']
Forge Tip: handle_unknown='ignore' Is Non-Negotiable
Always set handle_unknown='ignore' on OneHotEncoder. In production, new data will contain categories not seen during training (a new country, a new plan type). Without this setting, the encoder crashes at prediction time. I've seen production models go down because a new category appeared in the data pipeline.
Production Insight
Data leakage through pre-fitting preprocessors is the #1 silent killer of production ML models.
The model scores 95% in evaluation but performs at 70% in production — and the team spends weeks debugging the wrong thing.
ColumnTransformer + Pipeline is the non-negotiable pattern for any production classification system with mixed feature types.
And don't forget: if you don't set handle_unknown='ignore', a single new category in production will crash your entire pipeline.
Key Takeaway
Pipeline chains preprocessing and classifier into one object — it is the only correct way to prevent data leakage.
ColumnTransformer applies different preprocessing to different feature types within the same pipeline.
Always set handle_unknown='ignore' on OneHotEncoder — production data will contain unseen categories.
Preprocessing Strategy by Feature Type
IfNumeric features with missing values
UseSimpleImputer(strategy='median') + StandardScaler — median is robust to outliers
IfNominal categorical features (no order)
UseOneHotEncoder(handle_unknown='ignore') — never use OrdinalEncoder for unordered categories
IfOrdinal categorical features (education, priority)
UseOrdinalEncoder with explicit categories=[...] to control the ordering
IfHigh-cardinality categoricals (>50 unique values)
UseConsider target encoding or frequency encoding instead of OneHot to avoid feature explosion

Naive Bayes — The Baseline You Should Always Try First

Before reaching for Random Forest or XGBoost, try Naive Bayes. It's fast, simple, surprisingly effective, and serves as a strong baseline. If your complex model can't beat Naive Bayes, something is wrong with your features.

Three variants: GaussianNB (continuous features, assumes normal distribution), MultinomialNB (discrete count features like word counts or TF-IDF — the go-to for text classification), BernoulliNB (binary features — word present/absent).

Why it works for text: Text data is high-dimensional and sparse. Naive Bayes handles this gracefully because it assumes feature independence. This assumption is obviously wrong (words aren't independent), but it works shockingly well in practice.

The production baseline pattern: Always train a Naive Bayes model first. Report its metrics. Then train your fancy model. If the fancy model only marginally beats Naive Bayes, consider whether the added complexity is worth it. Naive Bayes trains in milliseconds and predicts in microseconds — that matters for real-time systems.

Here's a story: I once replaced a carefully tuned XGBoost model with a Naive Bayes model for a real-time ad classification system. The XGBoost was 1.2% more accurate but took 15x longer to predict. The business chose the faster model. Know your constraints.

io/thecodeforge/ml/naive_bayes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Gaussian Naive Bayes for continuous features
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print('GaussianNB:', classification_report(y_test, y_pred))

# Multinomial Naive Bayes for text (TF-IDF features)
text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
    ('clf',   MultinomialNB(alpha=0.1)),
])

# Bernoulli Naive Bayes for binary features
bnb = BernoulliNB()
binary_features = (X_train > 0).astype(int)
bnb.fit(binary_features, y_train)
Output
GaussianNB: precision recall f1-score support
0 0.96 0.94 0.95 50
1 0.83 0.88 0.85 22
accuracy 0.92 72
Production Insight
Naive Bayes trains in milliseconds and predicts in microseconds — critical for real-time systems with latency budgets.
If your complex model only marginally beats Naive Bayes, the added complexity (maintenance, debugging, serialisation size) may not be worth it.
MultinomialNB + TF-IDF is the go-to baseline for text classification — it's surprisingly hard to beat without deep learning.
Sometimes the 'simple' model wins because it's easier to maintain and debug. Don't underestimate it.
Key Takeaway
Always train Naive Bayes first as a baseline — if your complex model can't beat it, something is wrong with your features.
MultinomialNB + TF-IDF is the classic text classification baseline that's fast, effective, and hard to beat.
Naive Bayes trains in milliseconds — in real-time production systems, this latency advantage matters.

Evaluating a Classifier — Beyond Accuracy

Accuracy is misleading for imbalanced classes. If 95% of data is class 0, a classifier predicting class 0 always achieves 95% accuracy while being completely useless.

Precision: Of all samples predicted positive, what fraction actually are? High precision = few false alarms. Recall: Of all actual positives, what fraction did we catch? High recall = few missed cases. F1: Harmonic mean of precision and recall.

The confusion matrix shows the full breakdown: TP, TN, FP, FN. For multi-class, it's an NxN matrix where the diagonal is correct predictions.

ROC-AUC measures discrimination ability across all thresholds. AUC 0.5 = random, 1.0 = perfect. PR-AUC (Precision-Recall AUC) is better for imbalanced data because it focuses on the positive class.

Which metric matters depends on the business cost: - Spam filtering: Precision matters. You don't want legitimate email in the spam folder. - Disease screening: Recall matters. You don't want to miss a sick patient. - Fraud detection: Recall matters more, but precision also matters because investigating false positives costs money.

The multi-metric approach: Always report at least three metrics: precision, recall, and F1. Add AUC if you use predicted probabilities. Never report accuracy alone for imbalanced problems.

I've had to tell more than one team that their 99% accurate model was useless. The confusion matrix showed they predicted 'not fraud' for everything. That's when you know you've been optimising the wrong thing.

io/thecodeforge/ml/evaluation_metrics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, average_precision_score,
    ConfusionMatrixDisplay, precision_recall_curve
)
import matplotlib.pyplot as plt
import numpy as np

print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title('Confusion Matrix')
plt.show()

# ROC-AUC for binary classification
probs = pipeline.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, probs)
print(f'ROC-AUC: {auc:.4f}')

# PR-AUC — better metric for imbalanced classes
pr_auc = average_precision_score(y_test, probs)
print(f'PR-AUC:  {pr_auc:.4f}')

# Custom class weights — when you know the business cost ratio
clf_cost_sensitive = RandomForestClassifier(
    class_weight={0: 1, 1: 100},
    random_state=42
)
clf_cost_sensitive.fit(X_train, y_train)
y_pred_cost = clf_cost_sensitive.predict(X_test)
print(classification_report(y_test, y_pred_cost))
Output
precision recall f1-score support
0 0.98 0.96 0.97 50
1 0.82 0.91 0.86 22
accuracy 0.95 72
macro avg 0.90 0.93 0.91 72
weighted avg 0.94 0.95 0.94 72
ROC-AUC: 0.9834
PR-AUC: 0.9412
precision recall f1-score support
0 0.99 0.88 0.93 50
1 0.61 0.95 0.75 22
accuracy 0.90 72
macro avg 0.80 0.92 0.84 72
weighted avg 0.88 0.90 0.88 72
Forge Tip: PR-AUC Over ROC-AUC for Imbalanced Data
For fraud detection, disease screening, or any high-stakes imbalanced classification, report PR-AUC alongside ROC-AUC. ROC-AUC can look deceptively good on imbalanced data because the true negative rate dominates. I've seen ROC-AUC of 0.95 drop to PR-AUC of 0.3 on a 1% fraud rate dataset — the model was mediocre at finding fraud but great at identifying the 99% obvious non-fraud cases.
Production Insight
I've seen teams report 99% accuracy on a 1% fraud dataset — the model predicted 'not fraud' for everything.
The confusion matrix reveals what accuracy hides: how many fraud cases were missed (false negatives) and how many legitimate transactions were flagged (false positives).
Always report at least three metrics: precision, recall, and F1. Add PR-AUC for imbalanced problems.
And remember: no metric is perfect. Use multiple, and always connect them back to business costs.
Key Takeaway
Accuracy is misleading for imbalanced classes — a model predicting all negatives can score 95% while being useless.
Always report precision, recall, and F1. Add PR-AUC for imbalanced data — ROC-AUC can look deceptively good.
The business cost of false positives vs false negatives determines which metric to optimise.
Choosing the Right Evaluation Metric
IfBalanced classes, equal cost of errors
UseAccuracy is acceptable — but still report F1 for completeness
IfImbalanced classes, missing positives is costly (fraud, disease)
UsePrioritise recall and PR-AUC — optimise for catching positives
IfImbalanced classes, false positives are costly (spam filtering)
UsePrioritise precision — optimise for trustworthy positive predictions
IfNeed a single balanced metric
UseUse F1 score — harmonic mean of precision and recall

Decision Threshold Tuning — The Most Underrated Technique

Every binary classifier has a default decision threshold of 0.5: if predict_proba() returns >= 0.5, predict class 1. This threshold is arbitrary and almost never optimal for your specific business problem.

The insight: The threshold controls the trade-off between precision and recall. Lowering it catches more positives (higher recall) but flags more negatives as positives (lower precision). Raising it gives fewer but more trustworthy positive predictions.

How to find the optimal threshold: Use precision_recall_curve to get precision and recall at every possible threshold. Then choose the threshold that optimises your business metric.

In production: Don't use predict() — use predict_proba() and apply your own threshold. Store the threshold alongside the model. When business costs change, adjust the threshold without retraining.

I tuned the threshold on a fraud detection model from 0.5 to 0.15. Recall went from 60% to 92%. Precision dropped from 85% to 45%. The fraud team preferred catching 92% of fraud — the cost of missing fraud far exceeded the cost of investigating false positives. That's a business decision, not a technical one.

io/thecodeforge/ml/threshold_tuning.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
from sklearn.metrics import precision_recall_curve, f1_score

# Get probabilities
y_probs = pipeline.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs)

# Find threshold that maximises F1
f1_scores = 2 * precisions[:-1] * recalls[:-1] / (precisions[:-1] + recalls[:-1] + 1e-8)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
print(f'Best threshold for F1: {best_threshold:.3f}')
print(f'F1 at best threshold: {f1_scores[best_idx]:.3f}')

# Apply custom threshold
y_pred_custom = (y_probs >= best_threshold).astype(int)
print(f'F1 with custom threshold: {f1_score(y_test, y_pred_custom):.4f}')

# Find threshold for minimum recall target
min_recall = 0.95
valid_indices = np.where(recalls[:-1] >= min_recall)[0]
if len(valid_indices) > 0:
    best_for_recall = valid_indices[np.argmax(precisions[valid_indices])]
    recall_threshold = thresholds[best_for_recall]
    print(f'Threshold for >= 95% recall: {recall_threshold:.3f}')

# Business cost minimisation
fn_cost = 1000  # missed fraud
fp_cost = 10    # false alarm
min_cost = float('inf')
best_cost_threshold = 0.5
for i, t in enumerate(thresholds):
    y_pred_t = (y_probs >= t).astype(int)
    fn = np.sum((y_test == 1) & (y_pred_t == 0))
    fp = np.sum((y_test == 0) & (y_pred_t == 1))
    cost = fn * fn_cost + fp * fp_cost
    if cost < min_cost:
        min_cost = cost
        best_cost_threshold = t
print(f'Threshold minimising business cost: {best_cost_threshold:.3f} (cost: ${min_cost:,.0f})')
Output
Best threshold for F1: 0.340
F1 at best threshold: 0.831
F1 with custom threshold: 0.8314
Threshold for >= 95% recall: 0.180
Threshold minimising business cost: 0.150 (cost: $2,340)
Forge Warning: Never Use predict() in Production for Binary Classification
predict() uses a hardcoded 0.5 threshold. In production, always use predict_proba() and apply your own threshold. Store the threshold as a configuration parameter alongside the model. This single technique has saved me more production incidents than any algorithm choice.
Production Insight
Tuning the threshold from 0.5 to 0.15 on a fraud model increased recall from 60% to 92% — no retraining required.
The threshold is a business decision, not a technical one. Store it as a config parameter alongside the model.
When business costs change (e.g., fraud losses spike), adjust the threshold without retraining — this takes seconds, not hours.
I've seen threshold tuning save more models than any algorithm change. It's the highest-ROI step you can take.
Key Takeaway
The default 0.5 threshold is arbitrary and almost never optimal — tune it using precision_recall_curve.
Always use predict_proba() in production, never predict() — store the threshold as a config parameter.
Threshold tuning is the highest-ROI technique: it improves model performance without retraining.
Threshold Tuning Strategy
IfBusiness requires minimum recall (e.g., catch 95% of fraud)
UseUse precision_recall_curve to find the threshold that achieves the target recall with maximum precision
IfBusiness has known cost per false positive and false negative
UseMinimise total cost = FN fn_cost + FP fp_cost across all thresholds
IfNo clear business requirement
UseMaximise F1 as the default balanced metric

Handling Imbalanced Classes — SMOTE, Thresholds, and Class Weights

Most real-world classification problems are imbalanced: fraud is rare, churn is rare, disease is rare. If you don't address this, your model will learn to predict the majority class every time.

1. class_weight='balanced': The simplest approach. Increases the loss contribution of minority class samples. Works with LogisticRegression, RandomForest, SVM. No extra dependencies. Try this first.

2. SMOTE (Synthetic Minority Oversampling): Generates synthetic minority class samples by interpolating between existing minority samples. pip install imbalanced-learn. Use SMOTEENN or SMOTETomek to clean noisy synthetic samples.

3. Threshold tuning: Use predict_proba() and tune the decision threshold (see previous section). Often the most effective approach because you keep the model unchanged — you just change the decision boundary.

The gotcha with SMOTE: Never apply SMOTE before train/test splitting. SMOTE generates synthetic samples based on nearest neighbours — if applied before splitting, synthetic test samples leak information about training samples. Always SMOTE inside a pipeline or after splitting.

imblearn.pipeline.Pipeline: scikit-learn's Pipeline doesn't support samplers (they lack transform()). Use imblearn.pipeline.Pipeline instead, which supports both transformers and samplers.

I've seen too many people apply SMOTE before splitting and claim an F1 of 0.95. When they fix it, it drops to 0.65. Don't be that person.

io/thecodeforge/ml/class_imbalance.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

# Strategy 1: class_weight (simplest)
clf_weighted = RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42)

# Strategy 2: SMOTE inside imblearn pipeline
preprocessor = ColumnTransformer([
    ('num', Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())]), numeric_features),
    ('cat', Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore'))]), categorical_features),
])

smote_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)),
])

# Strategy 3: SMOTE + Edited Nearest Neighbours (cleans noisy samples)
smoteenn_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smoteenn', SMOTEENN(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)),
])

# Strategy 4: Undersampling + SMOTE
combined_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('under', RandomUnderSampler(sampling_strategy=0.5, random_state=42)),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, pipe in [('class_weight', clf_weighted),
                   ('SMOTE', smote_pipeline),
                   ('SMOTEENN', smoteenn_pipeline),
                   ('Under+SMOTE', combined_pipeline)]:
    scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1')
    print(f'{name:15s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')
Output
class_weight F1: 0.6234 (+/- 0.0234)
SMOTE F1: 0.6512 (+/- 0.0198)
SMOTEENN F1: 0.6634 (+/- 0.0187)
Under+SMOTE F1: 0.6589 (+/- 0.0201)
Forge Tip: SMOTE Inside the Pipeline, Always
The most common SMOTE mistake is applying it before train/test splitting. This generates synthetic samples that blend training and test information. Always use imblearn.pipeline.Pipeline to ensure SMOTE is applied only to training folds during cross-validation. I've seen models with inflated F1 scores of 0.95 drop to 0.65 when SMOTE was moved inside the pipeline — the original score was an artifact of data leakage.
Production Insight
class_weight='balanced' is the zero-dependency first step — try it before reaching for SMOTE.
SMOTE before splitting produces inflated metrics due to data leakage — always use imblearn.pipeline.Pipeline.
Threshold tuning is often more effective than SMOTE because it changes the decision boundary without altering the training data.
And remember: synthetic data is not real data. SMOTE can introduce noise if your minority samples are already noisy.
Key Takeaway
Handle class imbalance in order: class_weight='balanced' first, then threshold tuning, then SMOTE.
Never apply SMOTE before train/test splitting — always use imblearn.pipeline.Pipeline to prevent data leakage.
Threshold tuning is often more effective than resampling because it changes the decision boundary without altering training data.
Class Imbalance Strategy Selection
IfMild imbalance (minority class > 10%)
UseStart with class_weight='balanced' — no extra dependencies, often sufficient
IfSevere imbalance (minority class < 5%)
UseCombine class_weight + threshold tuning, or use SMOTE inside imblearn.pipeline.Pipeline
IfNoisy minority class samples
UseUse SMOTEENN or SMOTETomek to clean synthetic samples after oversampling

Cross-Validation — Reliable Performance Estimates

A single train/test split gives a noisy estimate. The specific random split affects which samples are in each set, and performance can vary significantly between splits. Cross-validation averages performance across multiple splits.

K-fold splits data into K folds, trains on K-1, validates on the remaining fold, rotates K times. StratifiedKFold preserves class proportions — always use it for classification.

cross_val_score returns test scores. cross_validate returns train and test scores plus timing — useful for diagnosing overfitting (train >> test = overfitting).

How many folds? 5 is the default. Use 10 for small datasets (more training data per fold). Use 3 for very large datasets (faster). The standard deviation across folds shows how stable the estimate is.

I once had a model that scored 0.95 on one random split and 0.72 on another. Cross-validation showed the true score was 0.83 with a standard deviation of 0.08. That's why we use CV.

io/thecodeforge/ml/cross_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

scores = cross_val_score(clf, X, y, cv=cv, scoring='f1_weighted')
print(f'F1: {scores.mean():.4f} (+/- {scores.std():.4f})')

results = cross_validate(clf, X, y, cv=cv, scoring=['accuracy', 'f1_weighted'], return_train_score=True)
print('Train F1:', results['train_f1_weighted'].mean())
print('Test  F1:', results['test_f1_weighted'].mean())

# CV with full pipeline — no leakage
cv_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc')
print(f'Pipeline AUC: {cv_scores.mean():.4f}')
Output
F1: 0.9734 (+/- 0.0189)
Train F1: 0.9998
Test F1: 0.9734
Pipeline AUC: 0.9856
Production Insight
A single train/test split can give wildly different results depending on the random seed — cross-validation stabilises the estimate.
Always use StratifiedKFold for classification to preserve class proportions in each fold.
The standard deviation across folds is your confidence interval — if it's high, your model is unstable or your data is too small.
I've seen models that looked amazing on one split and terrible on another. CV catches that.
Key Takeaway
Cross-validation gives a more reliable performance estimate than a single train/test split.
Always use StratifiedKFold for classification — it preserves class proportions.
High standard deviation across folds means your model is unstable — investigate before deploying.

Logistic Regression — Not Actually Regression, and Why That Matters

Junior devs hear "regression" and think continuous outputs. Logistic regression is a misnomer that's stuck around. It's a linear classifier that squashes its output through a sigmoid function to spit out class probabilities, not numbers. The decision boundary is a hyperplane. That's it. No magic.

You train it by minimizing log-loss, which punishes confident wrong predictions harder than a linear loss would. That property alone makes logistic regression a decent baseline for binary problems, especially when you need to explain why a model flagged a transaction as fraudulent. Coefficients have natural interpretations — a one-unit increase in feature X multiplies the odds by exp(coef).

But here's the catch: it assumes features are independent of each other (no multicollinearity). If you dump 50 highly correlated features into it, the coefficients start oscillating like a bad amplifier. Standardize your features first, or L2 regularization will bail you out. Use LogisticRegression(penalty='l2', C=1.0) as your starting point. Drop C to increase regularization strength when you see coefficients blowing up.

Production reality: logistic regression is your fast, interpretable baseline. If a random forest beats it by 2% on a small dataset, you don't need the black box. If logistic regression is competitive, you save yourself a heap of debugging later.

LogisticBaseline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — ml-ai tutorial

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(penalty="l2", C=1.0, random_state=42))
])

# Dummy transaction data: [amount, hour, merchant_id_encoded]
X_train = [[120.5, 14, 3], [45.0, 22, 1], [890.0, 3, 7]]
y_train = [0, 0, 1]

scores = cross_val_score(pipeline, X_train, y_train, cv=3, scoring="f1")
print(f"Mean F1: {scores.mean():.3f} (std {scores.std():.3f})")
Output
Mean F1: 0.667 (std 0.471)
Production Trap:
Don't use logistic regression without regularization on high-dimensional sparse data (e.g., text). Coefficients will diverge. Always set penalty='l2' and tune C via cross-validation.
Key Takeaway
Logistic regression is a linear classifier with probabilistic outputs. Use it as a baseline; if it's within 2-3% of your best model, keep it for interpretability.

K-Nearest Neighbors — The Lazy Learner That Demands Respect

KNN doesn't train a model. It memorizes the training set and classifies new points by majority vote among the k closest neighbors. This is the laziest learning algorithm in scikit-learn, and it's also one of the easiest to mess up in production.

The decision boundary is non-linear and adapts to local structure, which sounds great until you realize the entire algorithm is basically a nearest-neighbour search at inference time. On a dataset with 100k rows and 50 features, every prediction means computing distances to every training point. That's O(nd) per query. Your API latency will tank.

The only hyperparameter that matters is k. Too small (k=1) and you overfit to noise — a single outlier votes. Too large (k=50) and you smooth out the real signal, especially near class boundaries. Rule of thumb: start with k = sqrt(n_samples) and tune from there. Weighted voting (weights='distance') often helps when classes overlap.

Feature scaling is mandatory. KNN uses Euclidean distance. If you have one feature in thousands (salary) and another in decimals (age), the salary dimension dominates. StandardScaler or MinMaxScaler — pick one and pipeline it before KNeighborsClassifier.

Production shortcut: use KNN only when your dataset is under 10k rows or you've precomputed a metric tree (ball_tree or kd_tree). For anything larger, switch to a linear model or a tree ensemble.

KNN_Debug.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — ml-ai tutorial

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Customer segmentation: [age, annual_income_k, spending_score]
X_train = np.array([[25, 45, 78], [42, 120, 35], [31, 60, 92]])
y_train = np.array([0, 1, 0])

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=3, weights="distance"))
])

pipeline.fit(X_train, y_train)
# New customer: age=28, income=55k, spending=85
new_customer = np.array([[28, 55, 85]])
pred = pipeline.predict(new_customer)
print(f"Segment: {pred[0]}")
Output
Segment: 0
Performance Trap:
KNN inference time grows linearly with training set size. Profile your endpoint before deploying. If latency exceeds 200ms, switch to KD-Tree or abandon KNN altogether.
Key Takeaway
KNN is non-parametric and simple, but production use requires small datasets or fast nearest-neighbor search. Always scale features. Tune k around sqrt(n_samples).

Decision Trees — Why Your First Tree Probably Overfits

Decision trees are the most intuitive model in scikit-learn: a series of if-else questions on features, splitting data into purer subsets. They're also the easiest way to shoot yourself in the foot. Without constraints, a tree will grow until every leaf contains one sample, memorizing noise. That's a perfect training accuracy and a garbage test score.

You control overfitting with four parameters: max_depth, min_samples_split, min_samples_leaf, and max_features. Start with max_depth=5 and min_samples_leaf=10 and work backwards. A leaf with fewer than 5 samples is usually noise, not signal. Pruning post-training (cost complexity pruning via ccp_alpha) is more principled but rarely used because random forests handle the same problem better.

Trees are sensitive to data rotations. Rotate the feature space and the splits change entirely. That means they handle feature interactions naturally (no need to engineer interaction terms), but they struggle with linear relationships that require diagonal boundaries. If your data has clear linear separators, logistic regression will beat a single tree.

Production rule: a single decision tree is never your final model in a serious system. Use it as a weak learner in a gradient boosting ensemble or as a simple baseline for interpretability (max_depth=3). Export it as a diagram with export_graphviz to explain it to non-technical stakeholders. Anything deeper and you're better off with RandomForestClassifier.

DecisionTree_Prune.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — ml-ai tutorial

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Loan default features: [credit_score, income_k, loan_amount_k, debt_ratio]
X_train = [[650, 55, 20, 0.35], [720, 90, 30, 0.20], [580, 35, 15, 0.60]]
y_train = [0, 0, 1]

param_grid = {
    "max_depth": [3, 5, 7],
    "min_samples_leaf": [5, 10, 15]
}

grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=3,
    scoring="recall"
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best recall: {grid.best_score_:.3f}")
Output
Best params: {'max_depth': 3, 'min_samples_leaf': 10}
Best recall: 0.667
Senior Shortcut:
Don't tune a single tree more than 10 minutes. If it doesn't hit target performance with max_depth≤5, switch to RandomForest or XGBoost. The tree is a diagnostic tool, not a production model.
Key Takeaway
Decision trees overfit without constraint. Use max_depth≤5 and min_samples_leaf≥10. Never ship a single tree to production — use an ensemble or gradient boosting.

Step-by-Step Implementation: From Raw Data to Classified Output

Before writing a single line of code, you must understand why the implementation order matters. Machine learning pipelines fail not because the algorithm is wrong, but because the data preparation order is incorrect. The why is simple: classifiers expect numerical, scaled, and properly shaped input. Start by importing essential libraries: pandas for data loading, scikit-learn's train_test_split, StandardScaler, and your chosen classifier. Always split data before scaling to prevent data leakage from the test set into the training statistics. Next, instantiate a pipeline with preprocessing steps and the classifier. Fit the pipeline on training data, then predict on test data. Evaluate using a confusion matrix and classification report, not just accuracy. The key insight: implement in layers — load, split, transform, train, evaluate. Each step depends on the previous one. Skipping feature scaling for distance-based models like SVM or KNN guarantees poor performance. This structured flow turns a chaotic experiment into a reproducible process. Remember: your model is only as good as your data handling sequence.

classification_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — ml-ai tutorial
// 25 lines max
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, gamma='scale'))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
Output
precision recall f1-score support
0 0.94 0.97 0.95 100
1 0.92 0.85 0.88 60
accuracy 0.93 160
Production Trap:
Fitting StandardScaler on the entire dataset before splitting inflates test performance. Always fit scaler only on training data, then transform test data separately.
Key Takeaway
Implement in layers: load, split, scale, train, evaluate — never scale before splitting.

Unsupervised Learning in Python: Why Clustering Comes Before Classification

Unsupervised learning is not a separate discipline; it is the bedrock of understanding your data before labeling it. The why is pragmatic: most real-world data lacks labels, and even when labels exist, clustering reveals hidden subgroups that accuracy metrics miss. Start with K-Means for its speed and interpretability, but only after scaling features — distance metrics demand equal variance contributions. Use the elbow method or silhouette score to determine optimal cluster count. DBSCAN is essential for non-spherical clusters and outlier detection, as it does not force every point into a cluster. After clustering, assign pseudo-labels and treat them as a new target for a supervised classifier. This semi-supervised approach often outperforms pure supervised learning when labeled data is scarce. Apply PCA to visualize high-dimensional clusters in 2D. The implementation flow: scale, cluster, evaluate cohesion, assign labels, then train a classifier on those labels. Unsupervised learning is not the end goal — it is the exploratory phase that informs every subsequent modeling decision.

unsupervised_clustering.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — ml-ai tutorial
// 25 lines max
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

df = pd.read_csv('unlabeled_data.csv')
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.title('Cluster Visualization')
plt.show()
Output
[Visual output: A scatter plot showing three distinct clusters colored in green, yellow, and purple, well-separated in PCA space]
Production Trap:
K-Means assumes spherical clusters of equal size. For real-world data with varying density or elongated shapes, DBSCAN or Gaussian Mixture Models yield more meaningful groupings.
Key Takeaway
Always explore unlabeled structure first — clustering reveals natural groupings that supervised models can then exploit.
● Production incidentPOST-MORTEMseverity: high

Fraud model silently degraded after encoding schema drift

Symptom
Fraud team reports a flood of false positives. Precision drops from 85% to 40% overnight. No code changes deployed.
Assumption
Data distribution shifted due to a marketing campaign or seasonal pattern.
Root cause
The OneHotEncoder was not configured with handle_unknown='ignore'. A new payment method category appeared in production data. The encoder crashed or produced misaligned feature vectors, causing the model to output garbage predictions.
Fix
Set handle_unknown='ignore' on OneHotEncoder. Re-serialize the pipeline. Add a pre-prediction schema validation step that logs new categories without crashing.
Key lesson
  • Always set handle_unknown='ignore' on OneHotEncoder in production pipelines
  • Validate input schema before prediction — log new categories as warnings
  • Monitor prediction distribution daily — a sudden shift in predicted probabilities signals schema or distribution drift
  • Test the loaded pipeline on a sample of production data before every deployment
Production debug guideCommon failure modes and immediate diagnostic steps for production classification systems5 entries
Symptom · 01
Model accuracy looks great but business metrics are terrible
Fix
Check class distribution — you likely have imbalanced classes and accuracy is dominated by the majority class. Switch to F1, PR-AUC, or a cost-weighted metric.
Symptom · 02
Model performance drops suddenly with no code changes
Fix
Check for input schema drift: new categories, renamed columns, changed data types. Run pipeline.predict() on a known-good sample to isolate whether it's a data issue or model corruption.
Symptom · 03
Cross-validation scores are high but test/production scores are low
Fix
You likely have data leakage. Check if preprocessors (scaler, imputer, encoder) were fit before train/test splitting. Move everything into a Pipeline.
Symptom · 04
Model predicts all samples as the majority class
Fix
Class imbalance problem. Set class_weight='balanced', tune the decision threshold down using precision_recall_curve, or apply SMOTE inside an imblearn pipeline.
Symptom · 05
Prediction latency spikes in production
Fix
Check if you're using a large ensemble (500+ trees) or SVM with many support vectors. Profile with timeit. Consider switching to a faster model (LightGBM) or reducing n_estimators.
★ Classification Pipeline Quick DebugImmediate diagnostic commands for production classification issues
Model loaded but predict() crashes on new data
Immediate action
Check input schema matches training schema exactly
Commands
print(pipeline.named_steps['preprocessor'].feature_names_in_)
print(X_new.columns.tolist())
Fix now
Align column names, types, and add missing columns with default values
All predictions are the same class+
Immediate action
Check class distribution and decision threshold
Commands
print(y_train.value_counts(normalize=True))
print(pipeline.predict_proba(X_test)[:10])
Fix now
Set class_weight='balanced' or tune threshold via precision_recall_curve
Cross-validation F1 is 0.95 but test F1 is 0.60+
Immediate action
Suspect data leakage — check if preprocessors are fit outside the pipeline
Commands
cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
cross_val_score(clf, X_train_preprocessed, y_train, cv=5, scoring='f1')
Fix now
Move all preprocessing inside sklearn.pipeline.Pipeline — never pre-fit transformers
predict_proba returns extreme probabilities (all 0 or 1)+
Immediate action
Model is overconfident — check calibration
Commands
from sklearn.calibration import calibration_curve
frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
Fix now
Wrap model in CalibratedClassifierCV with cv=5 and method='isotonic'
Classifier Comparison for Common Scenarios
AlgorithmTraining SpeedPrediction SpeedHandles Imbalance?Needs Scaling?InterpretabilityBest For
Logistic RegressionFastFastYes (class_weight)YesHighBaseline, online learning
Decision TreeFastFastPartial (depth bias)NoHighInterpretable rules
Random ForestModerateModerateYes (class_weight)NoMediumGeneral purpose, tabular
Gradient Boosting (XGBoost)SlowFastYes (scale_pos_weight)NoLowCompetitions, high accuracy
SVM (RBF kernel)SlowSlowYes (class_weight)YesLowSmall datasets, non-linear
Naive BayesVery FastVery FastNoNoHighText, real-time baselines
KNNNoneSlow (lazy)NoYesMediumSmall datasets, simple decision boundaries

Key takeaways

1
Scikit-Learn's unified API enables rapid algorithm swapping, but the real work is in building leakage-free pipelines.
2
Always wrap preprocessing and the model in a Pipeline to prevent data leakage from test statistics.
3
Accuracy is useless for imbalanced data—report F1, PR-AUC, and precision/recall instead.
4
Tune the decision threshold using precision_recall_curve—the default 0.5 is almost never optimal.
5
Handle class imbalance with class_weight first, then threshold tuning, then SMOTE—never before splitting.
6
Use StratifiedKFold for cross-validation to get reliable performance estimates.
7
Set handle_unknown='ignore' on OneHotEncoder to avoid production crashes from unseen categories.

Common mistakes to avoid

5 patterns
×

Fitting preprocessors before train/test split

Symptom
Cross-validation scores are much higher than test scores (e.g., CV F1=0.95, test F1=0.60). The scaler or imputer has 'seen' the test data and leaked information into training.
Fix
Move all preprocessing into an sklearn.pipeline.Pipeline so that fit() is only called on training data and transform() is applied to test data using training statistics.
×

Using accuracy for imbalanced classification

Symptom
Model reports 99.5% accuracy but catches zero fraud cases. The confusion matrix shows all predictions are majority class.
Fix
Switch to F1 score, precision/recall, or PR-AUC. Always check class distribution first.
×

Forgetting handle_unknown='ignore' on OneHotEncoder

Symptom
Pipeline crashes on new data with an error like 'ValueError: Found unknown categories'. Or model silently degrades because categories are misaligned.
Fix
Set OneHotEncoder(handle_unknown='ignore'). Optionally add a log warning when unknown categories appear.
×

Using predict() instead of predict_proba() in production

Symptom
Model performance is suboptimal; you cannot adjust the decision threshold without retraining.
Fix
Always use predict_proba() and apply a business-specific threshold. Store the threshold as a config parameter.
×

Applying SMOTE before splitting the dataset

Symptom
Cross-validation scores are suspiciously high (e.g., F1=0.95) but test score drops dramatically (F1=0.65). Synthetic samples have leaked across folds.
Fix
Use imblearn.pipeline.Pipeline to ensure SMOTE is applied inside each fold's training data only.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain the difference between precision, recall, and F1-score. When wou...
Q02SENIOR
How would you handle extreme class imbalance (1% positive class) in a fr...
Q03SENIOR
Design a production classification system that can handle schema drift, ...
Q01 of 03JUNIOR

Explain the difference between precision, recall, and F1-score. When would you prioritise precision over recall?

ANSWER
Precision is the fraction of true positives among all predicted positives (TP / (TP + FP)). Recall is the fraction of true positives among all actual positives (TP / (TP + FN)). F1 is the harmonic mean of precision and recall (2 P R / (P + R)). Prioritise precision when false positives are costly—for example, spam filtering (you don't want legitimate email in spam). Prioritise recall when false negatives are costly—for example, disease screening (you don't want to miss a sick patient).
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between classification and regression?
02
When should I use Random Forest vs XGBoost?
03
How do I handle missing values in a classification pipeline?
04
What is data leakage and how do I prevent it?
05
Why is my model's accuracy high but its business impact low?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Scikit-Learn. Mark it forged?

13 min read · try the examples if you haven't

Previous
Linear Regression with Scikit-Learn
5 / 8 · Scikit-Learn
Next
Feature Engineering and Preprocessing in Scikit-Learn