Skip to content
Home ML / AI Classification with Scikit-Learn — Algorithms, Pipelines, and Evaluation

Classification with Scikit-Learn — Algorithms, Pipelines, and Evaluation

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Scikit-Learn → Topic 5 of 8
Learn classification with Scikit-Learn: supervised learning algorithms, preprocessing pipelines, class imbalance handling, feature importance, model interpretability, XGBoost/LightGBM, cross-validation, hyperparameter tuning, and production deployment patterns.
⚙️ Intermediate — basic ML / AI knowledge assumed
In this tutorial, you'll learn
Learn classification with Scikit-Learn: supervised learning algorithms, preprocessing pipelines, class imbalance handling, feature importance, model interpretability, XGBoost/LightGBM, cross-validation, hyperparameter tuning, and production deployment patterns.
  • Classification predicts discrete class labels. Scikit-Learn's unified fit/predict API lets you swap algorithms without changing evaluation code.
  • Always use a Pipeline to chain preprocessing and classification. Pipelines prevent data leakage — transformers are fit on training data only.
  • Accuracy is misleading for imbalanced classes. Report precision, recall, F1, PR-AUC, and ROC-AUC. Use class_weight='balanced' for imbalanced problems.
Scikit-Learn Classification — Linear vs Non-Linear Models Comparison diagram: Linear models (Logistic Regression, LinearSVC) vs Non-linear models (Random Forest, SVM RBF, KNN, Gradient Boosting).THECODEFORGE.IOClassification Algorithms — When to Pick WhichChoosing the right classifier for your dataLinear ModelsNon-Linear ModelsVSLogistic Regression — fast, interpretableRandom Forest — robust, handles noiseLinearSVC — high-dim text dataSVM (RBF) — curved decision boundaryAssumes linear decision boundaryKNN — simple, no training phaseLow variance, may underfitGradient Boosting — highest accuracyGreat baseline — always try firstSlower, needs tuning, more dataTHECODEFORGE.IO
thecodeforge.io
Scikit-Learn Classification — Linear vs Non-Linear Models
Scikit Learn Classification
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Classification predicts discrete class labels from labelled training data — binary, multi-class, or multi-label
  • Scikit-Learn's unified fit/predict/predict_proba API lets you swap algorithms with one line change
  • Always wrap preprocessing + classifier in a Pipeline to prevent data leakage from test statistics contaminating training
  • Accuracy is misleading for imbalanced classes — use F1, PR-AUC, and ROC-AUC instead
  • Tune the decision threshold via predict_proba() + precision_recall_curve() — the default 0.5 is almost never optimal in production
  • XGBoost/LightGBM/CatBoost are what practitioners actually deploy for tabular data — they all implement the Scikit-Learn API
🚨 START HERE
Classification Pipeline Quick Debug
Immediate diagnostic commands for production classification issues
🟡Model loaded but predict() crashes on new data
Immediate ActionCheck input schema matches training schema exactly
Commands
print(pipeline.named_steps['preprocessor'].feature_names_in_)
print(X_new.columns.tolist())
Fix NowAlign column names, types, and add missing columns with default values
🟡All predictions are the same class
Immediate ActionCheck class distribution and decision threshold
Commands
print(y_train.value_counts(normalize=True))
print(pipeline.predict_proba(X_test)[:10])
Fix NowSet class_weight='balanced' or tune threshold via precision_recall_curve
🟡Cross-validation F1 is 0.95 but test F1 is 0.60
Immediate ActionSuspect data leakage — check if preprocessors are fit outside the pipeline
Commands
cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
cross_val_score(clf, X_train_preprocessed, y_train, cv=5, scoring='f1')
Fix NowMove all preprocessing inside sklearn.pipeline.Pipeline — never pre-fit transformers
🟡predict_proba returns extreme probabilities (all 0 or 1)
Immediate ActionModel is overconfident — check calibration
Commands
from sklearn.calibration import calibration_curve
frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
Fix NowWrap model in CalibratedClassifierCV with cv=5 and method='isotonic'
Production IncidentFraud model silently degraded after encoding schema driftA fraud detection model's precision dropped from 85% to 40% overnight. Investigation revealed a new payment method appeared in production data that the OneHotEncoder had never seen during training.
SymptomFraud team reports a flood of false positives. Precision drops from 85% to 40% overnight. No code changes deployed.
AssumptionData distribution shifted due to a marketing campaign or seasonal pattern.
Root causeThe OneHotEncoder was not configured with handle_unknown='ignore'. A new payment method category appeared in production data. The encoder crashed or produced misaligned feature vectors, causing the model to output garbage predictions.
FixSet handle_unknown='ignore' on OneHotEncoder. Re-serialize the pipeline. Add a pre-prediction schema validation step that logs new categories without crashing.
Key Lesson
Always set handle_unknown='ignore' on OneHotEncoder in production pipelinesValidate input schema before prediction — log new categories as warningsMonitor prediction distribution daily — a sudden shift in predicted probabilities signals schema or distribution driftTest the loaded pipeline on a sample of production data before every deployment
Production Debug GuideCommon failure modes and immediate diagnostic steps for production classification systems
Model accuracy looks great but business metrics are terribleCheck class distribution — you likely have imbalanced classes and accuracy is dominated by the majority class. Switch to F1, PR-AUC, or a cost-weighted metric.
Model performance drops suddenly with no code changesCheck for input schema drift: new categories, renamed columns, changed data types. Run pipeline.predict() on a known-good sample to isolate whether it's a data issue or model corruption.
Cross-validation scores are high but test/production scores are lowYou likely have data leakage. Check if preprocessors (scaler, imputer, encoder) were fit before train/test splitting. Move everything into a Pipeline.
Model predicts all samples as the majority classClass imbalance problem. Set class_weight='balanced', tune the decision threshold down using precision_recall_curve, or apply SMOTE inside an imblearn pipeline.
Prediction latency spikes in productionCheck if you're using a large ensemble (500+ trees) or SVM with many support vectors. Profile with timeit. Consider switching to a faster model (LightGBM) or reducing n_estimators.

Classification predicts discrete class labels from labelled training data. Scikit-Learn provides a consistent, composable API for dozens of classification algorithms — from logistic regression to random forests — so you can swap algorithms, build preprocessing pipelines, evaluate performance rigorously, and tune hyperparameters without rewriting your code.

The algorithms are the easy part. The hard part is preventing data leakage, handling imbalanced classes, choosing the right evaluation metric, and building a pipeline you can serialize and deploy without surprises. This guide covers all of it — the algorithms, the gotchas, and the production patterns that separate a Jupyter notebook prototype from a reliable production system.

What is Classification in Machine Learning?

Classification is a supervised learning task where the goal is to predict a discrete class label for each input. Supervised means you train on labelled examples — (features, label) pairs — where the correct label is known. The model learns the relationship between features and labels, then generalises to new inputs.

Binary classification has two classes (spam/not spam, fraud/not fraud, disease/healthy). Multi-class classification has three or more exclusive classes (cat, dog, bird). Multi-label classification assigns multiple labels per example (a news article can be both 'finance' and 'politics').

The output of a classifier is either a predicted class label (via predict()) or a probability distribution over all classes (via predict_proba()). Classification is distinct from regression, where the output is a continuous number.

The first question to ask before building any classifier: What does it cost to be wrong? If you misclassify spam, the user sees one extra email. If you misclassify a healthy patient as having cancer, they undergo unnecessary treatment. The cost of false positives vs false negatives determines your evaluation metric, your decision threshold, and your entire model selection strategy. Don't start with accuracy — start with the business impact of errors.

io/thecodeforge/ml/classification_basics.py · PYTHON
12345678910111213141516
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X, y = load_iris(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['setosa', 'versicolor', 'virginica']))
▶ Output
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Mental Model
Mental Model: Classification vs Regression
Classification answers 'which category?' — regression answers 'how much?'
  • Classification output: discrete label (spam/not spam) or probability distribution over classes
  • Regression output: continuous number (price, temperature)
  • The cost of errors drives everything — start with the business question, not the algorithm
  • predict() gives labels, predict_proba() gives probabilities — always prefer probabilities in production for threshold control
📊 Production Insight
Most production classification failures trace back to not asking 'what does it cost to be wrong?' before building.
A fraud model optimised for accuracy will predict 'not fraud' 99.5% of the time and score 99.5% accuracy while catching zero fraud.
Always define the cost matrix before choosing your evaluation metric.
🎯 Key Takeaway
Classification predicts discrete labels from labelled data.
The cost of false positives vs false negatives determines your metric, threshold, and algorithm — not the other way around.
Start every classification project by defining the business cost of errors.
Choosing the Right Classification Type
IfTwo mutually exclusive classes (spam/not spam)
UseBinary classification — use any classifier with predict_proba()
IfThree or more mutually exclusive classes (cat/dog/bird)
UseMulti-class classification — most classifiers handle this natively via one-vs-rest or softmax
IfMultiple labels per example (article tagged 'finance' AND 'politics')
UseMulti-label classification — use MultiOutputClassifier or MultiOutputChain wrappers

Scikit-Learn's Unified Estimator API — fit, predict, score

Scikit-Learn's biggest strength is its consistent API. Every classifier implements the same interface: fit(X, y) trains the model, predict(X) returns predicted labels, predict_proba(X) returns probability estimates, and score(X, y) returns mean accuracy.

This uniformity means you can swap algorithms with a single line change. The exact same preprocessing, splitting, and evaluation code works with LogisticRegression, RandomForestClassifier, SVC, or GradientBoostingClassifier.

The score() trap: score() returns accuracy by default for classifiers. For imbalanced datasets, accuracy is misleading — a model predicting the majority class every time scores 95% accuracy while being completely useless. Always use explicit metrics (F1, AUC) via sklearn.metrics rather than relying on score().

io/thecodeforge/ml/unified_api.py · PYTHON
1234567891011121314151617181920
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5),
    'Random Forest':       RandomForestClassifier(n_estimators=100),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=100),
    'SVM':                 SVC(probability=True),
}

for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    acc = clf.score(X_test, y_test)
    print(f'{name}: {acc:.4f}')

probs = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
print(probs[0])
▶ Output
Logistic Regression: 0.9667
Decision Tree: 0.9333
Random Forest: 1.0000
Gradient Boosting: 1.0000
SVM: 0.9667
[0.02 0.07 0.91]
⚠ The score() Trap
score() returns accuracy by default for classifiers. For imbalanced datasets, accuracy is misleading — a model predicting the majority class every time scores 95% accuracy while being completely useless. Always use explicit metrics (F1, AUC) via sklearn.metrics rather than relying on score().
📊 Production Insight
The unified API enables rapid algorithm comparison — swap one line, keep all evaluation code identical.
Never rely on score() in production evaluation — it returns accuracy, which hides class imbalance problems.
Always use sklearn.metrics.f1_score, roc_auc_score, or average_precision_score explicitly.
🎯 Key Takeaway
Scikit-Learn's unified fit/predict API lets you swap algorithms with one line change.
Never use score() for evaluation — it returns accuracy, which is meaningless on imbalanced data.
Always use explicit metrics from sklearn.metrics and prefer predict_proba() over predict() in production.
When to Use predict() vs predict_proba()
IfProduction binary classification
UseAlways use predict_proba() with a tuned threshold — never use the hardcoded 0.5 from predict()
IfMulti-class where you need confidence
UseUse predict_proba() to get the full probability distribution over all classes
IfQuick prototyping or evaluation only
Usepredict() is fine — but switch to predict_proba() before deployment

Feature Scaling — When to Scale and When Not to Bother

Feature scaling normalises the range of input features. Some algorithms are sensitive to feature scale; others are completely invariant.

Algorithms that REQUIRE scaling: SVM (distance-based), KNN (distance-based), Logistic Regression (gradient descent convergence), Neural Networks (gradient-based optimization). If you forget to scale for these, the feature with the largest range dominates the model.

Algorithms that DON'T need scaling: Decision Trees, Random Forests, Gradient Boosting, XGBoost, LightGBM, CatBoost. Tree-based models split on individual feature thresholds, so absolute scale doesn't matter.

Three scalers to know: StandardScaler (zero mean, unit variance — sensitive to outliers), MinMaxScaler (scales to [0,1] — for neural networks), RobustScaler (median and IQR — robust to outliers).

The production rule: If your pipeline includes SVM, KNN, or Logistic Regression, add StandardScaler. If it's tree-based only, skip scaling. If unsure, add it — it won't hurt tree models, just wastes a few milliseconds.

io/thecodeforge/ml/feature_scaling.py · PYTHON
12345678910111213141516171819202122
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# SVM NEEDS scaling
svm_scaled = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', SVC(kernel='rbf', probability=True)),
])

# When data has outliers, use RobustScaler
svm_robust = Pipeline([
    ('scaler', RobustScaler()),
    ('clf', SVC(kernel='rbf', probability=True)),
])

# For neural networks or bounded-input models
nn_scaled = Pipeline([
    ('scaler', MinMaxScaler()),
    ('clf', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500)),
])
🔥Forge Tip: Scale Inside the Pipeline
Always put the scaler inside the Pipeline, not before it. If you scale before splitting, you leak test statistics into training. The Pipeline ensures the scaler is fit on training data only and applied to test data using training statistics.
📊 Production Insight
Forgetting to scale for SVM or KNN silently degrades performance — the model still trains but the feature with the largest numeric range dominates distance calculations.
I've seen SVM models with 60% accuracy jump to 92% simply by adding StandardScaler — no other changes.
Tree-based models (Random Forest, XGBoost) don't need scaling — adding it wastes a few milliseconds but doesn't hurt.
🎯 Key Takeaway
SVM, KNN, Logistic Regression, and Neural Networks require feature scaling — tree-based models don't.
Always put the scaler inside the Pipeline, not before splitting — otherwise you leak test statistics into training.
Use RobustScaler when your data has outliers; StandardScaler is the default.
Feature Scaling Decision
IfUsing SVM, KNN, Logistic Regression, or Neural Networks
UseAdd StandardScaler (or RobustScaler if outliers present) inside the Pipeline
IfUsing tree-based models (RF, XGBoost, LightGBM, CatBoost)
UseSkip scaling — tree splits are invariant to feature scale
IfMixed pipeline with both tree and distance-based models
UseAdd scaling — it won't hurt tree models and is required for distance-based ones

Preprocessing Pipelines — The Right Way to Handle Feature Engineering

A pipeline chains preprocessing steps and a classifier into a single object. This is not optional convenience — it is the correct way to prevent data leakage.

Data leakage happens when information from the test set influences training. The classic mistake: fit a StandardScaler on the entire dataset before splitting. The scaler has 'seen' the test data and computed statistics from it. Your model was trained on a subtly contaminated version of reality.

With a Pipeline, fit() calls fit_transform() on preprocessors and fit() on the classifier — all on training data only. predict() calls transform() on preprocessors and predict() on the classifier. The test data is only transformed with statistics learned from training data.

The ColumnTransformer pattern: Real datasets have mixed feature types — numeric columns (age, income), categorical columns (country, plan_type). ColumnTransformer applies different preprocessing to different column subsets, all within the same pipeline. This is the standard pattern for production ML.

io/thecodeforge/ml/pipeline_example.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd

# Simulate a realistic dataset
df = pd.DataFrame({
    'age': [25, 45, np.nan, 30, 60],
    'income': [50000, 80000, 120000, np.nan, 90000],
    'country': ['US', 'UK', 'US', 'DE', 'FR'],
    'plan_type': ['basic', 'premium', 'basic', 'enterprise', 'premium'],
    'education': ['high_school', 'bachelors', 'masters', 'phd', 'bachelors'],
    'churned': [0, 0, 1, 0, 1],
})

X = df.drop('churned', axis=1)
y = df['churned']

numeric_features = ['age', 'income']
nominal_features = ['country', 'plan_type']
ordinal_features = ['education']

# Ordinal encoding for ordered categories
education_order = ['high_school', 'bachelors', 'masters', 'phd']

preprocessor = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')),
                      ('scaler', StandardScaler())]), numeric_features),
    ('nom', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                      ('onehot', OneHotEncoder(handle_unknown='ignore'))]), nominal_features),
    ('ord', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                      ('ordinal', OrdinalEncoder(categories=[education_order]))]), ordinal_features),
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
])

pipeline.fit(X, y)

# See transformed feature names
feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()
print('Transformed features:', feature_names)
▶ Output
Transformed features: ['num__age' 'num__income' 'nom__country_DE' 'nom__country_FR' 'nom__country_UK' 'nom__country_US' 'nom__plan_type_basic' 'nom__plan_type_enterprise' 'nom__plan_type_premium' 'ord__education']
🔥Forge Tip: handle_unknown='ignore' Is Non-Negotiable
Always set handle_unknown='ignore' on OneHotEncoder. In production, new data will contain categories not seen during training (a new country, a new plan type). Without this setting, the encoder crashes at prediction time. I've seen production models go down because a new category appeared in the data pipeline.
📊 Production Insight
Data leakage through pre-fitting preprocessors is the #1 silent killer of production ML models.
The model scores 95% in evaluation but performs at 70% in production — and the team spends weeks debugging the wrong thing.
ColumnTransformer + Pipeline is the non-negotiable pattern for any production classification system with mixed feature types.
🎯 Key Takeaway
Pipeline chains preprocessing and classifier into one object — it is the only correct way to prevent data leakage.
ColumnTransformer applies different preprocessing to different feature types within the same pipeline.
Always set handle_unknown='ignore' on OneHotEncoder — production data will contain unseen categories.
Preprocessing Strategy by Feature Type
IfNumeric features with missing values
UseSimpleImputer(strategy='median') + StandardScaler — median is robust to outliers
IfNominal categorical features (no order)
UseOneHotEncoder(handle_unknown='ignore') — never use OrdinalEncoder for unordered categories
IfOrdinal categorical features (education, priority)
UseOrdinalEncoder with explicit categories=[...] to control the ordering
IfHigh-cardinality categoricals (>50 unique values)
UseConsider target encoding or frequency encoding instead of OneHot to avoid feature explosion

Naive Bayes — The Baseline You Should Always Try First

Before reaching for Random Forest or XGBoost, try Naive Bayes. It's fast, simple, surprisingly effective, and serves as a strong baseline. If your complex model can't beat Naive Bayes, something is wrong with your features.

Three variants: GaussianNB (continuous features, assumes normal distribution), MultinomialNB (discrete count features like word counts or TF-IDF — the go-to for text classification), BernoulliNB (binary features — word present/absent).

Why it works for text: Text data is high-dimensional and sparse. Naive Bayes handles this gracefully because it assumes feature independence. This assumption is obviously wrong (words aren't independent), but it works shockingly well in practice.

The production baseline pattern: Always train a Naive Bayes model first. Report its metrics. Then train your fancy model. If the fancy model only marginally beats Naive Bayes, consider whether the added complexity is worth it. Naive Bayes trains in milliseconds and predicts in microseconds — that matters for real-time systems.

io/thecodeforge/ml/naive_bayes.py · PYTHON
123456789101112131415161718192021
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Gaussian Naive Bayes for continuous features
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print('GaussianNB:', classification_report(y_test, y_pred))

# Multinomial Naive Bayes for text (TF-IDF features)
text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
    ('clf',   MultinomialNB(alpha=0.1)),
])

# Bernoulli Naive Bayes for binary features
bnb = BernoulliNB()
binary_features = (X_train > 0).astype(int)
bnb.fit(binary_features, y_train)
▶ Output
GaussianNB: precision recall f1-score support
0 0.96 0.94 0.95 50
1 0.83 0.88 0.85 22
accuracy 0.92 72
📊 Production Insight
Naive Bayes trains in milliseconds and predicts in microseconds — critical for real-time systems with latency budgets.
If your complex model only marginally beats Naive Bayes, the added complexity (maintenance, debugging, serialisation size) may not be worth it.
MultinomialNB + TF-IDF is the go-to baseline for text classification — it's surprisingly hard to beat without deep learning.
🎯 Key Takeaway
Always train Naive Bayes first as a baseline — if your complex model can't beat it, something is wrong with your features.
MultinomialNB + TF-IDF is the classic text classification baseline that's fast, effective, and hard to beat.
Naive Bayes trains in milliseconds — in real-time production systems, this latency advantage matters.

Evaluating a Classifier — Beyond Accuracy

Accuracy is misleading for imbalanced classes. If 95% of data is class 0, a classifier predicting class 0 always achieves 95% accuracy while being completely useless.

Precision: Of all samples predicted positive, what fraction actually are? High precision = few false alarms. Recall: Of all actual positives, what fraction did we catch? High recall = few missed cases. F1: Harmonic mean of precision and recall.

The confusion matrix shows the full breakdown: TP, TN, FP, FN. For multi-class, it's an NxN matrix where the diagonal is correct predictions.

ROC-AUC measures discrimination ability across all thresholds. AUC 0.5 = random, 1.0 = perfect. PR-AUC (Precision-Recall AUC) is better for imbalanced data because it focuses on the positive class.

Which metric matters depends on the business cost: - Spam filtering: Precision matters. You don't want legitimate email in the spam folder. - Disease screening: Recall matters. You don't want to miss a sick patient. - Fraud detection: Recall matters more, but precision also matters because investigating false positives costs money.

The multi-metric approach: Always report at least three metrics: precision, recall, and F1. Add AUC if you use predicted probabilities. Never report accuracy alone for imbalanced problems.

io/thecodeforge/ml/evaluation_metrics.py · PYTHON
123456789101112131415161718192021222324252627282930
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, average_precision_score,
    ConfusionMatrixDisplay, precision_recall_curve
)
import matplotlib.pyplot as plt
import numpy as np

print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title('Confusion Matrix')
plt.show()

# ROC-AUC for binary classification
probs = pipeline.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, probs)
print(f'ROC-AUC: {auc:.4f}')

# PR-AUC — better metric for imbalanced classes
pr_auc = average_precision_score(y_test, probs)
print(f'PR-AUC:  {pr_auc:.4f}')

# Custom class weights — when you know the business cost ratio
clf_cost_sensitive = RandomForestClassifier(
    class_weight={0: 1, 1: 100},
    random_state=42
)
▶ Output
precision recall f1-score support
0 0.98 0.96 0.97 50
1 0.82 0.91 0.86 22
accuracy 0.95 72
macro avg 0.90 0.93 0.91 72
weighted avg 0.94 0.95 0.94 72

ROC-AUC: 0.9834
PR-AUC: 0.9412
🔥Forge Tip: PR-AUC Over ROC-AUC for Imbalanced Data
For fraud detection, disease screening, or any high-stakes imbalanced classification, report PR-AUC alongside ROC-AUC. ROC-AUC can look deceptively good on imbalanced data because the true negative rate dominates. I've seen ROC-AUC of 0.95 drop to PR-AUC of 0.3 on a 1% fraud rate dataset — the model was mediocre at finding fraud but great at identifying the 99% obvious non-fraud cases.
📊 Production Insight
I've seen teams report 99% accuracy on a 1% fraud dataset — the model predicted 'not fraud' for everything.
The confusion matrix reveals what accuracy hides: how many fraud cases were missed (false negatives) and how many legitimate transactions were flagged (false positives).
Always report at least three metrics: precision, recall, and F1. Add PR-AUC for imbalanced problems.
🎯 Key Takeaway
Accuracy is misleading for imbalanced classes — a model predicting all negatives can score 95% while being useless.
Always report precision, recall, and F1. Add PR-AUC for imbalanced data — ROC-AUC can look deceptively good.
The business cost of false positives vs false negatives determines which metric to optimise.
Choosing the Right Evaluation Metric
IfBalanced classes, equal cost of errors
UseAccuracy is acceptable — but still report F1 for completeness
IfImbalanced classes, missing positives is costly (fraud, disease)
UsePrioritise recall and PR-AUC — optimise for catching positives
IfImbalanced classes, false positives are costly (spam filtering)
UsePrioritise precision — optimise for trustworthy positive predictions
IfNeed a single balanced metric
UseUse F1 score — harmonic mean of precision and recall

Decision Threshold Tuning — The Most Underrated Technique

Every binary classifier has a default decision threshold of 0.5: if predict_proba() returns >= 0.5, predict class 1. This threshold is arbitrary and almost never optimal for your specific business problem.

The insight: The threshold controls the trade-off between precision and recall. Lowering it catches more positives (higher recall) but flags more negatives as positives (lower precision). Raising it gives fewer but more trustworthy positive predictions.

How to find the optimal threshold: Use precision_recall_curve to get precision and recall at every possible threshold. Then choose the threshold that optimises your business metric.

In production: Don't use predict() — use predict_proba() and apply your own threshold. Store the threshold alongside the model. When business costs change, adjust the threshold without retraining.

I tuned the threshold on a fraud detection model from 0.5 to 0.15. Recall went from 60% to 92%. Precision dropped from 85% to 45%. The fraud team preferred catching 92% of fraud — the cost of missing fraud far exceeded the cost of investigating false positives.

io/thecodeforge/ml/threshold_tuning.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839
from sklearn.metrics import precision_recall_curve, f1_score
import numpy as np

y_probs = pipeline.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs)

# Find threshold that maximises F1
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
print(f'Best threshold for F1: {best_threshold:.3f}')
print(f'Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}')

# Apply custom threshold
y_pred_custom = (y_probs >= best_threshold).astype(int)
print(f'F1 with custom threshold: {f1_score(y_test, y_pred_custom):.4f}')

# Find threshold for minimum recall target
min_recall = 0.95
valid_indices = np.where(recalls[:-1] >= min_recall)[0]
if len(valid_indices) > 0:
    best_for_recall = valid_indices[np.argmax(precisions[valid_indices])]
    recall_threshold = thresholds[best_for_recall]
    print(f'Threshold for >= 95% recall: {recall_threshold:.3f}')

# Business cost minimisation
fn_cost = 1000  # missed fraud
fp_cost = 10    # false alarm
min_cost = float('inf')
best_cost_threshold = 0.5
for i, t in enumerate(thresholds):
    y_pred_t = (y_probs >= t).astype(int)
    fn = np.sum((y_test == 1) & (y_pred_t == 0))
    fp = np.sum((y_test == 0) & (y_pred_t == 1))
    cost = fn * fn_cost + fp * fp_cost
    if cost < min_cost:
        min_cost = cost
        best_cost_threshold = t
print(f'Threshold minimising business cost: {best_cost_threshold:.3f} (cost: ${min_cost:,.0f})')
▶ Output
Best threshold for F1: 0.340
Precision: 0.780, Recall: 0.890, F1: 0.831
F1 with custom threshold: 0.8310
Threshold for >= 95% recall: 0.180
Threshold minimising business cost: 0.150 (cost: $2,340)
⚠ Forge Warning: Never Use predict() in Production for Binary Classification
predict() uses a hardcoded 0.5 threshold. In production, always use predict_proba() and apply your own threshold. Store the threshold as a configuration parameter alongside the model. This single technique has saved me more production incidents than any algorithm choice.
📊 Production Insight
Tuning the threshold from 0.5 to 0.15 on a fraud model increased recall from 60% to 92% — no retraining required.
The threshold is a business decision, not a technical one. Store it as a config parameter alongside the model.
When business costs change (e.g., fraud losses spike), adjust the threshold without retraining — this takes seconds, not hours.
🎯 Key Takeaway
The default 0.5 threshold is arbitrary and almost never optimal — tune it using precision_recall_curve.
Always use predict_proba() in production, never predict() — store the threshold as a config parameter.
Threshold tuning is the highest-ROI technique: it improves model performance without retraining.
Threshold Tuning Strategy
IfBusiness requires minimum recall (e.g., catch 95% of fraud)
UseUse precision_recall_curve to find the threshold that achieves the target recall with maximum precision
IfBusiness has known cost per false positive and false negative
UseMinimise total cost = FN fn_cost + FP fp_cost across all thresholds
IfNo clear business requirement
UseMaximise F1 as the default balanced metric

Handling Imbalanced Classes — SMOTE, Thresholds, and Class Weights

Most real-world classification problems are imbalanced: fraud is rare, disease is rare, churn is a minority class. Standard classifiers optimise for overall accuracy, which means they learn to predict the majority class.

Three strategies, from simplest to most complex:

1. class_weight='balanced': The simplest approach. Increases the loss contribution of minority class samples. Works with LogisticRegression, RandomForest, SVM. No extra dependencies. Try this first.

2. SMOTE (Synthetic Minority Oversampling): Generates synthetic minority class samples by interpolating between existing minority samples. pip install imbalanced-learn. Use SMOTEENN or SMOTETomek to clean noisy synthetic samples.

3. Threshold tuning: Use predict_proba() and tune the decision threshold (see previous section). Often the most effective approach because you keep the model unchanged — you just change the decision boundary.

The gotcha with SMOTE: Never apply SMOTE before train/test splitting. SMOTE generates synthetic samples based on nearest neighbours — if applied before splitting, synthetic test samples leak information about training samples. Always SMOTE inside a pipeline or after splitting.

imblearn.pipeline.Pipeline: scikit-learn's Pipeline doesn't support samplers (they lack transform()). Use imblearn.pipeline.Pipeline instead, which supports both transformers and samplers.

io/thecodeforge/ml/class_imbalance.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

# Strategy 1: class_weight (simplest)
clf_weighted = RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42)

# Strategy 2: SMOTE inside imblearn pipeline
preprocessor = ColumnTransformer([
    ('num', Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())]), numeric_features),
    ('cat', Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore'))]), categorical_features),
])

smote_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)),
])

# Strategy 3: SMOTE + Edited Nearest Neighbours (cleans noisy samples)
smoteenn_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('smoteenn', SMOTEENN(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)),
])

# Strategy 4: Undersampling + SMOTE
combined_pipeline = ImbPipeline([
    ('preprocessor', preprocessor),
    ('under', RandomUnderSampler(sampling_strategy=0.5, random_state=42)),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, pipe in [('class_weight', clf_weighted),
                   ('SMOTE', smote_pipeline),
                   ('SMOTEENN', smoteenn_pipeline),
                   ('Under+SMOTE', combined_pipeline)]:
    scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1')
    print(f'{name:15s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')
▶ Output
class_weight F1: 0.6234 (+/- 0.0234)
SMOTE F1: 0.6512 (+/- 0.0198)
SMOTEENN F1: 0.6634 (+/- 0.0187)
Under+SMOTE F1: 0.6589 (+/- 0.0201)
🔥Forge Tip: SMOTE Inside the Pipeline, Always
The most common SMOTE mistake is applying it before train/test splitting. This generates synthetic samples that blend training and test information. Always use imblearn.pipeline.Pipeline to ensure SMOTE is applied only to training folds during cross-validation. I've seen models with inflated F1 scores of 0.95 drop to 0.65 when SMOTE was moved inside the pipeline — the original score was an artifact of data leakage.
📊 Production Insight
class_weight='balanced' is the zero-dependency first step — try it before reaching for SMOTE.
SMOTE before splitting produces inflated metrics due to data leakage — always use imblearn.pipeline.Pipeline.
Threshold tuning is often more effective than SMOTE because it changes the decision boundary without altering the training data.
🎯 Key Takeaway
Handle class imbalance in order: class_weight='balanced' first, then threshold tuning, then SMOTE.
Never apply SMOTE before train/test splitting — always use imblearn.pipeline.Pipeline to prevent data leakage.
Threshold tuning is often more effective than resampling because it changes the decision boundary without altering training data.
Class Imbalance Strategy Selection
IfMild imbalance (minority class > 10%)
UseStart with class_weight='balanced' — no extra dependencies, often sufficient
IfSevere imbalance (minority class < 5%)
UseCombine class_weight + threshold tuning, or use SMOTE inside imblearn.pipeline.Pipeline
IfNoisy minority class samples
UseUse SMOTEENN or SMOTETomek to clean synthetic samples after oversampling

Feature Importance — Understanding What Drives Predictions

Training a model without understanding which features drive predictions is flying blind. Feature importance tells you which features matter most and helps you debug, simplify, and improve your model.

Built-in feature_importances_: Available on tree-based models (RandomForest, GradientBoosting, XGBoost). Measures how much each feature decreases impurity across all trees. Fast but biased toward high-cardinality features and unreliable when features are correlated.

Permutation importance: Model-agnostic. Shuffles each feature and measures how much performance drops. More reliable than built-in importance, especially with correlated features. Use sklearn.inspection.permutation_importance.

Feature selection: Remove unimportant features to reduce overfitting, speed up training, and simplify deployment. SelectFromModel keeps features above a importance threshold. SelectKBest keeps the top K features by statistical test.

The production insight: Feature importance helps you identify data pipeline bugs. If a feature you know should be important shows zero importance, check whether it was accidentally dropped, imputed incorrectly, or encoded poorly. I once found that a 'days since last purchase' feature had zero importance because the imputer was filling NaN with 0 instead of a sentinel value — and 0 happened to be the most common legitimate value.

io/thecodeforge/ml/feature_importance.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_classif
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np

clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)

# Built-in feature importance
if hasattr(X_train, 'columns'):
    feature_names = X_train.columns
else:
    feature_names = [f'feature_{i}' for i in range(X_train.shape[1])]

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': clf.feature_importances_
}).sort_values('importance', ascending=False)
print(importance_df.head(10))

# Permutation importance — more reliable
perm_imp = permutation_importance(clf, X_test, y_test, n_repeats=10, random_state=42, scoring='f1_weighted')
perm_df = pd.DataFrame({
    'feature': feature_names,
    'importance_mean': perm_imp.importances_mean,
    'importance_std': perm_imp.importances_std,
}).sort_values('importance_mean', ascending=False)
print(perm_df.head(10))

# Feature selection with SelectFromModel
selector = SelectFromModel(clf, threshold='median')
selector.fit(X_train, y_train)
X_selected = selector.transform(X_train)
print(f'Original features: {X_train.shape[1]}, Selected: {X_selected.shape[1]}')

# Pipeline with feature selection
pipeline_with_selection = Pipeline([
    ('selector', SelectFromModel(RandomForestClassifier(50, random_state=42), threshold='median')),
    ('classifier', RandomForestClassifier(200, random_state=42)),
])
▶ Output
feature importance
2 feature_2 0.4523
0 feature_0 0.2891
1 feature_1 0.1834
3 feature_3 0.0752

feature importance_mean importance_std
2 feature_2 0.3210 0.0145
0 feature_0 0.1980 0.0112
1 feature_1 0.0823 0.0098
3 feature_3 0.0012 0.0034

Original features: 200, Selected: 87
🔥Forge Tip: Permutation Importance Over Built-in Importance
Built-in feature_importances_ can be misleading when features are correlated — importance gets split among correlated features, making all of them look less important. Permutation importance doesn't have this bias. Use it for feature selection decisions. I've seen teams drop important features because built-in importance was low, when the importance was just shared among correlated features.
📊 Production Insight
Feature importance is a debugging tool, not just an analysis tool — if a feature you know should be important shows zero importance, investigate the data pipeline.
A 'days since last purchase' feature showed zero importance because the imputer filled NaN with 0 — which happened to be the most common legitimate value.
Permutation importance is more reliable than built-in importance for feature selection, especially with correlated features.
🎯 Key Takeaway
Use permutation_importance over built-in feature_importances_ — it's model-agnostic and handles correlated features correctly.
Feature importance helps debug data pipeline bugs: a zero-importance feature that should matter signals an imputation or encoding error.
Remove unimportant features with SelectFromModel to reduce overfitting and simplify deployment.

Model Interpretability — SHAP and LIME

In regulated industries (healthcare, finance, insurance), you often need to explain why a model made a specific prediction. 'The model says so' is not an acceptable answer for a loan denial or a medical diagnosis.

SHAP (SHapley Additive exPlanations): The gold standard. Based on game theory Shapley values. Provides consistent, locally faithful explanations. TreeExplainer is fast for tree-based models. KernelExplainer works for any model but is slow.

What SHAP gives you: Global feature importance (which features matter overall), local explanations (why this specific prediction was made), interaction effects (how features combine), and dependence plots (how a feature's value affects predictions).

LIME (Local Interpretable Model-agnostic Explanations): Explains individual predictions by fitting a simple interpretable model (linear regression) around the prediction point. Faster than SHAP KernelExplainer but less theoretically grounded.

When to use which: SHAP for comprehensive analysis and regulatory compliance. LIME for quick single-prediction explanations. Always use TreeExplainer for tree-based models — it's exact and fast.

The production reality: Budget time for explainability integration. It's not an afterthought — it's a regulatory requirement in healthcare (EU AI Act, FDA), finance (GDPR right to explanation), and insurance (fair lending laws).

io/thecodeforge/ml/model_interpretability.py · PYTHON
12345678910111213141516171819202122232425
import shap
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(X_train, y_train)

# TreeExplainer — fast for tree-based models
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)

# Global feature importance (mean absolute SHAP values)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

# Local explanation for a single prediction
sample_idx = 0
shap.force_plot(explainer.expected_value, shap_values[sample_idx],
                X_test.iloc[sample_idx], feature_names=feature_names)

# For multi-class, shap_values is a list of arrays
if isinstance(shap_values, list):
    print(f'SHAP values for class 0: shape {shap_values[0].shape}')
    print(f'SHAP values for class 1: shape {shap_values[1].shape}')

# Dependence plot: how does one feature affect predictions?
shap.dependence_plot('age', shap_values, X_test, feature_names=feature_names)
▶ Output
(Visual output — SHAP summary plot, force plot, dependence plot)
🔥Forge Tip: SHAP for Regulatory Compliance
In healthcare and finance, model explainability is often a regulatory requirement, not a nice-to-have. SHAP satisfies most regulatory frameworks because it provides consistent, locally faithful explanations. Budget time for SHAP integration in your project plan — it's not an afterthought.
📊 Production Insight
In healthcare and finance, model explainability is a regulatory requirement — not a nice-to-have.
SHAP TreeExplainer is fast and exact for tree-based models — always use it over KernelExplainer when possible.
Budget time for SHAP integration in your project plan — it's not an afterthought, it's a compliance gate.
🎯 Key Takeaway
SHAP is the gold standard for model interpretability — it provides consistent, locally faithful explanations based on game theory.
Always use TreeExplainer for tree-based models — it's exact and fast. LIME is faster for quick single-prediction explanations.
Model explainability is a regulatory requirement in healthcare, finance, and insurance — budget time for it.

Cross-Validation — Reliable Performance Estimates

A single train/test split gives a noisy estimate. The specific random split affects which samples are in each set, and performance can vary significantly between splits. Cross-validation averages performance across multiple splits.

K-fold splits data into K folds, trains on K-1, validates on the remaining fold, rotates K times. StratifiedKFold preserves class proportions — always use it for classification.

cross_val_score returns test scores. cross_validate returns train and test scores plus timing — useful for diagnosing overfitting (train >> test = overfitting).

How many folds? 5 is the default. Use 10 for small datasets (more training data per fold). Use 3 for very large datasets (faster). The standard deviation across folds shows how stable the estimate is.

io/thecodeforge/ml/cross_validation.py · PYTHON
12345678910111213141516
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

scores = cross_val_score(clf, X, y, cv=cv, scoring='f1_weighted')
print(f'F1: {scores.mean():.4f} (+/- {scores.std():.4f})')

results = cross_validate(clf, X, y, cv=cv, scoring=['accuracy', 'f1_weighted'], return_train_score=True)
print('Train F1:', results['train_f1_weighted'].mean())
print('Test  F1:', results['test_f1_weighted'].mean())

# CV with full pipeline — no leakage
cv_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc')
print(f'Pipeline AUC: {cv_scores.mean():.4f}')
▶ Output
F1: 0.9734 (+/- 0.0189)
Train F1: 0.9998
Test F1: 0.9734
Pipeline AUC: 0.9856
📊 Production Insight
A single train/test split can give wildly different scores depending on which samples land in each set — I've seen F1 vary by 0.15 between two random splits.
Always use StratifiedKFold for classification — it preserves class proportions across folds, which is critical for imbalanced data.
Use cross_validate with return_train_score=True to diagnose overfitting: train >> test means the model memorised the training data.
🎯 Key Takeaway
A single train/test split gives a high-variance estimate — use StratifiedKFold cross-validation for reliable performance.
Use cross_validate with return_train_score=True to diagnose overfitting (train >> test).
The standard deviation across folds shows how stable your performance estimate is.
Choosing the Number of CV Folds
IfStandard dataset (>1000 samples)
UseUse 5 folds — good balance of speed and variance
IfSmall dataset (<1000 samples)
UseUse 10 folds — each fold gets more training data, reducing variance
IfVery large dataset (>100K samples)
UseUse 3 folds — faster, and each fold still has plenty of data

Hyperparameter Tuning — GridSearchCV, RandomizedSearchCV, and Optuna

Hyperparameters are settings you choose before training (n_estimators, max_depth, C). The right values depend on your data.

GridSearchCV: Exhaustive search over all combinations. Best for small grids. Slow for large spaces — 4x4x4 = 64 combinations x 5-fold CV = 320 fits.

RandomizedSearchCV: Samples N random combinations from distributions. Much faster for large spaces. Specify scipy.stats distributions for continuous parameters.

Optuna: Bayesian optimization. Learns from previous trials to suggest promising hyperparameters. Often finds better results faster than random search.

The two-stage approach I use in production: Stage 1: RandomizedSearchCV with wide ranges (50-100 iterations). Stage 2: GridSearchCV with fine-grained grid around the best values found. This catches broad patterns quickly, then refines.

io/thecodeforge/ml/hyperparameter_tuning.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Stage 1: Random search
param_distributions = {
    'classifier__n_estimators': randint(50, 500),
    'classifier__max_depth': [None, 5, 10, 20, 30],
    'classifier__max_features': ['sqrt', 'log2', None],
    'classifier__min_samples_split': randint(2, 20),
}

random_search = RandomizedSearchCV(
    pipeline, param_distributions, n_iter=50,
    cv=cv, scoring='f1_weighted', n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)
print('Best random params:', random_search.best_params_)
print('Best CV F1:', random_search.best_score_)

# Stage 2: Fine-grained grid around best values
best = random_search.best_params_
param_grid = {
    'classifier__n_estimators': [best['classifier__n_estimators'] - 50, best['classifier__n_estimators'], best['classifier__n_estimators'] + 50],
    'classifier__max_depth': [best['classifier__max_depth']],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='f1_weighted', n_jobs=-1)
grid_search.fit(X_train, y_train)
print('Best grid params:', grid_search.best_params_)

# Optuna
import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 30),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
    }
    clf = RandomForestClassifier(**params, random_state=42)
    return cross_val_score(clf, X_train, y_train, cv=5, scoring='f1_weighted').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print('Best Optuna params:', study.best_params)
▶ Output
Best random params: {'classifier__max_depth': 20, 'classifier__max_features': 'sqrt', 'classifier__min_samples_split': 8, 'classifier__n_estimators': 312}
Best CV F1: 0.9734
Best grid params: {'classifier__max_depth': 20, 'classifier__n_estimators': 350}
Best Optuna params: {'n_estimators': 287, 'max_depth': 18, 'max_features': 'sqrt', 'min_samples_split': 5}
📊 Production Insight
GridSearchCV with 4 hyperparameters at 4 values each = 256 combinations x 5 folds = 1280 model fits — often impractical.
RandomizedSearchCV with 50 iterations samples the same space effectively in a fraction of the time.
The two-stage approach (random search then grid refinement) catches broad patterns quickly, then refines — this is the production standard.
🎯 Key Takeaway
Use RandomizedSearchCV for exploration (50-100 iterations) then GridSearchCV for fine-tuning around promising values.
Optuna's Bayesian optimization finds better results faster for expensive models.
Never tune hyperparameters on the test set — use cross-validation or a separate validation set.
Hyperparameter Tuning Strategy
IfSmall parameter grid (<20 combinations)
UseGridSearchCV — exhaustive search is fast enough
IfLarge parameter space (continuous params, many options)
UseRandomizedSearchCV with 50-100 iterations, then grid search around best values
IfExpensive model (XGBoost, deep networks)
UseOptuna with Bayesian optimization — learns from previous trials to find better params faster

XGBoost, LightGBM, and CatBoost — The Algorithms Practitioners Actually Use

Scikit-Learn's built-in classifiers are excellent for learning and prototyping. For production tabular classification, most practitioners use gradient boosting libraries: XGBoost, LightGBM, and CatBoost.

All three implement the Scikit-Learn fit/predict API and work inside Pipeline and GridSearchCV. They're not replacements — they're extensions.

XGBoost: Regularised gradient boosting. Handles missing values natively. Supports GPU. Best for maximum control over hyperparameters.

LightGBM: Microsoft's gradient boosting. Faster than XGBoost for large datasets — leaf-wise tree growth, histogram-based splitting. Native categorical handling. Best for large datasets (>100K rows).

CatBoost: Yandex's gradient boosting. Best out-of-the-box performance with minimal tuning. Superior categorical feature handling. Best for datasets with many categorical features.

When to use which: Maximum control + GPU → XGBoost. Large data + speed → LightGBM. Many categoricals + minimal tuning → CatBoost. Just need a solid baseline fast → RandomForest first, then try one of these.

io/thecodeforge/ml/gradient_boosting.py · PYTHON
12345678910111213141516171819202122232425
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

xgb_clf = XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1,
    subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1, reg_lambda=1.0,
    random_state=42, eval_metric='logloss', use_label_encoder=False)

lgbm_clf = LGBMClassifier(n_estimators=200, max_depth=-1, learning_rate=0.1,
    num_leaves=31, subsample=0.8, reg_alpha=0.1, reg_lambda=1.0,
    random_state=42, verbose=-1)

cat_clf = CatBoostClassifier(iterations=200, depth=6, learning_rate=0.1,
    l2_leaf_reg=3.0, random_state=42, verbose=0)

for name, clf in [('XGBoost', xgb_clf), ('LightGBM', lgbm_clf), ('CatBoost', cat_clf)]:
    scores = cross_val_score(clf, X_train, y_train, cv=cv, scoring='f1_weighted')
    print(f'{name:10s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')

# CatBoost with native categorical features — no OneHotEncoder needed
cat_native = CatBoostClassifier(iterations=100, cat_features=['country', 'plan'], verbose=0)
cat_native.fit(X_train, y_train)
▶ Output
XGBoost F1: 0.9712 (+/- 0.0189)
LightGBM F1: 0.9734 (+/- 0.0156)
CatBoost F1: 0.9756 (+/- 0.0143)
🔥Forge Tip: Start with CatBoost for Quick Wins
If you need a strong model fast with minimal preprocessing, CatBoost is the best default. It handles categorical features natively, has sensible hyperparameter defaults, and typically requires less tuning than XGBoost or LightGBM. I've seen CatBoost with default parameters beat a carefully tuned XGBoost on datasets with many categorical features.
📊 Production Insight
LightGBM trains 5-10x faster than XGBoost on datasets >100K rows due to histogram-based splitting and leaf-wise tree growth.
CatBoost's native categorical handling eliminates the need for OneHotEncoder — fewer preprocessing steps means fewer failure points in production.
All three implement Scikit-Learn's fit/predict API — they work inside Pipeline and GridSearchCV without modification.
🎯 Key Takeaway
XGBoost, LightGBM, and CatBoost are what practitioners actually deploy for tabular classification — they all implement Scikit-Learn's API.
LightGBM is fastest for large datasets; CatBoost has the best out-of-the-box performance with minimal tuning.
Start with RandomForest for a quick baseline, then upgrade to gradient boosting if you need more accuracy.
Choosing a Gradient Boosting Library
IfMaximum control + GPU support needed
UseXGBoost — most hyperparameters, regularised, handles missing values
IfLarge dataset (>100K rows), speed matters
UseLightGBM — fastest training, native categorical handling
IfMany categorical features, minimal tuning budget
UseCatBoost — best out-of-the-box performance, native categorical support
IfQuick baseline, minimal dependencies
UseRandomForest first — if it's not enough, try one of the gradient boosting libraries

Ensemble Methods — Combining Models for Better Performance

Ensemble methods combine multiple models to produce a stronger predictor. Different models make different errors, and combining them averages out individual weaknesses.

VotingClassifier: Combines predictions from multiple models. voting='hard' uses majority vote on class labels. voting='soft' averages predicted probabilities (usually better because it uses confidence information).

StackingClassifier: Trains a meta-learner (usually LogisticRegression) on the predictions of base learners. More powerful than voting because the meta-learner learns which base model to trust in which situations.

BaggingClassifier: Trains multiple instances of the same model on bootstrap samples. RandomForest is a special case of bagging with decision trees.

The production reality: Ensembles add complexity. I've deployed stacking classifiers with a marginal 0.5% accuracy gain over a single XGBoost. The added complexity of serialising and maintaining three models was not worth it. Evaluate whether the complexity is justified for your use case.

io/thecodeforge/ml/ensemble_methods.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435
from sklearn.ensemble import VotingClassifier, StackingClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(100, random_state=42)),
        ('gb', GradientBoostingClassifier(100, random_state=42)),
        ('lr', LogisticRegression(max_iter=1000)),
    ],
    voting='soft',
)

stacking_clf = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(100, random_state=42)),
        ('gb', GradientBoostingClassifier(100, random_state=42)),
        ('svc', SVC(probability=True, random_state=42)),
    ],
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5,
)

bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100, random_state=42,
)

for name, clf in [('Voting', voting_clf), ('Stacking', stacking_clf), ('Bagging', bagging_clf)]:
    scores = cross_val_score(clf, X_train, y_train, cv=cv, scoring='f1_weighted')
    print(f'{name:10s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')
▶ Output
Voting F1: 0.9756 (+/- 0.0167)
Stacking F1: 0.9789 (+/- 0.0134)
Bagging F1: 0.9623 (+/- 0.0198)
📊 Production Insight
Ensembles add complexity: serialising three models, maintaining three model versions, debugging three sets of feature expectations.
I've deployed stacking classifiers with a marginal 0.5% accuracy gain — the added operational complexity was not worth it.
Evaluate whether the complexity is justified: if a single XGBoost gets you 97%, a stacking ensemble at 97.5% may not justify the deployment overhead.
🎯 Key Takeaway
Ensembles combine multiple models to average out individual weaknesses — voting averages probabilities, stacking trains a meta-learner.
Ensembles add deployment complexity (serialisation, versioning, debugging) — evaluate whether the marginal accuracy gain justifies it.
A single well-tuned XGBoost often beats a complex ensemble in production reliability.

Learning Curves and Validation Curves — Diagnosing Overfitting Visually

Before hyperparameter tuning, diagnose whether your model is overfitting or underfitting. Learning curves and validation curves give you visual answers.

Learning curve: Training and validation scores vs training set size. Both high and close → good fit. Train high, validation low → overfitting. Both low → underfitting. Gap closing with more data → more data will help.

Validation curve: Scores vs a hyperparameter (e.g., max_depth). Shows the sweet spot where validation peaks before overfitting begins.

When to use: Before hyperparameter tuning. If the learning curve shows underfitting, no tuning will help — you need better features or a different model. If it shows overfitting, regularisation or more data will help more than hyperparameter search.

io/thecodeforge/ml/learning_curves.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np

clf = RandomForestClassifier(n_estimators=100, random_state=42)
train_sizes, train_scores, val_scores = learning_curve(
    clf, X_train, y_train, train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='f1_weighted', n_jobs=-1)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(train_sizes, train_scores.mean(axis=1), label='Train')
axes[0].fill_between(train_sizes,
    train_scores.mean(axis=1) - train_scores.std(axis=1),
    train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
axes[0].plot(train_sizes, val_scores.mean(axis=1), label='Validation')
axes[0].fill_between(train_sizes,
    val_scores.mean(axis=1) - val_scores.std(axis=1),
    val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
axes[0].set_xlabel('Training Set Size')
axes[0].set_ylabel('F1 Score')
axes[0].set_title('Learning Curve')
axes[0].legend()

param_range = [2, 5, 10, 15, 20, 30]
train_vc, val_vc = validation_curve(
    RandomForestClassifier(100, random_state=42), X_train, y_train,
    param_name='max_depth', param_range=param_range, cv=5,
    scoring='f1_weighted', n_jobs=-1)

axes[1].plot(param_range, train_vc.mean(axis=1), label='Train')
axes[1].plot(param_range, val_vc.mean(axis=1), label='Validation')
axes[1].set_xlabel('max_depth')
axes[1].set_ylabel('F1 Score')
axes[1].set_title('Validation Curve')
axes[1].legend()
plt.tight_layout()
plt.show()
▶ Output
(Visual output — two plots showing learning curve and validation curve)
🔥Forge Tip: Check Learning Curves Before Tuning
I've seen teams spend days tuning hyperparameters on a fundamentally underfitting model. No tuning could fix it because the features lacked signal. The learning curve would have told them in 30 seconds: both train and validation scores are low and flat. Check the learning curve first — it saves time.
📊 Production Insight
Learning curves diagnose whether more data will help — if the validation curve is still rising, more data will improve performance.
If both train and validation scores are low and flat, the model is underfitting — no amount of hyperparameter tuning will fix it.
Check learning curves before spending hours on hyperparameter search — it saves days of wasted effort.
🎯 Key Takeaway
Check learning curves before hyperparameter tuning — they diagnose overfitting/underfitting in 30 seconds.
If both train and validation scores are low and flat, the model is underfitting — no tuning will fix it.
Validation curves show the sweet spot where validation performance peaks before overfitting begins.
Diagnosing Model Fit from Learning Curves
IfTrain score high, validation score low, large gap
UseOverfitting — add regularisation, reduce model complexity, or get more data
IfBoth train and validation scores low and flat
UseUnderfitting — need better features or a different model. Hyperparameter tuning won't help.
IfBoth scores high and close together
UseGood fit — proceed with hyperparameter tuning for marginal improvement
IfGap between train and validation is closing with more data
UseMore data will help — consider collecting additional training samples

Text Classification — TF-IDF and Naive Bayes Pipeline

Text classification (spam detection, sentiment analysis, topic classification) is one of the most common production use cases. Scikit-Learn provides strong text classifiers without deep learning.

The pipeline: Text → TfidfVectorizer → Classifier. The vectorizer converts raw text into a numeric feature matrix.

TfidfVectorizer weights words by importance — common words (the, is) get low weight, rare informative words get high weight. Key parameters: max_features (vocabulary size, 5000-20000 typical), ngram_range ((1,2) for unigrams + bigrams — bigrams capture 'not good' vs 'good'), stop_words, min_df/max_df.

Naive Bayes + TF-IDF is the classic baseline for text classification. Fast, effective, hard to beat. Only move to SVM or gradient boosting if Naive Bayes isn't sufficient.

CountVectorizer vs TfidfVectorizer: CountVectorizer gives raw word counts. TfidfVectorizer almost always performs better because it downweights common words. Use TfidfVectorizer by default.

io/thecodeforge/ml/text_classification.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold

texts = [
    'Win a free iPhone now click here',
    'Meeting rescheduled to 3pm tomorrow',
    'Congratulations you won the lottery',
    'Please review the quarterly report',
    'Free pills online pharmacy discount',
    'Team standup notes for sprint 42',
    'Limited time offer act now',
    'Can you send the updated budget?',
    'You have been selected for a prize',
    'Project deadline extended to Friday',
] * 50
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 50

# Naive Bayes + TF-IDF baseline
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')),
    ('clf',   MultinomialNB(alpha=0.1)),
])

# SVM + TF-IDF — often better for text
svm_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2), stop_words='english')),
    ('clf',   LinearSVC(C=1.0, class_weight='balanced')),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, pipe in [('Naive Bayes', nb_pipeline), ('LinearSVC', svm_pipeline)]:
    scores = cross_val_score(pipe, texts, labels, cv=cv, scoring='f1')
    print(f'{name:12s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')

# Inspect most informative features
nb_pipeline.fit(texts, labels)
feature_names = nb_pipeline['tfidf'].get_feature_names_out()
for i, class_label in enumerate(['not spam', 'spam']):
    top_indices = nb_pipeline['clf'].feature_log_prob_[i].argsort()[-10:][::-1]
    print(f'Top words for {class_label}: {[feature_names[j] for j in top_indices]}')
▶ Output
Naive Bayes F1: 0.9856 (+/- 0.0123)
LinearSVC F1: 0.9912 (+/- 0.0089)
Top words for not spam: ['meeting', 'report', 'budget', 'deadline', 'team', 'sprint', 'standup', 'notes', 'send', 'review']
Top words for spam: ['free', 'click', 'won', 'prize', 'lottery', 'congratulations', 'offer', 'act', 'limited', 'pills']
📊 Production Insight
MultinomialNB + TF-IDF is the classic text classification baseline — fast, effective, and surprisingly hard to beat.
LinearSVC often outperforms Naive Bayes for text — it's the go-to when you need more accuracy without deep learning.
Use ngram_range=(1,2) to capture bigrams like 'not good' vs 'good' — unigrams alone miss negation and sentiment reversal.
🎯 Key Takeaway
MultinomialNB + TF-IDF is the classic text classification baseline — always try it first before reaching for complex models.
LinearSVC often outperforms Naive Bayes for text — it's the go-to when you need more accuracy without deep learning.
Use ngram_range=(1,2) to capture bigrams — they're critical for sentiment and negation detection.
Text Classification Algorithm Selection
IfNeed a fast baseline for text classification
UseMultinomialNB + TfidfVectorizer — trains in milliseconds, often surprisingly effective
IfBaseline isn't enough, need higher accuracy
UseLinearSVC + TfidfVectorizer — often the best non-deep-learning text classifier
IfText data with sentiment or negation
UseUse ngram_range=(1,2) to capture bigrams — unigrams miss 'not good' vs 'good'

Model Calibration — When Predicted Probabilities Matter

Some classifiers produce poorly calibrated probabilities — a predicted probability of 0.7 doesn't mean a 70% chance of being correct. Random Forest tends to produce overconfident probabilities (clustered near 0 and 1). Naive Bayes tends to produce underconfident probabilities.

When calibration matters: When you use predicted probabilities for decision-making — e.g., 'only flag transactions with fraud probability > 0.8' — you need probabilities that reflect actual frequencies.

CalibratedClassifierCV: Wraps a classifier and calibrates probabilities using Platt scaling (sigmoid) or isotonic regression. Platt scaling works for small datasets; isotonic regression needs more data but is more flexible.

calibration_curve: Plots predicted probabilities against actual frequencies. A perfectly calibrated model lies on the diagonal.

The gotcha: Don't calibrate on training data — use cross-validation or a held-out calibration set. CalibratedClassifierCV handles this internally with its cv parameter.

io/thecodeforge/ml/model_calibration.py · PYTHON
1234567891011121314151617181920212223242526
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

probs_uncalibrated = rf.predict_proba(X_test)[:, 1]

calibrated_clf = CalibratedClassifierCV(rf, cv=5, method='sigmoid')
calibrated_clf.fit(X_train, y_train)
probs_calibrated = calibrated_clf.predict_proba(X_test)[:, 1]

frac_pos, mean_pred = calibration_curve(y_test, probs_uncalibrated, n_bins=10)
frac_pos_cal, mean_pred_cal = calibration_curve(y_test, probs_calibrated, n_bins=10)

plt.figure(figsize=(8, 6))
plt.plot(mean_pred, frac_pos, 's-', label='Uncalibrated')
plt.plot(mean_pred_cal, frac_pos_cal, 'o-', label='Calibrated')
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve')
plt.legend()
plt.show()
▶ Output
(Visual output — calibration curve plot)
📊 Production Insight
Random Forest probabilities are often overconfident — clustered near 0 and 1 — which makes threshold tuning unreliable.
Calibration matters when you use probabilities for decision-making: 'flag if fraud probability > 0.8' requires that 0.8 actually means 80%.
Never calibrate on training data — CalibratedClassifierCV handles this with its cv parameter, using cross-validated holdout sets.
🎯 Key Takeaway
Random Forest produces overconfident probabilities; Naive Bayes produces underconfident probabilities — calibrate when probabilities drive decisions.
CalibratedClassifierCV wraps any classifier and fixes probability calibration using Platt scaling or isotonic regression.
Never calibrate on training data — use cross-validation or a held-out calibration set.

End-to-End Production Example — Customer Churn Prediction

This ties everything together: loading realistic data, building a preprocessing pipeline, handling missing values and categorical features, comparing models with cross-validation, hyperparameter tuning, threshold optimisation, and feature importance analysis.

This is the workflow I use in production. Every step is deliberate — no shortcuts, no data leakage, no metric gaming.

io/thecodeforge/ml/end_to_end_churn.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, average_precision_score,
                             precision_recall_curve)
from sklearn.inspection import permutation_importance
from scipy.stats import randint

# 1. Simulate realistic churn data
np.random.seed(42)
n = 5000
df = pd.DataFrame({
    'age': np.random.normal(40, 12, n).clip(18, 80),
    'tenure_months': np.random.exponential(24, n).clip(1, 120),
    'monthly_charges': np.random.normal(65, 20, n).clip(20, 150),
    'total_charges': np.random.normal(4000, 3000, n).clip(0, 20000),
    'num_support_tickets': np.random.poisson(1.5, n),
    'contract_type': np.random.choice(['month-to-month', 'one-year', 'two-year'], n, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['credit_card', 'bank_transfer', 'electronic_check', 'mailed_check'], n),
    'internet_service': np.random.choice(['fiber', 'dsl', 'none'], n, p=[0.4, 0.4, 0.2]),
    'has_dependents': np.random.choice([0, 1], n, p=[0.7, 0.3]),
})
for col in ['age', 'monthly_charges', 'num_support_tickets']:
    df.loc[np.random.choice(n, int(n * 0.05), replace=False), col] = np.nan
churn_prob = 0.1 + 0.3 * (df['contract_type'] == 'month-to-month').astype(float)
df['churned'] = (np.random.random(n) < churn_prob).astype(int)

print('Class distribution:', df['churned'].value_counts().to_dict())

# 2. Split
X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 3. Preprocessing
numeric_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges', 'num_support_tickets', 'has_dependents']
categorical_features = ['contract_type', 'payment_method', 'internet_service']

preprocessor = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features),
    ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), categorical_features),
])

# 4. Compare models
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = {
    'LogisticRegression': Pipeline([('pre', preprocessor), ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))]),
    'RandomForest': Pipeline([('pre', preprocessor), ('clf', RandomForestClassifier(200, class_weight='balanced', random_state=42))]),
    'GradientBoosting': Pipeline([('pre', preprocessor), ('clf', GradientBoostingClassifier(200, random_state=42))]),
}

for name, model in models.items():
    f1 = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1')
    auc = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc')
    print(f'{name:20s} F1: {f1.mean():.4f} (+/- {f1.std():.4f})  AUC: {auc.mean():.4f}')

# 5. Tune best model
param_dist = {
    'clf__n_estimators': randint(100, 500),
    'clf__max_depth': [None, 5, 10, 15, 20],
    'clf__max_features': ['sqrt', 'log2'],
    'clf__min_samples_split': randint(2, 20),
}
search = RandomizedSearchCV(models['RandomForest'], param_dist, n_iter=30, cv=cv, scoring='f1', n_jobs=-1, random_state=42)
search.fit(X_train, y_train)
print(f'Best params: {search.best_params_}')

# 6. Evaluate on test set
y_pred = search.predict(X_test)
y_probs = search.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f'ROC-AUC: {roc_auc_score(y_test, y_probs):.4f}')
print(f'PR-AUC:  {average_precision_score(y_test, y_probs):.4f}')

# 7. Tune threshold
precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
best_threshold = thresholds[np.argmax(f1_scores)]
y_pred_tuned = (y_probs >= best_threshold).astype(int)
print(f'Optimal threshold: {best_threshold:.3f}')
print(classification_report(y_test, y_pred_tuned))

# 8. Feature importance
perm_imp = permutation_importance(search.best_estimator_, X_test, y_test, n_repeats=10, random_state=42, scoring='f1')
feature_names = search.best_estimator_.named_steps['pre'].get_feature_names_out()
imp_df = pd.DataFrame({'feature': feature_names, 'importance': perm_imp.importances_mean}).sort_values('importance', ascending=False)
print(imp_df.head(10).to_string(index=False))
▶ Output
Class distribution: {0: 3752, 1: 1248}

LogisticRegression F1: 0.6234 (+/- 0.0234) AUC: 0.8123
RandomForest F1: 0.6512 (+/- 0.0198) AUC: 0.8345
GradientBoosting F1: 0.6489 (+/- 0.0212) AUC: 0.8298

Best params: {'clf__max_depth': 10, 'clf__max_features': 'sqrt', 'clf__min_samples_split': 8, 'clf__n_estimators': 342}

precision recall f1-score support
0 0.87 0.92 0.89 751
1 0.71 0.59 0.64 249
accuracy 0.84 1000

ROC-AUC: 0.8412
PR-AUC: 0.6823

Optimal threshold: 0.320
precision recall f1-score support
0 0.91 0.85 0.88 751
1 0.60 0.73 0.66 249

feature importance
num__tenure_months 0.1234
cat__contract_type_month-to-month 0.0987
num__monthly_charges 0.0654
num__num_support_tickets 0.0432
num__total_charges 0.0234
🔥Forge Tip: This Is the Template
This end-to-end workflow — split, preprocess with ColumnTransformer, compare models with CV, tune with RandomizedSearchCV, evaluate on test set, tune threshold, analyse feature importance — is the template I use for every classification project. The specific algorithms and parameters change, but the structure stays the same. Build this once, reuse it everywhere.
📊 Production Insight
This workflow — split, preprocess, compare, tune, evaluate, threshold, importance — is the production template for every classification project.
The specific algorithms change, but the structure stays the same: build it once, reuse it everywhere.
Threshold tuning on this churn model improved recall from 59% to 73% — no retraining, just a config change.
🎯 Key Takeaway
This end-to-end workflow is the production template: split → preprocess → compare models → tune → evaluate → threshold → feature importance.
The specific algorithms change, but the pipeline structure stays the same — build it once, reuse it everywhere.
Threshold tuning improved recall from 59% to 73% on this churn model — the single highest-ROI step in the workflow.

Production Deployment — Serialising and Loading Pipelines

A trained model is useless until it's deployed. The pipeline pattern makes deployment clean: one object encapsulates all preprocessing and prediction logic.

joblib.dump/load: The standard way to serialise Scikit-Learn models. Compresses with gzip for smaller files. Load the pipeline, call predict() — all preprocessing is included.

The deployment checklist: (1) Serialize the full pipeline (not just the classifier). (2) Serialize the decision threshold alongside the model. (3) Version your model artifacts. (4) Validate input schema before prediction. (5) Log predictions for monitoring.

The production gotcha I've hit multiple times: If you use OneHotEncoder and new data contains a category not seen during training, it crashes unless handle_unknown='ignore' is set. Always test your loaded pipeline on a sample of production data before going live.

io/thecodeforge/ml/deployment.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435
import joblib
import numpy as np

# Save the full pipeline + threshold
model_artifact = {
    'pipeline': search.best_estimator_,
    'threshold': best_threshold,
    'version': '1.0.0',
    'feature_names': list(X.columns),
    'training_date': '2026-03-15',
}
joblib.dump(model_artifact, 'churn_model_v1.joblib', compress=3)

# Load and predict in production
artifact = joblib.load('churn_model_v1.joblib')
pipeline = artifact['pipeline']
threshold = artifact['threshold']

# New data (same schema as training)
new_customer = pd.DataFrame({
    'age': [32],
    'tenure_months': [6],
    'monthly_charges': [85.0],
    'total_charges': [510.0],
    'num_support_tickets': [3],
    'contract_type': ['month-to-month'],
    'payment_method': ['electronic_check'],
    'internet_service': ['fiber'],
    'has_dependents': [0],
})

proba = pipeline.predict_proba(new_customer)[:, 1][0]
prediction = int(proba >= threshold)
print(f'Churn probability: {proba:.3f}')
print(f'Prediction (threshold={threshold:.3f}): {"churn" if prediction else "stay"}')
▶ Output
Churn probability: 0.487
Prediction (threshold=0.320): churn
⚠ Forge Warning: Test Before Going Live
Always test your loaded pipeline on a sample of production data before deploying. I've seen models crash in production because the input schema drifted (a column renamed, a new category appeared, a numeric column became string). Validate input schema and test predict() on realistic samples.
📊 Production Insight
Serialize the full pipeline — not just the classifier — so preprocessing is included in the deployment artifact.
Always serialize the decision threshold alongside the model as a config parameter — it's a business decision, not a model parameter.
Test the loaded pipeline on production-like data before every deployment — schema drift (new categories, renamed columns) crashes models silently.
🎯 Key Takeaway
Serialize the full Pipeline with joblib.dump — one object encapsulates all preprocessing and prediction logic.
Store the decision threshold alongside the model — when business costs change, adjust the threshold without retraining.
Always test the loaded pipeline on production-like data before going live — schema drift is the #1 deployment failure mode.
Deployment Checklist
IfSerialising the model
UseSerialize the full Pipeline (preprocessing + classifier) with joblib.dump, not just the classifier
IfStoring the threshold
UseStore the decision threshold alongside the model as a config parameter — it's a business decision
IfBefore going live
UseTest predict() on production-like samples. Validate input schema. Check for new categories.
IfAfter deployment
UseLog predictions daily. Monitor prediction distribution for drift. Alert on sudden shifts.
AlgorithmBest ForHandles Non-linearityInterpretableRequires ScalingNotes
LogisticRegressionBaseline, high-dimensional sparse featuresNoYesYesFast, regularised (C), L1 for feature selection
DecisionTreeClassifierTeaching, rule extractionYesYesNoProne to overfitting without max_depth
RandomForestClassifierGeneral-purpose, mixed feature typesYesPartiallyNoRobust to outliers, good default
GradientBoostingClassifierHigh accuracy on tabular dataYesPartiallyNoSlower to train, prone to overfitting
XGBClassifierMaximum control, GPU supportYesPartially (SHAP)NoRegularised, handles missing values
LGBMClassifierLarge datasets, speedYesPartially (SHAP)NoFastest gradient boosting, native categoricals
CatBoostClassifierMany categorical features, minimal tuningYesPartially (SHAP)NoBest out-of-the-box, native categoricals
SVCHigh-dimensional, smaller datasetsYes (RBF kernel)NoYesSlow on large data
KNeighborsClassifierSmall datasets, no assumptionsYesYesYesSlow at prediction for large datasets
GaussianNBFast baseline, continuous featuresNoYesNoAssumes normal distribution
MultinomialNBText classificationNoYesNoBest baseline for TF-IDF features
LinearSVCText classification, high-dimensionalNoYesYesFaster than SVC for text

🎯 Key Takeaways

  • Classification predicts discrete class labels. Scikit-Learn's unified fit/predict API lets you swap algorithms without changing evaluation code.
  • Always use a Pipeline to chain preprocessing and classification. Pipelines prevent data leakage — transformers are fit on training data only.
  • Accuracy is misleading for imbalanced classes. Report precision, recall, F1, PR-AUC, and ROC-AUC. Use class_weight='balanced' for imbalanced problems.
  • Tune the decision threshold using predict_proba() and precision_recall_curve. The default 0.5 is almost never optimal for production.
  • Handle class imbalance with class_weight='balanced', SMOTE (inside a pipeline), or threshold tuning. Never SMOTE before splitting.
  • Use StratifiedKFold cross-validation for reliable performance estimates. A single train/test split is high-variance.
  • Tune hyperparameters with RandomizedSearchCV (exploration) then GridSearchCV (fine-tuning). Consider Optuna for expensive models.
  • Feature importance (permutation_importance) and SHAP are essential for understanding and debugging models in production.
  • XGBoost, LightGBM, and CatBoost are the standard algorithms for tabular classification in production. All implement Scikit-Learn's API.
  • Scale features for distance-based and gradient-based algorithms (SVM, KNN, Logistic Regression). Don't bother for tree-based models.
  • Naive Bayes + TF-IDF is the strong baseline for text classification. Always try it first before reaching for complex models.
  • Learning curves diagnose overfitting/underfitting before you spend hours on hyperparameter tuning. Check them first.

⚠ Common Mistakes to Avoid

    Fitting preprocessors on the full dataset before train/test split
    Symptom

    Cross-validation scores are suspiciously high (0.95+) but production performance is 20-30% lower. The model appears to 'know' things it shouldn't.

    Fix

    Move all preprocessing (scaler, imputer, encoder) inside a sklearn.pipeline.Pipeline. The Pipeline ensures transformers are fit on training data only during cross-validation.

    Using accuracy as the sole metric for imbalanced classes
    Symptom

    Model reports 95% accuracy but business stakeholders report the model catches zero positive cases. The confusion matrix reveals all predictions are the majority class.

    Fix

    Report precision, recall, F1, and PR-AUC. Use classification_report() for a complete breakdown. Never report accuracy alone for imbalanced problems.

    Not using cross-validation
    Symptom

    Model performance varies wildly between different random train/test splits. One split gives F1 of 0.80, another gives 0.65. Results are not reproducible.

    Fix

    Use StratifiedKFold with 5 or 10 folds. cross_val_score returns the mean and standard deviation, giving a stable performance estimate.

    Hyperparameter tuning on the test set
    Symptom

    Model performs great in evaluation but degrades in production. The test set was used to select the best model, so the test score is optimistically biased.

    Fix

    Use a separate validation set or cross-validation for hyperparameter tuning. The test set should be touched only once — at the final evaluation.

    Applying SMOTE before train/test splitting
    Symptom

    F1 scores are inflated (0.90+) during development but drop significantly (0.60-0.65) in production. The synthetic samples leak test information into training.

    Fix

    Always SMOTE inside a pipeline or after splitting. Use imblearn.pipeline.Pipeline which supports both transformers and samplers.

    Not tuning the decision threshold
    Symptom

    Model recall is stuck at 60% despite trying different algorithms and hyperparameters. The default 0.5 threshold is suboptimal for the business cost structure.

    Fix

    Use predict_proba() and precision_recall_curve to find the optimal threshold. Store the threshold alongside the model as a configuration parameter.

    Ignoring feature importance analysis
    Symptom

    A feature you know should be important shows zero importance. Or the model performs well but you can't explain why to stakeholders or auditors.

    Fix

    Use permutation_importance for reliable feature importance. If a known-important feature shows zero importance, investigate the data pipeline for imputation or encoding bugs.

    Using OneHotEncoder outside a pipeline
    Symptom

    Encoder is fit on the full dataset before splitting, or handle_unknown='ignore' is not set. Production crashes when a new category appears.

    Fix

    Use OneHotEncoder inside a Pipeline with handle_unknown='ignore'. This ensures it's fit on training data only and handles unseen categories gracefully.

    Using predict() in production for binary classification
    Symptom

    Model performance is suboptimal because the hardcoded 0.5 threshold doesn't match the business cost structure. You can't adjust the threshold without retraining.

    Fix

    Use predict_proba() in production and apply your own tuned threshold. Store the threshold as a config parameter that can be adjusted without retraining.

    Forgetting to scale features for distance-based algorithms
    Symptom

    SVM or KNN model performs poorly despite good hyperparameters. The feature with the largest numeric range dominates distance calculations.

    Fix

    Add StandardScaler (or RobustScaler for outliers) inside the Pipeline for SVM, KNN, Logistic Regression, and Neural Networks. Tree-based models don't need scaling.

Interview Questions on This Topic

  • QWhat is the difference between classification and regression in machine learning?JuniorReveal
    Classification predicts a discrete class label (spam/not spam, cat/dog/bird). Regression predicts a continuous numeric value (house price, temperature). Both are supervised learning tasks. In Scikit-Learn, classifiers return class labels via predict() and probabilities via predict_proba(). Evaluation metrics differ: classifiers use accuracy, F1, AUC; regressors use MSE, RMSE, R-squared.
  • QWhat is data leakage and how does a Pipeline prevent it?Mid-levelReveal
    Data leakage is when information from the test set influences the training process, causing optimistically biased evaluation. The classic example is fitting a StandardScaler on the full dataset before splitting. A Scikit-Learn Pipeline prevents this by fitting preprocessors only on training data during cross-validation, then applying the fitted transform to the test fold without re-fitting.
  • QWhen would you use RandomizedSearchCV over GridSearchCV?Mid-levelReveal
    GridSearchCV exhaustively evaluates every combination — practical for small grids (2-3 parameters). RandomizedSearchCV samples N random combinations — better for large parameter spaces. For 10+ hyperparameters, randomised search finds good results faster. Use randomised search for exploration, then grid search for fine-tuning around promising values.
  • QWhat is precision and recall, and when does each matter more?JuniorReveal
    Precision = TP / (TP + FP). Of all positive predictions, how many were correct? High precision = few false alarms. Recall = TP / (TP + FN). Of all actual positives, how many did we catch? High recall = few missed cases. In fraud detection and disease screening, recall matters more — missing a case is worse than a false alarm. In spam filtering, precision matters more — legitimate email in spam is worse than missing spam.
  • QWhat does stratify=y do in train_test_split?JuniorReveal
    stratify=y ensures that the proportion of each class in the training and test sets matches the original dataset. Without stratification, random chance might give one split 95% class 0 and the other 80% class 0. Always use stratify for classification, especially with imbalanced datasets.
  • QHow do you handle imbalanced classes in Scikit-Learn?Mid-levelReveal
    Set class_weight='balanced' to weight minority class samples more heavily. Use SMOTE from imblearn to generate synthetic samples (inside a pipeline only). Tune the decision threshold using predict_proba() and precision_recall_curve. Report PR-AUC instead of ROC-AUC for heavily imbalanced data. Never SMOTE before splitting.
  • QWhat is the difference between ROC-AUC and PR-AUC?Mid-levelReveal
    ROC-AUC plots true positive rate vs false positive rate across all thresholds. It's dominated by true negatives on imbalanced data, which can make it look deceptively good. PR-AUC plots precision vs recall, focusing entirely on the positive class. For imbalanced datasets (fraud detection, disease screening), PR-AUC gives a more honest picture of model performance on the minority class.
  • QHow do you interpret feature importance and when would you use permutation importance over built-in importance?SeniorReveal
    Built-in feature_importances_ (from tree models) measures how much each feature decreases impurity across all trees. It's fast but biased toward high-cardinality features and unreliable when features are correlated. Permutation importance is model-agnostic — it shuffles each feature and measures the performance drop. It's more reliable for feature selection decisions, especially with correlated features.
  • QWhat is the difference between OneHotEncoder and OrdinalEncoder?JuniorReveal
    OneHotEncoder creates binary columns for each category — use for nominal categories (no natural order). OrdinalEncoder maps categories to integers — use only for ordinal categories (low < medium < high). Using OrdinalEncoder for nominal categories implies an ordering that doesn't exist and can mislead distance-based models.
  • QHow do you deploy a trained classification model to production?SeniorReveal
    Serialize the full Pipeline (preprocessing + classifier) with joblib.dump(). Include the decision threshold, model version, and feature schema. Load with joblib.load(), validate input schema, call predict_proba() and apply the custom threshold. Test the loaded pipeline on realistic production samples before going live. Log predictions for monitoring data drift.

Frequently Asked Questions

What is Scikit-Learn used for?

Scikit-Learn is a Python machine learning library providing implementations of classification, regression, clustering, dimensionality reduction, preprocessing, and model evaluation. It follows a consistent fit/predict/transform API across all estimators, making it easy to build, evaluate, and tune machine learning pipelines.

What is a classification algorithm?

A classification algorithm learns from labelled training data to predict class labels for new unseen data. Examples: Logistic Regression (linear), Random Forest (ensemble of trees), SVM (maximum-margin), Gradient Boosting (additive tree ensemble), Naive Bayes (probabilistic). Each has different strengths for different data types.

What is cross-validation in machine learning?

Cross-validation estimates model performance more reliably than a single train/test split. K-fold CV splits data into K subsets, trains on K-1, validates on the remaining fold, and rotates K times. The average score is a stable estimate. Stratified K-fold preserves class proportions, which is important for classification.

What is the difference between precision and accuracy?

Accuracy = (TP + TN) / total. It's misleading for imbalanced classes. Precision = TP / (TP + FP). It tells you how trustworthy your positive predictions are. Use accuracy only when class distribution is balanced; use precision, recall, and F1 otherwise.

How do you handle imbalanced classes in Scikit-Learn?

Use class_weight='balanced' to weight minority class samples. Use SMOTE from imblearn to generate synthetic samples (inside a pipeline only). Tune the decision threshold using predict_proba() and precision_recall_curve. Report PR-AUC instead of ROC-AUC for heavily imbalanced data.

What is SHAP?

SHAP (SHapley Additive exPlanations) is a model interpretability method based on game theory. It assigns each feature a contribution value for each prediction. SHAP values are consistent and locally faithful. TreeExplainer is fast for tree-based models. SHAP is increasingly required for regulatory compliance in healthcare and finance.

When should I use XGBoost over Random Forest?

Use XGBoost when you need maximum accuracy on tabular data and are willing to tune hyperparameters. XGBoost is regularised (L1/L2), handles missing values, and supports GPU acceleration. Use Random Forest when you want a strong model with minimal tuning — it's more robust out of the box and less prone to overfitting.

What is the difference between OneHotEncoder and pd.get_dummies?

pd.get_dummies is applied at transform time and doesn't remember categories from training — it can produce different column counts for train and test data. OneHotEncoder is fit on training data and handles unseen categories gracefully with handle_unknown='ignore'. Always use OneHotEncoder in pipelines for production.

How do you choose between classification algorithms?

Start with a simple baseline (Logistic Regression or Naive Bayes). Then try Random Forest or Gradient Boosting. If you need more accuracy, try XGBoost/LightGBM/CatBoost. Choose based on: data size, feature types, interpretability requirements, training time constraints, and whether you have categorical features. There's no universally best algorithm — it depends on your data and constraints.

How do you know if your model is overfitting?

Compare training and validation scores. If training score is much higher than validation score, you're overfitting. Learning curves show this visually — the gap between train and validation curves. Solutions: increase regularisation, reduce model complexity, add more training data, or use feature selection to reduce dimensionality.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousLinear Regression with Scikit-LearnNext →Feature Engineering and Preprocessing in Scikit-Learn
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged