Classification with Scikit-Learn — Algorithms, Pipelines, and Evaluation
- Classification predicts discrete class labels. Scikit-Learn's unified fit/predict API lets you swap algorithms without changing evaluation code.
- Always use a Pipeline to chain preprocessing and classification. Pipelines prevent data leakage — transformers are fit on training data only.
- Accuracy is misleading for imbalanced classes. Report precision, recall, F1, PR-AUC, and ROC-AUC. Use class_weight='balanced' for imbalanced problems.
- Classification predicts discrete class labels from labelled training data — binary, multi-class, or multi-label
- Scikit-Learn's unified fit/predict/predict_proba API lets you swap algorithms with one line change
- Always wrap preprocessing + classifier in a Pipeline to prevent data leakage from test statistics contaminating training
- Accuracy is misleading for imbalanced classes — use F1, PR-AUC, and ROC-AUC instead
- Tune the decision threshold via predict_proba() + precision_recall_curve() — the default 0.5 is almost never optimal in production
- XGBoost/LightGBM/CatBoost are what practitioners actually deploy for tabular data — they all implement the Scikit-Learn API
Model loaded but predict() crashes on new data
print(pipeline.named_steps['preprocessor'].feature_names_in_)print(X_new.columns.tolist())All predictions are the same class
print(y_train.value_counts(normalize=True))print(pipeline.predict_proba(X_test)[:10])Cross-validation F1 is 0.95 but test F1 is 0.60
cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')cross_val_score(clf, X_train_preprocessed, y_train, cv=5, scoring='f1')predict_proba returns extreme probabilities (all 0 or 1)
from sklearn.calibration import calibration_curvefrac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)Production Incident
Production Debug GuideCommon failure modes and immediate diagnostic steps for production classification systems
pipeline.predict() on a known-good sample to isolate whether it's a data issue or model corruption.Classification predicts discrete class labels from labelled training data. Scikit-Learn provides a consistent, composable API for dozens of classification algorithms — from logistic regression to random forests — so you can swap algorithms, build preprocessing pipelines, evaluate performance rigorously, and tune hyperparameters without rewriting your code.
The algorithms are the easy part. The hard part is preventing data leakage, handling imbalanced classes, choosing the right evaluation metric, and building a pipeline you can serialize and deploy without surprises. This guide covers all of it — the algorithms, the gotchas, and the production patterns that separate a Jupyter notebook prototype from a reliable production system.
What is Classification in Machine Learning?
Classification is a supervised learning task where the goal is to predict a discrete class label for each input. Supervised means you train on labelled examples — (features, label) pairs — where the correct label is known. The model learns the relationship between features and labels, then generalises to new inputs.
Binary classification has two classes (spam/not spam, fraud/not fraud, disease/healthy). Multi-class classification has three or more exclusive classes (cat, dog, bird). Multi-label classification assigns multiple labels per example (a news article can be both 'finance' and 'politics').
The output of a classifier is either a predicted class label (via predict()) or a probability distribution over all classes (via predict_proba()). Classification is distinct from regression, where the output is a continuous number.
The first question to ask before building any classifier: What does it cost to be wrong? If you misclassify spam, the user sees one extra email. If you misclassify a healthy patient as having cancer, they undergo unnecessary treatment. The cost of false positives vs false negatives determines your evaluation metric, your decision threshold, and your entire model selection strategy. Don't start with accuracy — start with the business impact of errors.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(classification_report(y_test, y_pred, target_names=['setosa', 'versicolor', 'virginica']))
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
- Classification output: discrete label (spam/not spam) or probability distribution over classes
- Regression output: continuous number (price, temperature)
- The cost of errors drives everything — start with the business question, not the algorithm
- predict() gives labels,
predict_proba()gives probabilities — always prefer probabilities in production for threshold control
predict_proba()Scikit-Learn's Unified Estimator API — fit, predict, score
Scikit-Learn's biggest strength is its consistent API. Every classifier implements the same interface: fit(X, y) trains the model, predict(X) returns predicted labels, predict_proba(X) returns probability estimates, and score(X, y) returns mean accuracy.
This uniformity means you can swap algorithms with a single line change. The exact same preprocessing, splitting, and evaluation code works with LogisticRegression, RandomForestClassifier, SVC, or GradientBoostingClassifier.
The score() trap: score() returns accuracy by default for classifiers. For imbalanced datasets, accuracy is misleading — a model predicting the majority class every time scores 95% accuracy while being completely useless. Always use explicit metrics (F1, AUC) via sklearn.metrics rather than relying on score().
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.svm import SVC classifiers = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Decision Tree': DecisionTreeClassifier(max_depth=5), 'Random Forest': RandomForestClassifier(n_estimators=100), 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100), 'SVM': SVC(probability=True), } for name, clf in classifiers.items(): clf.fit(X_train, y_train) acc = clf.score(X_test, y_test) print(f'{name}: {acc:.4f}') probs = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test) print(probs[0])
Decision Tree: 0.9333
Random Forest: 1.0000
Gradient Boosting: 1.0000
SVM: 0.9667
[0.02 0.07 0.91]
score().score() in production evaluation — it returns accuracy, which hides class imbalance problems.score() for evaluation — it returns accuracy, which is meaningless on imbalanced data.predict_proba() over predict() in production.predict_proba() with a tuned threshold — never use the hardcoded 0.5 from predict()predict_proba() to get the full probability distribution over all classespredict_proba() before deploymentFeature Scaling — When to Scale and When Not to Bother
Feature scaling normalises the range of input features. Some algorithms are sensitive to feature scale; others are completely invariant.
Algorithms that REQUIRE scaling: SVM (distance-based), KNN (distance-based), Logistic Regression (gradient descent convergence), Neural Networks (gradient-based optimization). If you forget to scale for these, the feature with the largest range dominates the model.
Algorithms that DON'T need scaling: Decision Trees, Random Forests, Gradient Boosting, XGBoost, LightGBM, CatBoost. Tree-based models split on individual feature thresholds, so absolute scale doesn't matter.
Three scalers to know: StandardScaler (zero mean, unit variance — sensitive to outliers), MinMaxScaler (scales to [0,1] — for neural networks), RobustScaler (median and IQR — robust to outliers).
The production rule: If your pipeline includes SVM, KNN, or Logistic Regression, add StandardScaler. If it's tree-based only, skip scaling. If unsure, add it — it won't hurt tree models, just wastes a few milliseconds.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler from sklearn.pipeline import Pipeline from sklearn.svm import SVC from sklearn.neural_network import MLPClassifier # SVM NEEDS scaling svm_scaled = Pipeline([ ('scaler', StandardScaler()), ('clf', SVC(kernel='rbf', probability=True)), ]) # When data has outliers, use RobustScaler svm_robust = Pipeline([ ('scaler', RobustScaler()), ('clf', SVC(kernel='rbf', probability=True)), ]) # For neural networks or bounded-input models nn_scaled = Pipeline([ ('scaler', MinMaxScaler()), ('clf', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500)), ])
Preprocessing Pipelines — The Right Way to Handle Feature Engineering
A pipeline chains preprocessing steps and a classifier into a single object. This is not optional convenience — it is the correct way to prevent data leakage.
Data leakage happens when information from the test set influences training. The classic mistake: fit a StandardScaler on the entire dataset before splitting. The scaler has 'seen' the test data and computed statistics from it. Your model was trained on a subtly contaminated version of reality.
With a Pipeline, fit() calls fit_transform() on preprocessors and fit() on the classifier — all on training data only. predict() calls transform() on preprocessors and predict() on the classifier. The test data is only transformed with statistics learned from training data.
The ColumnTransformer pattern: Real datasets have mixed feature types — numeric columns (age, income), categorical columns (country, plan_type). ColumnTransformer applies different preprocessing to different column subsets, all within the same pipeline. This is the standard pattern for production ML.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier import numpy as np import pandas as pd # Simulate a realistic dataset df = pd.DataFrame({ 'age': [25, 45, np.nan, 30, 60], 'income': [50000, 80000, 120000, np.nan, 90000], 'country': ['US', 'UK', 'US', 'DE', 'FR'], 'plan_type': ['basic', 'premium', 'basic', 'enterprise', 'premium'], 'education': ['high_school', 'bachelors', 'masters', 'phd', 'bachelors'], 'churned': [0, 0, 1, 0, 1], }) X = df.drop('churned', axis=1) y = df['churned'] numeric_features = ['age', 'income'] nominal_features = ['country', 'plan_type'] ordinal_features = ['education'] # Ordinal encoding for ordered categories education_order = ['high_school', 'bachelors', 'masters', 'phd'] preprocessor = ColumnTransformer([ ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features), ('nom', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), nominal_features), ('ord', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('ordinal', OrdinalEncoder(categories=[education_order]))]), ordinal_features), ]) pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)), ]) pipeline.fit(X, y) # See transformed feature names feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out() print('Transformed features:', feature_names)
Naive Bayes — The Baseline You Should Always Try First
Before reaching for Random Forest or XGBoost, try Naive Bayes. It's fast, simple, surprisingly effective, and serves as a strong baseline. If your complex model can't beat Naive Bayes, something is wrong with your features.
Three variants: GaussianNB (continuous features, assumes normal distribution), MultinomialNB (discrete count features like word counts or TF-IDF — the go-to for text classification), BernoulliNB (binary features — word present/absent).
Why it works for text: Text data is high-dimensional and sparse. Naive Bayes handles this gracefully because it assumes feature independence. This assumption is obviously wrong (words aren't independent), but it works shockingly well in practice.
The production baseline pattern: Always train a Naive Bayes model first. Report its metrics. Then train your fancy model. If the fancy model only marginally beats Naive Bayes, consider whether the added complexity is worth it. Naive Bayes trains in milliseconds and predicts in microseconds — that matters for real-time systems.
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report # Gaussian Naive Bayes for continuous features gnb = GaussianNB() gnb.fit(X_train, y_train) y_pred = gnb.predict(X_test) print('GaussianNB:', classification_report(y_test, y_pred)) # Multinomial Naive Bayes for text (TF-IDF features) text_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')), ('clf', MultinomialNB(alpha=0.1)), ]) # Bernoulli Naive Bayes for binary features bnb = BernoulliNB() binary_features = (X_train > 0).astype(int) bnb.fit(binary_features, y_train)
0 0.96 0.94 0.95 50
1 0.83 0.88 0.85 22
accuracy 0.92 72
Evaluating a Classifier — Beyond Accuracy
Accuracy is misleading for imbalanced classes. If 95% of data is class 0, a classifier predicting class 0 always achieves 95% accuracy while being completely useless.
Precision: Of all samples predicted positive, what fraction actually are? High precision = few false alarms. Recall: Of all actual positives, what fraction did we catch? High recall = few missed cases. F1: Harmonic mean of precision and recall.
The confusion matrix shows the full breakdown: TP, TN, FP, FN. For multi-class, it's an NxN matrix where the diagonal is correct predictions.
ROC-AUC measures discrimination ability across all thresholds. AUC 0.5 = random, 1.0 = perfect. PR-AUC (Precision-Recall AUC) is better for imbalanced data because it focuses on the positive class.
Which metric matters depends on the business cost: - Spam filtering: Precision matters. You don't want legitimate email in the spam folder. - Disease screening: Recall matters. You don't want to miss a sick patient. - Fraud detection: Recall matters more, but precision also matters because investigating false positives costs money.
The multi-metric approach: Always report at least three metrics: precision, recall, and F1. Add AUC if you use predicted probabilities. Never report accuracy alone for imbalanced problems.
from sklearn.metrics import ( classification_report, confusion_matrix, roc_auc_score, average_precision_score, ConfusionMatrixDisplay, precision_recall_curve ) import matplotlib.pyplot as plt import numpy as np print(classification_report(y_test, y_pred)) cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm) disp.plot() plt.title('Confusion Matrix') plt.show() # ROC-AUC for binary classification probs = pipeline.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, probs) print(f'ROC-AUC: {auc:.4f}') # PR-AUC — better metric for imbalanced classes pr_auc = average_precision_score(y_test, probs) print(f'PR-AUC: {pr_auc:.4f}') # Custom class weights — when you know the business cost ratio clf_cost_sensitive = RandomForestClassifier( class_weight={0: 1, 1: 100}, random_state=42 )
0 0.98 0.96 0.97 50
1 0.82 0.91 0.86 22
accuracy 0.95 72
macro avg 0.90 0.93 0.91 72
weighted avg 0.94 0.95 0.94 72
ROC-AUC: 0.9834
PR-AUC: 0.9412
Decision Threshold Tuning — The Most Underrated Technique
Every binary classifier has a default decision threshold of 0.5: if predict_proba() returns >= 0.5, predict class 1. This threshold is arbitrary and almost never optimal for your specific business problem.
The insight: The threshold controls the trade-off between precision and recall. Lowering it catches more positives (higher recall) but flags more negatives as positives (lower precision). Raising it gives fewer but more trustworthy positive predictions.
How to find the optimal threshold: Use precision_recall_curve to get precision and recall at every possible threshold. Then choose the threshold that optimises your business metric.
In production: Don't use predict() — use predict_proba() and apply your own threshold. Store the threshold alongside the model. When business costs change, adjust the threshold without retraining.
I tuned the threshold on a fraud detection model from 0.5 to 0.15. Recall went from 60% to 92%. Precision dropped from 85% to 45%. The fraud team preferred catching 92% of fraud — the cost of missing fraud far exceeded the cost of investigating false positives.
from sklearn.metrics import precision_recall_curve, f1_score import numpy as np y_probs = pipeline.predict_proba(X_test)[:, 1] precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs) # Find threshold that maximises F1 f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8) best_idx = np.argmax(f1_scores) best_threshold = thresholds[best_idx] print(f'Best threshold for F1: {best_threshold:.3f}') print(f'Precision: {precisions[best_idx]:.3f}, Recall: {recalls[best_idx]:.3f}, F1: {f1_scores[best_idx]:.3f}') # Apply custom threshold y_pred_custom = (y_probs >= best_threshold).astype(int) print(f'F1 with custom threshold: {f1_score(y_test, y_pred_custom):.4f}') # Find threshold for minimum recall target min_recall = 0.95 valid_indices = np.where(recalls[:-1] >= min_recall)[0] if len(valid_indices) > 0: best_for_recall = valid_indices[np.argmax(precisions[valid_indices])] recall_threshold = thresholds[best_for_recall] print(f'Threshold for >= 95% recall: {recall_threshold:.3f}') # Business cost minimisation fn_cost = 1000 # missed fraud fp_cost = 10 # false alarm min_cost = float('inf') best_cost_threshold = 0.5 for i, t in enumerate(thresholds): y_pred_t = (y_probs >= t).astype(int) fn = np.sum((y_test == 1) & (y_pred_t == 0)) fp = np.sum((y_test == 0) & (y_pred_t == 1)) cost = fn * fn_cost + fp * fp_cost if cost < min_cost: min_cost = cost best_cost_threshold = t print(f'Threshold minimising business cost: {best_cost_threshold:.3f} (cost: ${min_cost:,.0f})')
Precision: 0.780, Recall: 0.890, F1: 0.831
F1 with custom threshold: 0.8310
Threshold for >= 95% recall: 0.180
Threshold minimising business cost: 0.150 (cost: $2,340)
predict_proba() and apply your own threshold. Store the threshold as a configuration parameter alongside the model. This single technique has saved me more production incidents than any algorithm choice.predict_proba() in production, never predict() — store the threshold as a config parameter.Handling Imbalanced Classes — SMOTE, Thresholds, and Class Weights
Most real-world classification problems are imbalanced: fraud is rare, disease is rare, churn is a minority class. Standard classifiers optimise for overall accuracy, which means they learn to predict the majority class.
Three strategies, from simplest to most complex:
1. class_weight='balanced': The simplest approach. Increases the loss contribution of minority class samples. Works with LogisticRegression, RandomForest, SVM. No extra dependencies. Try this first.
2. SMOTE (Synthetic Minority Oversampling): Generates synthetic minority class samples by interpolating between existing minority samples. pip install imbalanced-learn. Use SMOTEENN or SMOTETomek to clean noisy synthetic samples.
3. Threshold tuning: Use predict_proba() and tune the decision threshold (see previous section). Often the most effective approach because you keep the model unchanged — you just change the decision boundary.
The gotcha with SMOTE: Never apply SMOTE before train/test splitting. SMOTE generates synthetic samples based on nearest neighbours — if applied before splitting, synthetic test samples leak information about training samples. Always SMOTE inside a pipeline or after splitting.
imblearn.pipeline.Pipeline: scikit-learn's Pipeline doesn't support samplers (they lack transform()). Use imblearn.pipeline.Pipeline instead, which supports both transformers and samplers.
from imblearn.over_sampling import SMOTE, ADASYN from imblearn.under_sampling import RandomUnderSampler from imblearn.combine import SMOTEENN from imblearn.pipeline import Pipeline as ImbPipeline from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score, StratifiedKFold import numpy as np # Strategy 1: class_weight (simplest) clf_weighted = RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42) # Strategy 2: SMOTE inside imblearn pipeline preprocessor = ColumnTransformer([ ('num', Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())]), numeric_features), ('cat', Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore'))]), categorical_features), ]) smote_pipeline = ImbPipeline([ ('preprocessor', preprocessor), ('smote', SMOTE(random_state=42)), ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)), ]) # Strategy 3: SMOTE + Edited Nearest Neighbours (cleans noisy samples) smoteenn_pipeline = ImbPipeline([ ('preprocessor', preprocessor), ('smoteenn', SMOTEENN(random_state=42)), ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)), ]) # Strategy 4: Undersampling + SMOTE combined_pipeline = ImbPipeline([ ('preprocessor', preprocessor), ('under', RandomUnderSampler(sampling_strategy=0.5, random_state=42)), ('smote', SMOTE(random_state=42)), ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)), ]) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for name, pipe in [('class_weight', clf_weighted), ('SMOTE', smote_pipeline), ('SMOTEENN', smoteenn_pipeline), ('Under+SMOTE', combined_pipeline)]: scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1') print(f'{name:15s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')
SMOTE F1: 0.6512 (+/- 0.0198)
SMOTEENN F1: 0.6634 (+/- 0.0187)
Under+SMOTE F1: 0.6589 (+/- 0.0201)
Feature Importance — Understanding What Drives Predictions
Training a model without understanding which features drive predictions is flying blind. Feature importance tells you which features matter most and helps you debug, simplify, and improve your model.
Built-in feature_importances_: Available on tree-based models (RandomForest, GradientBoosting, XGBoost). Measures how much each feature decreases impurity across all trees. Fast but biased toward high-cardinality features and unreliable when features are correlated.
Permutation importance: Model-agnostic. Shuffles each feature and measures how much performance drops. More reliable than built-in importance, especially with correlated features. Use sklearn.inspection.permutation_importance.
Feature selection: Remove unimportant features to reduce overfitting, speed up training, and simplify deployment. SelectFromModel keeps features above a importance threshold. SelectKBest keeps the top K features by statistical test.
The production insight: Feature importance helps you identify data pipeline bugs. If a feature you know should be important shows zero importance, check whether it was accidentally dropped, imputed incorrectly, or encoded poorly. I once found that a 'days since last purchase' feature had zero importance because the imputer was filling NaN with 0 instead of a sentinel value — and 0 happened to be the most common legitimate value.
from sklearn.inspection import permutation_importance from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFromModel, SelectKBest, f_classif from sklearn.pipeline import Pipeline import pandas as pd import numpy as np clf = RandomForestClassifier(n_estimators=200, random_state=42) clf.fit(X_train, y_train) # Built-in feature importance if hasattr(X_train, 'columns'): feature_names = X_train.columns else: feature_names = [f'feature_{i}' for i in range(X_train.shape[1])] importance_df = pd.DataFrame({ 'feature': feature_names, 'importance': clf.feature_importances_ }).sort_values('importance', ascending=False) print(importance_df.head(10)) # Permutation importance — more reliable perm_imp = permutation_importance(clf, X_test, y_test, n_repeats=10, random_state=42, scoring='f1_weighted') perm_df = pd.DataFrame({ 'feature': feature_names, 'importance_mean': perm_imp.importances_mean, 'importance_std': perm_imp.importances_std, }).sort_values('importance_mean', ascending=False) print(perm_df.head(10)) # Feature selection with SelectFromModel selector = SelectFromModel(clf, threshold='median') selector.fit(X_train, y_train) X_selected = selector.transform(X_train) print(f'Original features: {X_train.shape[1]}, Selected: {X_selected.shape[1]}') # Pipeline with feature selection pipeline_with_selection = Pipeline([ ('selector', SelectFromModel(RandomForestClassifier(50, random_state=42), threshold='median')), ('classifier', RandomForestClassifier(200, random_state=42)), ])
2 feature_2 0.4523
0 feature_0 0.2891
1 feature_1 0.1834
3 feature_3 0.0752
feature importance_mean importance_std
2 feature_2 0.3210 0.0145
0 feature_0 0.1980 0.0112
1 feature_1 0.0823 0.0098
3 feature_3 0.0012 0.0034
Original features: 200, Selected: 87
Model Interpretability — SHAP and LIME
In regulated industries (healthcare, finance, insurance), you often need to explain why a model made a specific prediction. 'The model says so' is not an acceptable answer for a loan denial or a medical diagnosis.
SHAP (SHapley Additive exPlanations): The gold standard. Based on game theory Shapley values. Provides consistent, locally faithful explanations. TreeExplainer is fast for tree-based models. KernelExplainer works for any model but is slow.
What SHAP gives you: Global feature importance (which features matter overall), local explanations (why this specific prediction was made), interaction effects (how features combine), and dependence plots (how a feature's value affects predictions).
LIME (Local Interpretable Model-agnostic Explanations): Explains individual predictions by fitting a simple interpretable model (linear regression) around the prediction point. Faster than SHAP KernelExplainer but less theoretically grounded.
When to use which: SHAP for comprehensive analysis and regulatory compliance. LIME for quick single-prediction explanations. Always use TreeExplainer for tree-based models — it's exact and fast.
The production reality: Budget time for explainability integration. It's not an afterthought — it's a regulatory requirement in healthcare (EU AI Act, FDA), finance (GDPR right to explanation), and insurance (fair lending laws).
import shap from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=200, random_state=42) clf.fit(X_train, y_train) # TreeExplainer — fast for tree-based models explainer = shap.TreeExplainer(clf) shap_values = explainer.shap_values(X_test) # Global feature importance (mean absolute SHAP values) shap.summary_plot(shap_values, X_test, feature_names=feature_names) # Local explanation for a single prediction sample_idx = 0 shap.force_plot(explainer.expected_value, shap_values[sample_idx], X_test.iloc[sample_idx], feature_names=feature_names) # For multi-class, shap_values is a list of arrays if isinstance(shap_values, list): print(f'SHAP values for class 0: shape {shap_values[0].shape}') print(f'SHAP values for class 1: shape {shap_values[1].shape}') # Dependence plot: how does one feature affect predictions? shap.dependence_plot('age', shap_values, X_test, feature_names=feature_names)
Cross-Validation — Reliable Performance Estimates
A single train/test split gives a noisy estimate. The specific random split affects which samples are in each set, and performance can vary significantly between splits. Cross-validation averages performance across multiple splits.
K-fold splits data into K folds, trains on K-1, validates on the remaining fold, rotates K times. StratifiedKFold preserves class proportions — always use it for classification.
cross_val_score returns test scores. cross_validate returns train and test scores plus timing — useful for diagnosing overfitting (train >> test = overfitting).
How many folds? 5 is the default. Use 10 for small datasets (more training data per fold). Use 3 for very large datasets (faster). The standard deviation across folds shows how stable the estimate is.
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold from sklearn.ensemble import RandomForestClassifier cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X, y, cv=cv, scoring='f1_weighted') print(f'F1: {scores.mean():.4f} (+/- {scores.std():.4f})') results = cross_validate(clf, X, y, cv=cv, scoring=['accuracy', 'f1_weighted'], return_train_score=True) print('Train F1:', results['train_f1_weighted'].mean()) print('Test F1:', results['test_f1_weighted'].mean()) # CV with full pipeline — no leakage cv_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc') print(f'Pipeline AUC: {cv_scores.mean():.4f}')
Train F1: 0.9998
Test F1: 0.9734
Pipeline AUC: 0.9856
Hyperparameter Tuning — GridSearchCV, RandomizedSearchCV, and Optuna
Hyperparameters are settings you choose before training (n_estimators, max_depth, C). The right values depend on your data.
GridSearchCV: Exhaustive search over all combinations. Best for small grids. Slow for large spaces — 4x4x4 = 64 combinations x 5-fold CV = 320 fits.
RandomizedSearchCV: Samples N random combinations from distributions. Much faster for large spaces. Specify scipy.stats distributions for continuous parameters.
Optuna: Bayesian optimization. Learns from previous trials to suggest promising hyperparameters. Often finds better results faster than random search.
The two-stage approach I use in production: Stage 1: RandomizedSearchCV with wide ranges (50-100 iterations). Stage 2: GridSearchCV with fine-grained grid around the best values found. This catches broad patterns quickly, then refines.
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold from sklearn.ensemble import RandomForestClassifier from scipy.stats import randint cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Stage 1: Random search param_distributions = { 'classifier__n_estimators': randint(50, 500), 'classifier__max_depth': [None, 5, 10, 20, 30], 'classifier__max_features': ['sqrt', 'log2', None], 'classifier__min_samples_split': randint(2, 20), } random_search = RandomizedSearchCV( pipeline, param_distributions, n_iter=50, cv=cv, scoring='f1_weighted', n_jobs=-1, random_state=42 ) random_search.fit(X_train, y_train) print('Best random params:', random_search.best_params_) print('Best CV F1:', random_search.best_score_) # Stage 2: Fine-grained grid around best values best = random_search.best_params_ param_grid = { 'classifier__n_estimators': [best['classifier__n_estimators'] - 50, best['classifier__n_estimators'], best['classifier__n_estimators'] + 50], 'classifier__max_depth': [best['classifier__max_depth']], } grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='f1_weighted', n_jobs=-1) grid_search.fit(X_train, y_train) print('Best grid params:', grid_search.best_params_) # Optuna import optuna from sklearn.model_selection import cross_val_score def objective(trial): params = { 'n_estimators': trial.suggest_int('n_estimators', 50, 500), 'max_depth': trial.suggest_int('max_depth', 3, 30), 'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2']), 'min_samples_split': trial.suggest_int('min_samples_split', 2, 20), } clf = RandomForestClassifier(**params, random_state=42) return cross_val_score(clf, X_train, y_train, cv=5, scoring='f1_weighted').mean() study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=100) print('Best Optuna params:', study.best_params)
Best CV F1: 0.9734
Best grid params: {'classifier__max_depth': 20, 'classifier__n_estimators': 350}
Best Optuna params: {'n_estimators': 287, 'max_depth': 18, 'max_features': 'sqrt', 'min_samples_split': 5}
XGBoost, LightGBM, and CatBoost — The Algorithms Practitioners Actually Use
Scikit-Learn's built-in classifiers are excellent for learning and prototyping. For production tabular classification, most practitioners use gradient boosting libraries: XGBoost, LightGBM, and CatBoost.
All three implement the Scikit-Learn fit/predict API and work inside Pipeline and GridSearchCV. They're not replacements — they're extensions.
XGBoost: Regularised gradient boosting. Handles missing values natively. Supports GPU. Best for maximum control over hyperparameters.
LightGBM: Microsoft's gradient boosting. Faster than XGBoost for large datasets — leaf-wise tree growth, histogram-based splitting. Native categorical handling. Best for large datasets (>100K rows).
CatBoost: Yandex's gradient boosting. Best out-of-the-box performance with minimal tuning. Superior categorical feature handling. Best for datasets with many categorical features.
When to use which: Maximum control + GPU → XGBoost. Large data + speed → LightGBM. Many categoricals + minimal tuning → CatBoost. Just need a solid baseline fast → RandomForest first, then try one of these.
from xgboost import XGBClassifier from lightgbm import LGBMClassifier from catboost import CatBoostClassifier from sklearn.model_selection import cross_val_score, StratifiedKFold cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) xgb_clf = XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1, reg_lambda=1.0, random_state=42, eval_metric='logloss', use_label_encoder=False) lgbm_clf = LGBMClassifier(n_estimators=200, max_depth=-1, learning_rate=0.1, num_leaves=31, subsample=0.8, reg_alpha=0.1, reg_lambda=1.0, random_state=42, verbose=-1) cat_clf = CatBoostClassifier(iterations=200, depth=6, learning_rate=0.1, l2_leaf_reg=3.0, random_state=42, verbose=0) for name, clf in [('XGBoost', xgb_clf), ('LightGBM', lgbm_clf), ('CatBoost', cat_clf)]: scores = cross_val_score(clf, X_train, y_train, cv=cv, scoring='f1_weighted') print(f'{name:10s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})') # CatBoost with native categorical features — no OneHotEncoder needed cat_native = CatBoostClassifier(iterations=100, cat_features=['country', 'plan'], verbose=0) cat_native.fit(X_train, y_train)
LightGBM F1: 0.9734 (+/- 0.0156)
CatBoost F1: 0.9756 (+/- 0.0143)
Ensemble Methods — Combining Models for Better Performance
Ensemble methods combine multiple models to produce a stronger predictor. Different models make different errors, and combining them averages out individual weaknesses.
VotingClassifier: Combines predictions from multiple models. voting='hard' uses majority vote on class labels. voting='soft' averages predicted probabilities (usually better because it uses confidence information).
StackingClassifier: Trains a meta-learner (usually LogisticRegression) on the predictions of base learners. More powerful than voting because the meta-learner learns which base model to trust in which situations.
BaggingClassifier: Trains multiple instances of the same model on bootstrap samples. RandomForest is a special case of bagging with decision trees.
The production reality: Ensembles add complexity. I've deployed stacking classifiers with a marginal 0.5% accuracy gain over a single XGBoost. The added complexity of serialising and maintaining three models was not worth it. Evaluate whether the complexity is justified for your use case.
from sklearn.ensemble import VotingClassifier, StackingClassifier, BaggingClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score, StratifiedKFold cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) voting_clf = VotingClassifier( estimators=[ ('rf', RandomForestClassifier(100, random_state=42)), ('gb', GradientBoostingClassifier(100, random_state=42)), ('lr', LogisticRegression(max_iter=1000)), ], voting='soft', ) stacking_clf = StackingClassifier( estimators=[ ('rf', RandomForestClassifier(100, random_state=42)), ('gb', GradientBoostingClassifier(100, random_state=42)), ('svc', SVC(probability=True, random_state=42)), ], final_estimator=LogisticRegression(max_iter=1000), cv=5, ) bagging_clf = BaggingClassifier( estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42, ) for name, clf in [('Voting', voting_clf), ('Stacking', stacking_clf), ('Bagging', bagging_clf)]: scores = cross_val_score(clf, X_train, y_train, cv=cv, scoring='f1_weighted') print(f'{name:10s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')
Stacking F1: 0.9789 (+/- 0.0134)
Bagging F1: 0.9623 (+/- 0.0198)
Learning Curves and Validation Curves — Diagnosing Overfitting Visually
Before hyperparameter tuning, diagnose whether your model is overfitting or underfitting. Learning curves and validation curves give you visual answers.
Learning curve: Training and validation scores vs training set size. Both high and close → good fit. Train high, validation low → overfitting. Both low → underfitting. Gap closing with more data → more data will help.
Validation curve: Scores vs a hyperparameter (e.g., max_depth). Shows the sweet spot where validation peaks before overfitting begins.
When to use: Before hyperparameter tuning. If the learning curve shows underfitting, no tuning will help — you need better features or a different model. If it shows overfitting, regularisation or more data will help more than hyperparameter search.
from sklearn.model_selection import learning_curve, validation_curve from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt import numpy as np clf = RandomForestClassifier(n_estimators=100, random_state=42) train_sizes, train_scores, val_scores = learning_curve( clf, X_train, y_train, train_sizes=np.linspace(0.1, 1.0, 10), cv=5, scoring='f1_weighted', n_jobs=-1) fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].plot(train_sizes, train_scores.mean(axis=1), label='Train') axes[0].fill_between(train_sizes, train_scores.mean(axis=1) - train_scores.std(axis=1), train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1) axes[0].plot(train_sizes, val_scores.mean(axis=1), label='Validation') axes[0].fill_between(train_sizes, val_scores.mean(axis=1) - val_scores.std(axis=1), val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1) axes[0].set_xlabel('Training Set Size') axes[0].set_ylabel('F1 Score') axes[0].set_title('Learning Curve') axes[0].legend() param_range = [2, 5, 10, 15, 20, 30] train_vc, val_vc = validation_curve( RandomForestClassifier(100, random_state=42), X_train, y_train, param_name='max_depth', param_range=param_range, cv=5, scoring='f1_weighted', n_jobs=-1) axes[1].plot(param_range, train_vc.mean(axis=1), label='Train') axes[1].plot(param_range, val_vc.mean(axis=1), label='Validation') axes[1].set_xlabel('max_depth') axes[1].set_ylabel('F1 Score') axes[1].set_title('Validation Curve') axes[1].legend() plt.tight_layout() plt.show()
Text Classification — TF-IDF and Naive Bayes Pipeline
Text classification (spam detection, sentiment analysis, topic classification) is one of the most common production use cases. Scikit-Learn provides strong text classifiers without deep learning.
The pipeline: Text → TfidfVectorizer → Classifier. The vectorizer converts raw text into a numeric feature matrix.
TfidfVectorizer weights words by importance — common words (the, is) get low weight, rare informative words get high weight. Key parameters: max_features (vocabulary size, 5000-20000 typical), ngram_range ((1,2) for unigrams + bigrams — bigrams capture 'not good' vs 'good'), stop_words, min_df/max_df.
Naive Bayes + TF-IDF is the classic baseline for text classification. Fast, effective, hard to beat. Only move to SVM or gradient boosting if Naive Bayes isn't sufficient.
CountVectorizer vs TfidfVectorizer: CountVectorizer gives raw word counts. TfidfVectorizer almost always performs better because it downweights common words. Use TfidfVectorizer by default.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_score, StratifiedKFold texts = [ 'Win a free iPhone now click here', 'Meeting rescheduled to 3pm tomorrow', 'Congratulations you won the lottery', 'Please review the quarterly report', 'Free pills online pharmacy discount', 'Team standup notes for sprint 42', 'Limited time offer act now', 'Can you send the updated budget?', 'You have been selected for a prize', 'Project deadline extended to Friday', ] * 50 labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 50 # Naive Bayes + TF-IDF baseline nb_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')), ('clf', MultinomialNB(alpha=0.1)), ]) # SVM + TF-IDF — often better for text svm_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2), stop_words='english')), ('clf', LinearSVC(C=1.0, class_weight='balanced')), ]) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for name, pipe in [('Naive Bayes', nb_pipeline), ('LinearSVC', svm_pipeline)]: scores = cross_val_score(pipe, texts, labels, cv=cv, scoring='f1') print(f'{name:12s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})') # Inspect most informative features nb_pipeline.fit(texts, labels) feature_names = nb_pipeline['tfidf'].get_feature_names_out() for i, class_label in enumerate(['not spam', 'spam']): top_indices = nb_pipeline['clf'].feature_log_prob_[i].argsort()[-10:][::-1] print(f'Top words for {class_label}: {[feature_names[j] for j in top_indices]}')
LinearSVC F1: 0.9912 (+/- 0.0089)
Top words for not spam: ['meeting', 'report', 'budget', 'deadline', 'team', 'sprint', 'standup', 'notes', 'send', 'review']
Top words for spam: ['free', 'click', 'won', 'prize', 'lottery', 'congratulations', 'offer', 'act', 'limited', 'pills']
Model Calibration — When Predicted Probabilities Matter
Some classifiers produce poorly calibrated probabilities — a predicted probability of 0.7 doesn't mean a 70% chance of being correct. Random Forest tends to produce overconfident probabilities (clustered near 0 and 1). Naive Bayes tends to produce underconfident probabilities.
When calibration matters: When you use predicted probabilities for decision-making — e.g., 'only flag transactions with fraud probability > 0.8' — you need probabilities that reflect actual frequencies.
CalibratedClassifierCV: Wraps a classifier and calibrates probabilities using Platt scaling (sigmoid) or isotonic regression. Platt scaling works for small datasets; isotonic regression needs more data but is more flexible.
calibration_curve: Plots predicted probabilities against actual frequencies. A perfectly calibrated model lies on the diagonal.
The gotcha: Don't calibrate on training data — use cross-validation or a held-out calibration set. CalibratedClassifierCV handles this internally with its cv parameter.
from sklearn.calibration import CalibratedClassifierCV, calibration_curve from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt import numpy as np rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) probs_uncalibrated = rf.predict_proba(X_test)[:, 1] calibrated_clf = CalibratedClassifierCV(rf, cv=5, method='sigmoid') calibrated_clf.fit(X_train, y_train) probs_calibrated = calibrated_clf.predict_proba(X_test)[:, 1] frac_pos, mean_pred = calibration_curve(y_test, probs_uncalibrated, n_bins=10) frac_pos_cal, mean_pred_cal = calibration_curve(y_test, probs_calibrated, n_bins=10) plt.figure(figsize=(8, 6)) plt.plot(mean_pred, frac_pos, 's-', label='Uncalibrated') plt.plot(mean_pred_cal, frac_pos_cal, 'o-', label='Calibrated') plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration') plt.xlabel('Mean Predicted Probability') plt.ylabel('Fraction of Positives') plt.title('Calibration Curve') plt.legend() plt.show()
End-to-End Production Example — Customer Churn Prediction
This ties everything together: loading realistic data, building a preprocessing pipeline, handling missing values and categorical features, comparing models with cross-validation, hyperparameter tuning, threshold optimisation, and feature importance analysis.
This is the workflow I use in production. Every step is deliberate — no shortcuts, no data leakage, no metric gaming.
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, RandomizedSearchCV from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score, average_precision_score, precision_recall_curve) from sklearn.inspection import permutation_importance from scipy.stats import randint # 1. Simulate realistic churn data np.random.seed(42) n = 5000 df = pd.DataFrame({ 'age': np.random.normal(40, 12, n).clip(18, 80), 'tenure_months': np.random.exponential(24, n).clip(1, 120), 'monthly_charges': np.random.normal(65, 20, n).clip(20, 150), 'total_charges': np.random.normal(4000, 3000, n).clip(0, 20000), 'num_support_tickets': np.random.poisson(1.5, n), 'contract_type': np.random.choice(['month-to-month', 'one-year', 'two-year'], n, p=[0.5, 0.3, 0.2]), 'payment_method': np.random.choice(['credit_card', 'bank_transfer', 'electronic_check', 'mailed_check'], n), 'internet_service': np.random.choice(['fiber', 'dsl', 'none'], n, p=[0.4, 0.4, 0.2]), 'has_dependents': np.random.choice([0, 1], n, p=[0.7, 0.3]), }) for col in ['age', 'monthly_charges', 'num_support_tickets']: df.loc[np.random.choice(n, int(n * 0.05), replace=False), col] = np.nan churn_prob = 0.1 + 0.3 * (df['contract_type'] == 'month-to-month').astype(float) df['churned'] = (np.random.random(n) < churn_prob).astype(int) print('Class distribution:', df['churned'].value_counts().to_dict()) # 2. Split X = df.drop('churned', axis=1) y = df['churned'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # 3. Preprocessing numeric_features = ['age', 'tenure_months', 'monthly_charges', 'total_charges', 'num_support_tickets', 'has_dependents'] categorical_features = ['contract_type', 'payment_method', 'internet_service'] preprocessor = ColumnTransformer([ ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features), ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), categorical_features), ]) # 4. Compare models cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) models = { 'LogisticRegression': Pipeline([('pre', preprocessor), ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))]), 'RandomForest': Pipeline([('pre', preprocessor), ('clf', RandomForestClassifier(200, class_weight='balanced', random_state=42))]), 'GradientBoosting': Pipeline([('pre', preprocessor), ('clf', GradientBoostingClassifier(200, random_state=42))]), } for name, model in models.items(): f1 = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1') auc = cross_val_score(model, X_train, y_train, cv=cv, scoring='roc_auc') print(f'{name:20s} F1: {f1.mean():.4f} (+/- {f1.std():.4f}) AUC: {auc.mean():.4f}') # 5. Tune best model param_dist = { 'clf__n_estimators': randint(100, 500), 'clf__max_depth': [None, 5, 10, 15, 20], 'clf__max_features': ['sqrt', 'log2'], 'clf__min_samples_split': randint(2, 20), } search = RandomizedSearchCV(models['RandomForest'], param_dist, n_iter=30, cv=cv, scoring='f1', n_jobs=-1, random_state=42) search.fit(X_train, y_train) print(f'Best params: {search.best_params_}') # 6. Evaluate on test set y_pred = search.predict(X_test) y_probs = search.predict_proba(X_test)[:, 1] print(classification_report(y_test, y_pred)) print(f'ROC-AUC: {roc_auc_score(y_test, y_probs):.4f}') print(f'PR-AUC: {average_precision_score(y_test, y_probs):.4f}') # 7. Tune threshold precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs) f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8) best_threshold = thresholds[np.argmax(f1_scores)] y_pred_tuned = (y_probs >= best_threshold).astype(int) print(f'Optimal threshold: {best_threshold:.3f}') print(classification_report(y_test, y_pred_tuned)) # 8. Feature importance perm_imp = permutation_importance(search.best_estimator_, X_test, y_test, n_repeats=10, random_state=42, scoring='f1') feature_names = search.best_estimator_.named_steps['pre'].get_feature_names_out() imp_df = pd.DataFrame({'feature': feature_names, 'importance': perm_imp.importances_mean}).sort_values('importance', ascending=False) print(imp_df.head(10).to_string(index=False))
LogisticRegression F1: 0.6234 (+/- 0.0234) AUC: 0.8123
RandomForest F1: 0.6512 (+/- 0.0198) AUC: 0.8345
GradientBoosting F1: 0.6489 (+/- 0.0212) AUC: 0.8298
Best params: {'clf__max_depth': 10, 'clf__max_features': 'sqrt', 'clf__min_samples_split': 8, 'clf__n_estimators': 342}
precision recall f1-score support
0 0.87 0.92 0.89 751
1 0.71 0.59 0.64 249
accuracy 0.84 1000
ROC-AUC: 0.8412
PR-AUC: 0.6823
Optimal threshold: 0.320
precision recall f1-score support
0 0.91 0.85 0.88 751
1 0.60 0.73 0.66 249
feature importance
num__tenure_months 0.1234
cat__contract_type_month-to-month 0.0987
num__monthly_charges 0.0654
num__num_support_tickets 0.0432
num__total_charges 0.0234
Production Deployment — Serialising and Loading Pipelines
A trained model is useless until it's deployed. The pipeline pattern makes deployment clean: one object encapsulates all preprocessing and prediction logic.
joblib.dump/load: The standard way to serialise Scikit-Learn models. Compresses with gzip for smaller files. Load the pipeline, call predict() — all preprocessing is included.
The deployment checklist: (1) Serialize the full pipeline (not just the classifier). (2) Serialize the decision threshold alongside the model. (3) Version your model artifacts. (4) Validate input schema before prediction. (5) Log predictions for monitoring.
The production gotcha I've hit multiple times: If you use OneHotEncoder and new data contains a category not seen during training, it crashes unless handle_unknown='ignore' is set. Always test your loaded pipeline on a sample of production data before going live.
import joblib import numpy as np # Save the full pipeline + threshold model_artifact = { 'pipeline': search.best_estimator_, 'threshold': best_threshold, 'version': '1.0.0', 'feature_names': list(X.columns), 'training_date': '2026-03-15', } joblib.dump(model_artifact, 'churn_model_v1.joblib', compress=3) # Load and predict in production artifact = joblib.load('churn_model_v1.joblib') pipeline = artifact['pipeline'] threshold = artifact['threshold'] # New data (same schema as training) new_customer = pd.DataFrame({ 'age': [32], 'tenure_months': [6], 'monthly_charges': [85.0], 'total_charges': [510.0], 'num_support_tickets': [3], 'contract_type': ['month-to-month'], 'payment_method': ['electronic_check'], 'internet_service': ['fiber'], 'has_dependents': [0], }) proba = pipeline.predict_proba(new_customer)[:, 1][0] prediction = int(proba >= threshold) print(f'Churn probability: {proba:.3f}') print(f'Prediction (threshold={threshold:.3f}): {"churn" if prediction else "stay"}')
Prediction (threshold=0.320): churn
predict() on realistic samples.predict() on production-like samples. Validate input schema. Check for new categories.| Algorithm | Best For | Handles Non-linearity | Interpretable | Requires Scaling | Notes |
|---|---|---|---|---|---|
| LogisticRegression | Baseline, high-dimensional sparse features | No | Yes | Yes | Fast, regularised (C), L1 for feature selection |
| DecisionTreeClassifier | Teaching, rule extraction | Yes | Yes | No | Prone to overfitting without max_depth |
| RandomForestClassifier | General-purpose, mixed feature types | Yes | Partially | No | Robust to outliers, good default |
| GradientBoostingClassifier | High accuracy on tabular data | Yes | Partially | No | Slower to train, prone to overfitting |
| XGBClassifier | Maximum control, GPU support | Yes | Partially (SHAP) | No | Regularised, handles missing values |
| LGBMClassifier | Large datasets, speed | Yes | Partially (SHAP) | No | Fastest gradient boosting, native categoricals |
| CatBoostClassifier | Many categorical features, minimal tuning | Yes | Partially (SHAP) | No | Best out-of-the-box, native categoricals |
| SVC | High-dimensional, smaller datasets | Yes (RBF kernel) | No | Yes | Slow on large data |
| KNeighborsClassifier | Small datasets, no assumptions | Yes | Yes | Yes | Slow at prediction for large datasets |
| GaussianNB | Fast baseline, continuous features | No | Yes | No | Assumes normal distribution |
| MultinomialNB | Text classification | No | Yes | No | Best baseline for TF-IDF features |
| LinearSVC | Text classification, high-dimensional | No | Yes | Yes | Faster than SVC for text |
🎯 Key Takeaways
- Classification predicts discrete class labels. Scikit-Learn's unified fit/predict API lets you swap algorithms without changing evaluation code.
- Always use a Pipeline to chain preprocessing and classification. Pipelines prevent data leakage — transformers are fit on training data only.
- Accuracy is misleading for imbalanced classes. Report precision, recall, F1, PR-AUC, and ROC-AUC. Use class_weight='balanced' for imbalanced problems.
- Tune the decision threshold using
predict_proba()and precision_recall_curve. The default 0.5 is almost never optimal for production. - Handle class imbalance with class_weight='balanced', SMOTE (inside a pipeline), or threshold tuning. Never SMOTE before splitting.
- Use StratifiedKFold cross-validation for reliable performance estimates. A single train/test split is high-variance.
- Tune hyperparameters with RandomizedSearchCV (exploration) then GridSearchCV (fine-tuning). Consider Optuna for expensive models.
- Feature importance (permutation_importance) and SHAP are essential for understanding and debugging models in production.
- XGBoost, LightGBM, and CatBoost are the standard algorithms for tabular classification in production. All implement Scikit-Learn's API.
- Scale features for distance-based and gradient-based algorithms (SVM, KNN, Logistic Regression). Don't bother for tree-based models.
- Naive Bayes + TF-IDF is the strong baseline for text classification. Always try it first before reaching for complex models.
- Learning curves diagnose overfitting/underfitting before you spend hours on hyperparameter tuning. Check them first.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the difference between classification and regression in machine learning?JuniorReveal
- QWhat is data leakage and how does a Pipeline prevent it?Mid-levelReveal
- QWhen would you use RandomizedSearchCV over GridSearchCV?Mid-levelReveal
- QWhat is precision and recall, and when does each matter more?JuniorReveal
- QWhat does stratify=y do in train_test_split?JuniorReveal
- QHow do you handle imbalanced classes in Scikit-Learn?Mid-levelReveal
- QWhat is the difference between ROC-AUC and PR-AUC?Mid-levelReveal
- QHow do you interpret feature importance and when would you use permutation importance over built-in importance?SeniorReveal
- QWhat is the difference between OneHotEncoder and OrdinalEncoder?JuniorReveal
- QHow do you deploy a trained classification model to production?SeniorReveal
Frequently Asked Questions
What is Scikit-Learn used for?
Scikit-Learn is a Python machine learning library providing implementations of classification, regression, clustering, dimensionality reduction, preprocessing, and model evaluation. It follows a consistent fit/predict/transform API across all estimators, making it easy to build, evaluate, and tune machine learning pipelines.
What is a classification algorithm?
A classification algorithm learns from labelled training data to predict class labels for new unseen data. Examples: Logistic Regression (linear), Random Forest (ensemble of trees), SVM (maximum-margin), Gradient Boosting (additive tree ensemble), Naive Bayes (probabilistic). Each has different strengths for different data types.
What is cross-validation in machine learning?
Cross-validation estimates model performance more reliably than a single train/test split. K-fold CV splits data into K subsets, trains on K-1, validates on the remaining fold, and rotates K times. The average score is a stable estimate. Stratified K-fold preserves class proportions, which is important for classification.
What is the difference between precision and accuracy?
Accuracy = (TP + TN) / total. It's misleading for imbalanced classes. Precision = TP / (TP + FP). It tells you how trustworthy your positive predictions are. Use accuracy only when class distribution is balanced; use precision, recall, and F1 otherwise.
How do you handle imbalanced classes in Scikit-Learn?
Use class_weight='balanced' to weight minority class samples. Use SMOTE from imblearn to generate synthetic samples (inside a pipeline only). Tune the decision threshold using predict_proba() and precision_recall_curve. Report PR-AUC instead of ROC-AUC for heavily imbalanced data.
What is SHAP?
SHAP (SHapley Additive exPlanations) is a model interpretability method based on game theory. It assigns each feature a contribution value for each prediction. SHAP values are consistent and locally faithful. TreeExplainer is fast for tree-based models. SHAP is increasingly required for regulatory compliance in healthcare and finance.
When should I use XGBoost over Random Forest?
Use XGBoost when you need maximum accuracy on tabular data and are willing to tune hyperparameters. XGBoost is regularised (L1/L2), handles missing values, and supports GPU acceleration. Use Random Forest when you want a strong model with minimal tuning — it's more robust out of the box and less prone to overfitting.
What is the difference between OneHotEncoder and pd.get_dummies?
pd.get_dummies is applied at transform time and doesn't remember categories from training — it can produce different column counts for train and test data. OneHotEncoder is fit on training data and handles unseen categories gracefully with handle_unknown='ignore'. Always use OneHotEncoder in pipelines for production.
How do you choose between classification algorithms?
Start with a simple baseline (Logistic Regression or Naive Bayes). Then try Random Forest or Gradient Boosting. If you need more accuracy, try XGBoost/LightGBM/CatBoost. Choose based on: data size, feature types, interpretability requirements, training time constraints, and whether you have categorical features. There's no universally best algorithm — it depends on your data and constraints.
How do you know if your model is overfitting?
Compare training and validation scores. If training score is much higher than validation score, you're overfitting. Learning curves show this visually — the gap between train and validation curves. Solutions: increase regularisation, reduce model complexity, add more training data, or use feature selection to reduce dimensionality.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.