Scikit-Learn Classification — OneHotEncoder Schema Drift
- Scikit-Learn's unified API enables rapid algorithm swapping, but the real work is in building leakage-free pipelines.
- Always wrap preprocessing and the model in a Pipeline to prevent data leakage from test statistics.
- Accuracy is useless for imbalanced data—report F1, PR-AUC, and precision/recall instead.
- Classification predicts discrete class labels from labelled training data — binary, multi-class, or multi-label
- Scikit-Learn's unified fit/predict/predict_proba API lets you swap algorithms with one line change
- Always wrap preprocessing + classifier in a Pipeline to prevent data leakage from test statistics contaminating training
- Accuracy is misleading for imbalanced classes — use F1, PR-AUC, and ROC-AUC instead
- Tune the decision threshold via predict_proba() + precision_recall_curve() — the default 0.5 is almost never optimal in production
- XGBoost/LightGBM/CatBoost are what practitioners actually deploy for tabular data — they all implement the Scikit-Learn API
Classification Pipeline Quick Debug
Model loaded but predict() crashes on new data
print(pipeline.named_steps['preprocessor'].feature_names_in_)print(X_new.columns.tolist())All predictions are the same class
print(y_train.value_counts(normalize=True))print(pipeline.predict_proba(X_test)[:10])Cross-validation F1 is 0.95 but test F1 is 0.60
cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')cross_val_score(clf, X_train_preprocessed, y_train, cv=5, scoring='f1')predict_proba returns extreme probabilities (all 0 or 1)
from sklearn.calibration import calibration_curvefrac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)Production Incident
Production Debug GuideCommon failure modes and immediate diagnostic steps for production classification systems
pipeline.predict() on a known-good sample to isolate whether it's a data issue or model corruption.Classification predicts discrete class labels from labelled training data. Scikit-Learn provides a consistent, composable API for dozens of classification algorithms — from logistic regression to random forests — so you can swap algorithms, build preprocessing pipelines, evaluate performance rigorously, and tune hyperparameters without rewriting your code.
The algorithms are the easy part. The hard part is preventing data leakage, handling imbalanced classes, choosing the right evaluation metric, and building a pipeline you can serialize and deploy without surprises. This guide covers all of it — the algorithms, the gotchas, and the production patterns that separate a Jupyter notebook prototype from a reliable production system.
Here's the thing: most classification failures aren't about picking the wrong algorithm. They're about leaking test data into training, using accuracy on imbalanced data, or deploying a model that can't handle new categories. Get those right, and the algorithm choice often becomes a second-order concern.
What is Classification in Machine Learning?
Classification is a supervised learning task where the goal is to predict a discrete class label for each input. Supervised means you train on labelled examples — (features, label) pairs — where the correct label is known. The model learns the relationship between features and labels, then generalises to new inputs.
Binary classification has two classes (spam/not spam, fraud/not fraud, disease/healthy). Multi-class classification has three or more exclusive classes (cat, dog, bird). Multi-label classification assigns multiple labels per example (a news article can be both 'finance' and 'politics').
The output of a classifier is either a predicted class label (via predict()) or a probability distribution over all classes (via predict_proba()). Classification is distinct from regression, where the output is a continuous number.
The first question to ask before building any classifier: What does it cost to be wrong? If you misclassify spam, the user sees one extra email. If you misclassify a healthy patient as having cancer, they undergo unnecessary treatment. The cost of false positives vs false negatives determines your evaluation metric, your decision threshold, and your entire model selection strategy. Don't start with accuracy — start with the business impact of errors.
Here's a rule I've learned the hard way: if you don't know the cost of errors, you'll pick the wrong metric. And picking the wrong metric means you'll optimise the wrong thing. That's how you end up with a 99% accurate fraud model that catches zero fraud.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(classification_report(y_test, y_pred, target_names=['setosa', 'versicolor', 'virginica']))
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 10
virginica 1.00 1.00 1.00 10
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
- Classification output: discrete label (spam/not spam) or probability distribution over classes
- Regression output: continuous number (price, temperature)
- The cost of errors drives everything — start with the business question, not the algorithm
- predict() gives labels,
predict_proba()gives probabilities — always prefer probabilities in production for threshold control
predict_proba()Scikit-Learn's Unified Estimator API — fit, predict, score
Scikit-Learn's biggest strength is its consistent API. Every classifier implements the same interface: fit(X, y) trains the model, predict(X) returns predicted labels, predict_proba(X) returns probability estimates, and score(X, y) returns mean accuracy.
This uniformity means you can swap algorithms with a single line change. The exact same preprocessing, splitting, and evaluation code works with LogisticRegression, RandomForestClassifier, SVC, or GradientBoostingClassifier.
The score() trap: score() returns accuracy by default for classifiers. For imbalanced datasets, accuracy is misleading — a model predicting the majority class every time scores 95% accuracy while being completely useless. Always use explicit metrics (F1, AUC) via sklearn.metrics rather than relying on score().
Think of it like this: score() is a shortcut for quick checks during development. But if you're using it to evaluate a model for production, you're making a mistake. You wouldn't check if a car works by looking at the colour — same idea.
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.svm import SVC classifiers = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Decision Tree': DecisionTreeClassifier(max_depth=5), 'Random Forest': RandomForestClassifier(n_estimators=100), 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100), 'SVM': SVC(probability=True), } for name, clf in classifiers.items(): clf.fit(X_train, y_train) acc = clf.score(X_test, y_test) print(f'{name}: {acc:.4f}') probs = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test) print(probs[0])
Decision Tree: 0.9333
Random Forest: 1.0000
Gradient Boosting: 1.0000
SVM: 0.9667
[0.02 0.07 0.91]
score().score() in production evaluation — it returns accuracy, which hides class imbalance problems.score() for evaluation — it returns accuracy, which is meaningless on imbalanced data.predict_proba() over predict() in production.predict_proba() with a tuned threshold — never use the hardcoded 0.5 from predict()predict_proba() to get the full probability distribution over all classespredict_proba() before deploymentFeature Scaling — When to Scale and When Not to Bother
Feature scaling normalises the range of input features. Some algorithms are sensitive to feature scale; others are completely invariant.
Algorithms that REQUIRE scaling: SVM (distance-based), KNN (distance-based), Logistic Regression (gradient descent convergence), Neural Networks (gradient-based optimization). If you forget to scale for these, the feature with the largest range dominates the model.
Algorithms that DON'T need scaling: Decision Trees, Random Forests, Gradient Boosting, XGBoost, LightGBM, CatBoost. Tree-based models split on individual feature thresholds, so absolute scale doesn't matter.
Three scalers to know: StandardScaler (zero mean, unit variance — sensitive to outliers), MinMaxScaler (scales to [0,1] — for neural networks), RobustScaler (median and IQR — robust to outliers).
The production rule: If your pipeline includes SVM, KNN, or Logistic Regression, add StandardScaler. If it's tree-based only, skip scaling. If unsure, add it — it won't hurt tree models, just wastes a few milliseconds.
One more thing: don't just blindly scale everything. I've debugged cases where scaling a sparse binary feature caused the model to behave poorly. Know your data.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler from sklearn.pipeline import Pipeline from sklearn.svm import SVC from sklearn.neural_network import MLPClassifier # SVM NEEDS scaling svm_scaled = Pipeline([ ('scaler', StandardScaler()), ('clf', SVC(kernel='rbf', probability=True)), ]) # When data has outliers, use RobustScaler svm_robust = Pipeline([ ('scaler', RobustScaler()), ('clf', SVC(kernel='rbf', probability=True)), ]) # For neural networks or bounded-input models nn_scaled = Pipeline([ ('scaler', MinMaxScaler()), ('clf', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500)), ])
Preprocessing Pipelines — The Right Way to Handle Feature Engineering
A pipeline chains preprocessing steps and a classifier into a single object. This is not optional convenience — it is the correct way to prevent data leakage.
Data leakage happens when information from the test set influences training. The classic mistake: fit a StandardScaler on the entire dataset before splitting. The scaler has 'seen' the test data and computed statistics from it. Your model was trained on a subtly contaminated version of reality.
With a Pipeline, fit() calls fit_transform() on preprocessors and fit() on the classifier — all on training data only. predict() calls transform() on preprocessors and predict() on the classifier. The test data is only transformed with statistics learned from training data.
The ColumnTransformer pattern: Real datasets have mixed feature types — numeric columns (age, income), categorical columns (country, plan_type). ColumnTransformer applies different preprocessing to different column subsets, all within the same pipeline. This is the standard pattern for production ML.
I'll say it again: if you're not using a Pipeline, you're almost certainly leaking data. I've seen it countless times.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier import numpy as np import pandas as pd # Simulate a realistic dataset df = pd.DataFrame({ 'age': [25, 45, np.nan, 30, 60], 'income': [50000, 80000, 120000, np.nan, 90000], 'country': ['US', 'UK', 'US', 'DE', 'FR'], 'plan_type': ['basic', 'premium', 'basic', 'enterprise', 'premium'], 'education': ['high_school', 'bachelors', 'masters', 'phd', 'bachelors'], 'churned': [0, 0, 1, 0, 1], }) X = df.drop('churned', axis=1) y = df['churned'] numeric_features = ['age', 'income'] nominal_features = ['country', 'plan_type'] ordinal_features = ['education'] # Ordinal encoding for ordered categories education_order = ['high_school', 'bachelors', 'masters', 'phd'] preprocessor = ColumnTransformer([ ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), numeric_features), ('nom', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]), nominal_features), ('ord', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('ordinal', OrdinalEncoder(categories=[education_order]))]), ordinal_features), ]) pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)), ]) pipeline.fit(X, y) # See transformed feature names feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out() print('Transformed features:', feature_names)
Naive Bayes — The Baseline You Should Always Try First
Before reaching for Random Forest or XGBoost, try Naive Bayes. It's fast, simple, surprisingly effective, and serves as a strong baseline. If your complex model can't beat Naive Bayes, something is wrong with your features.
Three variants: GaussianNB (continuous features, assumes normal distribution), MultinomialNB (discrete count features like word counts or TF-IDF — the go-to for text classification), BernoulliNB (binary features — word present/absent).
Why it works for text: Text data is high-dimensional and sparse. Naive Bayes handles this gracefully because it assumes feature independence. This assumption is obviously wrong (words aren't independent), but it works shockingly well in practice.
The production baseline pattern: Always train a Naive Bayes model first. Report its metrics. Then train your fancy model. If the fancy model only marginally beats Naive Bayes, consider whether the added complexity is worth it. Naive Bayes trains in milliseconds and predicts in microseconds — that matters for real-time systems.
Here's a story: I once replaced a carefully tuned XGBoost model with a Naive Bayes model for a real-time ad classification system. The XGBoost was 1.2% more accurate but took 15x longer to predict. The business chose the faster model. Know your constraints.
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report # Gaussian Naive Bayes for continuous features gnb = GaussianNB() gnb.fit(X_train, y_train) y_pred = gnb.predict(X_test) print('GaussianNB:', classification_report(y_test, y_pred)) # Multinomial Naive Bayes for text (TF-IDF features) text_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')), ('clf', MultinomialNB(alpha=0.1)), ]) # Bernoulli Naive Bayes for binary features bnb = BernoulliNB() binary_features = (X_train > 0).astype(int) bnb.fit(binary_features, y_train)
0 0.96 0.94 0.95 50
1 0.83 0.88 0.85 22
accuracy 0.92 72
Evaluating a Classifier — Beyond Accuracy
Accuracy is misleading for imbalanced classes. If 95% of data is class 0, a classifier predicting class 0 always achieves 95% accuracy while being completely useless.
Precision: Of all samples predicted positive, what fraction actually are? High precision = few false alarms. Recall: Of all actual positives, what fraction did we catch? High recall = few missed cases. F1: Harmonic mean of precision and recall.
The confusion matrix shows the full breakdown: TP, TN, FP, FN. For multi-class, it's an NxN matrix where the diagonal is correct predictions.
ROC-AUC measures discrimination ability across all thresholds. AUC 0.5 = random, 1.0 = perfect. PR-AUC (Precision-Recall AUC) is better for imbalanced data because it focuses on the positive class.
Which metric matters depends on the business cost: - Spam filtering: Precision matters. You don't want legitimate email in the spam folder. - Disease screening: Recall matters. You don't want to miss a sick patient. - Fraud detection: Recall matters more, but precision also matters because investigating false positives costs money.
The multi-metric approach: Always report at least three metrics: precision, recall, and F1. Add AUC if you use predicted probabilities. Never report accuracy alone for imbalanced problems.
I've had to tell more than one team that their 99% accurate model was useless. The confusion matrix showed they predicted 'not fraud' for everything. That's when you know you've been optimising the wrong thing.
from sklearn.metrics import ( classification_report, confusion_matrix, roc_auc_score, average_precision_score, ConfusionMatrixDisplay, precision_recall_curve ) import matplotlib.pyplot as plt import numpy as np print(classification_report(y_test, y_pred)) cm = confusion_matrix(y_test, y_pred) disp = ConfusionMatrixDisplay(confusion_matrix=cm) disp.plot() plt.title('Confusion Matrix') plt.show() # ROC-AUC for binary classification probs = pipeline.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, probs) print(f'ROC-AUC: {auc:.4f}') # PR-AUC — better metric for imbalanced classes pr_auc = average_precision_score(y_test, probs) print(f'PR-AUC: {pr_auc:.4f}') # Custom class weights — when you know the business cost ratio clf_cost_sensitive = RandomForestClassifier( class_weight={0: 1, 1: 100}, random_state=42 ) clf_cost_sensitive.fit(X_train, y_train) y_pred_cost = clf_cost_sensitive.predict(X_test) print(classification_report(y_test, y_pred_cost))
0 0.98 0.96 0.97 50
1 0.82 0.91 0.86 22
accuracy 0.95 72
macro avg 0.90 0.93 0.91 72
weighted avg 0.94 0.95 0.94 72
ROC-AUC: 0.9834
PR-AUC: 0.9412
precision recall f1-score support
0 0.99 0.88 0.93 50
1 0.61 0.95 0.75 22
accuracy 0.90 72
macro avg 0.80 0.92 0.84 72
weighted avg 0.88 0.90 0.88 72
Decision Threshold Tuning — The Most Underrated Technique
Every binary classifier has a default decision threshold of 0.5: if predict_proba() returns >= 0.5, predict class 1. This threshold is arbitrary and almost never optimal for your specific business problem.
The insight: The threshold controls the trade-off between precision and recall. Lowering it catches more positives (higher recall) but flags more negatives as positives (lower precision). Raising it gives fewer but more trustworthy positive predictions.
How to find the optimal threshold: Use precision_recall_curve to get precision and recall at every possible threshold. Then choose the threshold that optimises your business metric.
In production: Don't use predict() — use predict_proba() and apply your own threshold. Store the threshold alongside the model. When business costs change, adjust the threshold without retraining.
I tuned the threshold on a fraud detection model from 0.5 to 0.15. Recall went from 60% to 92%. Precision dropped from 85% to 45%. The fraud team preferred catching 92% of fraud — the cost of missing fraud far exceeded the cost of investigating false positives. That's a business decision, not a technical one.
import numpy as np from sklearn.metrics import precision_recall_curve, f1_score # Get probabilities y_probs = pipeline.predict_proba(X_test)[:, 1] precisions, recalls, thresholds = precision_recall_curve(y_test, y_probs) # Find threshold that maximises F1 f1_scores = 2 * precisions[:-1] * recalls[:-1] / (precisions[:-1] + recalls[:-1] + 1e-8) best_idx = np.argmax(f1_scores) best_threshold = thresholds[best_idx] print(f'Best threshold for F1: {best_threshold:.3f}') print(f'F1 at best threshold: {f1_scores[best_idx]:.3f}') # Apply custom threshold y_pred_custom = (y_probs >= best_threshold).astype(int) print(f'F1 with custom threshold: {f1_score(y_test, y_pred_custom):.4f}') # Find threshold for minimum recall target min_recall = 0.95 valid_indices = np.where(recalls[:-1] >= min_recall)[0] if len(valid_indices) > 0: best_for_recall = valid_indices[np.argmax(precisions[valid_indices])] recall_threshold = thresholds[best_for_recall] print(f'Threshold for >= 95% recall: {recall_threshold:.3f}') # Business cost minimisation fn_cost = 1000 # missed fraud fp_cost = 10 # false alarm min_cost = float('inf') best_cost_threshold = 0.5 for i, t in enumerate(thresholds): y_pred_t = (y_probs >= t).astype(int) fn = np.sum((y_test == 1) & (y_pred_t == 0)) fp = np.sum((y_test == 0) & (y_pred_t == 1)) cost = fn * fn_cost + fp * fp_cost if cost < min_cost: min_cost = cost best_cost_threshold = t print(f'Threshold minimising business cost: {best_cost_threshold:.3f} (cost: ${min_cost:,.0f})')
F1 at best threshold: 0.831
F1 with custom threshold: 0.8314
Threshold for >= 95% recall: 0.180
Threshold minimising business cost: 0.150 (cost: $2,340)
predict_proba() and apply your own threshold. Store the threshold as a configuration parameter alongside the model. This single technique has saved me more production incidents than any algorithm choice.predict_proba() in production, never predict() — store the threshold as a config parameter.Handling Imbalanced Classes — SMOTE, Thresholds, and Class Weights
Most real-world classification problems are imbalanced: fraud is rare, churn is rare, disease is rare. If you don't address this, your model will learn to predict the majority class every time.
1. class_weight='balanced': The simplest approach. Increases the loss contribution of minority class samples. Works with LogisticRegression, RandomForest, SVM. No extra dependencies. Try this first.
2. SMOTE (Synthetic Minority Oversampling): Generates synthetic minority class samples by interpolating between existing minority samples. pip install imbalanced-learn. Use SMOTEENN or SMOTETomek to clean noisy synthetic samples.
3. Threshold tuning: Use predict_proba() and tune the decision threshold (see previous section). Often the most effective approach because you keep the model unchanged — you just change the decision boundary.
The gotcha with SMOTE: Never apply SMOTE before train/test splitting. SMOTE generates synthetic samples based on nearest neighbours — if applied before splitting, synthetic test samples leak information about training samples. Always SMOTE inside a pipeline or after splitting.
imblearn.pipeline.Pipeline: scikit-learn's Pipeline doesn't support samplers (they lack transform()). Use imblearn.pipeline.Pipeline instead, which supports both transformers and samplers.
I've seen too many people apply SMOTE before splitting and claim an F1 of 0.95. When they fix it, it drops to 0.65. Don't be that person.
from imblearn.over_sampling import SMOTE, ADASYN from imblearn.under_sampling import RandomUnderSampler from imblearn.combine import SMOTEENN from imblearn.pipeline import Pipeline as ImbPipeline from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score, StratifiedKFold import numpy as np # Strategy 1: class_weight (simplest) clf_weighted = RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42) # Strategy 2: SMOTE inside imblearn pipeline preprocessor = ColumnTransformer([ ('num', Pipeline([('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler())]), numeric_features), ('cat', Pipeline([('imp', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore'))]), categorical_features), ]) smote_pipeline = ImbPipeline([ ('preprocessor', preprocessor), ('smote', SMOTE(random_state=42)), ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)), ]) # Strategy 3: SMOTE + Edited Nearest Neighbours (cleans noisy samples) smoteenn_pipeline = ImbPipeline([ ('preprocessor', preprocessor), ('smoteenn', SMOTEENN(random_state=42)), ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)), ]) # Strategy 4: Undersampling + SMOTE combined_pipeline = ImbPipeline([ ('preprocessor', preprocessor), ('under', RandomUnderSampler(sampling_strategy=0.5, random_state=42)), ('smote', SMOTE(random_state=42)), ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)), ]) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for name, pipe in [('class_weight', clf_weighted), ('SMOTE', smote_pipeline), ('SMOTEENN', smoteenn_pipeline), ('Under+SMOTE', combined_pipeline)]: scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1') print(f'{name:15s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})')
SMOTE F1: 0.6512 (+/- 0.0198)
SMOTEENN F1: 0.6634 (+/- 0.0187)
Under+SMOTE F1: 0.6589 (+/- 0.0201)
Cross-Validation — Reliable Performance Estimates
A single train/test split gives a noisy estimate. The specific random split affects which samples are in each set, and performance can vary significantly between splits. Cross-validation averages performance across multiple splits.
K-fold splits data into K folds, trains on K-1, validates on the remaining fold, rotates K times. StratifiedKFold preserves class proportions — always use it for classification.
cross_val_score returns test scores. cross_validate returns train and test scores plus timing — useful for diagnosing overfitting (train >> test = overfitting).
How many folds? 5 is the default. Use 10 for small datasets (more training data per fold). Use 3 for very large datasets (faster). The standard deviation across folds shows how stable the estimate is.
I once had a model that scored 0.95 on one random split and 0.72 on another. Cross-validation showed the true score was 0.83 with a standard deviation of 0.08. That's why we use CV.
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold from sklearn.ensemble import RandomForestClassifier cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X, y, cv=cv, scoring='f1_weighted') print(f'F1: {scores.mean():.4f} (+/- {scores.std():.4f})') results = cross_validate(clf, X, y, cv=cv, scoring=['accuracy', 'f1_weighted'], return_train_score=True) print('Train F1:', results['train_f1_weighted'].mean()) print('Test F1:', results['test_f1_weighted'].mean()) # CV with full pipeline — no leakage cv_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='roc_auc') print(f'Pipeline AUC: {cv_scores.mean():.4f}')
Train F1: 0.9998
Test F1: 0.9734
Pipeline AUC: 0.9856
| Algorithm | Training Speed | Prediction Speed | Handles Imbalance? | Needs Scaling? | Interpretability | Best For |
|---|---|---|---|---|---|---|
| Logistic Regression | Fast | Fast | Yes (class_weight) | Yes | High | Baseline, online learning |
| Decision Tree | Fast | Fast | Partial (depth bias) | No | High | Interpretable rules |
| Random Forest | Moderate | Moderate | Yes (class_weight) | No | Medium | General purpose, tabular |
| Gradient Boosting (XGBoost) | Slow | Fast | Yes (scale_pos_weight) | No | Low | Competitions, high accuracy |
| SVM (RBF kernel) | Slow | Slow | Yes (class_weight) | Yes | Low | Small datasets, non-linear |
| Naive Bayes | Very Fast | Very Fast | No | No | High | Text, real-time baselines |
| KNN | None | Slow (lazy) | No | Yes | Medium | Small datasets, simple decision boundaries |
🎯 Key Takeaways
- Scikit-Learn's unified API enables rapid algorithm swapping, but the real work is in building leakage-free pipelines.
- Always wrap preprocessing and the model in a Pipeline to prevent data leakage from test statistics.
- Accuracy is useless for imbalanced data—report F1, PR-AUC, and precision/recall instead.
- Tune the decision threshold using precision_recall_curve—the default 0.5 is almost never optimal.
- Handle class imbalance with class_weight first, then threshold tuning, then SMOTE—never before splitting.
- Use StratifiedKFold for cross-validation to get reliable performance estimates.
- Set handle_unknown='ignore' on OneHotEncoder to avoid production crashes from unseen categories.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the difference between precision, recall, and F1-score. When would you prioritise precision over recall?JuniorReveal
- QHow would you handle extreme class imbalance (1% positive class) in a fraud detection pipeline? Walk through the steps.Mid-levelReveal
- QDesign a production classification system that can handle schema drift, has interpretable predictions, and can be retrained without downtime.SeniorReveal
Frequently Asked Questions
What is the difference between classification and regression?
Classification predicts discrete class labels (e.g., spam or not spam). Regression predicts continuous values (e.g., house price). Scikit-Learn provides separate estimators for each task, but the API is identical (fit/predict/score).
When should I use Random Forest vs XGBoost?
Random Forest is simpler, trains faster, and has fewer hyperparameters to tune. Use it as a strong baseline. XGBoost typically achieves higher accuracy but requires hyperparameter tuning (learning rate, max_depth, subsample). In production, XGBoost is more common due to its performance on tabular data. Start with Random Forest, then try XGBoost if you need the extra edge.
How do I handle missing values in a classification pipeline?
Use SimpleImputer inside the Pipeline. For numeric features, strategy='median' is robust to outliers. For categorical features, strategy='most_frequent' fills with the most common category. Never drop rows with missing values in production—your pipeline must handle them gracefully.
What is data leakage and how do I prevent it?
Data leakage happens when information from the test set (or future data) influences training. Common causes: fitting a scaler on the whole dataset before splitting, applying SMOTE before splitting, or using future data for feature engineering. Prevent it by placing all preprocessing inside a Pipeline so that fit() is only called on the training set.
Why is my model's accuracy high but its business impact low?
Accuracy is dominated by the majority class. If 95% of your data is 'not fraud', a model that predicts 'not fraud' every time scores 95% accuracy but catches zero fraud. Switch to precision, recall, F1, or PR-AUC to evaluate performance on the minority class. Then tune the decision threshold to match business priorities.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.