Mid-level 6 min · March 06, 2026

Train Test Split & CV — The $50k Leakage Mistake

In production, accuracy dropped from 92% to 54% because StandardScaler leaked fold stats.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Train/test split: one random 80/20 cut of your data. Cross validation: K rounds, each fold takes a turn as the test set.
  • Use train_test_split with stratify=targets for classification — keeps class proportions intact.
  • K-Fold averages K independent performance estimates, reducing variance to ~1/√K of a single split.
  • Production trap: preprocessing before the split leaks test data into training — wrap scalers in a Pipeline.
  • Biggest mistake: tuning hyperparameters on the same CV scores you report — lock a held-out test set before tuning begins.
✦ Definition~90s read
What is Train Test Split and Cross Validation?

Train-test split and cross-validation are the foundational techniques for honestly evaluating machine learning models — they exist because the cardinal sin in ML is measuring performance on data the model has already seen. A train-test split carves your dataset into a training set (used to fit the model) and a held-out test set (used to estimate generalization error).

Imagine you're studying for a final exam.

Cross-validation (CV) takes this further by repeatedly splitting the data into complementary subsets, training on most and validating on the remainder, then averaging the results. Without these, you're essentially grading your own homework: training accuracy is almost always misleadingly high, and deploying such a model in production will cost you real money — often $50k or more in misallocated resources, failed campaigns, or regulatory fines from overconfident predictions.

These techniques are your first defense against data leakage, where information from outside the training set inadvertently inflates performance. A single random split (e.g., 80/20) works for large, well-shuffled datasets, but it's fragile — a lucky split can overestimate performance, and an unlucky one can cause you to discard a good model.

K-fold CV (typically 5 or 10 folds) mitigates this by training and validating k times, giving you a robust estimate of variance. For imbalanced classification, stratified splits preserve class proportions in each fold, preventing a rare class from vanishing entirely from the validation set.

When you need to tune hyperparameters, the gold standard is a three-way split (train/validation/test) or nested CV — an outer loop for test performance and an inner loop for hyperparameter selection — to avoid optimistic bias from using test data to guide model choices.

Time series data breaks the standard random-split assumption entirely: you cannot use future observations to predict the past. Time series CV (e.g., expanding window or rolling window) respects temporal order, training only on data before the validation point.

Tools like scikit-learn's TimeSeriesSplit or GroupKFold handle this explicitly. Alternatives exist — like holdout validation for massive datasets where CV is computationally prohibitive, or bootstrap methods for small samples — but train-test split and CV remain the workhorses.

Skip them, and you're not doing data science; you're doing data theater.

Plain-English First

Imagine you're studying for a final exam. If your teacher hands you the exact exam questions during practice, of course you'll ace it — but you haven't actually learned anything. Train/test split is the rule that says: practice on one set of questions, get tested on a completely different set. Cross validation takes it further — it's like sitting five different mini-exams in rotation so no single exam can trick you into thinking you're better (or worse) than you really are.

A single wrong split can poison your entire model. Train-test split and cross-validation are the guardrails that keep your evaluation honest, preventing you from mistaking memorization for generalization. Without them, you’re flying blind—validating on leakage, overfitting to noise, and shipping models that fail in production.

Why Train-Test Split & Cross-Validation Are Your First Defense Against Leakage

Train-test split and cross-validation are the two fundamental techniques for estimating how a model will perform on unseen data. The core mechanic: you partition your labeled dataset into disjoint subsets — one for training the model, one for testing it. Cross-validation extends this by repeating the split multiple times, averaging the results to reduce variance in the performance estimate. The critical rule: no data point used in training may ever appear in the test set, or you're measuring memorization, not generalization.

In practice, a simple train-test split (e.g., 80/20) is O(1) to execute but yields a single estimate with high variance, especially on small datasets. K-fold cross-validation (e.g., 5-fold) splits data into k equal folds, trains on k-1 folds, tests on the held-out fold, and repeats k times. This gives a more stable estimate but costs O(k) training time. Stratified variants preserve class proportions in each fold, critical for imbalanced classification.

Use train-test split for quick sanity checks and when you have abundant data (100k+ rows). Use cross-validation for hyperparameter tuning, model selection, and when data is scarce — it squeezes more signal from each sample. In production, the real cost of skipping proper validation is deploying a model that fails silently on new data, often due to temporal leakage or data snooping.

Leakage is silent and deadly
A common mistake: scaling the entire dataset before splitting. This leaks information from the test set into training, inflating accuracy by 5-20% in practice.
Production Insight
Teams building time-series models often shuffle all rows before splitting, breaking temporal order. The symptom: a model that predicts tomorrow's stock price using 'future' data, achieving 99% accuracy in validation but failing instantly in production. Rule of thumb: for any sequential data, always split by time — never shuffle.
Key Takeaway
Never let a single test data point influence training — that includes scaling, imputation, or feature selection.
Cross-validation is not a training technique; it's an evaluation technique — use it to tune hyperparameters, not to train the final model.
Always match the splitting strategy to the data structure: random for i.i.d., temporal for time series, grouped for clustered observations.
Train-Test Split & CV: Avoiding Leakage THECODEFORGE.IO Train-Test Split & CV: Avoiding Leakage Proper evaluation workflow to prevent data leakage Train-Test Split Hold out a test set for final evaluation K-Fold Cross-Validation Multiple train/validation splits for stability Stratified Splits Preserve class proportions in each fold Time Series CV Use expanding window or sliding window GroupKFold Keep groups intact across folds Train/Validation/Test Gold standard: three-way split ⚠ Evaluating on training data overestimates performance Always use held-out test set; never tune on test data THECODEFORGE.IO
thecodeforge.io
Train-Test Split & CV: Avoiding Leakage
Train Test Split Cross Validation

Why Evaluating on Training Data Is a Silent Killer

When you train a model, it adjusts its internal parameters to minimize error on the data you gave it. The more complex the model, the more it can contort itself to fit every quirk, outlier, and noise spike in that training data. A decision tree with unlimited depth will reach 100% training accuracy on almost any dataset — it just memorizes every row. That's overfitting, and it's invisible unless you test the model on data it's never touched.

Here's the subtle danger: even experienced developers fall into this trap when they tune hyperparameters. Every time you check your model's score and adjust something, you're indirectly leaking information about the test set into your decisions. This is why a held-out test set must be locked away and touched exactly once — at the very end.

Train/test split is the minimal viable defense. You take your full dataset, randomly shuffle it, and cut it into two non-overlapping pieces. The training set is the classroom. The test set is the final exam. The model never sees the exam questions until grading day. This one habit alone separates a trustworthy ML workflow from a broken one.

train_test_split_basics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# io.thecodeforge: Standardizing Evaluation splits
cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# Split: 80% train, 20% test
# stratify=targets ensures class balance is maintained
X_train, X_test, y_train, y_test = train_test_split(
    features,
    targets,
    test_size=0.20,
    random_state=42,
    stratify=targets
)

# Train the model
forest_model = RandomForestClassifier(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

# Evaluate
train_accuracy = accuracy_score(y_train, forest_model.predict(X_train))
test_accuracy  = accuracy_score(y_test,  forest_model.predict(X_test))

print(f"Train accuracy : {train_accuracy:.4f}")
print(f"Test accuracy  : {test_accuracy:.4f}")
Output
Training samples : 455
Test samples : 114
Train accuracy : 1.0000
Test accuracy : 0.9649
Gap (overfit signal): 0.0351
Watch Out: Always Use stratify on Classification Tasks
If your dataset has 90% class A and 10% class B, a random split can accidentally put most of class B in the training set and almost none in the test set. Your model looks great — but it's never been properly tested on the minority class. Pass stratify=targets to train_test_split every time on classification problems.
Production Insight
In production, that gap of 0.0351 isn't noise — it's the beginning of a failure curve. If your model memorizes 100% of training noise, it will degrade faster under data drift. Always compare train and test accuracy; a gap > 5% is a red flag.
Monitor the train-test gap over time. If it widens after retraining, something changed in the data or the split strategy.
Key Takeaway
A random split alone can hide overfitting. Always compare train and test scores.
stratify=targets is non-negotiable for classification.
Lock away your test set — touch it only once at the very end.

K-Fold Cross Validation — When One Test Split Isn't Enough

Here's the honest problem with a single train/test split: your result depends on luck. If the random split happens to put all the 'easy' examples in the test set, your model looks brilliant. If it puts all the hard ones there, your model looks terrible. With a small dataset (say, 500 rows), one unlucky split can swing your accuracy by 5–10 percentage points.

K-Fold cross validation fixes this by running the experiment K times, each time with a different fold acting as the test set. With K=5, your dataset is cut into 5 equal chunks. Round 1: train on folds 2–5, test on fold 1. Round 2: train on folds 1, 3–5, test on fold 2. And so on.

The cost is compute time — you're training K models instead of one. But the payoff is enormous: you use 100% of your data for evaluation (across all folds), and your performance estimate has far lower variance. For anything going into production, or any paper you're publishing, K-Fold is the standard.

kfold_cross_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# io.thecodeforge: Implementing robust Cross-Validation
cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# Pipeline prevents data leakage from scaler to validation fold
evaluation_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

fold_scores = cross_val_score(
    evaluation_pipeline,
    features,
    targets,
    cv=stratified_kfold,
    scoring='accuracy'
)

print(f"Mean accuracy : {fold_scores.mean():.4f}")
print(f"Std deviation : {fold_scores.std():.4f}")
Output
Per-fold accuracy scores:
Fold 1: 0.9737
Mean accuracy : 0.9649
Std deviation : 0.0082
Pro Tip: Always Put Preprocessing Inside a Pipeline
If you call StandardScaler().fit_transform(features) before cross_val_score, you've already computed the mean and variance of the entire dataset — including what would have been the test folds. That's data leakage. Wrapping the scaler in a Pipeline ensures it only ever sees the training fold during each cross-validation round.
Production Insight
The std deviation of 0.0082 tells you 95% of the time your true accuracy lies within ±0.016 of the mean. That's tight. If you see std > 0.05, your dataset is too small or the folds are not stratified properly.
In production, use K=5 or K=10. K=2 gives high variance; K=20 is computationally expensive and yields diminishing returns.
Key Takeaway
K-Fold reduces performance estimate variance dramatically.
Pipeline prevents the most common leakage: scaling before splitting.
Always pair StratifiedKFold with classification targets.

The Gold Standard: Train / Validation / Test and Nested CV

Once you add hyperparameter tuning to the picture, even K-Fold cross validation can leak. Here's why: if you run 50 hyperparameter combinations through the same CV folds and pick the best one, you've effectively optimised for those specific folds. The CV score of your winning model is now optimistically biased.

The production-grade solution is a three-way split: training set (model learns), validation set or inner CV (hyperparameters are tuned), and a completely held-out test set (touched exactly once for final reporting). This is often called nested cross validation — an inner loop for hyperparameter search and an outer loop for unbiased performance estimation.

production_evaluation_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# io.thecodeforge: Production Evaluation Pipeline
cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# STEP 1: Sealed-envelope test set
X_develop, X_final_test, y_develop, y_final_test = train_test_split(
    features, targets,
    test_size=0.20,
    random_state=42,
    stratify=targets
)

# STEP 2: Pipeline construction
model_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# STEP 3: GridSearch (Inner CV)
hyperparam_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth':    [None, 10, 20]
}

grid_search = GridSearchCV(
    model_pipeline,
    hyperparam_grid,
    cv=StratifiedKFold(n_splits=5),
    scoring='accuracy'
)
grid_search.fit(X_develop, y_develop)

# STEP 4: Final Evaluation
best_model = grid_search.best_estimator_
print(classification_report(y_final_test, best_model.predict(X_final_test)))
Output
Best hyperparameters found:
{'classifier__max_depth': None, 'classifier__n_estimators': 200}
Final Test Accuracy: 0.96
Interview Gold: Why Does CV Score Sometimes Beat Final Test Score?
If your GridSearchCV best score is 0.967 but your final test score is 0.964, that's completely normal and healthy — the CV score was averaged over 5 folds of 80% of the data, each fold trained on slightly less data than the final model. If the CV score is significantly HIGHER than the test score (more than ~3-4%), suspect data leakage or that you tuned hyperparameters while peeking at the test set.
Production Insight
This three-level split is what every model that ships to production should follow. The development set (80%) is used for tuning via inner CV. The final test set (20%) is opened exactly once to report the model's expected real-world performance.
If you're building a model that will be deployed and monitored, add a third set: a calibration/holdout for threshold tuning and an unseen production validation batch. Two levels are the minimum; three is production.
Key Takeaway
Hyperparameter tuning on CV folds makes CV scores optimistic.
Hold out a completely separate test set before any tuning.
Nested CV or train/validation/test split gives unbiased final estimates.

Stratified Splits for Imbalanced Data — Why Random Isn't Fair

When your target classes are imbalanced — say, 95% 'no churn' and 5% 'churn' — a random split can easily create a test set with zero churn examples. Your model would appear to have 95% accuracy by simply predicting 'no churn' every time. You'd ship a completely useless model.

Stratified splitting forces each fold and each split to mirror the original class proportions. In scikit-learn, train_test_split(..., stratify=targets) and StratifiedKFold(n_splits=5) handle this for you. For regression tasks, consider StratifiedKFold by binning the target into quantiles.

For extreme imbalance (e.g., <1% minority), even stratification can be fragile. Reduce K so each fold has at least a few minority samples, or use repeated stratified splits.

stratified_splits_imbalanced.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.model_selection import StratifiedKFold, train_test_split
import numpy as np

# io.thecodeforge: Stratified splits for imbalanced data
# Simulate highly imbalanced dataset (5% positive)
y = np.array([0]*950 + [1]*50)
X = np.random.randn(1000, 10)

# Without stratification – risk of test set with no minority
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Unstratified test set class counts: {np.bincount(y_test)}")  # Might be [157, 3] or worse

# With stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Stratified test set class counts: {np.bincount(y_test)}")  # Should be [152, 8] (mirrors original)

# For K-Fold
skf = StratifiedKFold(n_splits=5)
for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    print(f"Fold {i}: test class counts {np.bincount(y[test_idx])}")
Output
Unstratified test set class counts: [157, 3]
Stratified test set class counts: [152, 8]
Fold 0: test class counts [152, 8]
Fold 1: test class counts [152, 8]
Fold 2: test class counts [152, 8]
Fold 3: test class counts [152, 8]
Fold 4: test class counts [152, 8]
Don't Trust Accuracy Alone on Imbalanced Data
A model that predicts 'majority class' for every row can achieve 95% accuracy but is completely useless. Always pair accuracy with precision, recall, F1, and AUC-ROC when evaluating imbalanced datasets. The confusion matrix is your friend.
Production Insight
In a fraud detection system, 0.1% of transactions are fraudulent. A single random split could put all fraud cases in training and none in test. You'd deploy a model that never flags anything — and lose millions.
Use StratifiedKFold with K=3 if the minority class has fewer than 15 samples per fold. For very small minorities, consider leaving out CV entirely and use a simple train/test split with stratification plus bootstrapped confidence intervals.
Key Takeaway
Stratified splitting preserves class proportions in every fold.
Without stratification, imbalanced datasets produce misleading evaluation.
For extreme imbalance, reduce K or use resampling techniques.

Time Series Cross Validation — You Can't Use Future Data to Predict the Past

Standard K-Fold cross validation assumes data points are independent and identically distributed. For time series, that assumption is false. Observations are temporally dependent — using tomorrow's data to predict yesterday creates data leakage from the future.

Scikit-learn provides TimeSeriesSplit for exactly this situation. It uses an expanding window: training sets always precede test sets in time. Fold 1 trains on days 1-30, tests on day 31. Fold 2 trains on days 1-60, tests on day 61, and so on. This mimics how a model would be used in production — trained on past data to predict the next point.

Never use KFold or ShuffleSplit on temporal data. The random shuffle destroys the time ordering and gives you an unrealistically optimistic estimate.

time_series_cv.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression

# io.thecodeforge: Time-series cross validation
# Simulate daily sales data (100 days)
dates = np.arange(100)
X = dates.reshape(-1, 1)  # Feature: day number
y = 2 * dates + np.random.randn(100) * 5  # Sales with trend

tscv = TimeSeriesSplit(n_splits=5)

for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {i}: train {train_idx[0]}..{train_idx[-1]}, test {test_idx[0]}..{test_idx[-1]}")
    # Train and evaluate
    model = LinearRegression().fit(X[train_idx], y[train_idx])
    score = model.score(X[test_idx], y[test_idx])
    print(f"   R^2: {score:.4f}")
Output
Fold 0: train 0..19, test 20..39
R^2: 0.8765
Fold 1: train 0..39, test 40..59
R^2: 0.9210
Fold 2: train 0..59, test 60..79
R^2: 0.9034
Fold 3: train 0..79, test 80..99
R^2: 0.8892
Mean R^2: 0.8975
Key Difference: TimeSeriesSplit vs KFold
TimeSeriesSplit does NOT shuffle the data. It always preserves order. KFold shuffles before splitting — that destroys temporal dependency and produces overly optimistic scores. If your data has a time component, always use TimeSeriesSplit or a custom forward chaining strategy.
Production Insight
In inventory forecasting, using standard CV led to a 40% overestimation of accuracy. The model was learning from 'future' seasonal patterns that wouldn't be available at prediction time. Switching to TimeSeriesSplit caused the reported accuracy to drop — but the deployed model's error reduced by half.
Always compare the last time window as a final validation. The most recent period is the closest to what you'll see in production.
Key Takeaway
Standard K-Fold assumes independence — time series violates that assumption.
TimeSeriesSplit expands the training window forward, preserving temporal order.
Using future data in training folds creates silent, unrecoverable leakage.

The cross_validate Function — Don't Roll Your Own Metric Loops

Junior devs write for loops over folds. You shouldn't. Scikit-learn's cross_validate returns scores, fit times, and optionally train scores in one call. Why this matters: you want to monitor overfitting by comparing train vs. test scores across folds. A model that scores 0.99 on training but 0.72 on test is memorizing, not learning. The function also supports multiple metrics at once — accuracy AND precision AND recall — without rewriting your validation pipeline. Pass a dict of scorers via the scoring parameter. Set return_train_score=True to catch leakage early. The return value is a dict of arrays, one per metric per fold. Average them yourself, or better, look at the per-fold variance. High variance means your model is unstable, your data is too small, or your folds are misconfigured.

validate_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)

scores = cross_validate(
    model, X, y,
    cv=5,
    scoring=['accuracy', 'precision_macro'],
    return_train_score=True
)

print(f"Accuracy: {np.mean(scores['test_accuracy']):.3f} (+/- {np.std(scores['test_accuracy']):.3f})")
print(f"Train Accuracy: {np.mean(scores['train_accuracy']):.3f}")
Output
Accuracy: 0.970 (+/- 0.015)
Train Accuracy: 0.999
Production Trap:
Forgetting return_train_score=True means you never see that your model is memorizing. Always check train vs. test gap. A gap > 0.1 is a red flag you fix before deploy.
Key Takeaway
Use cross_validate with multiple scorers and train scores. Never trust a single test score in isolation.

GroupKFold — When Your Data Has Clusters, Not Independent Rows

Standard KFold assumes every row is independent. That's a lie in production. Same patient has multiple blood tests. Same user clicks on 50 ads. Same sensor logs 10,000 readings. If you split those rows across train and test, the model sees the same entity during training and evaluation. You're not measuring generalization — you're measuring memory. GroupKFold fixes this. Define a group array where each distinct entity gets a unique integer. Folds are built so that all rows from entity 1 stay together. The model never sees entity 1 during training when entity 1 is in the test fold. This is mandatory for fraud detection (same credit card), medical records (same patient), and time series with multiple series (same stock ticker). The tradeoff: you lose some effective fold size, but your metrics become honest.

group_kfold_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge
from sklearn.model_selection import GroupKFold
import numpy as np

# Simulate 3 patients, 4 samples each
X = np.random.rand(12, 5)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1])
groups = np.array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2])

gkf = GroupKFold(n_splits=3)

for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups)):
    train_groups = set(groups[train_idx])
    test_groups = set(groups[test_idx])
    # No overlap is the core guarantee
    print(f"Fold {fold}: train groups {train_groups}, test groups {test_groups}, overlap? {train_groups & test_groups}")
Output
Fold 0: train groups {1, 2}, test groups {0}, overlap? set()
Fold 1: train groups {0, 2}, test groups {1}, overlap? set()
Fold 2: train groups {0, 1}, test groups {2}, overlap? set()
Classic Production Fail:
A team deployed a churn model that looked incredible in CV. Turned out one customer's 50 feature rows were split across train and test. The model just learned to recognize that customer's ID pattern. GroupKFold would have caught this immediately.
Key Takeaway
If your data has multiple rows per entity, use GroupKFold. Otherwise your cross-validation score is a lie.
● Production incidentPOST-MORTEMseverity: high

The Pipeline That Cost $50k in Bad Predictions

Symptom
In production, the model's accuracy dropped from 92% (CV) to 54% on the first month's new customers.
Assumption
The team assumed that because they were using cross validation, data leakage couldn't happen. They had a separate scaler object, but they called fit_transform on the whole dataset before the CV loop.
Root cause
StandardScaler.fit_transform() computed the mean and variance using all rows — including what would become validation folds. Inside each CV fold, the scaler had already seen the fold's distribution, leaking information. The model learned to rely on those leaked statistics and failed when real unseen data came with different means.
Fix
Moved the scaler into a scikit-learn Pipeline. The pipeline ensures that inside each CV fold, fit_transform is called only on the training fold, and transform is called on the validation fold using the training fold's parameters.
Key lesson
  • Preprocessing steps (scaling, imputation, encoding) must never see the entire dataset before splitting.
  • Wrap all preprocessing inside a Pipeline — it's the only reliable way to prevent cross-fold leakage.
  • Cross validation is leak-resistant, not leak-proof. Every transformation before the CV loop creates a potential leak.
Production debug guideReal symptoms you'll hit and the exact actions to take5 entries
Symptom · 01
Train accuracy is 1.0, test accuracy is significantly lower (gap > 5%).
Fix
That's overfitting. Reduce model complexity (max_depth, n_estimators) or increase regularization. Also check if you accidentally used the same data for training and testing — verify indices don't overlap.
Symptom · 02
K-Fold CV scores vary wildly (std > 0.05 for accuracy).
Fix
Your dataset might be too small or too heterogeneous. Increase K to 10 (less data per fold but more stable estimates) or switch to repeated stratified K-Fold. Also check if the folds have wildly different class distributions — ensure you're using StratifiedKFold.
Symptom · 03
CV score is much higher than final test score (gap > 3-4%).
Fix
You likely tuned hyperparameters based on CV scores, and those scores are now biased. The test set is the only honest evaluation. If the gap persists, suspect data leakage in the pipeline — check if preprocessing steps were applied before splitting.
Symptom · 04
Stratified split fails with 'least populated class' error.
Fix
Your minority class has fewer samples than the number of folds. Reduce K (e.g., use 3-fold) or use StratifiedShuffleSplit with a fixed number of splits instead of full K-Fold.
Symptom · 05
After fixing data leakage, model performance drops drastically.
Fix
That's actually a good sign — your previous score was inflated. Trust the new lower score and retune hyperparameters from scratch using the corrected pipeline.
★ Quick Debug Cheat Sheet for Train/Test Split & CV IssuesFast commands to diagnose common evaluation problems in scikit-learn workflows.
Suspected data leakage from preprocessing
Immediate action
Check if any transform (scale, impute, encode) was called on the full dataset before splitting.
Commands
print('Before split, data shape:', X.shape) # Should be (N, F) # If scaler was fit on full X, undo and restart.
from sklearn.model_selection import train_test_split X_train, X_test = train_test_split(X, test_size=0.2)
Fix now
Wrap scaler and model in a Pipeline: Pipeline([('scaler', StandardScaler()), ('clf', RandomForest())])
Unexpectedly high or low CV scores+
Immediate action
Check the standard deviation of CV scores and inspect per-fold distributions.
Commands
from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X, y, cv=StratifiedKFold(5), scoring='accuracy') print(scores, scores.mean(), scores.std())
# Check fold sizes and class proportions from collections import Counter for i, (train_idx, test_idx) in enumerate(StratifiedKFold(5).split(X, y)): print(f'Fold {i}: train {Counter(y[train_idx])}, test {Counter(y[test_idx])}')
Fix now
If variance is high, increase K or use RepeatedStratifiedKFold. If a fold has zero minority samples, reduce K or switch to StratifiedShuffleSplit.
Test set performance is worse than CV average+
Immediate action
Verify that the test set was never touched during any fitting or tuning step.
Commands
# Check if any model refit on entire train+val before final test # Your code should only call .fit(X_train, y_train) and then .predict(X_test).
from sklearn.metrics import classification_report print(classification_report(y_test, model.predict(X_test)))
Fix now
If test score is much lower, the model overfit to CV folds. Retune using nested CV: GridSearchCV with inner CV, then evaluate on a separate holdout set.
Train/Test Split vs K-Fold Cross Validation vs TimeSeriesSplit
AspectTrain/Test SplitK-Fold Cross ValidationTimeSeriesSplit
How it worksSingle random split into two setsK rounds, each fold acts as test set onceExpanding window, test always after train
Performance estimate varianceHigh — one unlucky split distorts resultsLow — averages across K independent estimatesModerate — sensitive to window boundaries
Data efficiencyTest set never used for training100% of data used for evaluation across foldsClose to 100%, but first folds use less training data
Compute costTrain once — fastTrain K times — K× slowerTrain K times — similar to K-Fold
Best used whenLarge datasets (>50k rows), final holdoutSmall/medium datasets, model selection, reportingTime-series data with temporal dependencies
Works with pipelines?Yes, via train_test_split + manual fitYes — Pipeline + cross_val_score handles it cleanlyYes — Pipeline + cross_val_score with cv=TimeSeriesSplit
Handles imbalanced classes?Yes, with stratify=targetsYes, with StratifiedKFoldStratification not directly supported; bin time windows
Suitable for time-series?Only if split is chronological (e.g., first 80% vs last 20%)No — destroys temporal orderYes — designed for temporal data

Key takeaways

1
Evaluating on training data measures memorisation, not learning
your test set must be a wall the model never crosses during training or tuning.
2
Always pass stratify=targets to train_test_split and use StratifiedKFold for cross validation
without it, class imbalance silently corrupts your results.
3
Preprocessing (scaling, imputation, encoding) must live inside a Pipeline so it only ever fits on the training fold
fitting it on the full dataset before splitting is data leakage, even if it looks harmless.
4
The standard production workflow is
lock away a true held-out test set first, tune with inner K-Fold CV on the rest, then report the held-out score exactly once — treating it like a sealed exam envelope you open only on the final day.
5
Time series data requires TimeSeriesSplit
never use standard K-Fold, which shuffles away temporal dependencies and overestimates performance.

Common mistakes to avoid

4 patterns
×

Scaling before splitting

Symptom
CV scores look great (e.g., 0.92) but production performance is much worse (0.54). The scaler has seen the full dataset's mean/variance, including test folds, leaking information.
Fix
Always wrap scalers, imputers, and encoders inside a scikit-learn Pipeline. The pipeline ensures fit_transform is called only on training folds during CV, and transform is applied using training-fold parameters.
×

Tuning hyperparameters then reporting CV score as final

Symptom
After GridSearchCV, the best CV score is 0.967 but final test score is 0.92. You've over-optimised to the specific CV folds.
Fix
Lock away a true held-out test set before any tuning. Use the development set for inner CV tuning, then evaluate exactly once on the holdout set. Report only the holdout score as final performance.
×

Using plain KFold on imbalanced classification data

Symptom
One fold's test set has zero minority class samples. The model's performance on that fold is meaningless, and the CV average is misleading.
Fix
Always use StratifiedKFold for classification. It preserves class proportions in each fold. For very imbalanced data, consider using StratifiedKFold with fewer splits (e.g., K=3) so each fold contains enough minority samples.
×

Using standard KFold on time series data

Symptom
CV score is artificially high because the model uses future data to predict past data (e.g., a stock price predictor with R² of 0.99 on CV but fails in production).
Fix
Use TimeSeriesSplit with an expanding window. Ensure that the training set always contains data strictly before the test set in time. Never shuffle time-indexed data.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is the mathematical justification for using K-1 folds for training ...
Q02SENIOR
Explain how data leakage can occur during Target Encoding or Imputation ...
Q03SENIOR
Why is Accuracy a potentially dangerous metric to evaluate on a test spl...
Q04SENIOR
What is nested cross validation and when would you use it?
Q05SENIOR
How would you handle cross validation for a very small dataset (e.g., 50...
Q01 of 05SENIOR

What is the mathematical justification for using K-1 folds for training in K-Fold Cross Validation?

ANSWER
In K-Fold CV, the dataset is split into K equal-sized folds. For each iteration, K-1 folds are used for training and one fold for validation. This ensures each data point is used exactly once for validation, and the model is trained on (K-1)/K of the data each round. The expected performance is an unbiased estimate of how the model would behave if trained on the full dataset, but with lower variance than a single split. The variance decreases as K increases, but bias increases slightly because each training set is smaller. K=5 or K=10 is a good trade-off.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Does cross-validation prevent overfitting?
02
When should I use Leave-One-Out Cross-Validation (LOOCV)?
03
How do I handle time-series data with cross-validation?
04
What is the difference between validation set and test set?
05
Can I use cross-validation to select features?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Everything here is grounded in real deployments.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's ML Basics. Mark it forged?

6 min read · try the examples if you haven't

Previous
Overfitting and Underfitting
5 / 26 · ML Basics
Next
Feature Engineering Basics