Home ML / AI Train Test Split & Cross Validation Explained — With Real Code

Train Test Split & Cross Validation Explained — With Real Code

In Plain English 🔥
Imagine you're studying for a final exam. If your teacher hands you the exact exam questions during practice, of course you'll ace it — but you haven't actually learned anything. Train/test split is the rule that says: practice on one set of questions, get tested on a completely different set. Cross validation takes it further — it's like sitting five different mini-exams in rotation so no single exam can trick you into thinking you're better (or worse) than you really are.
⚡ Quick Answer
Imagine you're studying for a final exam. If your teacher hands you the exact exam questions during practice, of course you'll ace it — but you haven't actually learned anything. Train/test split is the rule that says: practice on one set of questions, get tested on a completely different set. Cross validation takes it further — it's like sitting five different mini-exams in rotation so no single exam can trick you into thinking you're better (or worse) than you really are.

Every machine learning model you build is ultimately a gamble. You're betting that the patterns your model learned from historical data will hold up on data it's never seen — whether that's tomorrow's customer transactions, next month's medical scans, or a stock price six hours from now. If you measure your model's performance on the same data you trained it on, you're not measuring anything real. You're measuring how well it memorized the past, not how well it predicts the future.

The problem this solves has a name: data leakage and overfitting. A model that scores 99% on training data but 61% on new data hasn't learned — it's cheated. Train/test split and cross validation are the two foundational tools that force honest evaluation. They create a clear wall between what the model learns from and what it gets graded on. Without them, every accuracy score you report is fiction.

By the end of this article you'll understand exactly why naive evaluation is dangerous, how to implement a proper train/test split in scikit-learn, when to reach for K-Fold cross validation instead, and how to combine both for a production-grade evaluation pipeline. You'll also know the three mistakes that silently corrupt results for even experienced practitioners.

Why Evaluating on Training Data Is a Silent Killer

When you train a model, it adjusts its internal parameters to minimize error on the data you gave it. The more complex the model, the more it can contort itself to fit every quirk, outlier, and noise spike in that training data. A decision tree with unlimited depth will reach 100% training accuracy on almost any dataset — it just memorizes every row. That's overfitting, and it's invisible unless you test the model on data it's never touched.

Here's the subtle danger: even experienced developers fall into this trap when they tune hyperparameters. Every time you check your model's score and adjust something, you're indirectly leaking information about the test set into your decisions. This is why a held-out test set must be locked away and touched exactly once — at the very end.

Train/test split is the minimal viable defense. You take your full dataset, randomly shuffle it, and cut it into two non-overlapping pieces. The training set is the classroom. The test set is the final exam. The model never sees the exam questions until grading day. This one habit alone separates a trustworthy ML workflow from a broken one.

train_test_split_basics.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load a real-world dataset: breast cancer classification
# 569 samples, 30 features, binary target (malignant / benign)
cancer_data = load_breast_cancer()
features = cancer_data.data      # shape: (569, 30)
targets   = cancer_data.target   # shape: (569,)

# Split: 80% train, 20% test
# random_state=42 locks the shuffle so results are reproducible
# stratify=targets ensures BOTH splits have the same class ratio
#   (critical for imbalanced datasets — without this, your test set
#    might accidentally contain mostly one class)
X_train, X_test, y_train, y_test = train_test_split(
    features,
    targets,
    test_size=0.20,
    random_state=42,
    stratify=targets
)

print(f"Training samples : {X_train.shape[0]}")
print(f"Test samples     : {X_test.shape[0]}")
print(f"Train class dist : {np.bincount(y_train) / len(y_train)}")
print(f"Test class dist  : {np.bincount(y_test)  / len(y_test)}")

# Train the model — it ONLY sees X_train and y_train
forest_model = RandomForestClassifier(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

# Evaluate — the model sees X_test for the first time here
train_accuracy = accuracy_score(y_train, forest_model.predict(X_train))
test_accuracy  = accuracy_score(y_test,  forest_model.predict(X_test))

print(f"\nTrain accuracy : {train_accuracy:.4f}")
print(f"Test accuracy  : {test_accuracy:.4f}")
print(f"Gap (overfit signal): {train_accuracy - test_accuracy:.4f}")
▶ Output
Training samples : 455
Test samples : 114
Train class dist : [0.37142857 0.62857143]
Test class dist : [0.37719298 0.62280702]

Train accuracy : 1.0000
Test accuracy : 0.9649
Gap (overfit signal): 0.0351
⚠️
Watch Out: Always Use stratify on Classification TasksIf your dataset has 90% class A and 10% class B, a random split can accidentally put most of class B in the training set and almost none in the test set. Your model looks great — but it's never been properly tested on the minority class. Pass stratify=targets to train_test_split every time on classification problems.

K-Fold Cross Validation — When One Test Split Isn't Enough

Here's the honest problem with a single train/test split: your result depends on luck. If the random split happens to put all the 'easy' examples in the test set, your model looks brilliant. If it puts all the hard ones there, your model looks terrible. With a small dataset (say, 500 rows), one unlucky split can swing your accuracy by 5–10 percentage points.

K-Fold cross validation fixes this by running the experiment K times, each time with a different fold acting as the test set. With K=5, your dataset is cut into 5 equal chunks. Round 1: train on folds 2–5, test on fold 1. Round 2: train on folds 1, 3–5, test on fold 2. And so on. At the end you have 5 accuracy scores — average them for a stable, unbiased estimate of real-world performance.

The cost is compute time — you're training K models instead of one. But the payoff is enormous: you use 100% of your data for evaluation (across all folds), and your performance estimate has far lower variance. For anything going into production, or any paper you're publishing, K-Fold is the standard.

Stratified K-Fold is the version you should default to for classification — it preserves class ratios in every fold, just like stratify does in train_test_split.

kfold_cross_validation.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# Wrap scaler + model in a Pipeline so the scaler ONLY learns
# from training folds — never from the test fold.
# This is critical: scaling before splitting leaks test set statistics.
evaluation_pipeline = Pipeline([
    ('scaler', StandardScaler()),           # fit_transform on train fold, transform on test fold
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# StratifiedKFold preserves class balance in every fold
# shuffle=True prevents ordering bias (e.g., if data was sorted by class)
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# cross_val_score handles the entire loop: split -> fit -> score -> repeat
fold_scores = cross_val_score(
    evaluation_pipeline,
    features,
    targets,
    cv=stratified_kfold,
    scoring='accuracy'
)

print("Per-fold accuracy scores:")
for fold_index, fold_score in enumerate(fold_scores, start=1):
    print(f"  Fold {fold_index}: {fold_score:.4f}")

print(f"\nMean accuracy : {fold_scores.mean():.4f}")
print(f"Std deviation : {fold_scores.std():.4f}")
print(f"95% CI estimate: ({fold_scores.mean() - 2*fold_scores.std():.4f}, "
      f"{fold_scores.mean() + 2*fold_scores.std():.4f})")

# High std deviation is a red flag — it means your model is unstable
# and performance depends heavily on which samples it sees
if fold_scores.std() > 0.03:
    print("\n⚠  High variance detected — consider more data or simpler model")
else:
    print("\n✓  Low variance — model generalises consistently across folds")
▶ Output
Per-fold accuracy scores:
Fold 1: 0.9737
Fold 2: 0.9561
Fold 3: 0.9649
Fold 4: 0.9561
Fold 5: 0.9737

Mean accuracy : 0.9649
Std deviation : 0.0082
95% CI estimate: (0.9485, 0.9813)

✓ Low variance — model generalises consistently across folds
⚠️
Pro Tip: Always Put Preprocessing Inside a PipelineIf you call StandardScaler().fit_transform(features) before cross_val_score, you've already computed the mean and variance of the entire dataset — including what would have been the test folds. That's data leakage. Wrapping the scaler in a Pipeline ensures it only ever sees the training fold during each cross-validation round.

The Gold Standard: Train / Validation / Test and Nested CV

Once you add hyperparameter tuning to the picture, even K-Fold cross validation can leak. Here's why: if you run 50 hyperparameter combinations through the same CV folds and pick the best one, you've effectively optimised for those specific folds. The CV score of your winning model is now optimistically biased.

The production-grade solution is a three-way split: training set (model learns), validation set or inner CV (hyperparameters are tuned), and a completely held-out test set (touched exactly once for final reporting). This is often called nested cross validation — an inner loop for hyperparameter search and an outer loop for unbiased performance estimation.

For most real projects, a simpler version works fine: do an upfront 80/20 train/test split, lock away the 20%, then run K-Fold CV with hyperparameter tuning only on the 80%. Report the final score on the locked-away 20% at the very end. Never go back and re-tune after seeing that score.

production_evaluation_pipeline.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import (
    train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# ── STEP 1 ──────────────────────────────────────────────────
# Lock away the final test set immediately. Do NOT touch it again
# until the very end. This is your "exam paper in a sealed envelope".
X_develop, X_final_test, y_develop, y_final_test = train_test_split(
    features, targets,
    test_size=0.20,
    random_state=42,
    stratify=targets
)

# ── STEP 2 ──────────────────────────────────────────────────
# Build pipeline (scaler must be inside to prevent leakage)
model_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# ── STEP 3 ──────────────────────────────────────────────────
# Hyperparameter search using inner 5-fold CV on the development set ONLY
hyperparam_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth':    [None, 10, 20]
}

inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    model_pipeline,
    hyperparam_grid,
    cv=inner_cv,          # inner CV: tunes hyperparameters
    scoring='accuracy',
    n_jobs=-1,
    refit=True            # refit best model on full development set
)
grid_search.fit(X_develop, y_develop)

print("Best hyperparameters found:")
print(f"  {grid_search.best_params_}")
print(f"\nBest CV score (development set): {grid_search.best_score_:.4f}")

# ── STEP 4 ──────────────────────────────────────────────────
# NOW — and only now — open the sealed envelope and evaluate
best_model = grid_search.best_estimator_
final_predictions = best_model.predict(X_final_test)

print("\n── Final Held-Out Test Set Results ──")
print(classification_report(
    y_final_test,
    final_predictions,
    target_names=cancer_data.target_names
))
print("This score is your honest, reportable model performance.")
▶ Output
Best hyperparameters found:
{'classifier__max_depth': None, 'classifier__n_estimators': 200}

Best CV score (development set): 0.9670

── Final Held-Out Test Set Results ──
precision recall f1-score support

malignant 0.98 0.93 0.95 43
benign 0.96 0.99 0.97 71

accuracy 0.96 114
macro avg 0.97 0.96 0.96 114
weighted avg 0.97 0.96 0.96 114

This score is your honest, reportable model performance.
🔥
Interview Gold: Why Does CV Score Sometimes Beat Final Test Score?If your GridSearchCV best score is 0.967 but your final test score is 0.964, that's completely normal and healthy — the CV score was averaged over 5 folds of 80% of the data, each fold trained on slightly less data than the final model. If the CV score is significantly HIGHER than the test score (more than ~3-4%), suspect data leakage or that you tuned hyperparameters while peeking at the test set.
AspectTrain/Test SplitK-Fold Cross Validation
How it worksSingle random split into two non-overlapping setsK rounds, each fold acts as test set once
Performance estimate varianceHigh — one unlucky split distorts resultsLow — averages across K independent estimates
Data efficiencyTest set data never used for training100% of data used for evaluation across folds
Compute costTrain once — fastTrain K times — K× slower
Best used whenLarge datasets (>50k rows), final holdoutSmall/medium datasets, model selection, reporting
Works with pipelines?Yes, via train_test_split + manual fitYes — Pipeline + cross_val_score handles it cleanly
Handles imbalanced classes?Yes, with stratify=targetsYes, with StratifiedKFold
Suitable for time-series?Yes, but split must be chronologicalNo — use TimeSeriesSplit instead of KFold

🎯 Key Takeaways

  • Evaluating on training data measures memorisation, not learning — your test set must be a wall the model never crosses during training or tuning.
  • Always pass stratify=targets to train_test_split and use StratifiedKFold for cross validation — without it, class imbalance silently corrupts your results.
  • Preprocessing (scaling, imputation, encoding) must live inside a Pipeline so it only ever fits on the training fold — fitting it on the full dataset before splitting is data leakage, even if it looks harmless.
  • The standard production workflow is: lock away a true held-out test set first, tune with inner K-Fold CV on the rest, then report the held-out score exactly once — treating it like a sealed exam envelope you open only on the final day.

⚠ Common Mistakes to Avoid

  • Mistake 1: Scaling before splitting — Calling StandardScaler().fit_transform(features) on your full dataset before train_test_split means the scaler has already seen the mean and variance of your test rows. Your model has indirectly learned from the test set. Fix: always wrap your scaler in a Pipeline and let cross_val_score or fit/transform handle the split boundary.
  • Mistake 2: Tuning hyperparameters then reporting CV score as your final score — Every time you check CV results and adjust a hyperparameter, you're optimising for those folds. The CV score after tuning is biased upward. Fix: lock away a true held-out test set before any tuning begins, tune using inner CV on the development set only, and report the held-out test score exactly once at the very end.
  • Mistake 3: Using plain KFold on imbalanced classification data — With KFold, a fold might get very few examples of your minority class, making training unstable and metrics misleading. Fix: always use StratifiedKFold for classification so that each fold mirrors the full dataset's class distribution. Pass stratify=targets to train_test_split for the same reason.

Interview Questions on This Topic

  • QWhat is data leakage in the context of train/test split, and can you give a concrete example of how it sneaks in during preprocessing?
  • QIf your model scores 98% accuracy on training data but 72% on the test set, walk me through the steps you'd take to diagnose and address this gap.
  • QYou have a dataset of 400 samples and need to tune 20 hyperparameter combinations. Why is a single train/test split a poor choice here, and what would you do instead?

Frequently Asked Questions

What is the difference between train test split and cross validation?

Train/test split divides your data once into two fixed sets — one for training, one for evaluation. Cross validation repeats this process K times, rotating which portion acts as the test set each time. Cross validation gives a more reliable performance estimate because it averages across K independent evaluations rather than relying on one lucky (or unlucky) split.

What is a good train/test split ratio?

80/20 is the most common starting point and works well for datasets above a few thousand rows. For very small datasets (under 1000 rows), consider 90/10 or switch entirely to cross validation so you're not wasting too much training data. For very large datasets (millions of rows), even 95/5 or 99/1 gives a test set large enough to be statistically meaningful.

Why do I need to use a Pipeline instead of just scaling before splitting?

When you scale before splitting, the scaler computes statistics (mean, variance) from the entire dataset — including the rows that will become your test set. This means your test set is no longer truly unseen; it has subtly influenced your preprocessing. A Pipeline ensures the scaler's fit() call only ever sees training data, keeping the test set genuinely independent.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousOverfitting and UnderfittingNext →Feature Engineering Basics
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged