Skip to content
Home ML / AI Train Test Split & Cross Validation Explained — With Real Code

Train Test Split & Cross Validation Explained — With Real Code

Where developers are forged. · Structured learning · Free forever.
📍 Part of: ML Basics → Topic 5 of 25
Train test split and cross validation explained clearly — why they exist, how to use them correctly in scikit-learn, and the mistakes that silently ruin your model.
⚙️ Intermediate — basic ML / AI knowledge assumed
In this tutorial, you'll learn
Train test split and cross validation explained clearly — why they exist, how to use them correctly in scikit-learn, and the mistakes that silently ruin your model.
  • Evaluating on training data measures memorisation, not learning — your test set must be a wall the model never crosses during training or tuning.
  • Always pass stratify=targets to train_test_split and use StratifiedKFold for cross validation — without it, class imbalance silently corrupts your results.
  • Preprocessing (scaling, imputation, encoding) must live inside a Pipeline so it only ever fits on the training fold — fitting it on the full dataset before splitting is data leakage, even if it looks harmless.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer

Imagine you're studying for a final exam. If your teacher hands you the exact exam questions during practice, of course you'll ace it — but you haven't actually learned anything. Train/test split is the rule that says: practice on one set of questions, get tested on a completely different set. Cross validation takes it further — it's like sitting five different mini-exams in rotation so no single exam can trick you into thinking you're better (or worse) than you really are.

Every machine learning model you build is ultimately a gamble. You're betting that the patterns your model learned from historical data will hold up on data it's never seen — whether that's tomorrow's customer transactions, next month's medical scans, or a stock price six hours from now. If you measure your model's performance on the same data you trained it on, you're not measuring anything real. You're measuring how well it memorized the past, not how well it predicts the future.

The problem this solves has a name: data leakage and overfitting. A model that scores 99% on training data but 61% on new data hasn't learned — it's cheated. Train/test split and cross validation are the two foundational tools that force honest evaluation. They create a clear wall between what the model learns from and what it gets graded on. Without them, every accuracy score you report is fiction.

By the end of this article you'll understand exactly why naive evaluation is dangerous, how to implement a proper train/test split in scikit-learn, when to reach for K-Fold cross validation instead, and how to combine both for a production-grade evaluation pipeline. You'll also know the three mistakes that silently corrupt results for even experienced practitioners.

Why Evaluating on Training Data Is a Silent Killer

When you train a model, it adjusts its internal parameters to minimize error on the data you gave it. The more complex the model, the more it can contort itself to fit every quirk, outlier, and noise spike in that training data. A decision tree with unlimited depth will reach 100% training accuracy on almost any dataset — it just memorizes every row. That's overfitting, and it's invisible unless you test the model on data it's never touched.

Here's the subtle danger: even experienced developers fall into this trap when they tune hyperparameters. Every time you check your model's score and adjust something, you're indirectly leaking information about the test set into your decisions. This is why a held-out test set must be locked away and touched exactly once — at the very end.

Train/test split is the minimal viable defense. You take your full dataset, randomly shuffle it, and cut it into two non-overlapping pieces. The training set is the classroom. The test set is the final exam. The model never sees the exam questions until grading day. This one habit alone separates a trustworthy ML workflow from a broken one.

train_test_split_basics.py · PYTHON
12345678910111213141516171819202122232425262728293031
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# io.thecodeforge: Standardizing Evaluation splits
cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# Split: 80% train, 20% test
# stratify=targets ensures class balance is maintained
X_train, X_test, y_train, y_test = train_test_split(
    features,
    targets,
    test_size=0.20,
    random_state=42,
    stratify=targets
)

# Train the model
forest_model = RandomForestClassifier(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

# Evaluate
train_accuracy = accuracy_score(y_train, forest_model.predict(X_train))
test_accuracy  = accuracy_score(y_test,  forest_model.predict(X_test))

print(f"Train accuracy : {train_accuracy:.4f}")
print(f"Test accuracy  : {test_accuracy:.4f}")
▶ Output
Training samples : 455
Test samples : 114
Train accuracy : 1.0000
Test accuracy : 0.9649
Gap (overfit signal): 0.0351
⚠ Watch Out: Always Use stratify on Classification Tasks
If your dataset has 90% class A and 10% class B, a random split can accidentally put most of class B in the training set and almost none in the test set. Your model looks great — but it's never been properly tested on the minority class. Pass stratify=targets to train_test_split every time on classification problems.

K-Fold Cross Validation — When One Test Split Isn't Enough

Here's the honest problem with a single train/test split: your result depends on luck. If the random split happens to put all the 'easy' examples in the test set, your model looks brilliant. If it puts all the hard ones there, your model looks terrible. With a small dataset (say, 500 rows), one unlucky split can swing your accuracy by 5–10 percentage points.

K-Fold cross validation fixes this by running the experiment K times, each time with a different fold acting as the test set. With K=5, your dataset is cut into 5 equal chunks. Round 1: train on folds 2–5, test on fold 1. Round 2: train on folds 1, 3–5, test on fold 2. And so on.

The cost is compute time — you're training K models instead of one. But the payoff is enormous: you use 100% of your data for evaluation (across all folds), and your performance estimate has far lower variance. For anything going into production, or any paper you're publishing, K-Fold is the standard.

kfold_cross_validation.py · PYTHON
123456789101112131415161718192021222324252627282930
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# io.thecodeforge: Implementing robust Cross-Validation
cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# Pipeline prevents data leakage from scaler to validation fold
evaluation_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

fold_scores = cross_val_score(
    evaluation_pipeline,
    features,
    targets,
    cv=stratified_kfold,
    scoring='accuracy'
)

print(f"Mean accuracy : {fold_scores.mean():.4f}")
print(f"Std deviation : {fold_scores.std():.4f}")
▶ Output
Per-fold accuracy scores:
Fold 1: 0.9737
Mean accuracy : 0.9649
Std deviation : 0.0082
💡Pro Tip: Always Put Preprocessing Inside a Pipeline
If you call StandardScaler().fit_transform(features) before cross_val_score, you've already computed the mean and variance of the entire dataset — including what would have been the test folds. That's data leakage. Wrapping the scaler in a Pipeline ensures it only ever sees the training fold during each cross-validation round.

The Gold Standard: Train / Validation / Test and Nested CV

Once you add hyperparameter tuning to the picture, even K-Fold cross validation can leak. Here's why: if you run 50 hyperparameter combinations through the same CV folds and pick the best one, you've effectively optimised for those specific folds. The CV score of your winning model is now optimistically biased.

The production-grade solution is a three-way split: training set (model learns), validation set or inner CV (hyperparameters are tuned), and a completely held-out test set (touched exactly once for final reporting). This is often called nested cross validation — an inner loop for hyperparameter search and an outer loop for unbiased performance estimation.

production_evaluation_pipeline.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# io.thecodeforge: Production Evaluation Pipeline
cancer_data = load_breast_cancer()
features = cancer_data.data
targets   = cancer_data.target

# STEP 1: Sealed-envelope test set
X_develop, X_final_test, y_develop, y_final_test = train_test_split(
    features, targets,
    test_size=0.20,
    random_state=42,
    stratify=targets
)

# STEP 2: Pipeline construction
model_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# STEP 3: GridSearch (Inner CV)
hyperparam_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth':    [None, 10, 20]
}

grid_search = GridSearchCV(
    model_pipeline,
    hyperparam_grid,
    cv=StratifiedKFold(n_splits=5),
    scoring='accuracy'
)
grid_search.fit(X_develop, y_develop)

# STEP 4: Final Evaluation
best_model = grid_search.best_estimator_
print(classification_report(y_final_test, best_model.predict(X_final_test)))
▶ Output
Best hyperparameters found:
{'classifier__max_depth': None, 'classifier__n_estimators': 200}
Final Test Accuracy: 0.96
🔥Interview Gold: Why Does CV Score Sometimes Beat Final Test Score?
If your GridSearchCV best score is 0.967 but your final test score is 0.964, that's completely normal and healthy — the CV score was averaged over 5 folds of 80% of the data, each fold trained on slightly less data than the final model. If the CV score is significantly HIGHER than the test score (more than ~3-4%), suspect data leakage or that you tuned hyperparameters while peeking at the test set.
AspectTrain/Test SplitK-Fold Cross Validation
How it worksSingle random split into two non-overlapping setsK rounds, each fold acts as test set once
Performance estimate varianceHigh — one unlucky split distorts resultsLow — averages across K independent estimates
Data efficiencyTest set data never used for training100% of data used for evaluation across folds
Compute costTrain once — fastTrain K times — K× slower
Best used whenLarge datasets (>50k rows), final holdoutSmall/medium datasets, model selection, reporting
Works with pipelines?Yes, via train_test_split + manual fitYes — Pipeline + cross_val_score handles it cleanly
Handles imbalanced classes?Yes, with stratify=targetsYes, with StratifiedKFold
Suitable for time-series?Yes, but split must be chronologicalNo — use TimeSeriesSplit instead of KFold

🎯 Key Takeaways

  • Evaluating on training data measures memorisation, not learning — your test set must be a wall the model never crosses during training or tuning.
  • Always pass stratify=targets to train_test_split and use StratifiedKFold for cross validation — without it, class imbalance silently corrupts your results.
  • Preprocessing (scaling, imputation, encoding) must live inside a Pipeline so it only ever fits on the training fold — fitting it on the full dataset before splitting is data leakage, even if it looks harmless.
  • The standard production workflow is: lock away a true held-out test set first, tune with inner K-Fold CV on the rest, then report the held-out score exactly once — treating it like a sealed exam envelope you open only on the final day.

⚠ Common Mistakes to Avoid

    Scaling before splitting — Calling StandardScaler().fit_transform(features) on your full dataset before train_test_split means the scaler has already seen the mean and variance of your test rows. Your model has indirectly learned from the test set. Fix: always wrap your scaler in a Pipeline and let cross_val_score or fit/transform handle the split boundary.
    Fix

    always wrap your scaler in a Pipeline and let cross_val_score or fit/transform handle the split boundary.

    Tuning hyperparameters then reporting CV score as your final score — Every time you check CV results and adjust a hyperparameter, you're optimising for those folds. The CV score after tuning is biased upward. Fix: lock away a true held-out test set before any tuning begins, tune using inner CV on the development set only, and report the held-out test score exactly once at the very end.
    Fix

    lock away a true held-out test set before any tuning begins, tune using inner CV on the development set only, and report the held-out test score exactly once at the very end.

    Using plain KFold on imbalanced classification data — With KFold, a fold might get very few examples of your minority class, making training unstable and metrics misleading. Fix: always use StratifiedKFold for classification so that each fold mirrors the full dataset's class distribution. Pass stratify=targets to train_test_split for the same reason.
    Fix

    always use StratifiedKFold for classification so that each fold mirrors the full dataset's class distribution. Pass stratify=targets to train_test_split for the same reason.

Interview Questions on This Topic

  • QWhat is the mathematical justification for using $K-1$ folds for training in K-Fold Cross Validation?
  • QExplain how data leakage can occur during Target Encoding or Imputation if splits are handled incorrectly.
  • QWhy is Accuracy a potentially dangerous metric to evaluate on a test split if the classes are highly imbalanced, and what should we use instead?

Frequently Asked Questions

Does cross-validation prevent overfitting?

Cross-validation does not directly prevent overfitting, but it makes it much easier to detect. By comparing the average training score across folds to the average validation score, you can see if the gap is widening—indicating the model is memorizing noise rather than general patterns.

When should I use Leave-One-Out Cross-Validation (LOOCV)?

LOOCV is the extreme case where $K$ equals the number of samples in your dataset. Use it only for very small datasets (e.g., $N < 50$) where every single data point is precious. For larger sets, it is computationally prohibitive and can lead to high variance in your performance estimate.

How do I handle time-series data with cross-validation?

Standard K-Fold is dangerous for time-series because it uses 'future' data to predict 'past' data. Instead, use TimeSeriesSplit, which uses an expanding window approach: Fold 1 trains on months 1-3 to predict month 4; Fold 2 trains on months 1-4 to predict month 5, and so on.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousOverfitting and UnderfittingNext →Feature Engineering Basics
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged