Train Test Split & Cross Validation Explained — With Real Code
Every machine learning model you build is ultimately a gamble. You're betting that the patterns your model learned from historical data will hold up on data it's never seen — whether that's tomorrow's customer transactions, next month's medical scans, or a stock price six hours from now. If you measure your model's performance on the same data you trained it on, you're not measuring anything real. You're measuring how well it memorized the past, not how well it predicts the future.
The problem this solves has a name: data leakage and overfitting. A model that scores 99% on training data but 61% on new data hasn't learned — it's cheated. Train/test split and cross validation are the two foundational tools that force honest evaluation. They create a clear wall between what the model learns from and what it gets graded on. Without them, every accuracy score you report is fiction.
By the end of this article you'll understand exactly why naive evaluation is dangerous, how to implement a proper train/test split in scikit-learn, when to reach for K-Fold cross validation instead, and how to combine both for a production-grade evaluation pipeline. You'll also know the three mistakes that silently corrupt results for even experienced practitioners.
Why Evaluating on Training Data Is a Silent Killer
When you train a model, it adjusts its internal parameters to minimize error on the data you gave it. The more complex the model, the more it can contort itself to fit every quirk, outlier, and noise spike in that training data. A decision tree with unlimited depth will reach 100% training accuracy on almost any dataset — it just memorizes every row. That's overfitting, and it's invisible unless you test the model on data it's never touched.
Here's the subtle danger: even experienced developers fall into this trap when they tune hyperparameters. Every time you check your model's score and adjust something, you're indirectly leaking information about the test set into your decisions. This is why a held-out test set must be locked away and touched exactly once — at the very end.
Train/test split is the minimal viable defense. You take your full dataset, randomly shuffle it, and cut it into two non-overlapping pieces. The training set is the classroom. The test set is the final exam. The model never sees the exam questions until grading day. This one habit alone separates a trustworthy ML workflow from a broken one.
import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Load a real-world dataset: breast cancer classification # 569 samples, 30 features, binary target (malignant / benign) cancer_data = load_breast_cancer() features = cancer_data.data # shape: (569, 30) targets = cancer_data.target # shape: (569,) # Split: 80% train, 20% test # random_state=42 locks the shuffle so results are reproducible # stratify=targets ensures BOTH splits have the same class ratio # (critical for imbalanced datasets — without this, your test set # might accidentally contain mostly one class) X_train, X_test, y_train, y_test = train_test_split( features, targets, test_size=0.20, random_state=42, stratify=targets ) print(f"Training samples : {X_train.shape[0]}") print(f"Test samples : {X_test.shape[0]}") print(f"Train class dist : {np.bincount(y_train) / len(y_train)}") print(f"Test class dist : {np.bincount(y_test) / len(y_test)}") # Train the model — it ONLY sees X_train and y_train forest_model = RandomForestClassifier(n_estimators=100, random_state=42) forest_model.fit(X_train, y_train) # Evaluate — the model sees X_test for the first time here train_accuracy = accuracy_score(y_train, forest_model.predict(X_train)) test_accuracy = accuracy_score(y_test, forest_model.predict(X_test)) print(f"\nTrain accuracy : {train_accuracy:.4f}") print(f"Test accuracy : {test_accuracy:.4f}") print(f"Gap (overfit signal): {train_accuracy - test_accuracy:.4f}")
Test samples : 114
Train class dist : [0.37142857 0.62857143]
Test class dist : [0.37719298 0.62280702]
Train accuracy : 1.0000
Test accuracy : 0.9649
Gap (overfit signal): 0.0351
K-Fold Cross Validation — When One Test Split Isn't Enough
Here's the honest problem with a single train/test split: your result depends on luck. If the random split happens to put all the 'easy' examples in the test set, your model looks brilliant. If it puts all the hard ones there, your model looks terrible. With a small dataset (say, 500 rows), one unlucky split can swing your accuracy by 5–10 percentage points.
K-Fold cross validation fixes this by running the experiment K times, each time with a different fold acting as the test set. With K=5, your dataset is cut into 5 equal chunks. Round 1: train on folds 2–5, test on fold 1. Round 2: train on folds 1, 3–5, test on fold 2. And so on. At the end you have 5 accuracy scores — average them for a stable, unbiased estimate of real-world performance.
The cost is compute time — you're training K models instead of one. But the payoff is enormous: you use 100% of your data for evaluation (across all folds), and your performance estimate has far lower variance. For anything going into production, or any paper you're publishing, K-Fold is the standard.
Stratified K-Fold is the version you should default to for classification — it preserves class ratios in every fold, just like stratify does in train_test_split.
import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import StratifiedKFold, cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline cancer_data = load_breast_cancer() features = cancer_data.data targets = cancer_data.target # Wrap scaler + model in a Pipeline so the scaler ONLY learns # from training folds — never from the test fold. # This is critical: scaling before splitting leaks test set statistics. evaluation_pipeline = Pipeline([ ('scaler', StandardScaler()), # fit_transform on train fold, transform on test fold ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ]) # StratifiedKFold preserves class balance in every fold # shuffle=True prevents ordering bias (e.g., if data was sorted by class) stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # cross_val_score handles the entire loop: split -> fit -> score -> repeat fold_scores = cross_val_score( evaluation_pipeline, features, targets, cv=stratified_kfold, scoring='accuracy' ) print("Per-fold accuracy scores:") for fold_index, fold_score in enumerate(fold_scores, start=1): print(f" Fold {fold_index}: {fold_score:.4f}") print(f"\nMean accuracy : {fold_scores.mean():.4f}") print(f"Std deviation : {fold_scores.std():.4f}") print(f"95% CI estimate: ({fold_scores.mean() - 2*fold_scores.std():.4f}, " f"{fold_scores.mean() + 2*fold_scores.std():.4f})") # High std deviation is a red flag — it means your model is unstable # and performance depends heavily on which samples it sees if fold_scores.std() > 0.03: print("\n⚠ High variance detected — consider more data or simpler model") else: print("\n✓ Low variance — model generalises consistently across folds")
Fold 1: 0.9737
Fold 2: 0.9561
Fold 3: 0.9649
Fold 4: 0.9561
Fold 5: 0.9737
Mean accuracy : 0.9649
Std deviation : 0.0082
95% CI estimate: (0.9485, 0.9813)
✓ Low variance — model generalises consistently across folds
The Gold Standard: Train / Validation / Test and Nested CV
Once you add hyperparameter tuning to the picture, even K-Fold cross validation can leak. Here's why: if you run 50 hyperparameter combinations through the same CV folds and pick the best one, you've effectively optimised for those specific folds. The CV score of your winning model is now optimistically biased.
The production-grade solution is a three-way split: training set (model learns), validation set or inner CV (hyperparameters are tuned), and a completely held-out test set (touched exactly once for final reporting). This is often called nested cross validation — an inner loop for hyperparameter search and an outer loop for unbiased performance estimation.
For most real projects, a simpler version works fine: do an upfront 80/20 train/test split, lock away the 20%, then run K-Fold CV with hyperparameter tuning only on the 80%. Report the final score on the locked-away 20% at the very end. Never go back and re-tune after seeing that score.
import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import ( train_test_split, GridSearchCV, StratifiedKFold, cross_val_score ) from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report cancer_data = load_breast_cancer() features = cancer_data.data targets = cancer_data.target # ── STEP 1 ────────────────────────────────────────────────── # Lock away the final test set immediately. Do NOT touch it again # until the very end. This is your "exam paper in a sealed envelope". X_develop, X_final_test, y_develop, y_final_test = train_test_split( features, targets, test_size=0.20, random_state=42, stratify=targets ) # ── STEP 2 ────────────────────────────────────────────────── # Build pipeline (scaler must be inside to prevent leakage) model_pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42)) ]) # ── STEP 3 ────────────────────────────────────────────────── # Hyperparameter search using inner 5-fold CV on the development set ONLY hyperparam_grid = { 'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [None, 10, 20] } inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) grid_search = GridSearchCV( model_pipeline, hyperparam_grid, cv=inner_cv, # inner CV: tunes hyperparameters scoring='accuracy', n_jobs=-1, refit=True # refit best model on full development set ) grid_search.fit(X_develop, y_develop) print("Best hyperparameters found:") print(f" {grid_search.best_params_}") print(f"\nBest CV score (development set): {grid_search.best_score_:.4f}") # ── STEP 4 ────────────────────────────────────────────────── # NOW — and only now — open the sealed envelope and evaluate best_model = grid_search.best_estimator_ final_predictions = best_model.predict(X_final_test) print("\n── Final Held-Out Test Set Results ──") print(classification_report( y_final_test, final_predictions, target_names=cancer_data.target_names )) print("This score is your honest, reportable model performance.")
{'classifier__max_depth': None, 'classifier__n_estimators': 200}
Best CV score (development set): 0.9670
── Final Held-Out Test Set Results ──
precision recall f1-score support
malignant 0.98 0.93 0.95 43
benign 0.96 0.99 0.97 71
accuracy 0.96 114
macro avg 0.97 0.96 0.96 114
weighted avg 0.97 0.96 0.96 114
This score is your honest, reportable model performance.
| Aspect | Train/Test Split | K-Fold Cross Validation |
|---|---|---|
| How it works | Single random split into two non-overlapping sets | K rounds, each fold acts as test set once |
| Performance estimate variance | High — one unlucky split distorts results | Low — averages across K independent estimates |
| Data efficiency | Test set data never used for training | 100% of data used for evaluation across folds |
| Compute cost | Train once — fast | Train K times — K× slower |
| Best used when | Large datasets (>50k rows), final holdout | Small/medium datasets, model selection, reporting |
| Works with pipelines? | Yes, via train_test_split + manual fit | Yes — Pipeline + cross_val_score handles it cleanly |
| Handles imbalanced classes? | Yes, with stratify=targets | Yes, with StratifiedKFold |
| Suitable for time-series? | Yes, but split must be chronological | No — use TimeSeriesSplit instead of KFold |
🎯 Key Takeaways
- Evaluating on training data measures memorisation, not learning — your test set must be a wall the model never crosses during training or tuning.
- Always pass stratify=targets to train_test_split and use StratifiedKFold for cross validation — without it, class imbalance silently corrupts your results.
- Preprocessing (scaling, imputation, encoding) must live inside a Pipeline so it only ever fits on the training fold — fitting it on the full dataset before splitting is data leakage, even if it looks harmless.
- The standard production workflow is: lock away a true held-out test set first, tune with inner K-Fold CV on the rest, then report the held-out score exactly once — treating it like a sealed exam envelope you open only on the final day.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Scaling before splitting — Calling StandardScaler().fit_transform(features) on your full dataset before train_test_split means the scaler has already seen the mean and variance of your test rows. Your model has indirectly learned from the test set. Fix: always wrap your scaler in a Pipeline and let cross_val_score or fit/transform handle the split boundary.
- ✕Mistake 2: Tuning hyperparameters then reporting CV score as your final score — Every time you check CV results and adjust a hyperparameter, you're optimising for those folds. The CV score after tuning is biased upward. Fix: lock away a true held-out test set before any tuning begins, tune using inner CV on the development set only, and report the held-out test score exactly once at the very end.
- ✕Mistake 3: Using plain KFold on imbalanced classification data — With KFold, a fold might get very few examples of your minority class, making training unstable and metrics misleading. Fix: always use StratifiedKFold for classification so that each fold mirrors the full dataset's class distribution. Pass stratify=targets to train_test_split for the same reason.
Interview Questions on This Topic
- QWhat is data leakage in the context of train/test split, and can you give a concrete example of how it sneaks in during preprocessing?
- QIf your model scores 98% accuracy on training data but 72% on the test set, walk me through the steps you'd take to diagnose and address this gap.
- QYou have a dataset of 400 samples and need to tune 20 hyperparameter combinations. Why is a single train/test split a poor choice here, and what would you do instead?
Frequently Asked Questions
What is the difference between train test split and cross validation?
Train/test split divides your data once into two fixed sets — one for training, one for evaluation. Cross validation repeats this process K times, rotating which portion acts as the test set each time. Cross validation gives a more reliable performance estimate because it averages across K independent evaluations rather than relying on one lucky (or unlucky) split.
What is a good train/test split ratio?
80/20 is the most common starting point and works well for datasets above a few thousand rows. For very small datasets (under 1000 rows), consider 90/10 or switch entirely to cross validation so you're not wasting too much training data. For very large datasets (millions of rows), even 95/5 or 99/1 gives a test set large enough to be statistically meaningful.
Why do I need to use a Pipeline instead of just scaling before splitting?
When you scale before splitting, the scaler computes statistics (mean, variance) from the entire dataset — including the rows that will become your test set. This means your test set is no longer truly unseen; it has subtly influenced your preprocessing. A Pipeline ensures the scaler's fit() call only ever sees training data, keeping the test set genuinely independent.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.