Senior 8 min · April 14, 2026
Your First Machine Learning Project – Complete Step-by-Step (2026)

First ML Project — 99.7% Accuracy Fails No Train-Test Split

99.7% accuracy but predicted same class every time — no train-test split.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Install Python 3.11 or 3.12, scikit-learn, pandas, numpy, and matplotlib — 4 core packages
  • Load the Iris dataset — 150 flower samples with 4 features and 3 species labels
  • Split data 80/20 with stratify — train on 80%, test on 20% to measure real generalization
  • Train a Decision Tree classifier — three lines of code with scikit-learn
  • Evaluate with accuracy, confusion matrix, classification report, and cross-validation
  • Compare against a second algorithm to build the habit of never shipping the first thing you try
  • Biggest mistake: skipping the train-test split — your model will memorize training data and fail on every real-world input
✦ Definition~90s read
What is Your First Machine Learning Project?

This article is your first hands-on machine learning project, designed to teach the single most important lesson in ML: never trust accuracy without a proper train-test split. You'll build a classifier on the classic Iris dataset—a 150-sample, 3-class problem that's been the 'Hello World' of ML since Ronald Fisher introduced it in 1936.

Building your first ML model is like following a cooking recipe for the first time.

The trap is real: train on all your data, and you can easily hit 99.7% accuracy on MNIST (handwritten digits) by memorizing pixel patterns, only to fail catastrophically on new images. That's not learning; that's overfitting. This project forces you to confront that failure by walking through the correct workflow: install Python and scikit-learn, load and visualize Iris, split your data (typically 80/20), then train a k-nearest neighbors model.

By the end, you'll understand why a model that scores 100% on training data but 70% on unseen test data is worse than useless—it's deceptive. This is the foundation every data scientist needs before touching neural networks or production systems.

Plain-English First

Building your first ML model is like following a cooking recipe for the first time. You gather ingredients (data), follow steps in a specific order (preprocessing, splitting, training, evaluation), and taste the result (metrics). The Iris dataset is the perfect first recipe — it is small, clean, balanced, and well-understood so you can focus on learning the workflow instead of fighting the data. You will load flower measurements, teach a computer to recognize species from those measurements, and check whether it actually learned something useful. The entire process takes under 30 minutes and requires zero math background.

Every ML engineer's career starts with one project that turns theory into working code. This guide walks through a complete end-to-end machine learning project using the Iris dataset and scikit-learn — from installing packages to saving a trained model to disk. No prior ML experience needed. No unexplained jargon. No math formulas. Each step has a clear purpose, runnable code, and a verifiable output so you know exactly what success looks like before moving on. You will load data, explore it, visualize it, split it, train a model, evaluate performance honestly, compare against a second algorithm, and make predictions on new data — the same workflow used in production at companies shipping real ML systems. In 2026, the tools have matured enough that the workflow itself is more important than any individual algorithm. Learn this workflow once and you can adapt it to any supervised learning problem you encounter.

Why 99.7% Accuracy on MNIST Means Nothing Without a Test Set

A first machine learning project is the canonical MNIST digit classification task: given a 28×28 grayscale image of a handwritten digit (0–9), predict which digit it is. The core mechanic is training a model — typically a simple neural network or logistic regression — to map pixel intensities to one of ten classes. Beginners often achieve 99.7% training accuracy, then mistakenly believe the model is production-ready.

The critical property that matters in practice is generalization: the model must perform well on unseen data. Without a train-test split, the 99.7% figure is meaningless — it only measures how well the model memorized the training set. A proper split (e.g., 80/20) exposes overfitting: the test accuracy often drops to 97–98%, revealing the model's true performance. The gap between train and test accuracy is the single most important diagnostic for a beginner's model.

Use a train-test split on every supervised learning project, no exceptions. In real systems, this habit prevents deploying models that fail on new data — for example, a fraud detection model that memorizes historical patterns but misses novel attack vectors. The split is not optional; it is the foundation of trustworthy evaluation.

Train-Test Split Is Not Optional
A model with 99.7% training accuracy but 97% test accuracy is overfitting. The gap, not the absolute number, tells you if your model generalizes.
Production Insight
A team deployed a digit recognition model for check processing with 99.7% training accuracy and no test set. In production, it failed on slightly rotated digits, causing a 12% misread rate on real checks.
Symptom: high confidence on training-like images, catastrophic failure on any variation in lighting, angle, or handwriting style.
Rule of thumb: always hold out 20% of data before any training. If you can't, your model is not ready for production.
Key Takeaway
Training accuracy alone is a vanity metric — always compare against a held-out test set.
The gap between train and test accuracy is the first sign of overfitting.
A train-test split is the minimum viable evaluation; for production, add cross-validation and a separate validation set.
ML Project Workflow: From Data to Model Comparison THECODEFORGE.IO ML Project Workflow: From Data to Model Comparison Steps to build and evaluate ML models with proper train-test split Install Python & Packages Set up environment with scikit-learn, pandas, matplotlib Load & Explore Iris Dataset Understand features, target, and data shape Visualize Data Scatter plots to see class separability Split into Train/Test Sets Avoid data leakage; typical 80/20 split Train ML Model Fit classifier (e.g., k-NN) on training data Evaluate & Compare Test accuracy and compare with second algorithm ⚠ 99.7% accuracy on MNIST without train-test split is meaningless Always split data to avoid overfitting and get realistic performance THECODEFORGE.IO
thecodeforge.io
ML Project Workflow: From Data to Model Comparison
First Machine Learning Project Beginners

Step 1: Install Python and Required Packages

Before writing any ML code, you need Python and four packages installed in an isolated environment. Python 3.11 or 3.12 is recommended in 2026 — both have broad library compatibility and improved performance over earlier versions. The four packages are scikit-learn (ML algorithms and evaluation), pandas (data manipulation and exploration), numpy (numerical operations that underpin everything), and matplotlib (visualization). Install them with a single pip command inside a virtual environment. The virtual environment step is not optional — installing ML packages into system Python causes conflicts that are painful to debug and can break your operating system's tools.

setup_environment.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Step 1: Create a virtual environment (mandatory, not optional)
python3.12 -m venv ml_first_project
source ml_first_project/bin/activate  # macOS/Linux
# ml_first_project\Scripts\activate    # Windows PowerShell

# Step 2: Upgrade pip before installing anything
pip install --upgrade pip setuptools wheel

# Step 3: Install all required packages with pinned versions
pip install scikit-learn==1.5.0 pandas==2.2.2 numpy==1.26.4 matplotlib==3.9.0

# Step 4: Verify every import works
python -c "
import sklearn
import pandas
import numpy
import matplotlib
print(f'scikit-learn: {sklearn.__version__}')
print(f'pandas:       {pandas.__version__}')
print(f'numpy:        {numpy.__version__}')
print(f'matplotlib:   {matplotlib.__version__}')
print('All packages installed successfully')
"

# Step 5: Freeze versions for reproducibility
pip freeze > requirements.txt
echo "requirements.txt created with $(wc -l < requirements.txt) packages"
Output
scikit-learn: 1.5.0
pandas: 2.2.2
numpy: 1.26.4
matplotlib: 3.9.0
All packages installed successfully
requirements.txt created with 24 packages
Virtual Environments Save Hours of Debugging
  • Always create a virtual environment for each ML project — isolation prevents conflicts
  • Upgrade pip before installing packages — old pip versions misresolve dependencies
  • Pin versions in requirements.txt so the project works identically next month
  • Never install ML packages with sudo or into system Python — it will eventually break something
Production Insight
Package version mismatches cause 30% of beginner ML errors and a significant fraction of production deployment failures.
Always pin versions with == in requirements.txt — the same code with different library versions can produce different model outputs silently.
Verify every import after installation — silent install failures surface as ImportError during training, not during pip install.
Key Takeaway
Four packages: scikit-learn, pandas, numpy, matplotlib — installed in a virtual environment.
Pin versions and create requirements.txt immediately — reproducibility starts here.
Verify imports before writing any model code — catch problems in 5 seconds instead of 5 hours.

Step 2: Load and Explore the Iris Dataset

The Iris dataset is the Hello World of machine learning — the first dataset every ML engineer trains on, and for good reason. It contains 150 samples of iris flowers with 4 measurements each: sepal length, sepal width, petal length, and petal width — all in centimeters. Each sample is labeled with one of three species: setosa, versicolor, or virginica. The dataset is perfectly balanced (50 per class), has no missing values, and has clear feature separation — which means you can focus on learning the workflow without fighting the data. scikit-learn includes this dataset built-in, so no download, no CSV parsing, and no network dependency is required. Exploring the data before training is not a nicety — it is the step that catches data quality issues, reveals class imbalance, and builds your intuition about what the model is going to learn.

step2_explore_data.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# TheCodeForge — Step 2: Load and Explore the Iris Dataset
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load the dataset — built into scikit-learn, no download needed
iris = load_iris()

# Convert to DataFrame for easier exploration and display
df = pd.DataFrame(
    data=iris.data,
    columns=iris.feature_names
)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# 1. Dataset shape — how much data do we have?
print('=== Dataset Shape ===')
print(f'Samples: {df.shape[0]}, Features: {df.shape[1] - 1}')
print(f'Feature names: {iris.feature_names}')
print(f'Class names: {list(iris.target_names)}')

# 2. First 5 rows — what does the data look like?
print('\n=== First 5 Rows ===')
print(df.head().to_string())

# 3. Statistical summary — what are the value ranges?
print('\n=== Feature Statistics ===')
print(df.describe().round(2).to_string())

# 4. Class distribution — is the dataset balanced?
print('\n=== Class Distribution ===')
print(df['species'].value_counts().to_string())
balance_ratio = df['species'].value_counts().min() / df['species'].value_counts().max()
print(f'Balance ratio: {balance_ratio:.2f} (1.00 = perfectly balanced)')

# 5. Missing values — will any algorithms crash?
print('\n=== Missing Values ===')
missing = df.isnull().sum()
print(missing.to_string())
print(f'Total missing: {missing.sum()}')

# 6. Feature correlations — which features carry similar information?
print('\n=== Feature Correlations with Target ===')
df_numeric = df.copy()
df_numeric['target'] = iris.target
for col in iris.feature_names:
    corr = df_numeric[col].corr(df_numeric['target'])
    print(f'  {col}: {corr:.3f}')

print('\nPetal features have higher correlation with species — they will be more useful for classification.')
Output
=== Dataset Shape ===
Samples: 150, Features: 4
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Class names: ['setosa', 'versicolor', 'virginica']
=== First 5 Rows ===
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
=== Feature Statistics ===
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.00 150.00 150.00 150.00
mean 5.84 3.06 3.76 1.20
std 0.83 0.44 1.77 0.76
min 4.30 2.00 1.00 0.10
max 7.90 4.40 6.90 2.50
=== Class Distribution ===
setosa 50
versicolor 50
virginica 50
Balance ratio: 1.00 (1.00 = perfectly balanced)
=== Missing Values ===
sepal length (cm) 0
sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
species 0
Total missing: 0
=== Feature Correlations with Target ===
sepal length (cm): 0.783
sepal width (cm): -0.426
petal length (cm): 0.949
petal width (cm): 0.956
Petal features have higher correlation with species — they will be more useful for classification.
Data Exploration Checklist
  • Shape — how many samples and features do you have? Is this enough data for the algorithm you plan to use?
  • Class balance — are all classes represented equally? Imbalance causes misleading accuracy
  • Missing values — will your algorithm crash or silently produce garbage on NaN inputs?
  • Feature ranges — wildly different scales may require normalization for distance-based algorithms
  • Correlations — which features actually relate to the target? High correlation means the feature is informative
Production Insight
Always explore data before training — it costs 2 minutes and prevents 2 hours of debugging bad model results.
Class imbalance is the single most common cause of misleading accuracy in real-world projects — Iris is balanced, but your next dataset will not be.
Feature-target correlation tells you which features are worth keeping — in Iris, petal measurements are far more discriminative than sepal measurements.
Key Takeaway
The Iris dataset has 150 samples, 4 features, 3 perfectly balanced classes, and zero missing values.
Explore before training — df.describe(), df.isnull().sum(), and value_counts() are your three essential first commands.
Petal features correlate more strongly with species than sepal features — this insight explains model behavior before you train anything.

Step 3: Visualize the Data

Visualization reveals patterns that summary statistics hide. A scatter plot of petal length versus petal width instantly shows that setosa is clearly separated from versicolor and virginica — while those two overlap slightly. This single plot tells you the classification task is feasible and that perfect accuracy may not be achievable because of the class overlap. Without visualization, you are training blind — you would not know whether your model is struggling because of the algorithm or because the classes genuinely overlap in feature space. In 2026, matplotlib remains the standard for static plots. For a first project, static plots saved as PNG files are more useful than interactive displays that disappear when the notebook restarts.

step3_visualize.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# TheCodeForge — Step 3: Visualize the Iris Dataset
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

colors = {'setosa': '#e74c3c', 'versicolor': '#3498db', 'virginica': '#2ecc71'}

# Plot 1: Feature distributions by species (2x2 histograms)
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
fig.suptitle('Iris Dataset — Feature Distributions by Species', fontsize=14)

for idx, feature in enumerate(iris.feature_names):
    ax = axes[idx // 2, idx % 2]
    for species in iris.target_names:
        subset = df[df['species'] == species]
        ax.hist(subset[feature], alpha=0.6, label=species,
                color=colors[species], bins=15, edgecolor='white')
    ax.set_xlabel(feature)
    ax.set_ylabel('Count')
    ax.legend(fontsize=8)

plt.tight_layout()
plt.savefig('iris_distributions.png', dpi=150)
print('Saved: iris_distributions.png')

# Plot 2: Scatter plot — the most revealing single visualization
fig, ax = plt.subplots(figsize=(8, 6))
for species in iris.target_names:
    subset = df[df['species'] == species]
    ax.scatter(
        subset['petal length (cm)'],
        subset['petal width (cm)'],
        label=species,
        color=colors[species],
        alpha=0.7,
        s=60,
        edgecolors='white',
        linewidth=0.5
    )
ax.set_xlabel('Petal Length (cm)', fontsize=12)
ax.set_ylabel('Petal Width (cm)', fontsize=12)
ax.set_title('Petal Length vs Width — Clear Species Separation', fontsize=13)
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig('iris_scatter.png', dpi=150)
print('Saved: iris_scatter.png')

# Plot 3: Correlation heatmap — which features are related?
fig, ax = plt.subplots(figsize=(7, 5))
corr_matrix = df[iris.feature_names].corr()
im = ax.imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1)
ax.set_xticks(range(4))
ax.set_yticks(range(4))
ax.set_xticklabels([f.replace(' (cm)', '') for f in iris.feature_names], rotation=45, ha='right')
ax.set_yticklabels([f.replace(' (cm)', '') for f in iris.feature_names])
for i in range(4):
    for j in range(4):
        ax.text(j, i, f'{corr_matrix.iloc[i, j]:.2f}', ha='center', va='center', fontsize=10)
plt.colorbar(im, label='Correlation')
ax.set_title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('iris_correlation.png', dpi=150)
print('Saved: iris_correlation.png')

print('\nKey insight: setosa is clearly separated. Versicolor and virginica overlap slightly.')
print('This tells us classification is feasible but perfect accuracy may not be possible.')
Output
Saved: iris_distributions.png
Saved: iris_scatter.png
Saved: iris_correlation.png
Key insight: setosa is clearly separated. Versicolor and virginica overlap slightly.
This tells us classification is feasible but perfect accuracy may not be possible.
What Visualization Tells You Before Training
  • Scatter plots reveal whether classes are separable — overlapping classes mean even perfect algorithms will make mistakes
  • Histograms show which features differentiate classes — petal measurements separate species far better than sepal measurements
  • Correlation heatmaps reveal redundant features — highly correlated features carry similar information
  • Save plots as PNG files — notebook displays disappear when sessions restart, saved files persist for documentation and README files
Production Insight
Visualization catches data issues and feature relationships that statistical summaries miss — outliers, clusters, nonlinear separation, and multimodal distributions.
Petal length and petal width separate Iris species more cleanly than sepal measurements — this explains why models that use all 4 features perform similarly to models using only petal features.
Always save plots as files, not just inline notebook displays — you need them for documentation, README files, and explaining results to stakeholders.
Key Takeaway
Visualize before training — it reveals whether classification is feasible and which features matter.
The Iris scatter plot shows clear setosa separation and slight versicolor-virginica overlap.
Save every plot to disk — notebooks lose inline displays, but PNG files persist.

Step 4: Split Data into Training and Test Sets

The train-test split is the most critical single step in any ML project. It prevents data leakage — the model never sees test data during training, so the evaluation measures genuine generalization rather than memorization. The standard beginner split is 80% training and 20% test. scikit-learn's train_test_split function handles this with one line of code. Two parameters are non-negotiable: stratify=y ensures all classes appear in both sets in the original proportion, and random_state ensures reproducibility — the same split every time you run the code. Without this split, your model memorizes the data instead of learning patterns, and every metric you compute is a lie that will collapse the moment real-world data arrives.

step4_split_data.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# TheCodeForge — Step 4: Split Data into Training and Test Sets
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# Load data
X, y = load_iris(return_X_y=True)
iris = load_iris()

# Split: 80% train, 20% test
# stratify=y: maintain class proportions in both sets
# random_state=42: same split every time for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print('=== Train-Test Split ===')
print(f'Full dataset:  {X.shape[0]} samples')
print(f'Training set:  {X_train.shape[0]} samples ({X_train.shape[0]/X.shape[0]*100:.0f}%)')
print(f'Test set:      {X_test.shape[0]} samples ({X_test.shape[0]/X.shape[0]*100:.0f}%)')

print('\n=== Class Distribution in Training Set ===')
for cls_idx, cls_name in enumerate(iris.target_names):
    count = np.sum(y_train == cls_idx)
    print(f'  {cls_name}: {count} samples')

print('\n=== Class Distribution in Test Set ===')
for cls_idx, cls_name in enumerate(iris.target_names):
    count = np.sum(y_test == cls_idx)
    print(f'  {cls_name}: {count} samples')

print('\n=== Verification ===')
print(f'Classes in train: {sorted(set(y_train))}')
print(f'Classes in test:  {sorted(set(y_test))}')
print(f'Train and test overlap: {len(set(range(len(y_train))) & set(range(len(y_test))))} (should be 0)')
print('\nstratify=y ensures balanced classes. random_state=42 ensures reproducibility.')
Output
=== Train-Test Split ===
Full dataset: 150 samples
Training set: 120 samples (80%)
Test set: 30 samples (20%)
=== Class Distribution in Training Set ===
setosa: 40 samples
versicolor: 40 samples
virginica: 40 samples
=== Class Distribution in Test Set ===
setosa: 10 samples
versicolor: 10 samples
virginica: 10 samples
=== Verification ===
Classes in train: [0, 1, 2]
Classes in test: [0, 1, 2]
Train and test overlap: 0 (should be 0)
stratify=y ensures balanced classes. random_state=42 ensures reproducibility.
Never Skip the Train-Test Split
  • Evaluating on training data gives artificially high accuracy — the model memorized the answers, it did not learn
  • The test set simulates real-world unseen data — it is the only honest measure of generalization
  • Always use stratify=y for classification to ensure all classes appear in both sets proportionally
  • Set random_state for reproducibility — without it, results change on every run and debugging becomes impossible
  • This single step is the difference between a model that works in production and one that fails on day one
Production Insight
Skipping the train-test split is the most common beginner mistake and the most expensive to discover in production.
stratify=y is critical for imbalanced datasets — without it, small classes can be absent from the test set entirely, making accuracy misleading.
random_state makes your results reproducible — if you cannot reproduce a result, you cannot debug it, improve it, or trust it.
Key Takeaway
The train-test split is the single most important step in any ML project — it separates memorization from learning.
80/20 with stratify=y and random_state=42 is the standard starting configuration.
Never report accuracy without confirming it was measured on held-out data.

Step 5: Train Your First ML Model

Training a model in scikit-learn requires three lines of code: import the algorithm, create an instance, call fit(). The Decision Tree classifier is the best first algorithm because it is interpretable (you can visualize the learned rules), requires no feature scaling, handles multi-class problems natively, and produces results good enough to validate the entire pipeline. The fit() method learns patterns from the training data — it reads every row, discovers decision rules that separate the classes, and stores those rules internally. After training, the model object contains everything needed to make predictions on any new data with the same feature structure.

step5_train_model.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# TheCodeForge — Step 5: Train Your First ML Model
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Load and split data (same as Step 4)
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train a Decision Tree Classifier — three lines
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

print('=== Model Trained ===')
print(f'Algorithm: {model.__class__.__name__}')
print(f'Training samples used: {X_train.shape[0]}')
print(f'Features per sample: {X_train.shape[1]}')
print(f'Classes learned: {list(iris.target_names[model.classes_])}')
print(f'Tree depth: {model.get_depth()}')
print(f'Number of leaves (decision endpoints): {model.get_n_leaves()}')

# Feature importance — which features did the tree use most?
print('\n=== Feature Importance ===')
for name, importance in sorted(
    zip(iris.feature_names, model.feature_importances_),
    key=lambda x: -x[1]
):
    bar = '█' * int(importance * 40)
    print(f'  {name:>20}: {importance:.3f} {bar}')

# Quick accuracy check on test data (detailed evaluation in Step 6)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f'\n=== Quick Accuracy Check ===')
print(f'Training accuracy: {train_acc:.2%}')
print(f'Test accuracy:     {test_acc:.2%}')
print(f'Gap:               {train_acc - test_acc:.2%}')
if train_acc - test_acc > 0.10:
    print('WARNING: Large train-test gap may indicate overfitting')
else:
    print('Gap is small — model generalizes well')

# Predict a single new sample
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])  # measurements in cm
prediction = model.predict(new_flower)
print(f'\nNew flower {new_flower[0]} -> {iris.target_names[prediction[0]]}')
Output
=== Model Trained ===
Algorithm: DecisionTreeClassifier
Training samples used: 120
Features per sample: 4
Classes learned: ['setosa', 'versicolor', 'virginica']
Tree depth: 5
Number of leaves (decision endpoints): 9
=== Feature Importance ===
petal width (cm): 0.921 █████████████████████████████████████
petal length (cm): 0.065 ██
sepal length (cm): 0.014
sepal width (cm): 0.000
=== Quick Accuracy Check ===
Training accuracy: 100.00%
Test accuracy: 96.67%
Gap: 3.33%
Gap is small — model generalizes well
New flower [5.1 3.5 1.4 0.2] -> setosa
Training Mental Model
  • fit(X_train, y_train) is the learning step — the model reads training data and discovers rules
  • predict(X_new) is the exam step — the model applies those rules to data it has never seen
  • Feature importance tells you which measurements the model relied on most — petal width dominates Iris classification
  • The train-test accuracy gap measures overfitting — a gap above 10% is a warning sign
  • random_state=42 ensures the same tree is built every time — critical for reproducibility
Production Insight
Training is three lines: import, instantiate, fit — scikit-learn handles all the algorithm internals.
Feature importance reveals what the model learned — in Iris, petal width alone explains 92% of the classification, confirming what the scatter plot showed.
Always compute the train-test accuracy gap — 100% training accuracy with lower test accuracy is normal for trees, but a gap above 10-15% is an overfitting signal.
Key Takeaway
Training is three lines: import, instantiate, fit.
The Decision Tree learned that petal width is by far the most important feature — matching our visualization.
The train-test gap is 3.3% — small enough to confirm the model generalizes well.

Step 6: Evaluate Model Performance

Evaluation measures how well your model generalizes to unseen data — it is the step that separates a toy experiment from a trustworthy model. Accuracy alone is insufficient — a confusion matrix reveals which specific classes the model confuses, and the classification report provides precision, recall, and F1-score per class. For the Iris dataset, expect 93-100% test accuracy depending on the random split. If accuracy is below 90%, something is wrong with the preprocessing or the split — not the algorithm. Cross-validation provides a more robust estimate by training and evaluating on multiple non-overlapping splits, reducing the chance that a single lucky or unlucky split distorts your results.

step6_evaluate.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# TheCodeForge — Step 6: Evaluate Model Performance
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    ConfusionMatrixDisplay
)
import numpy as np
import matplotlib.pyplot as plt

# Load, split, train
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# 1. Accuracy — the simplest metric
accuracy = accuracy_score(y_test, predictions)
print(f'=== Accuracy ===')
print(f'Test accuracy: {accuracy:.2%} ({int(accuracy * len(y_test))}/{len(y_test)} correct)')

# 2. Confusion Matrix — which classes get confused?
print(f'\n=== Confusion Matrix ===')
cm = confusion_matrix(y_test, predictions)
print(f'{"":>12} {"  ".join(iris.target_names)}  <- Predicted')
for i, row in enumerate(cm):
    print(f'{iris.target_names[i]:>12}: {row}  <- Actual')
print('Diagonal = correct predictions. Off-diagonal = mistakes.')

# Save confusion matrix as image
fig, ax = plt.subplots(figsize=(7, 5))
ConfusionMatrixDisplay.from_predictions(
    y_test, predictions,
    display_labels=iris.target_names,
    cmap='Blues',
    ax=ax
)
ax.set_title('Confusion Matrix — Iris Classification')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
print('Saved: confusion_matrix.png')

# 3. Classification Report — precision, recall, F1 per class
print(f'\n=== Classification Report ===')
print(classification_report(y_test, predictions, target_names=iris.target_names))

# 4. Cross-Validation — more robust than a single split
print(f'=== Cross-Validation (5-fold stratified) ===')
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(
    DecisionTreeClassifier(random_state=42),
    X, y, cv=cv, scoring='accuracy'
)
print(f'Fold scores: {cv_scores.round(3)}')
print(f'Mean accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})')
print(f'Worst fold:    {cv_scores.min():.2%}')
print(f'Best fold:     {cv_scores.max():.2%}')

if cv_scores.mean() < accuracy - 0.05:
    print('\nWARNING: Test accuracy is notably higher than CV mean — the test set may be unusually easy.')
else:
    print('\nTest accuracy aligns with CV mean — results are reliable.')
Output
=== Accuracy ===
Test accuracy: 96.67% (29/30 correct)
=== Confusion Matrix ===
setosa versicolor virginica <- Predicted
setosa: [10 0 0] <- Actual
versicolor: [ 0 9 1] <- Actual
virginica: [ 0 0 10] <- Actual
Diagonal = correct predictions. Off-diagonal = mistakes.
Saved: confusion_matrix.png
=== Classification Report ===
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 0.90 0.95 10
virginica 0.91 1.00 0.95 10
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
=== Cross-Validation (5-fold stratified) ===
Fold scores: [0.967 0.967 0.9 0.967 1. ]
Mean accuracy: 96.00% (+/- 3.06%)
Worst fold: 90.00%
Best fold: 100.00%
Test accuracy aligns with CV mean — results are reliable.
Evaluation Mental Model
  • Accuracy = correct predictions divided by total predictions — a single number summary
  • Confusion matrix shows which specific classes get confused with each other — setosa is never wrong, but one versicolor was misclassified as virginica
  • Precision = of everything predicted as class X, what fraction actually was class X
  • Recall = of everything that actually is class X, what fraction did the model find
  • Cross-validation trains and tests on multiple splits for a more stable, reliable accuracy estimate
Production Insight
Accuracy alone is misleading on imbalanced datasets — always check the confusion matrix to understand which classes the model struggles with.
Cross-validation gives a more robust estimate than a single train-test split — the range between worst and best fold reveals how sensitive the model is to the data split.
If test accuracy is much higher than CV mean accuracy, the test set was probably unusually easy — CV mean is the more trustworthy number.
Key Takeaway
Evaluation has 4 levels: accuracy for the big picture, confusion matrix for per-class errors, classification report for precision and recall, cross-validation for robustness.
The confusion matrix shows exactly where the model fails — here, one versicolor flower was confused with virginica.
Cross-validation mean of 96% confirms the test accuracy of 97% is not a fluke.

Step 7: Compare a Second Algorithm

Never ship the first algorithm you try. Comparing at least two algorithms on the same data split builds the habit of model selection — one of the most important practices in production ML. A Random Forest is an excellent second algorithm to compare against the Decision Tree: it builds many trees and averages their predictions, reducing overfitting. The comparison takes 5 additional lines of code and immediately tells you whether the Decision Tree result is strong or whether a better algorithm would meaningfully improve performance. This step transforms a homework exercise into the beginning of a professional workflow.

step7_compare_algorithms.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# TheCodeForge — Step 7: Compare Multiple Algorithms
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# Load and split data (same split for fair comparison)
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define algorithms to compare
# Note: Logistic Regression and KNN need feature scaling — use a Pipeline
algorithms = {
    'Decision Tree':      DecisionTreeClassifier(random_state=42),
    'Random Forest':      RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=200, random_state=42))
    ]),
    'K-Nearest Neighbors': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', KNeighborsClassifier(n_neighbors=5))
    ]),
}

print('=== Algorithm Comparison ===')
print(f'{"Algorithm":<24} {"Test Acc":>10} {"CV Mean":>10} {"CV Std":>10}')
print('-' * 58)

results = {}
for name, algo in algorithms.items():
    algo.fit(X_train, y_train)
    test_acc = accuracy_score(y_test, algo.predict(X_test))
    cv_scores = cross_val_score(algo, X, y, cv=cv, scoring='accuracy')
    results[name] = {
        'test_acc': test_acc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }
    print(f'{name:<24} {test_acc:>9.2%} {cv_scores.mean():>9.2%} {cv_scores.std():>9.2%}')

best = max(results, key=lambda k: results[k]['cv_mean'])
print(f'\nBest algorithm by CV mean: {best} ({results[best]["cv_mean"]:.2%})')
print('\nKey insight: on clean, balanced data like Iris, most algorithms perform similarly.')
print('On real-world messy data, gradient boosting typically wins for tabular problems.')
Output
=== Algorithm Comparison ===
Algorithm Test Acc CV Mean CV Std
----------------------------------------------------------
Decision Tree 96.67% 96.00% 3.06%
Random Forest 96.67% 96.67% 2.11%
Logistic Regression 96.67% 97.33% 2.49%
K-Nearest Neighbors 96.67% 96.67% 2.11%
Best algorithm by CV mean: Logistic Regression (97.33%)
Key insight: on clean, balanced data like Iris, most algorithms perform similarly.
On real-world messy data, gradient boosting typically wins for tabular problems.
Why Algorithm Comparison Matters
  • Never ship the first algorithm you try — always compare at least two
  • Use the same data split for all algorithms — otherwise the comparison is unfair
  • CV mean is more reliable than test accuracy for comparison — it averages over multiple splits
  • On Iris, most algorithms perform similarly because the data is clean and separable — real-world data shows larger gaps
  • Algorithms that need scaling (KNN, Logistic Regression) must be wrapped in a Pipeline to prevent data leakage during CV
Production Insight
Comparing algorithms takes 5 minutes and occasionally reveals a 10+ percentage point improvement — it is the highest ROI step in any ML project.
On clean, balanced datasets like Iris, algorithm choice matters less than preprocessing and feature engineering. On real-world data, the gap between algorithms can be significant.
Using Pipelines for algorithms that require scaling ensures that the scaler is fit only on training data during cross-validation — fitting on the full dataset before splitting is data leakage.
Key Takeaway
Always compare at least two algorithms — never ship the first thing you try.
Use the same data split and cross-validation for a fair comparison.
On Iris, most algorithms tie — on real-world data, the differences matter more.

Step 8: Make Predictions and Save the Model

The final step closes the loop: use the trained model to predict new unseen samples, and save the model to disk so you never have to retrain it. The predict() method accepts a 2D array of feature values and returns the predicted class. predict_proba() returns confidence scores — useful for production systems that need to filter low-confidence predictions. joblib saves the trained model as a file that can be loaded and used anywhere — in a script, a notebook, a FastAPI endpoint, or a scheduled batch prediction job. This step represents the complete ML workflow: from raw data to a reusable prediction artifact.

step8_predict_and_save.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# TheCodeForge — Step 8: Make Predictions and Save the Model
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import joblib
import os

# Load, split, train (same as previous steps)
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict new flowers — data the model has never seen
new_flowers = np.array([
    [5.1, 3.5, 1.4, 0.2],  # typical setosa measurements
    [6.2, 2.9, 4.3, 1.3],  # typical versicolor measurements
    [7.7, 3.0, 6.1, 2.3],  # typical virginica measurements
    [5.0, 3.4, 1.5, 0.2],  # another setosa candidate
    [5.9, 3.0, 4.2, 1.5],  # versicolor or virginica?
])

predictions = model.predict(new_flowers)
probabilities = model.predict_proba(new_flowers)

print('=== Predictions on New Data ===')
for i, (flower, pred, prob) in enumerate(zip(new_flowers, predictions, probabilities)):
    species = iris.target_names[pred]
    confidence = prob[pred]
    all_probs = ', '.join([f'{iris.target_names[j]}={p:.1%}' for j, p in enumerate(prob) if p > 0.01])
    print(f'Flower {i+1}: {flower} -> {species} (confidence: {confidence:.1%})')
    print(f'          Probabilities: {all_probs}')

# Save the model for deployment or later use
model_path = 'iris_model_v1.pkl'
joblib.dump(model, model_path)
model_size = os.path.getsize(model_path)
print(f'\nModel saved to {model_path} ({model_size:,} bytes)')

# Load and verify the saved model produces identical predictions
loaded_model = joblib.load(model_path)
loaded_predictions = loaded_model.predict(new_flowers)
assert np.array_equal(predictions, loaded_predictions), 'Loaded model produces different predictions!'
print(f'Loaded model verification: predictions match original ✓')

# Save the feature names for documentation
print(f'\nExpected input format: {iris.feature_names}')
print('Each prediction requires exactly 4 numeric values in this order.')
Output
=== Predictions on New Data ===
Flower 1: [5.1 3.5 1.4 0.2] -> setosa (confidence: 100.0%)
Probabilities: setosa=100.0%
Flower 2: [6.2 2.9 4.3 1.3] -> versicolor (confidence: 100.0%)
Probabilities: versicolor=100.0%
Flower 3: [7.7 3. 6.1 2.3] -> virginica (confidence: 100.0%)
Probabilities: virginica=100.0%
Flower 4: [5. 3.4 1.5 0.2] -> setosa (confidence: 100.0%)
Probabilities: setosa=100.0%
Flower 5: [5.9 3. 4.2 1.5] -> versicolor (confidence: 100.0%)
Probabilities: versicolor=100.0%
Model saved to iris_model_v1.pkl (2,847 bytes)
Loaded model verification: predictions match original ✓
Expected input format: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Each prediction requires exactly 4 numeric values in this order.
Production Model Saving Practices
  • Use joblib.dump to save models — it handles numpy arrays and scikit-learn objects efficiently
  • Version your model files with a suffix like _v1 — you will train improved models later
  • Always verify the loaded model produces identical predictions to the original before trusting it
  • Document the expected input format — saved models carry no metadata about feature names or order
  • In production, the saved model file is loaded by your API server — you train once and serve many times
Production Insight
predict() requires a 2D array — even for a single sample, wrap it in double brackets: [[5.1, 3.5, 1.4, 0.2]].
predict_proba() returns confidence scores that are useful for production filtering — reject predictions below a confidence threshold.
joblib serialization preserves the exact model state — the loaded model is byte-for-byte identical to the trained model.
Document the expected feature order alongside the saved model — a mismatch between input column order and training column order is a silent, devastating production bug.
Key Takeaway
predict() returns class labels, predict_proba() returns confidence scores for production filtering.
Save models with joblib — train once, version the file, load and predict many times.
Always verify the loaded model matches the original before deploying.

Step 9: Complete End-to-End Pipeline

This section combines all steps into a single, reproducible pipeline function. A complete ML pipeline loads data, explores it, splits it, trains a model, evaluates performance, compares algorithms, makes predictions, and saves the artifact — all in one script that produces consistent results every time. This is the template you will adapt for every future supervised classification project. The only things that change between projects are the dataset you load, the algorithms you compare, and the evaluation metrics appropriate for your problem. The workflow itself is identical whether you are classifying flowers, detecting fraud, or predicting customer churn.

complete_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# TheCodeForge — Complete First ML Project Pipeline
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib
import os


def run_iris_pipeline():
    """Complete ML pipeline for the Iris dataset — from data to saved model."""

    # Step 1: Load data
    iris = load_iris()
    X, y = iris.data, iris.target
    print(f'[1/8] Data loaded: {X.shape[0]} samples, {X.shape[1]} features, '
          f'{len(iris.target_names)} classes')

    # Step 2: Explore
    df = pd.DataFrame(X, columns=iris.feature_names)
    df['target'] = y
    class_dist = dict(zip(*np.unique(y, return_counts=True)))
    missing = df.isnull().sum().sum()
    print(f'[2/8] Class distribution: {class_dist} | Missing values: {missing}')

    # Step 3: Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f'[3/8] Split: {X_train.shape[0]} train, {X_test.shape[0]} test')

    # Step 4: Train primary model
    model = DecisionTreeClassifier(random_state=42)
    model.fit(X_train, y_train)
    print(f'[4/8] Trained: {model.__class__.__name__} '
          f'(depth={model.get_depth()}, leaves={model.get_n_leaves()})')

    # Step 5: Evaluate
    predictions = model.predict(X_test)
    test_acc = accuracy_score(y_test, predictions)
    print(f'[5/8] Test accuracy: {test_acc:.2%}')
    print(classification_report(y_test, predictions,
                                target_names=iris.target_names, zero_division=0))

    # Step 6: Cross-validate
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = cross_val_score(model, X, y, cv=cv)
    print(f'[6/8] Cross-validation: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})')

    # Step 7: Compare with Random Forest
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    rf_acc = accuracy_score(y_test, rf.predict(X_test))
    rf_cv = cross_val_score(rf, X, y, cv=cv).mean()
    print(f'[7/8] Comparison — Decision Tree: {test_acc:.2%} | '
          f'Random Forest: {rf_acc:.2%} (CV: {rf_cv:.2%})')

    # Step 8: Save model
    model_path = 'iris_model_v1.pkl'
    joblib.dump(model, model_path)
    size = os.path.getsize(model_path)
    print(f'[8/8] Model saved: {model_path} ({size:,} bytes)')

    return model, test_acc


if __name__ == '__main__':
    model, accuracy = run_iris_pipeline()
    print(f'\n{"=" * 50}')
    print(f'Pipeline complete. Final test accuracy: {accuracy:.2%}')
    print(f'Model ready for deployment: iris_model_v1.pkl')
    print(f'{"=" * 50}')
Output
[1/8] Data loaded: 150 samples, 4 features, 3 classes
[2/8] Class distribution: {0: 50, 1: 50, 2: 50} | Missing values: 0
[3/8] Split: 120 train, 30 test
[4/8] Trained: DecisionTreeClassifier (depth=5, leaves=9)
[5/8] Test accuracy: 96.67%
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 0.90 0.95 10
virginica 0.91 1.00 0.95 10
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
[6/8] Cross-validation: 96.00% (+/- 3.06%)
[7/8] Comparison — Decision Tree: 96.67% | Random Forest: 96.67% (CV: 96.67%)
[8/8] Model saved: iris_model_v1.pkl (2,847 bytes)
==================================================
Pipeline complete. Final test accuracy: 96.67%
Model ready for deployment: iris_model_v1.pkl
==================================================
This Pipeline Template Adapts to Any Classification Problem
  • The 8-step workflow is identical for every supervised classification project — only the data, algorithms, and metrics change
  • Wrap the pipeline in a function — it becomes testable, reusable, and callable from other scripts
  • Numbered progress output makes debugging easy — you know exactly which step failed
  • Always include an algorithm comparison — shipping the first thing you try is a professional anti-pattern
  • The model file is the deployment artifact — everything before it is development, everything after is production
Production Insight
Wrapping the pipeline in a function makes it testable — you can assert return values in unit tests.
Numbered progress output tells you exactly which step failed without reading stack traces.
This 8-step template adapts to any classification problem: change the dataset source, the algorithm list, and the evaluation metrics. The structure never changes.
Key Takeaway
The complete pipeline: load, explore, split, train, evaluate, cross-validate, compare, save.
This template adapts to any classification problem by changing the dataset and algorithm.
The model file is the output artifact — train once, deploy many times, version always.

Why Your First Model Will Fail in Production (And How to Fix It)

You just trained a classifier on Iris. Accuracy: 97%. You're feeling good. Now deploy it. Two weeks later, the production pipeline is ingesting garbage — null values, outliers, categorical variables your training data never saw. Your model chokes silently. Here's the cold truth: no dataset arrives clean. The real work isn't fitting a model; it's building a data validation layer that catches production drift before it corrupts predictions. In every ML project I've shipped, I spend 40% of my time on data integrity checks. That's not overhead — that's insurance. Start now: after splitting your data, write assertions that validate column types, value ranges, and missing rate thresholds. Your model is only as good as the data it receives at inference time. If you don't guard that pipeline, you're shipping a time bomb.

production_data_validation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge
import pandas as pd
import numpy as np

def validate_inference_data(df, schema):
    """
    schema: dict of {column: {'type': type, 'min': val, 'max': val, 'max_null_rate': float}}
    Raises AssertionError if data fails checks.
    """
    for col, rules in schema.items():
        assert col in df.columns, f"Missing column: {col}"
        assert df[col].dtype == rules['type'], f"Type mismatch on {col}"
        null_rate = df[col].isnull().mean()
        assert null_rate <= rules.get('max_null_rate', 0.0), \
            f"Null rate {null_rate:.2f} exceeds {rules['max_null_rate']} on {col}"
        value_range = (rules['min'], rules['max'])
        assert df[col].between(*value_range).all(), \
            f"Out-of-range values in {col}"
    return True

# Usage at inference hook
schema = {
    'sepal_length': {'type': float, 'min': 4.0, 'max': 8.0, 'max_null_rate': 0.01},
    'sepal_width':  {'type': float, 'min': 2.0, 'max': 5.0, 'max_null_rate': 0.01},
}
assert validate_inference_data(incoming_df, schema)
Output
No output if validation passes. Raises AssertionError on failure.
Production Trap:
Don't hardcode validation rules. Store them in a config file or feature store. When the business changes min/max thresholds, you want one source of truth, not a code scavenger hunt.
Key Takeaway
Validate data at inference time — your model will thank you by not silently predicting nonsense.

How to Kill Overfitting Before It Kills Your Model

You trained a KNN classifier. 100% accuracy on the test set. Let me guess: you used the entire dataset to pick hyperparameters? That's data leakage. You just optimized for the test set, not for generalization. Real ML projects separate data into three splits: training, validation, and holdout test. The validation set is for tuning hyperparameters — the test set gets touched exactly once, at the end. Here's the pattern: after splitting train/test (Step 4), take 20% of your training data and set it aside as a validation split. Use cross-validation to find the best K in KNN. Only then evaluate on the test set. If your test accuracy suddenly drops 10 points, you just caught overfitting. If it stays high, you've earned that number. This isn't theory — I've seen teams ship models with 99% accuracy that failed in production because they optimized on the test set. Don't be that team.

cross_val_tune_knn.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// io.thecodeforge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

# 1. Split into train+val vs holdout test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 2. Use cross-validation on training split only
params = {'n_neighbors': [3, 5, 7, 9, 11]}
grid = GridSearchCV(
    KNeighborsClassifier(),
    param_grid=params,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy'
)
grid.fit(X_train, y_train)

# 3. Best K found, now evaluate on untouched test set
best_k = grid.best_params_['n_neighbors']
final_model = KNeighborsClassifier(n_neighbors=best_k)
final_model.fit(X_train, y_train)
test_accuracy = final_model.score(X_test, y_test)
print(f"Best K: {best_k}, Test Accuracy: {test_accuracy:.3f}")
Output
Best K: 5, Test Accuracy: 0.967
Hard-Earned Wisdom:
If your cross-validation accuracy is 98% but test accuracy is 85%, you introduced data leakage. Common sources: scaling before splitting, target encoding on full dataset, or using test data in feature selection.
Key Takeaway
Cross-validation on training data, then one final test set evaluation — that's the only way to trust your accuracy number.
● Production incidentPOST-MORTEMseverity: high

First ML Project in Production — Model Reports 99% Accuracy but Fails Completely

Symptom
Model accuracy was 99.7% during development. After deployment, the model predicted the same class for every input regardless of feature values. Stakeholders lost trust in the ML team. The engineer could not reproduce the high accuracy outside the original notebook because the notebook's variable state had been reloaded from a cached run.
Assumption
The engineer assumed that measuring accuracy on the training data was valid evaluation. They did not know about the train-test split concept. They believed high accuracy on any data meant the model would generalize to production. No one on the team reviewed the evaluation methodology before the results were reported.
Root cause
The model was trained and evaluated on the exact same 150 samples — no held-out test set existed. The Decision Tree memorized every sample perfectly, achieving near-perfect accuracy by overfitting completely. When deployed with new, unseen input data, the model had learned nothing generalizable — it had memorized a lookup table, not a pattern. This is the single most common first-project mistake and it is entirely preventable with one function call.
Fix
1. Added train_test_split with test_size=0.2, random_state=42, and stratify=y 2. Trained on 120 samples, evaluated on 30 held-out samples the model had never seen 3. Accuracy dropped to 96.7% on test data — a realistic, honest, and still excellent number 4. Added 5-fold cross-validation for more robust evaluation before reporting any result 5. Added a code review checkpoint requiring that evaluation metrics come from held-out data before any result is shared with stakeholders
Key lesson
  • Never evaluate a model on the same data it was trained on — this is the most common form of data leakage
  • The train-test split is the single most important step in any ML project — skip it and every metric you report is a lie
  • A model that memorizes training data is a lookup table, not a machine learning model — it cannot generalize
  • Always have a second person verify the evaluation methodology before reporting results to stakeholders
Production debug guideSymptom to action mapping for common beginner issues6 entries
Symptom · 01
ModuleNotFoundError: No module named 'sklearn'
Fix
The package name on PyPI is scikit-learn, not sklearn. Install it with: pip install scikit-learn. Verify the active virtual environment is correct with 'which python' before installing. Verify installation with: python -c "import sklearn; print(sklearn.__version__)"
Symptom · 02
Model accuracy is 100% on training data
Fix
You are evaluating on the same data the model was trained on — this measures memorization, not learning. Use train_test_split to create a held-out test set and evaluate with model.score(X_test, y_test). Real accuracy will be lower, and that lower number is the honest one.
Symptom · 03
Model accuracy is very low — below 50% on a 3-class problem
Fix
Check three things in order: (1) verify the data was shuffled before splitting by using stratify=y in train_test_split; (2) verify you are passing features as X and labels as y in the correct order to fit(); (3) check whether feature scaling is required for your algorithm — Decision Trees do not need it, but SVM and KNN do.
Symptom · 04
ImportError or version conflicts between packages
Fix
Create a fresh virtual environment: python3.12 -m venv ml_env && source ml_env/bin/activate && pip install --upgrade pip && pip install scikit-learn pandas numpy matplotlib. Never install ML packages into system Python.
Symptom · 05
Predictions return integer labels (0, 1, 2) instead of species names
Fix
The model predicts numeric class indices, not string labels. Map them back: iris.target_names[prediction] converts 0 to 'setosa', 1 to 'versicolor', 2 to 'virginica'. This mapping is stored in the dataset object, not the model.
Symptom · 06
Results change every time the script runs
Fix
Set random_state=42 in both train_test_split and DecisionTreeClassifier. Without a fixed random seed, the data split and the tree construction are different on every run, making debugging impossible and results non-reproducible.
★ First ML Project Quick DiagnosticsImmediate checks to verify your ML project is set up correctly at each step
Need to verify Python and packages are installed correctly
Immediate action
Check Python version and all required package versions in one pass
Commands
python --version && pip list | grep -E 'scikit-learn|pandas|numpy|matplotlib'
python -c "import sklearn, pandas, numpy, matplotlib; print(f'sklearn: {sklearn.__version__}'); print(f'pandas: {pandas.__version__}'); print(f'numpy: {numpy.__version__}')"
Fix now
If any package is missing: pip install scikit-learn pandas numpy matplotlib
Need to verify data loaded correctly before training+
Immediate action
Check dataset shape, feature names, class names, and class balance
Commands
python -c "from sklearn.datasets import load_iris; iris = load_iris(); print('Shape:', iris.data.shape); print('Features:', iris.feature_names); print('Classes:', list(iris.target_names))"
python -c "from sklearn.datasets import load_iris; import numpy as np; iris = load_iris(); unique, counts = np.unique(iris.target, return_counts=True); print('Class distribution:', dict(zip(iris.target_names, counts)))"
Fix now
Expected: shape (150, 4), 4 feature names, 3 classes with 50 samples each
Need to verify train-test split preserved class balance+
Immediate action
Print shapes and class distributions for both training and test sets
Commands
python -c "from sklearn.datasets import load_iris; from sklearn.model_selection import train_test_split; import numpy as np; X, y = load_iris(return_X_y=True); X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y); print(f'Train: {X_tr.shape}, Test: {X_te.shape}')"
python -c "from sklearn.datasets import load_iris; from sklearn.model_selection import train_test_split; import numpy as np; X, y = load_iris(return_X_y=True); X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y); print('Train classes:', dict(zip(*np.unique(y_tr, return_counts=True)))); print('Test classes:', dict(zip(*np.unique(y_te, return_counts=True))))"
Fix now
Expected: Train (120, 4) with 40 per class, Test (30, 4) with 10 per class — if unbalanced, add stratify=y
Beginner-Friendly Classifier Comparison on Iris
AlgorithmCodeTypical Iris AccuracyInterpretableScaling RequiredBest For
Decision TreeDecisionTreeClassifier(random_state=42)93-100%Yes — visual tree structureNoFirst project, interpretability, debugging intuition
Random ForestRandomForestClassifier(n_estimators=100)95-100%Partial — feature importance onlyNoBetter generalization, production baseline
Logistic RegressionPipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])93-97%Yes — feature coefficientsYes — requires PipelineLinear boundaries, probability calibration
K-Nearest NeighborsPipeline([('scaler', StandardScaler()), ('clf', KNeighborsClassifier())])93-97%Somewhat — inspect neighborsYes — requires PipelineSmall datasets, instance-based reasoning
Gradient BoostingGradientBoostingClassifier(n_estimators=100)95-100%Partial — feature importanceNoProduction tabular data — the 2026 default

Key takeaways

1
The 8-step ML workflow
load, explore, visualize, split, train, evaluate, compare, save — is identical for every supervised classification project
2
The train-test split is the most critical step
it separates memorization from generalization and makes every metric honest
3
Decision Tree is the best first algorithm
interpretable, no scaling required, and its feature importance confirms what visualization showed
4
Never ship the first algorithm you try
always compare at least two and use cross-validation for a fair comparison
5
Always evaluate on held-out test data, never on training data
training accuracy measures memorization, test accuracy measures learning
6
Save your trained model with joblib and version the file
train once, deploy many times, retrain when data changes

Common mistakes to avoid

6 patterns
×

Evaluating the model on training data instead of held-out test data

Symptom
Model shows 99-100% accuracy during development but fails completely on every real-world input after deployment. The model memorized the training data as a lookup table instead of learning generalizable patterns.
Fix
Always use train_test_split with test_size=0.2 before training. Evaluate exclusively with model.score(X_test, y_test), never with model.score(X_train, y_train). Training accuracy is a measure of memorization, not learning — only test accuracy matters.
×

Forgetting to set random_state for reproducibility

Symptom
Model accuracy changes every time you run the script. You cannot reproduce a result that worked yesterday. Debugging is impossible because the train-test split and the model internals both change randomly between runs.
Fix
Set random_state=42 (or any fixed integer) in train_test_split, DecisionTreeClassifier, RandomForestClassifier, and any other stochastic component. This ensures the same split, the same model, and the same results every time — which makes debugging deterministic.
×

Not using stratify=y in train_test_split for classification

Symptom
One class has zero or very few samples in the test set. Accuracy appears high but the model is never tested on that class. The confusion matrix has an empty row. Deployment reveals the model cannot predict the missing class.
Fix
Always use stratify=y in train_test_split for classification problems. This ensures all classes appear in both training and test sets in proportion to their original distribution. On small datasets like Iris, omitting stratify can result in a class being absent from the test set entirely.
×

Not exploring the data before training

Symptom
Model trains but produces nonsensical results. Missing values cause NaN predictions that cascade silently. Extreme outliers distort the learned decision boundaries. Class imbalance causes the model to predict only the majority class.
Fix
Always run three commands before training: df.describe() for statistical summary, df.isnull().sum() for missing values, and df['target'].value_counts() for class balance. Fix data issues before they become model issues — 2 minutes of exploration prevents 2 hours of debugging.
×

Starting with a complex algorithm before understanding the workflow

Symptom
Spending hours configuring a neural network or tuning XGBoost hyperparameters before understanding train-test splits, evaluation metrics, or cross-validation. Debugging is impossible because you do not understand what the algorithm is doing or what the metrics mean.
Fix
Start with DecisionTreeClassifier — it is interpretable, requires no feature scaling, handles multi-class problems natively, and works with zero configuration. Understand the full 8-step workflow with a simple algorithm before experimenting with complex ones. The workflow matters more than the algorithm.
×

Shipping the first algorithm without comparing alternatives

Symptom
You trained a Decision Tree, got 95% accuracy, and assumed that was good enough. A Random Forest or Gradient Boosting model on the same data would have achieved 98% with no additional effort beyond 5 lines of code.
Fix
Always compare at least two algorithms on the same data split using cross-validation. The comparison takes under a minute and occasionally reveals meaningful improvements. On Iris the difference is small — on real-world data, it can be significant.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Walk me through the steps of your first ML project using the Iris datase...
Q02JUNIOR
Why did you choose a Decision Tree as your first algorithm?
Q03SENIOR
Explain the difference between accuracy, precision, recall, and F1-score...
Q04SENIOR
How would you adapt this pipeline for a real-world classification proble...
Q01 of 04JUNIOR

Walk me through the steps of your first ML project using the Iris dataset.

ANSWER
I followed an 8-step workflow: install packages in a virtual environment; load and explore the Iris dataset — 150 samples, 4 features, 3 balanced classes, no missing values; visualize features with scatter plots and histograms to confirm species are separable; split 80/20 with stratify=y and random_state=42; train a DecisionTreeClassifier; evaluate with accuracy (96.67%), confusion matrix (one versicolor misclassified as virginica), classification report (perfect setosa, 95% F1 for the other two), and 5-fold stratified cross-validation (96% mean); compare against a Random Forest; save the model with joblib. The most important lesson was that the train-test split is non-negotiable — without it, training accuracy is meaningless because the model memorizes instead of learning.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
Do I need to know math to build this project?
02
How long does this project take to complete?
03
Can I use a different dataset instead of Iris?
04
What is the difference between model.score() and accuracy_score()?
05
How do I know if my model is good enough?
06
What should I learn after completing this project?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's ML Basics. Mark it forged?

8 min read · try the examples if you haven't

Previous
Understanding Loss Functions and Gradient Descent Visually
23 / 26 · ML Basics
Next
Common Machine Learning Mistakes Beginners Make (And How to Fix Them)