Your First Machine Learning Project – Complete Step-by-Step (2026)
- The 8-step ML workflow — load, explore, visualize, split, train, evaluate, compare, save — is identical for every supervised classification project
- The train-test split is the most critical step — it separates memorization from generalization and makes every metric honest
- Decision Tree is the best first algorithm — interpretable, no scaling required, and its feature importance confirms what visualization showed
- Install Python 3.11 or 3.12, scikit-learn, pandas, numpy, and matplotlib — 4 core packages
- Load the Iris dataset — 150 flower samples with 4 features and 3 species labels
- Split data 80/20 with stratify — train on 80%, test on 20% to measure real generalization
- Train a Decision Tree classifier — three lines of code with scikit-learn
- Evaluate with accuracy, confusion matrix, classification report, and cross-validation
- Compare against a second algorithm to build the habit of never shipping the first thing you try
- Biggest mistake: skipping the train-test split — your model will memorize training data and fail on every real-world input
Need to verify Python and packages are installed correctly
python --version && pip list | grep -E 'scikit-learn|pandas|numpy|matplotlib'python -c "import sklearn, pandas, numpy, matplotlib; print(f'sklearn: {sklearn.__version__}'); print(f'pandas: {pandas.__version__}'); print(f'numpy: {numpy.__version__}')"Need to verify data loaded correctly before training
python -c "from sklearn.datasets import load_iris; iris = load_iris(); print('Shape:', iris.data.shape); print('Features:', iris.feature_names); print('Classes:', list(iris.target_names))"python -c "from sklearn.datasets import load_iris; import numpy as np; iris = load_iris(); unique, counts = np.unique(iris.target, return_counts=True); print('Class distribution:', dict(zip(iris.target_names, counts)))"Need to verify train-test split preserved class balance
python -c "from sklearn.datasets import load_iris; from sklearn.model_selection import train_test_split; import numpy as np; X, y = load_iris(return_X_y=True); X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y); print(f'Train: {X_tr.shape}, Test: {X_te.shape}')"python -c "from sklearn.datasets import load_iris; from sklearn.model_selection import train_test_split; import numpy as np; X, y = load_iris(return_X_y=True); X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y); print('Train classes:', dict(zip(*np.unique(y_tr, return_counts=True)))); print('Test classes:', dict(zip(*np.unique(y_te, return_counts=True))))"Production Incident
Production Debug GuideSymptom to action mapping for common beginner issues
fit(); (3) check whether feature scaling is required for your algorithm — Decision Trees do not need it, but SVM and KNN do.Every ML engineer's career starts with one project that turns theory into working code. This guide walks through a complete end-to-end machine learning project using the Iris dataset and scikit-learn — from installing packages to saving a trained model to disk. No prior ML experience needed. No unexplained jargon. No math formulas. Each step has a clear purpose, runnable code, and a verifiable output so you know exactly what success looks like before moving on. You will load data, explore it, visualize it, split it, train a model, evaluate performance honestly, compare against a second algorithm, and make predictions on new data — the same workflow used in production at companies shipping real ML systems. In 2026, the tools have matured enough that the workflow itself is more important than any individual algorithm. Learn this workflow once and you can adapt it to any supervised learning problem you encounter.
Step 1: Install Python and Required Packages
Before writing any ML code, you need Python and four packages installed in an isolated environment. Python 3.11 or 3.12 is recommended in 2026 — both have broad library compatibility and improved performance over earlier versions. The four packages are scikit-learn (ML algorithms and evaluation), pandas (data manipulation and exploration), numpy (numerical operations that underpin everything), and matplotlib (visualization). Install them with a single pip command inside a virtual environment. The virtual environment step is not optional — installing ML packages into system Python causes conflicts that are painful to debug and can break your operating system's tools.
# Step 1: Create a virtual environment (mandatory, not optional) python3.12 -m venv ml_first_project source ml_first_project/bin/activate # macOS/Linux # ml_first_project\Scripts\activate # Windows PowerShell # Step 2: Upgrade pip before installing anything pip install --upgrade pip setuptools wheel # Step 3: Install all required packages with pinned versions pip install scikit-learn==1.5.0 pandas==2.2.2 numpy==1.26.4 matplotlib==3.9.0 # Step 4: Verify every import works python -c " import sklearn import pandas import numpy import matplotlib print(f'scikit-learn: {sklearn.__version__}') print(f'pandas: {pandas.__version__}') print(f'numpy: {numpy.__version__}') print(f'matplotlib: {matplotlib.__version__}') print('All packages installed successfully') " # Step 5: Freeze versions for reproducibility pip freeze > requirements.txt echo "requirements.txt created with $(wc -l < requirements.txt) packages"
pandas: 2.2.2
numpy: 1.26.4
matplotlib: 3.9.0
All packages installed successfully
requirements.txt created with 24 packages
- Always create a virtual environment for each ML project — isolation prevents conflicts
- Upgrade pip before installing packages — old pip versions misresolve dependencies
- Pin versions in requirements.txt so the project works identically next month
- Never install ML packages with sudo or into system Python — it will eventually break something
Step 2: Load and Explore the Iris Dataset
The Iris dataset is the Hello World of machine learning — the first dataset every ML engineer trains on, and for good reason. It contains 150 samples of iris flowers with 4 measurements each: sepal length, sepal width, petal length, and petal width — all in centimeters. Each sample is labeled with one of three species: setosa, versicolor, or virginica. The dataset is perfectly balanced (50 per class), has no missing values, and has clear feature separation — which means you can focus on learning the workflow without fighting the data. scikit-learn includes this dataset built-in, so no download, no CSV parsing, and no network dependency is required. Exploring the data before training is not a nicety — it is the step that catches data quality issues, reveals class imbalance, and builds your intuition about what the model is going to learn.
# TheCodeForge — Step 2: Load and Explore the Iris Dataset import pandas as pd import numpy as np from sklearn.datasets import load_iris # Load the dataset — built into scikit-learn, no download needed iris = load_iris() # Convert to DataFrame for easier exploration and display df = pd.DataFrame( data=iris.data, columns=iris.feature_names ) df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names) # 1. Dataset shape — how much data do we have? print('=== Dataset Shape ===') print(f'Samples: {df.shape[0]}, Features: {df.shape[1] - 1}') print(f'Feature names: {iris.feature_names}') print(f'Class names: {list(iris.target_names)}') # 2. First 5 rows — what does the data look like? print('\n=== First 5 Rows ===') print(df.head().to_string()) # 3. Statistical summary — what are the value ranges? print('\n=== Feature Statistics ===') print(df.describe().round(2).to_string()) # 4. Class distribution — is the dataset balanced? print('\n=== Class Distribution ===') print(df['species'].value_counts().to_string()) balance_ratio = df['species'].value_counts().min() / df['species'].value_counts().max() print(f'Balance ratio: {balance_ratio:.2f} (1.00 = perfectly balanced)') # 5. Missing values — will any algorithms crash? print('\n=== Missing Values ===') missing = df.isnull().sum() print(missing.to_string()) print(f'Total missing: {missing.sum()}') # 6. Feature correlations — which features carry similar information? print('\n=== Feature Correlations with Target ===') df_numeric = df.copy() df_numeric['target'] = iris.target for col in iris.feature_names: corr = df_numeric[col].corr(df_numeric['target']) print(f' {col}: {corr:.3f}') print('\nPetal features have higher correlation with species — they will be more useful for classification.')
Samples: 150, Features: 4
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Class names: ['setosa', 'versicolor', 'virginica']
=== First 5 Rows ===
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
=== Feature Statistics ===
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.00 150.00 150.00 150.00
mean 5.84 3.06 3.76 1.20
std 0.83 0.44 1.77 0.76
min 4.30 2.00 1.00 0.10
max 7.90 4.40 6.90 2.50
=== Class Distribution ===
setosa 50
versicolor 50
virginica 50
Balance ratio: 1.00 (1.00 = perfectly balanced)
=== Missing Values ===
sepal length (cm) 0
sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
species 0
Total missing: 0
=== Feature Correlations with Target ===
sepal length (cm): 0.783
sepal width (cm): -0.426
petal length (cm): 0.949
petal width (cm): 0.956
Petal features have higher correlation with species — they will be more useful for classification.
- Shape — how many samples and features do you have? Is this enough data for the algorithm you plan to use?
- Class balance — are all classes represented equally? Imbalance causes misleading accuracy
- Missing values — will your algorithm crash or silently produce garbage on NaN inputs?
- Feature ranges — wildly different scales may require normalization for distance-based algorithms
- Correlations — which features actually relate to the target? High correlation means the feature is informative
df.describe(), df.isnull().sum(), and value_counts() are your three essential first commands.Step 3: Visualize the Data
Visualization reveals patterns that summary statistics hide. A scatter plot of petal length versus petal width instantly shows that setosa is clearly separated from versicolor and virginica — while those two overlap slightly. This single plot tells you the classification task is feasible and that perfect accuracy may not be achievable because of the class overlap. Without visualization, you are training blind — you would not know whether your model is struggling because of the algorithm or because the classes genuinely overlap in feature space. In 2026, matplotlib remains the standard for static plots. For a first project, static plots saved as PNG files are more useful than interactive displays that disappear when the notebook restarts.
# TheCodeForge — Step 3: Visualize the Iris Dataset import matplotlib.pyplot as plt import pandas as pd import numpy as np from sklearn.datasets import load_iris iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names) colors = {'setosa': '#e74c3c', 'versicolor': '#3498db', 'virginica': '#2ecc71'} # Plot 1: Feature distributions by species (2x2 histograms) fig, axes = plt.subplots(2, 2, figsize=(10, 8)) fig.suptitle('Iris Dataset — Feature Distributions by Species', fontsize=14) for idx, feature in enumerate(iris.feature_names): ax = axes[idx // 2, idx % 2] for species in iris.target_names: subset = df[df['species'] == species] ax.hist(subset[feature], alpha=0.6, label=species, color=colors[species], bins=15, edgecolor='white') ax.set_xlabel(feature) ax.set_ylabel('Count') ax.legend(fontsize=8) plt.tight_layout() plt.savefig('iris_distributions.png', dpi=150) print('Saved: iris_distributions.png') # Plot 2: Scatter plot — the most revealing single visualization fig, ax = plt.subplots(figsize=(8, 6)) for species in iris.target_names: subset = df[df['species'] == species] ax.scatter( subset['petal length (cm)'], subset['petal width (cm)'], label=species, color=colors[species], alpha=0.7, s=60, edgecolors='white', linewidth=0.5 ) ax.set_xlabel('Petal Length (cm)', fontsize=12) ax.set_ylabel('Petal Width (cm)', fontsize=12) ax.set_title('Petal Length vs Width — Clear Species Separation', fontsize=13) ax.legend(fontsize=10) plt.tight_layout() plt.savefig('iris_scatter.png', dpi=150) print('Saved: iris_scatter.png') # Plot 3: Correlation heatmap — which features are related? fig, ax = plt.subplots(figsize=(7, 5)) corr_matrix = df[iris.feature_names].corr() im = ax.imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1) ax.set_xticks(range(4)) ax.set_yticks(range(4)) ax.set_xticklabels([f.replace(' (cm)', '') for f in iris.feature_names], rotation=45, ha='right') ax.set_yticklabels([f.replace(' (cm)', '') for f in iris.feature_names]) for i in range(4): for j in range(4): ax.text(j, i, f'{corr_matrix.iloc[i, j]:.2f}', ha='center', va='center', fontsize=10) plt.colorbar(im, label='Correlation') ax.set_title('Feature Correlation Matrix') plt.tight_layout() plt.savefig('iris_correlation.png', dpi=150) print('Saved: iris_correlation.png') print('\nKey insight: setosa is clearly separated. Versicolor and virginica overlap slightly.') print('This tells us classification is feasible but perfect accuracy may not be possible.')
Saved: iris_scatter.png
Saved: iris_correlation.png
Key insight: setosa is clearly separated. Versicolor and virginica overlap slightly.
This tells us classification is feasible but perfect accuracy may not be possible.
- Scatter plots reveal whether classes are separable — overlapping classes mean even perfect algorithms will make mistakes
- Histograms show which features differentiate classes — petal measurements separate species far better than sepal measurements
- Correlation heatmaps reveal redundant features — highly correlated features carry similar information
- Save plots as PNG files — notebook displays disappear when sessions restart, saved files persist for documentation and README files
Step 4: Split Data into Training and Test Sets
The train-test split is the most critical single step in any ML project. It prevents data leakage — the model never sees test data during training, so the evaluation measures genuine generalization rather than memorization. The standard beginner split is 80% training and 20% test. scikit-learn's train_test_split function handles this with one line of code. Two parameters are non-negotiable: stratify=y ensures all classes appear in both sets in the original proportion, and random_state ensures reproducibility — the same split every time you run the code. Without this split, your model memorizes the data instead of learning patterns, and every metric you compute is a lie that will collapse the moment real-world data arrives.
# TheCodeForge — Step 4: Split Data into Training and Test Sets from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split import numpy as np # Load data X, y = load_iris(return_X_y=True) iris = load_iris() # Split: 80% train, 20% test # stratify=y: maintain class proportions in both sets # random_state=42: same split every time for reproducibility X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print('=== Train-Test Split ===') print(f'Full dataset: {X.shape[0]} samples') print(f'Training set: {X_train.shape[0]} samples ({X_train.shape[0]/X.shape[0]*100:.0f}%)') print(f'Test set: {X_test.shape[0]} samples ({X_test.shape[0]/X.shape[0]*100:.0f}%)') print('\n=== Class Distribution in Training Set ===') for cls_idx, cls_name in enumerate(iris.target_names): count = np.sum(y_train == cls_idx) print(f' {cls_name}: {count} samples') print('\n=== Class Distribution in Test Set ===') for cls_idx, cls_name in enumerate(iris.target_names): count = np.sum(y_test == cls_idx) print(f' {cls_name}: {count} samples') print('\n=== Verification ===') print(f'Classes in train: {sorted(set(y_train))}') print(f'Classes in test: {sorted(set(y_test))}') print(f'Train and test overlap: {len(set(range(len(y_train))) & set(range(len(y_test))))} (should be 0)') print('\nstratify=y ensures balanced classes. random_state=42 ensures reproducibility.')
Full dataset: 150 samples
Training set: 120 samples (80%)
Test set: 30 samples (20%)
=== Class Distribution in Training Set ===
setosa: 40 samples
versicolor: 40 samples
virginica: 40 samples
=== Class Distribution in Test Set ===
setosa: 10 samples
versicolor: 10 samples
virginica: 10 samples
=== Verification ===
Classes in train: [0, 1, 2]
Classes in test: [0, 1, 2]
Train and test overlap: 0 (should be 0)
stratify=y ensures balanced classes. random_state=42 ensures reproducibility.
Step 5: Train Your First ML Model
Training a model in scikit-learn requires three lines of code: import the algorithm, create an instance, call fit(). The Decision Tree classifier is the best first algorithm because it is interpretable (you can visualize the learned rules), requires no feature scaling, handles multi-class problems natively, and produces results good enough to validate the entire pipeline. The fit() method learns patterns from the training data — it reads every row, discovers decision rules that separate the classes, and stores those rules internally. After training, the model object contains everything needed to make predictions on any new data with the same feature structure.
# TheCodeForge — Step 5: Train Your First ML Model from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import numpy as np # Load and split data (same as Step 4) iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Train a Decision Tree Classifier — three lines model = DecisionTreeClassifier(random_state=42) model.fit(X_train, y_train) print('=== Model Trained ===') print(f'Algorithm: {model.__class__.__name__}') print(f'Training samples used: {X_train.shape[0]}') print(f'Features per sample: {X_train.shape[1]}') print(f'Classes learned: {list(iris.target_names[model.classes_])}') print(f'Tree depth: {model.get_depth()}') print(f'Number of leaves (decision endpoints): {model.get_n_leaves()}') # Feature importance — which features did the tree use most? print('\n=== Feature Importance ===') for name, importance in sorted( zip(iris.feature_names, model.feature_importances_), key=lambda x: -x[1] ): bar = '█' * int(importance * 40) print(f' {name:>20}: {importance:.3f} {bar}') # Quick accuracy check on test data (detailed evaluation in Step 6) train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) print(f'\n=== Quick Accuracy Check ===') print(f'Training accuracy: {train_acc:.2%}') print(f'Test accuracy: {test_acc:.2%}') print(f'Gap: {train_acc - test_acc:.2%}') if train_acc - test_acc > 0.10: print('WARNING: Large train-test gap may indicate overfitting') else: print('Gap is small — model generalizes well') # Predict a single new sample new_flower = np.array([[5.1, 3.5, 1.4, 0.2]]) # measurements in cm prediction = model.predict(new_flower) print(f'\nNew flower {new_flower[0]} -> {iris.target_names[prediction[0]]}')
Algorithm: DecisionTreeClassifier
Training samples used: 120
Features per sample: 4
Classes learned: ['setosa', 'versicolor', 'virginica']
Tree depth: 5
Number of leaves (decision endpoints): 9
=== Feature Importance ===
petal width (cm): 0.921 █████████████████████████████████████
petal length (cm): 0.065 ██
sepal length (cm): 0.014
sepal width (cm): 0.000
=== Quick Accuracy Check ===
Training accuracy: 100.00%
Test accuracy: 96.67%
Gap: 3.33%
Gap is small — model generalizes well
New flower [5.1 3.5 1.4 0.2] -> setosa
- fit(X_train, y_train) is the learning step — the model reads training data and discovers rules
- predict(X_new) is the exam step — the model applies those rules to data it has never seen
- Feature importance tells you which measurements the model relied on most — petal width dominates Iris classification
- The train-test accuracy gap measures overfitting — a gap above 10% is a warning sign
- random_state=42 ensures the same tree is built every time — critical for reproducibility
Step 6: Evaluate Model Performance
Evaluation measures how well your model generalizes to unseen data — it is the step that separates a toy experiment from a trustworthy model. Accuracy alone is insufficient — a confusion matrix reveals which specific classes the model confuses, and the classification report provides precision, recall, and F1-score per class. For the Iris dataset, expect 93-100% test accuracy depending on the random split. If accuracy is below 90%, something is wrong with the preprocessing or the split — not the algorithm. Cross-validation provides a more robust estimate by training and evaluating on multiple non-overlapping splits, reducing the chance that a single lucky or unlucky split distorts your results.
# TheCodeForge — Step 6: Evaluate Model Performance from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import ( accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay ) import numpy as np import matplotlib.pyplot as plt # Load, split, train iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) model = DecisionTreeClassifier(random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test) # 1. Accuracy — the simplest metric accuracy = accuracy_score(y_test, predictions) print(f'=== Accuracy ===') print(f'Test accuracy: {accuracy:.2%} ({int(accuracy * len(y_test))}/{len(y_test)} correct)') # 2. Confusion Matrix — which classes get confused? print(f'\n=== Confusion Matrix ===') cm = confusion_matrix(y_test, predictions) print(f'{"":>12} {" ".join(iris.target_names)} <- Predicted') for i, row in enumerate(cm): print(f'{iris.target_names[i]:>12}: {row} <- Actual') print('Diagonal = correct predictions. Off-diagonal = mistakes.') # Save confusion matrix as image fig, ax = plt.subplots(figsize=(7, 5)) ConfusionMatrixDisplay.from_predictions( y_test, predictions, display_labels=iris.target_names, cmap='Blues', ax=ax ) ax.set_title('Confusion Matrix — Iris Classification') plt.tight_layout() plt.savefig('confusion_matrix.png', dpi=150) print('Saved: confusion_matrix.png') # 3. Classification Report — precision, recall, F1 per class print(f'\n=== Classification Report ===') print(classification_report(y_test, predictions, target_names=iris.target_names)) # 4. Cross-Validation — more robust than a single split print(f'=== Cross-Validation (5-fold stratified) ===') cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) cv_scores = cross_val_score( DecisionTreeClassifier(random_state=42), X, y, cv=cv, scoring='accuracy' ) print(f'Fold scores: {cv_scores.round(3)}') print(f'Mean accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})') print(f'Worst fold: {cv_scores.min():.2%}') print(f'Best fold: {cv_scores.max():.2%}') if cv_scores.mean() < accuracy - 0.05: print('\nWARNING: Test accuracy is notably higher than CV mean — the test set may be unusually easy.') else: print('\nTest accuracy aligns with CV mean — results are reliable.')
Test accuracy: 96.67% (29/30 correct)
=== Confusion Matrix ===
setosa versicolor virginica <- Predicted
setosa: [10 0 0] <- Actual
versicolor: [ 0 9 1] <- Actual
virginica: [ 0 0 10] <- Actual
Diagonal = correct predictions. Off-diagonal = mistakes.
Saved: confusion_matrix.png
=== Classification Report ===
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 0.90 0.95 10
virginica 0.91 1.00 0.95 10
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
=== Cross-Validation (5-fold stratified) ===
Fold scores: [0.967 0.967 0.9 0.967 1. ]
Mean accuracy: 96.00% (+/- 3.06%)
Worst fold: 90.00%
Best fold: 100.00%
Test accuracy aligns with CV mean — results are reliable.
- Accuracy = correct predictions divided by total predictions — a single number summary
- Confusion matrix shows which specific classes get confused with each other — setosa is never wrong, but one versicolor was misclassified as virginica
- Precision = of everything predicted as class X, what fraction actually was class X
- Recall = of everything that actually is class X, what fraction did the model find
- Cross-validation trains and tests on multiple splits for a more stable, reliable accuracy estimate
Step 7: Compare a Second Algorithm
Never ship the first algorithm you try. Comparing at least two algorithms on the same data split builds the habit of model selection — one of the most important practices in production ML. A Random Forest is an excellent second algorithm to compare against the Decision Tree: it builds many trees and averages their predictions, reducing overfitting. The comparison takes 5 additional lines of code and immediately tells you whether the Decision Tree result is strong or whether a better algorithm would meaningfully improve performance. This step transforms a homework exercise into the beginning of a professional workflow.
# TheCodeForge — Step 7: Compare Multiple Algorithms from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score import numpy as np # Load and split data (same split for fair comparison) iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Define algorithms to compare # Note: Logistic Regression and KNN need feature scaling — use a Pipeline algorithms = { 'Decision Tree': DecisionTreeClassifier(random_state=42), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'Logistic Regression': Pipeline([ ('scaler', StandardScaler()), ('clf', LogisticRegression(max_iter=200, random_state=42)) ]), 'K-Nearest Neighbors': Pipeline([ ('scaler', StandardScaler()), ('clf', KNeighborsClassifier(n_neighbors=5)) ]), } print('=== Algorithm Comparison ===') print(f'{"Algorithm":<24} {"Test Acc":>10} {"CV Mean":>10} {"CV Std":>10}') print('-' * 58) results = {} for name, algo in algorithms.items(): algo.fit(X_train, y_train) test_acc = accuracy_score(y_test, algo.predict(X_test)) cv_scores = cross_val_score(algo, X, y, cv=cv, scoring='accuracy') results[name] = { 'test_acc': test_acc, 'cv_mean': cv_scores.mean(), 'cv_std': cv_scores.std() } print(f'{name:<24} {test_acc:>9.2%} {cv_scores.mean():>9.2%} {cv_scores.std():>9.2%}') best = max(results, key=lambda k: results[k]['cv_mean']) print(f'\nBest algorithm by CV mean: {best} ({results[best]["cv_mean"]:.2%})') print('\nKey insight: on clean, balanced data like Iris, most algorithms perform similarly.') print('On real-world messy data, gradient boosting typically wins for tabular problems.')
Algorithm Test Acc CV Mean CV Std
----------------------------------------------------------
Decision Tree 96.67% 96.00% 3.06%
Random Forest 96.67% 96.67% 2.11%
Logistic Regression 96.67% 97.33% 2.49%
K-Nearest Neighbors 96.67% 96.67% 2.11%
Best algorithm by CV mean: Logistic Regression (97.33%)
Key insight: on clean, balanced data like Iris, most algorithms perform similarly.
On real-world messy data, gradient boosting typically wins for tabular problems.
- Never ship the first algorithm you try — always compare at least two
- Use the same data split for all algorithms — otherwise the comparison is unfair
- CV mean is more reliable than test accuracy for comparison — it averages over multiple splits
- On Iris, most algorithms perform similarly because the data is clean and separable — real-world data shows larger gaps
- Algorithms that need scaling (KNN, Logistic Regression) must be wrapped in a Pipeline to prevent data leakage during CV
Step 8: Make Predictions and Save the Model
The final step closes the loop: use the trained model to predict new unseen samples, and save the model to disk so you never have to retrain it. The predict() method accepts a 2D array of feature values and returns the predicted class. predict_proba() returns confidence scores — useful for production systems that need to filter low-confidence predictions. joblib saves the trained model as a file that can be loaded and used anywhere — in a script, a notebook, a FastAPI endpoint, or a scheduled batch prediction job. This step represents the complete ML workflow: from raw data to a reusable prediction artifact.
# TheCodeForge — Step 8: Make Predictions and Save the Model from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import numpy as np import joblib import os # Load, split, train (same as previous steps) iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) model = DecisionTreeClassifier(random_state=42) model.fit(X_train, y_train) # Predict new flowers — data the model has never seen new_flowers = np.array([ [5.1, 3.5, 1.4, 0.2], # typical setosa measurements [6.2, 2.9, 4.3, 1.3], # typical versicolor measurements [7.7, 3.0, 6.1, 2.3], # typical virginica measurements [5.0, 3.4, 1.5, 0.2], # another setosa candidate [5.9, 3.0, 4.2, 1.5], # versicolor or virginica? ]) predictions = model.predict(new_flowers) probabilities = model.predict_proba(new_flowers) print('=== Predictions on New Data ===') for i, (flower, pred, prob) in enumerate(zip(new_flowers, predictions, probabilities)): species = iris.target_names[pred] confidence = prob[pred] all_probs = ', '.join([f'{iris.target_names[j]}={p:.1%}' for j, p in enumerate(prob) if p > 0.01]) print(f'Flower {i+1}: {flower} -> {species} (confidence: {confidence:.1%})') print(f' Probabilities: {all_probs}') # Save the model for deployment or later use model_path = 'iris_model_v1.pkl' joblib.dump(model, model_path) model_size = os.path.getsize(model_path) print(f'\nModel saved to {model_path} ({model_size:,} bytes)') # Load and verify the saved model produces identical predictions loaded_model = joblib.load(model_path) loaded_predictions = loaded_model.predict(new_flowers) assert np.array_equal(predictions, loaded_predictions), 'Loaded model produces different predictions!' print(f'Loaded model verification: predictions match original ✓') # Save the feature names for documentation print(f'\nExpected input format: {iris.feature_names}') print('Each prediction requires exactly 4 numeric values in this order.')
Flower 1: [5.1 3.5 1.4 0.2] -> setosa (confidence: 100.0%)
Probabilities: setosa=100.0%
Flower 2: [6.2 2.9 4.3 1.3] -> versicolor (confidence: 100.0%)
Probabilities: versicolor=100.0%
Flower 3: [7.7 3. 6.1 2.3] -> virginica (confidence: 100.0%)
Probabilities: virginica=100.0%
Flower 4: [5. 3.4 1.5 0.2] -> setosa (confidence: 100.0%)
Probabilities: setosa=100.0%
Flower 5: [5.9 3. 4.2 1.5] -> versicolor (confidence: 100.0%)
Probabilities: versicolor=100.0%
Model saved to iris_model_v1.pkl (2,847 bytes)
Loaded model verification: predictions match original ✓
Expected input format: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Each prediction requires exactly 4 numeric values in this order.
- Use joblib.dump to save models — it handles numpy arrays and scikit-learn objects efficiently
- Version your model files with a suffix like _v1 — you will train improved models later
- Always verify the loaded model produces identical predictions to the original before trusting it
- Document the expected input format — saved models carry no metadata about feature names or order
- In production, the saved model file is loaded by your API server — you train once and serve many times
predict_proba() returns confidence scores for production filtering.Step 9: Complete End-to-End Pipeline
This section combines all steps into a single, reproducible pipeline function. A complete ML pipeline loads data, explores it, splits it, trains a model, evaluates performance, compares algorithms, makes predictions, and saves the artifact — all in one script that produces consistent results every time. This is the template you will adapt for every future supervised classification project. The only things that change between projects are the dataset you load, the algorithms you compare, and the evaluation metrics appropriate for your problem. The workflow itself is identical whether you are classifying flowers, detecting fraud, or predicting customer churn.
# TheCodeForge — Complete First ML Project Pipeline import pandas as pd import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report import joblib import os def run_iris_pipeline(): """Complete ML pipeline for the Iris dataset — from data to saved model.""" # Step 1: Load data iris = load_iris() X, y = iris.data, iris.target print(f'[1/8] Data loaded: {X.shape[0]} samples, {X.shape[1]} features, ' f'{len(iris.target_names)} classes') # Step 2: Explore df = pd.DataFrame(X, columns=iris.feature_names) df['target'] = y class_dist = dict(zip(*np.unique(y, return_counts=True))) missing = df.isnull().sum().sum() print(f'[2/8] Class distribution: {class_dist} | Missing values: {missing}') # Step 3: Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(f'[3/8] Split: {X_train.shape[0]} train, {X_test.shape[0]} test') # Step 4: Train primary model model = DecisionTreeClassifier(random_state=42) model.fit(X_train, y_train) print(f'[4/8] Trained: {model.__class__.__name__} ' f'(depth={model.get_depth()}, leaves={model.get_n_leaves()})') # Step 5: Evaluate predictions = model.predict(X_test) test_acc = accuracy_score(y_test, predictions) print(f'[5/8] Test accuracy: {test_acc:.2%}') print(classification_report(y_test, predictions, target_names=iris.target_names, zero_division=0)) # Step 6: Cross-validate cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) cv_scores = cross_val_score(model, X, y, cv=cv) print(f'[6/8] Cross-validation: {cv_scores.mean():.2%} (+/- {cv_scores.std():.2%})') # Step 7: Compare with Random Forest rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) rf_acc = accuracy_score(y_test, rf.predict(X_test)) rf_cv = cross_val_score(rf, X, y, cv=cv).mean() print(f'[7/8] Comparison — Decision Tree: {test_acc:.2%} | ' f'Random Forest: {rf_acc:.2%} (CV: {rf_cv:.2%})') # Step 8: Save model model_path = 'iris_model_v1.pkl' joblib.dump(model, model_path) size = os.path.getsize(model_path) print(f'[8/8] Model saved: {model_path} ({size:,} bytes)') return model, test_acc if __name__ == '__main__': model, accuracy = run_iris_pipeline() print(f'\n{"=" * 50}') print(f'Pipeline complete. Final test accuracy: {accuracy:.2%}') print(f'Model ready for deployment: iris_model_v1.pkl') print(f'{"=" * 50}')
[2/8] Class distribution: {0: 50, 1: 50, 2: 50} | Missing values: 0
[3/8] Split: 120 train, 30 test
[4/8] Trained: DecisionTreeClassifier (depth=5, leaves=9)
[5/8] Test accuracy: 96.67%
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 0.90 0.95 10
virginica 0.91 1.00 0.95 10
accuracy 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
[6/8] Cross-validation: 96.00% (+/- 3.06%)
[7/8] Comparison — Decision Tree: 96.67% | Random Forest: 96.67% (CV: 96.67%)
[8/8] Model saved: iris_model_v1.pkl (2,847 bytes)
==================================================
Pipeline complete. Final test accuracy: 96.67%
Model ready for deployment: iris_model_v1.pkl
==================================================
- The 8-step workflow is identical for every supervised classification project — only the data, algorithms, and metrics change
- Wrap the pipeline in a function — it becomes testable, reusable, and callable from other scripts
- Numbered progress output makes debugging easy — you know exactly which step failed
- Always include an algorithm comparison — shipping the first thing you try is a professional anti-pattern
- The model file is the deployment artifact — everything before it is development, everything after is production
| Algorithm | Code | Typical Iris Accuracy | Interpretable | Scaling Required | Best For |
|---|---|---|---|---|---|
| Decision Tree | DecisionTreeClassifier(random_state=42) | 93-100% | Yes — visual tree structure | No | First project, interpretability, debugging intuition |
| Random Forest | RandomForestClassifier(n_estimators=100) | 95-100% | Partial — feature importance only | No | Better generalization, production baseline |
| Logistic Regression | Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())]) | 93-97% | Yes — feature coefficients | Yes — requires Pipeline | Linear boundaries, probability calibration |
| K-Nearest Neighbors | Pipeline([('scaler', StandardScaler()), ('clf', KNeighborsClassifier())]) | 93-97% | Somewhat — inspect neighbors | Yes — requires Pipeline | Small datasets, instance-based reasoning |
| Gradient Boosting | GradientBoostingClassifier(n_estimators=100) | 95-100% | Partial — feature importance | No | Production tabular data — the 2026 default |
🎯 Key Takeaways
- The 8-step ML workflow — load, explore, visualize, split, train, evaluate, compare, save — is identical for every supervised classification project
- The train-test split is the most critical step — it separates memorization from generalization and makes every metric honest
- Decision Tree is the best first algorithm — interpretable, no scaling required, and its feature importance confirms what visualization showed
- Never ship the first algorithm you try — always compare at least two and use cross-validation for a fair comparison
- Always evaluate on held-out test data, never on training data — training accuracy measures memorization, test accuracy measures learning
- Save your trained model with joblib and version the file — train once, deploy many times, retrain when data changes
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWalk me through the steps of your first ML project using the Iris dataset.JuniorReveal
- QWhy did you choose a Decision Tree as your first algorithm?JuniorReveal
- QExplain the difference between accuracy, precision, recall, and F1-score using the Iris classification results.Mid-levelReveal
- QHow would you adapt this pipeline for a real-world classification problem?SeniorReveal
Frequently Asked Questions
Do I need to know math to build this project?
No. scikit-learn handles all the mathematics internally. You need to understand what each step does and what the metrics mean — not the formulas behind the algorithms. The Decision Tree algorithm learns if-then rules from your data automatically. You can build, evaluate, compare, and deploy this entire project without writing a single formula. Math becomes useful later when you need to tune hyperparameters with intention or diagnose why a model is underfitting — but for your first project, focus on mastering the 8-step workflow, not the algebra.
How long does this project take to complete?
30 to 45 minutes for a complete beginner who reads every explanation. Installing packages takes 5 minutes. Loading and exploring data takes 5 minutes. Visualization takes 5 minutes. Splitting, training, and evaluating take 10 minutes. Comparing algorithms takes 5 minutes. Making predictions and saving the model takes 5 minutes. The complete pipeline script runs in under 2 seconds. Most of the time is spent understanding what each step does and why it matters, not waiting for code to execute.
Can I use a different dataset instead of Iris?
Absolutely — and you should, as your second project. scikit-learn includes several built-in datasets: load_wine (wine classification, 178 samples, 13 features), load_digits (handwritten digit recognition, 1797 samples, 64 features), and load_breast_cancer (tumor classification, 569 samples, 30 features). The pipeline workflow is identical — only the dataset source and the number of classes change. For a more challenging next step, download a CSV from Kaggle and replace load_iris() with pd.read_csv() to practice with real-world data that has missing values and imbalanced classes.
What is the difference between model.score() and accuracy_score()?
They produce the same numeric result for classification. model.score(X_test, y_test) is a convenience method that calls predict internally and computes accuracy in one step. accuracy_score(y_test, predictions) is a standalone function from sklearn.metrics that takes pre-computed predictions. Use model.score() for quick checks. Use accuracy_score() when you have already called predict() and need the predictions array for other metrics like the confusion matrix or classification report. In production code, calling predict() once and reusing the predictions array for all metrics is more efficient than calling score() and predict() separately.
How do I know if my model is good enough?
For Iris, 90-100% test accuracy is expected because the data is clean, balanced, and well-separated. The real question is whether your model beats a meaningful baseline. The simplest baseline for classification is a model that always predicts the most common class — on balanced Iris, that is 33% accuracy. Your model should dramatically exceed this. For real-world problems, the baseline depends on the domain — existing heuristic rules, human expert accuracy, or the current production model. A model that does not beat the relevant baseline has learned nothing useful regardless of its absolute accuracy number. Cross-validation mean is more trustworthy than a single test split — if CV accuracy is much lower than test accuracy, the test split was unusually easy.
What should I learn after completing this project?
Three directions that each build directly on the workflow you learned here: (1) Try a harder dataset — load_breast_cancer or a Kaggle CSV with missing values and class imbalance, which forces you to add preprocessing and use F1 instead of accuracy. (2) Try gradient boosting — install xgboost or lightgbm and add them to your algorithm comparison; these are the production default for tabular data in 2026. (3) Deploy the model — wrap iris_model_v1.pkl in a FastAPI endpoint that accepts measurements as JSON and returns the predicted species; this turns a notebook exercise into a shipped product. Each of these extends one step of the 8-step pipeline without changing the overall structure.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.