Beginner 6 min · March 06, 2026

Supervised vs Unsupervised Learning

Supervised vs Unsupervised — Label Trap That Kills

Q: What is Supervised vs Unsupervised Learning in simple terms?

Supervised learning is studying with an answer key — you have examples with correct answers attached and the model learns to predict new ones. Unsupervised learning is discovering patterns on your own — you have data but no answers, and the algorithm groups things by similarity without being told what the groups should be. Supervised: teacher with labels. Unsupervised: no teacher, find structure yourself.

Q: Which is better: supervised or unsupervised learning?

Neither is universally better — they solve different problems. Supervised learning is better when you have validated labels and need to predict a known outcome for new inputs (will this customer churn? is this transaction fraudulent?). Unsupervised learning is better when you have raw data and want to discover structure you did not anticipate (what natural customer groups exist? which transactions look anomalous?). In practice, most production ML systems use supervised learning for prediction and unsupervised learning for exploration, feature engineering, and anomaly detection — often together in the same pipeline.

Q: Can I use both supervised and unsupervised learning together?

Yes, and you frequently should. Common production patterns include: running unsupervised clustering to discover natural groups, then labelling those groups and training a supervised classifier to assign new records; applying PCA (unsupervised dimensionality reduction) to compress features before training a supervised model; and using unsupervised anomaly detection to identify and review potential mislabels in the supervised training set. The two paradigms complement each other — unsupervised for discovery and exploration, supervised for prediction and operationalisation.

Q: How much labelled data do I need for supervised learning?

It depends on the complexity of the problem, the number of classes, and the signal-to-noise ratio in your features. For simple binary classification with 10-20 informative features, 500-1,000 validated labelled examples per class is a reasonable starting point. For complex problems with many classes or subtle signal, you may need thousands per class. Transfer learning (starting from a pre-trained model) can reduce this requirement significantly — a fine-tuned language model may generalise well from 100-200 examples. If labelling is expensive, start with unsupervised exploration to understand your data's structure before committing to a labelling effort.

Q: What is the silhouette score and why does it matter for clustering?

The silhouette score measures how well each data point fits its assigned cluster relative to the nearest other cluster. For each point it computes two distances: the average distance to other points in the same cluster (cohesion) and the average distance to points in the nearest different cluster (separation). The silhouette score is (separation - cohesion) / max(separation, cohesion). It ranges from -1 to +1. A score above 0.5 suggests reasonable clusters. Above 0.7 suggests well-separated clusters. Below 0.25 suggests the clusters are not meaningfully distinct. It matters because it is one of the few ways to evaluate clustering quality without ground-truth labels — it tells you whether the algorithm found real structure or just divided the data arbitrarily.

85% accuracy on customer segmentation failed because labels were arbitrary.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide

⚡Quick Answer

Supervised learning trains on labelled data — each input has a known correct output
Unsupervised learning finds patterns in unlabelled data — no answers provided
Use supervised when you have labelled examples and need predictions (classification, regression)
Use unsupervised when you need to discover structure (clustering, dimensionality reduction)
Labelling is expensive — 70% of real-world ML projects spend most time on data labelling
Biggest mistake: using unsupervised methods when labelled data exists, or forcing labels where patterns should be discovered

✦ Definition~90s read

What is Supervised vs Unsupervised Learning?

Supervised learning trains a model on labelled data — every input example has a known correct output attached to it. The model learns the mapping from inputs to outputs, then applies that mapping to new, unseen data. The word 'supervised' refers to the fact that a human has already done the work of labelling — providing the answer key the model learns from.

★

Imagine you are learning to identify birds.

Regression asks 'what number will this produce?' — what is this house worth, how many units will we sell next quarter, what temperature will it be tomorrow.

The quality of supervised learning is bounded by the quality of the labels. A perfectly tuned model trained on noisy or inconsistent labels will faithfully reproduce those noisy labels. This is why experienced ML engineers treat label auditing as a first-class engineering task, not an afterthought.

Plain-English First

Imagine you are learning to identify birds. In supervised learning, a teacher shows you 1,000 photos — each labelled 'Robin', 'Eagle', or 'Sparrow' — and you study the labels until you can name any new bird yourself. In unsupervised learning, someone dumps 1,000 unlabelled photos on your desk and says 'figure out which ones are similar'. You start noticing patterns — small ones, big ones, colourful ones — and group them yourself, even though nobody told you the category names. That is the whole difference: one has a teacher with an answer key, the other makes you find the patterns on your own.

Every recommendation you get on Netflix, every spam email that lands in your junk folder, and every fraud alert your bank sends you — all of these are powered by machine learning models. But not all machine learning works the same way. The single biggest fork in the road when building any ML system is deciding: do we have labelled data to learn from, or are we on our own? Getting this decision wrong does not just slow your project down — it can make your model completely useless, no matter how much compute you throw at it.

The core problem both approaches solve is teaching a computer to find patterns without explicitly programming every rule. Instead of writing 'if the email contains the word free AND the sender is unknown THEN mark as spam', you feed the machine examples and let it work out the rules itself. Supervised learning works when you already have examples with correct answers attached. Unsupervised learning works when you have mountains of raw data but nobody has sat down to label any of it — which, in the real world, is most of the time.

By the end of this article you will be able to explain the difference clearly in plain English, know exactly which approach to reach for given a problem, write working Python code for both paradigms from scratch, and avoid the three most common mistakes beginners make when choosing between them. No ML experience needed — we will build everything up piece by piece.

What is Supervised Learning?

The two main supervised tasks are classification (predicting categories) and regression (predicting numbers). Classification asks 'which category does this belong to?' — spam or not spam, will this customer churn or stay, is this tumour malignant or benign. Regression asks 'what number will this produce?' — what is this house worth, how many units will we sell next quarter, what temperature will it be tomorrow.

io/thecodeforge/ml/supervised_example.pyPYTHON

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Supervised learning: classification with labelled data
# Dataset: predict whether a customer will churn (1) or stay (0)
# Features: usage_minutes, support_tickets, months_active

X = np.array([
    [120, 3, 24],   # moderate usage, few tickets, long tenure
    [45,  8,  6],   # low usage, many tickets, new customer
    [200, 1, 36],   # high usage, few tickets, long tenure
    [30, 12,  3],   # very low usage, many tickets, very new
    [180, 2, 18],   # high usage, few tickets, mid tenure
    [60,  7,  8],   # low usage, several tickets, new
    [250, 0, 48],   # very high usage, zero tickets, veteran
    [40, 10,  4],   # low usage, many tickets, new
])

# Labels: 0 = stayed, 1 = churned (the answer key)
# These labels were sourced from historical CRM records — not invented
y = np.array([0, 1, 0, 1, 0, 1, 0, 1])

# Split into training and test sets
# stratify=y preserves the class ratio in both splits
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Train the model — it learns the mapping X -> y during fit()
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,          # prevent overfitting on small dataset
    random_state=42
)
model.fit(X_train, y_train)

# Predict on new data the model has never seen
predictions = model.predict(X_test)
print(f'Predictions: {predictions}')
print(f'Actual:      {y_test}')

# classification_report shows per-class precision, recall, F1
# Never rely only on accuracy — it hides class imbalance problems
print(classification_report(y_test, predictions,
      target_names=['stayed', 'churned']))

# Feature importance — which inputs drove predictions most?
for feature, importance in zip(
    ['usage_minutes', 'support_tickets', 'months_active'],
    model.feature_importances_
):
    print(f'  {feature}: {importance:.3f}')

Mental Model

Supervised Learning as Function Approximation

Supervised learning finds a function f such that f(inputs) approximately equals the known outputs. Everything else is implementation detail.

Training data = pairs of (input, correct_output) — the answer key the model learns from.
The model adjusts internal parameters to minimise the difference between its predictions and the correct outputs.
Once trained, the model predicts outputs for new inputs it has never seen.
Classification: output is a category — spam/not spam, churn/stay, fraud/legitimate.
Regression: output is a number — house price, revenue forecast, sensor reading.
The ceiling of model quality is set by label quality — a well-tuned model on bad labels produces bad predictions confidently.

📊 Production Insight

Supervised models are only as good as their labels.

Noisy labels from multiple annotators with no adjudication process silently degrade model accuracy — often by more than model architecture choices.

Rule: audit label quality and measure inter-annotator agreement before investing engineering time in model complexity. A cleaner dataset with a simpler model almost always beats a complex model on dirty labels.

🎯 Key Takeaway

Supervised learning requires labelled data — inputs paired with known, validated outputs.

It learns a mapping function and applies it to new, unseen data.

Label quality sets the ceiling of model performance — audit labels before tuning models.

thecodeforge.io

Supervised Unsupervised Learning

What is Unsupervised Learning?

Unsupervised learning finds hidden patterns in data without any labels. The model has no answer key — it discovers structure on its own by finding data points that are similar to each other, or features that vary together, or records that behave differently from everything else.

The three main unsupervised tasks are clustering (grouping similar data points), dimensionality reduction (compressing many features into fewer while preserving structure), and anomaly detection (finding data points that deviate significantly from the norm).

The fundamental challenge of unsupervised learning is validation. With supervised learning, you compare predictions to known labels and compute accuracy. With unsupervised learning, there are no labels to compare against. You must use internal metrics like silhouette score, involve domain experts to validate whether discovered groups make business sense, or apply extrinsic evaluation by checking whether the discovered structure correlates with outcomes you care about.

This is why unsupervised results should never be shipped directly to production without human review. The algorithm finds groups — it cannot tell you whether those groups are meaningful.

io/thecodeforge/ml/unsupervised_example.pyPYTHON

import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Unsupervised learning: clustering without labels
# Dataset: customer behaviour data — no predefined segments exist
# Features: annual_spend ($), visit_frequency (visits/year), avg_cart_value ($)

X = np.array([
    [5000,  50,  100],   # moderate spend, frequent, small carts
    [4800,  48,  100],
    [5200,  52,  100],
    [200,   12,   17],   # low spend, infrequent, very small carts
    [180,   10,   18],
    [220,   14,   16],
    [12000,  8, 1500],   # high spend, rare visits, very large carts
    [11500,  7, 1643],
    [12500,  9, 1389],
    [300,   45,    7],   # low spend, frequent, tiny carts (browse-heavy)
    [250,   40,    6],
    [280,   42,    7],
])

# IMPORTANT: scale features before clustering
# KMeans uses Euclidean distance — unscaled spend (0-12000) will
# completely dominate cart value (6-1643) and crush the signal
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal K using silhouette score
print('Searching for optimal K:')
print('K | Silhouette Score | Inertia')
print('--|-------------------|--------')
best_k, best_score = 2, -1
for k in range(2, 6):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f'{k} | {score:.4f}             | {km.inertia_:.1f}')
    if score > best_score:
        best_score, best_k = score, k

print(f'\nBest K = {best_k} (silhouette = {best_score:.4f})')

# Fit with best K and inspect discovered segments
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

print('\nDiscovered cluster assignments:')
for i, cluster in enumerate(clusters):
    print(f'  Customer {i+1}: spend=${X[i][0]}, '
          f'visits={X[i][1]}, cart=${X[i][2]} -> Cluster {cluster}')

print('\nCluster profiles (domain expert interpretation needed):')
for i in range(best_k):
    members = X[clusters == i]
    print(f'  Cluster {i}: {len(members)} customers | '
          f'avg spend=${members[:, 0].mean():.0f} | '
          f'avg visits={members[:, 1].mean():.0f} | '
          f'avg cart=${members[:, 2].mean():.0f}')

Mental Model

Unsupervised Learning as Pattern Discovery

Unsupervised learning finds structure that humans did not explicitly define or label. The algorithm discovers — you interpret.

No labels exist — the algorithm discovers groups, patterns, or anomalies entirely on its own.
Clustering groups similar data points together — customer segments, document topics, gene expression profiles.
Dimensionality reduction compresses many features into fewer while preserving the relationships between data points.
Anomaly detection identifies records that deviate significantly from the established norm — useful for fraud, equipment failure, and data quality issues.
The discovered patterns must be interpreted by humans — the algorithm outputs Cluster 0, 1, 2 — not 'Frequent Browsers', 'Bulk Buyers', 'High-Value Loyalists'.
Validation without labels requires internal metrics (silhouette score) and external validation (domain expert review).

📊 Production Insight

Unsupervised results require human interpretation and business validation before any production use.

A cluster labelled 'Cluster 2' has zero business meaning without domain expert analysis.

Rule: always involve domain experts when interpreting unsupervised results. Build in two to three review sessions before using cluster assignments to drive any decision.

🎯 Key Takeaway

Unsupervised learning discovers patterns without labels — no answer key exists.

Clustering, dimensionality reduction, and anomaly detection are the main tasks.

The algorithm finds groups — humans must interpret what those groups mean and whether they are worth acting on.

Side-by-Side Comparison

The choice between supervised and unsupervised learning depends on your data, your goal, and your resources. These two paradigms are not competitors — they are tools for different jobs. Choosing the wrong one wastes months of engineering time on a fundamentally unsolvable problem.

The most important question is not 'which is more accurate?' It is 'what do I actually have and what do I actually need?' If you have validated labels and need to predict a known outcome, supervised learning is the answer. If you have raw data and want to discover structure you did not anticipate, unsupervised learning is the answer. If you have both needs, you likely need both paradigms working together.

io/thecodeforge/ml/paradigm_comparison.pyPYTHON

100

101

102

103

104

# Decision framework: supervised vs unsupervised
def recommend_approach(has_labels, goal, label_quality_validated,
                       labeling_budget_days):
    """
    Return the recommended ML paradigm based on your actual situation.

    Parameters
    ----------
    has_labels              : bool  — do labelled examples exist?
    goal                    : str   — what are you trying to accomplish?
    label_quality_validated : bool  — have labels been audited for quality?
    labeling_budget_days    : int   — days available for labelling effort
    """

    if has_labels and label_quality_validated and labeling_budget_days > 0:
        if goal in ['classify', 'predict_category', 'detect']:
            return {
                'approach': 'Supervised — Classification',
                'algorithms': [
                    'Logistic Regression (interpretable baseline)',
                    'Random Forest (robust, handles nonlinearity)',
                    'XGBoost (high performance, competition favourite)'
                ],
                'evaluation': 'Accuracy, Precision, Recall, F1, AUC-ROC',
                'watch_out': 'Class imbalance — always use stratified splits',
                'data_requirement': '500+ validated labelled examples per class'
            }
        elif goal in ['predict_number', 'forecast', 'estimate']:
            return {
                'approach': 'Supervised — Regression',
                'algorithms': [
                    'Linear Regression (interpretable baseline)',
                    'Gradient Boosting Regressor (high performance)',
                    'XGBoost / LightGBM (production default)'
                ],
                'evaluation': 'MAE, RMSE, R-squared — report all three',
                'watch_out': 'Outliers inflate RMSE — check both MAE and RMSE',
                'data_requirement': '1000+ labelled examples'
            }

    elif not has_labels or labeling_budget_days == 0:
        if goal in ['group', 'segment', 'discover_structure', 'explore']:
            return {
                'approach': 'Unsupervised — Clustering',
                'algorithms': [
                    'K-Means (fast, interpretable, assumes spherical clusters)',
                    'DBSCAN (finds arbitrarily shaped clusters, handles noise)',
                    'Hierarchical (no K needed, good for small datasets)'
                ],
                'evaluation': 'Silhouette Score, Inertia, Domain Expert Validation',
                'watch_out': 'Scale features first — distance metrics break on raw data',
                'data_requirement': 'Any volume — more data = more stable clusters'
            }
        elif goal in ['reduce_dimensions', 'visualize', 'compress',
                      'feature_engineering']:
            return {
                'approach': 'Unsupervised — Dimensionality Reduction',
                'algorithms': [
                    'PCA (linear, fast, variance explained is interpretable)',
                    't-SNE (nonlinear, good for visualization, slow on large data)',
                    'UMAP (nonlinear, faster than t-SNE, preserves global structure)'
                ],
                'evaluation': 'Variance Explained (PCA), Visual Cluster Separation',
                'watch_out': 't-SNE is for visualization only — do not use as features',
                'data_requirement': 'Any volume'
            }

    elif has_labels and not label_quality_validated:
        return {
            'approach': 'Audit labels first before choosing paradigm',
            'reason': 'Unvalidated labels may be arbitrary — training on them '
                      'produces a model that confidently predicts the wrong thing.',
            'next_step': 'Measure inter-annotator agreement. If kappa < 0.6, '
                         'your labels are not reliable enough for supervised training.'
        }

    # Hybrid: no labels and complex goal
    return {
        'approach': 'Hybrid — start unsupervised, then label discovered groups',
        'steps': [
            '1. Cluster the unlabelled data to discover natural groups.',
            '2. Validate clusters with domain experts.',
            '3. Label cluster centroids instead of individual records.',
            '4. Train a supervised classifier on the validated cluster labels.',
            '5. Use the classifier to assign new records to discovered segments.'
        ]
    }


# Examples
print(recommend_approach(
    has_labels=True, goal='classify',
    label_quality_validated=True, labeling_budget_days=10
))
print()
print(recommend_approach(
    has_labels=False, goal='segment',
    label_quality_validated=False, labeling_budget_days=0
))
print()
print(recommend_approach(
    has_labels=True, goal='classify',
    label_quality_validated=False, labeling_budget_days=5
))

thecodeforge.io

Supervised Unsupervised Learning

Supervised Learning: Classification Deep Dive

Classification is the most common supervised task in production. The model learns to assign inputs to predefined categories, and that assignment drives real decisions — flag this email as spam, decline this transaction, call this customer before they leave. The critical decisions are: choosing the right algorithm, handling class imbalance, selecting the correct evaluation metric, and ensuring your labels are actually meaningful.

The most common mistake in classification is reporting only accuracy. On a dataset where 90% of records are class 0, a model that always predicts class 0 achieves 90% accuracy while being completely useless — it never catches a single class 1 instance. This is not a rare edge case. Fraud, disease, and churn are all rare events. Class imbalance is the norm in production, not the exception.

io/thecodeforge/ml/classification_deep_dive.pyPYTHON

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Generate a realistic imbalanced classification dataset
# 90% class 0, 10% class 1 — typical of fraud or churn scenarios
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    weights=[0.9, 0.1],   # 900 negative, 100 positive
    random_state=42
)

# stratify=y is mandatory on imbalanced data
# Without it, the test set might have no positive examples at all
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f'Train class distribution: '
      f'{dict(zip(*np.unique(y_train, return_counts=True)))}')
print(f'Test class distribution:  '
      f'{dict(zip(*np.unique(y_test, return_counts=True)))}')
print()

# Train two models: a simple baseline and a stronger model
models = [
    ('Logistic Regression (baseline)',
     LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)),
    ('Random Forest',
     RandomForestClassifier(n_estimators=100, class_weight='balanced',
                            random_state=42))
]

for name, model in models:
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    probabilities = model.predict_proba(X_test)[:, 1]

    auc = roc_auc_score(y_test, probabilities)

    print(f'=== {name} ===')
    print(f'AUC-ROC: {auc:.4f}  '
          f'(0.5 = random, 1.0 = perfect)')
    print(classification_report(y_test, predictions,
                                target_names=['stayed', 'churned']))
    print(f'Confusion Matrix:')
    cm = confusion_matrix(y_test, predictions)
    print(f'  True Negatives:  {cm[0][0]:4d} | False Positives: {cm[0][1]:4d}')
    print(f'  False Negatives: {cm[1][0]:4d} | True Positives:  {cm[1][1]:4d}')
    print()

# KEY LESSON: a model that always predicts class 0 achieves 90% accuracy
# but AUC-ROC of 0.5 and recall of 0 for the positive class
class_zero_baseline = np.zeros(len(y_test), dtype=int)
print('=== Always-Predict-Zero Baseline ===')
print(f'Accuracy: {(class_zero_baseline == y_test).mean():.2%}  '
      f'(looks great — but catches zero positives)')
print(classification_report(y_test, class_zero_baseline,
                             target_names=['stayed', 'churned'],
                             zero_division=0))

⚠ Accuracy Is Deceptive on Imbalanced Data

On a dataset with 90% negative examples, a model that always predicts 'negative' achieves 90% accuracy and catches exactly zero positive cases. This model would pass a naive accuracy check and fail completely in production. Always check precision and recall for the minority class. Use AUC-ROC to evaluate the model's ability to rank positives above negatives across all thresholds. Use class_weight='balanced' or oversampling (SMOTE) to compensate for imbalance during training. Use stratified train/test splits to preserve class ratios in both sets.

📊 Production Insight

Class imbalance is the norm in production, not the exception.

Fraud detection, disease diagnosis, equipment failure, and churn prediction all have rare positive classes — typically 1-10% of total records.

Rule: never report only accuracy. Always show per-class precision, recall, F1, and AUC-ROC. If your stakeholder only looks at accuracy, educate them before the model ships.

🎯 Key Takeaway

Classification assigns inputs to predefined categories with known labels.

Always use stratified splits and per-class metrics on imbalanced data — accuracy alone is dangerously misleading.

AUC-ROC gives you a threshold-independent view of classification quality — use it alongside F1 for the minority class.

Supervised Learning: Regression Deep Dive

Regression predicts continuous numbers. The model learns a function that maps input features to a numeric output — not a category, a specific value. The output could be a house price, a delivery time estimate, a sales forecast, or a sensor reading. The model's quality is judged by how close its numeric predictions are to the true values.

The key decisions in regression are: choosing the loss function (MSE vs MAE vs Huber), handling outliers that distort gradient updates, preventing overfitting when features are many and data is sparse, and scaling features so that different-range inputs do not dominate each other. A regression model trained on unscaled features where income ranges from 0 to 500,000 and age ranges from 0 to 100 will behave as if income matters 5,000x more than age — not because income is more important, but because its raw numbers are larger.

io/thecodeforge/ml/regression_deep_dive.pyPYTHON

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Generate regression data with noise and some informative features
X, y = make_regression(
    n_samples=500,
    n_features=8,
    n_informative=4,   # only 4 of 8 features actually predict y
    noise=25,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

def evaluate_regression(name, model, X_train, X_test, y_train, y_test):
    """Fit a model and print all three regression metrics."""
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    mae  = mean_absolute_error(y_test, predictions)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    r2   = r2_score(y_test, predictions)

    print(f'=== {name} ===')
    print(f'  MAE:  {mae:.2f}  (avg absolute error, same units as target)')
    print(f'  RMSE: {rmse:.2f}  (penalises large errors more than MAE)')
    print(f'  R²:   {r2:.4f}  (1.0 = perfect, 0.0 = predicts mean, <0 = bad)')
    print()

# Model 1: Linear Regression without scaling — naive baseline
evaluate_regression(
    'Linear Regression (no scaling)',
    LinearRegression(), X_train, X_test, y_train, y_test
)

# Model 2: Ridge Regression with scaling inside a Pipeline
# Pipeline ensures the scaler is fit on training data only,
# preventing data leakage into the validation set
ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge',  Ridge(alpha=1.0))
])
evaluate_regression(
    'Ridge Regression (scaled, L2 regularisation)',
    ridge_pipeline, X_train, X_test, y_train, y_test
)

# Model 3: Gradient Boosting — handles nonlinearity and feature interactions
evaluate_regression(
    'Gradient Boosting Regressor',
    GradientBoostingRegressor(n_estimators=200, learning_rate=0.05,
                              max_depth=4, random_state=42),
    X_train, X_test, y_train, y_test
)

# Cross-validation: more reliable than a single train/test split
gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05,
                               max_depth=4, random_state=42)
cv_scores = cross_val_score(gb, X, y, cv=5, scoring='neg_mean_absolute_error')
print(f'5-Fold CV MAE: {-cv_scores.mean():.2f} +/- {cv_scores.std():.2f}')
print('(More reliable than a single test split)')

🔥MAE vs RMSE: Which to Report

MAE (Mean Absolute Error) is in the same units as your target — if you are predicting house prices in dollars, MAE tells you the average dollar error directly. RMSE (Root Mean Squared Error) penalises large errors more heavily because it squares them before averaging. If your data has outliers, RMSE will look worse than MAE because the outlier errors dominate. Report both — MAE tells you the typical error, RMSE tells you the worst-case behaviour. R² (coefficient of determination) tells you what fraction of the target variance your model explains — 1.0 is perfect, 0.0 means your model is no better than predicting the mean every time.

📊 Production Insight

Regression models are sensitive to feature scale.

Income (0-500,000) will numerically dominate age (0-100) without normalisation, even if both are equally informative.

Rule: always use a Pipeline that wraps the scaler and model together. This prevents data leakage — the scaler sees only training data during cross-validation, not the validation fold.

🎯 Key Takeaway

Regression predicts continuous numbers, not categories.

Report MAE, RMSE, and R² together — each reveals a different aspect of model quality.

Always use a Pipeline for scaling to prevent data leakage from validation folds into the scaler fit.

Unsupervised Learning: Clustering Deep Dive

Clustering groups data points that are similar to each other without any labels guiding the process. The challenge is threefold: choosing the right number of clusters, validating that the discovered groups are stable and meaningful, and then interpreting what those groups represent in business terms.

K-Means is the most common starting point because it is fast, interpretable, and scales to large datasets. But K-Means makes assumptions that often do not hold in real data — it assumes clusters are spherical, roughly equal in size, and have similar density. When those assumptions break down, DBSCAN or hierarchical clustering produce better results.

The most common mistake in clustering is choosing K arbitrarily or picking the one that looks 'round'. Use the elbow method and silhouette score together. If they disagree, use domain knowledge as the tiebreaker — the number of clusters that makes the most business sense is the right answer.

io/thecodeforge/ml/clustering_deep_dive.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Customer behaviour data — 3 natural segments exist in this dataset
# Features: annual_spend ($), visit_frequency (visits/year), avg_cart_value ($)
np.random.seed(42)
X = np.array([
    # Segment A: moderate spend, frequent, small carts
    [5000, 50, 100], [4800, 48, 100], [5200, 52, 100],
    [4900, 49,  98], [5100, 51, 102], [4700, 47,  99],
    # Segment B: low spend, infrequent, tiny carts
    [200, 12, 17], [180, 10, 18], [220, 14, 16],
    [190, 11, 17], [210, 13, 16], [230, 15, 15],
    # Segment C: high spend, rare visits, large carts
    [12000, 8, 1500], [11500, 7, 1643], [12500, 9, 1389],
    [11800, 8, 1475], [12200, 9, 1356], [11200, 7, 1600],
    # Segment D (browse-heavy): low spend, very frequent, tiny carts
    [300, 45, 7], [250, 40, 6], [280, 42, 7],
    [260, 41, 6], [310, 46, 7], [270, 43, 6],
])

# Scale features BEFORE clustering
# Distance-based algorithms are dominated by the largest-scale feature
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Method 1: Elbow plot (inertia vs K) ---
inertias = []
silhouettes = []
K_range = range(2, 8)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

print('K  | Inertia   | Silhouette')
print('---|-----------|----------')
for k, inertia, sil in zip(K_range, inertias, silhouettes):
    print(f'{k}  | {inertia:9.1f} | {sil:.4f}')

best_k = K_range[np.argmax(silhouettes)]
print(f'\nBest K by silhouette: {best_k}')

# --- Fit final model with best K ---
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

print(f'\nCluster profiles (needs domain expert interpretation):')
for i in range(best_k):
    members = X[clusters == i]
    print(f'  Cluster {i} — {len(members)} customers:')
    print(f'    Avg spend:     ${members[:, 0].mean():,.0f}')
    print(f'    Avg visits:    {members[:, 1].mean():.0f}/year')
    print(f'    Avg cart:      ${members[:, 2].mean():,.0f}')

# --- Visualise clusters using PCA (2D projection) ---
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f'\nPCA variance explained: '
      f'{pca.explained_variance_ratio_.sum():.1%} in 2 components')

fig, ax = plt.subplots(figsize=(8, 6))
colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12']
for i in range(best_k):
    mask = clusters == i
    ax.scatter(X_pca[mask, 0], X_pca[mask, 1],
               c=colors[i], label=f'Cluster {i}',
               s=100, edgecolors='black', linewidth=0.5)
ax.set_title(f'Customer Segments (K={best_k}) — PCA Projection',
             fontweight='bold')
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
ax.legend()
fig.tight_layout()
fig.savefig('clusters_pca.png', dpi=300, bbox_inches='tight')
plt.close(fig)
print('Saved clusters_pca.png')

💡Choosing K: Elbow Method vs Silhouette Score

Elbow Method: plot inertia (within-cluster sum of squares) vs K. The point where improvement slows sharply — the 'elbow' — suggests the optimal K. The elbow is often ambiguous on real data.
Silhouette Score: measures how similar each point is to its own cluster versus the nearest other cluster. Ranges from -1 to 1. Above 0.5 is good. Above 0.7 is strong.
Always try both methods — if they agree, you have good evidence. If they disagree, use domain knowledge as the tiebreaker.
If no clear elbow exists and silhouette scores are uniformly low (below 0.3), the data may not have natural clusters. Dimensionality reduction before clustering often helps.
Visualise your final clusters with PCA or t-SNE — if clusters overlap heavily in 2D, they are probably not meaningfully separate in the original space.

📊 Production Insight

K-Means assumes spherical clusters of similar size and density — a set of assumptions that rarely holds in real customer or transactional data.

Irregularly shaped clusters, clusters of very different sizes, or data with significant noise require DBSCAN or hierarchical clustering.

Rule: always visualise clusters after fitting using PCA or t-SNE. If the clusters do not look visually separated in a 2D projection, do not ship them to the business.

🎯 Key Takeaway

Clustering discovers groups without labels — the algorithm finds structure, humans interpret it.

Use silhouette score and the elbow method together to choose K — never guess.

Always scale features before clustering and visualise results with PCA — trust metrics and plots, not the algorithm's confidence.

When to Use Which: A Decision Framework

The supervised vs unsupervised choice is not always binary. Many production systems combine both paradigms in sequence. The canonical pattern is: use unsupervised learning to discover structure you did not anticipate, validate those discoveries with domain experts, then build a supervised model on top of the validated structure to operationalise it at scale.

The framework below walks through the decision based on your actual situation — not what you wish your data looked like.

Supervised vs Unsupervised Decision Flowchart

IfYou have labelled data with validated labels and need to predict a known target for new inputs

→

UseUse supervised learning. Classification for categories, regression for numbers. Audit label quality first — arbitrary labels produce arbitrary models.

IfYou have unlabelled data and want to discover groups, patterns, or structure you did not define in advance

→

UseUse unsupervised learning. Start with clustering. Apply dimensionality reduction first if you have more than 20 features.

IfYou have labelled data but the labels were invented rather than observed from historical outcomes

→

UseStop and audit labels before doing anything else. Invented labels may not reflect real patterns. Consider running unsupervised clustering to see what structure actually exists in the data.

IfYou have unlabelled data but need a production system that assigns new records to groups in real time

→

UseUse hybrid approach: unsupervised to discover groups, label the discovered groups with domain experts, then train a supervised classifier to assign new records efficiently.

IfYou have a small amount of labelled data and a large amount of unlabelled data

→

UseUse semi-supervised or active learning. Cluster the unlabelled data, label representative samples from each cluster, then train a supervised model. Use the model's uncertainty to guide which unlabelled records to label next.

IfYour supervised model performance is unexpectedly poor and you cannot explain why

→

UseRun unsupervised exploration before debugging the model. PCA and t-SNE visualisations often reveal whether any structure exists in the data at all. If no structure is visible, the problem may be fundamentally underdetermined.

Common Pitfalls: What Beginners Get Wrong

Beginners make predictable mistakes when choosing between supervised and unsupervised learning. These mistakes waste months of engineering effort and produce models that cannot be deployed or that actively mislead decision-makers. The three most costly pitfalls are using supervised learning without validated labels, ignoring unsupervised methods when labelling is expensive, and evaluating unsupervised results with supervised metrics.

⚠ The Three Costliest Mistakes

1. Using supervised learning when labels are invented rather than observed — the model learns to confidently predict arbitrary categories. 2. Ignoring unsupervised methods when labelling would be cheaper applied to clusters than to individual records. 3. Evaluating clustering with accuracy or F1 — these metrics require ground-truth labels that do not exist in unsupervised settings. Each of these mistakes can waste two to six months of engineering time on a fundamentally broken approach.

Why Your Model Fails in Production: The Label Leak Trap

You trained a supervised model. Accuracy hit 99% on validation. You deployed it, and it predicted garbage. Classic label leak. You accidentally fed the model information it wouldn't have at inference time. For example: predicting customer churn using the feature 'number of support tickets closed today.' That data doesn't exist until after the churn event. Your model learned to cheat. Unsupervised learning isn't immune either. If you cluster customer segments using the total purchase amount from the entire year, you're bleeding future data into your grouping. The fix? Audit your feature pipeline. Timestamp everything. Train only on features available at prediction time. Senior engineers burn weekend on-call rotations because someone skipped this check. Don't be that person. Always ask: 'Would this feature exist at the moment I need to predict?' If the answer is no, cut it.

label_leak_detector.pyPYTHON

// io.thecodeforge
import pandas as pd
from sklearn.model_selection import train_test_split

# DANGER: using 'total_monthly_spend' which is computed AFTER the prediction date
df = pd.read_csv('customer_data.csv')
# Feature 'total_monthly_spend' includes future transactions
X = df[['age', 'tenure_days', 'total_monthly_spend']]
y = df['churned']

# Toxic split: no temporal guard
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Real fix: sort by timestamp, split by date, never random
# sorted_df = df.sort_values('prediction_timestamp')
# cutoff = int(len(sorted_df) * 0.8)
# X_train = sorted_df.iloc[:cutoff][['age', 'tenure_days']]  # no total_monthly_spend
# X_test = sorted_df.iloc[cutoff:][['age', 'tenure_days']]

Output

Model accuracy: 0.99 -> Production accuracy: 0.52. Label leak confirmed.

⚠ Production Trap:

Random splitting on time-series data is the #1 cause of silent production failures. Always sort by timestamp before splitting.

🎯 Key Takeaway

If a feature uses data from the future, your model is just memorizing the test set.

Semi-Supervised Learning: When Your Budget Can't Afford Labels

Labeling data is expensive. A radiologist charges $200/hour to mark tumors. You have 10,000 unlabeled scans but only 200 labeled. Pure supervised learning will overfit on 200 samples. Pure unsupervised clustering will miss rare tumors. Enter semi-supervised learning: use the unlabeled data to build a better decision boundary. The trick? Start with your 200 labeled examples. Train a weak classifier. Use it to pseudo-label the unlabeled data. Keep only the predictions with high confidence. Retrain on the combined set. Repeat. This technique called self-training cut costs by 80% on a medical imaging project I worked on. But watch out: if your initial classifier is wrong, you amplify errors. Mitigate by using an ensemble or a confidence threshold above 0.95. Never let pseudo-labels from a single model contaminate the training loop unsupervised.

semi_supervised_self_train.pyPYTHON

// io.thecodeforge
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import numpy as np

# 200 labeled, 9800 unlabeled
X_labeled = np.random.rand(200, 10)
y_labeled = np.random.randint(0, 2, 200)
X_unlabeled = np.random.rand(9800, 10)

model = SVC(probability=True, C=1.0)
model.fit(X_labeled, y_labeled)

# Self-training loop - max 5 iterations
for i in range(5):
    probs = model.predict_proba(X_unlabeled)
    max_probs = np.max(probs, axis=1)
    high_conf_mask = max_probs > 0.95
    if np.sum(high_conf_mask) == 0:
        break  # No more high-confidence samples
    X_pseudo = X_unlabeled[high_conf_mask]
    y_pseudo = np.argmax(probs[high_conf_mask], axis=1)
    # Append to labeled set
    X_labeled = np.vstack([X_labeled, X_pseudo])
    y_labeled = np.hstack([y_labeled, y_pseudo])
    model.fit(X_labeled, y_labeled)
    print(f'Iteration {i+1}: added {len(y_pseudo)} pseudo-labels')

Output

Iteration 1: added 120 pseudo-labels

Iteration 2: added 95 pseudo-labels

Iteration 3: added 45 pseudo-labels

Iteration 4: added 12 pseudo-labels

Iteration 5: added 0 pseudo-labels

🔥Bandwidth Saver:

Semi-supervised learning works best when unlabeled data is abundant and the decision boundary is smooth. Never use it with noisy labels or high-dimensional sparse data.

🎯 Key Takeaway

Semi-supervised learning lets you stretch labeled data 10x, but only if you trust your confidence threshold.

● Production incidentPOST-MORTEMseverity: high

Customer Segmentation Project Failed After Team Used Supervised Learning on Unlabelled Data

Symptom

The team produced a classification model with 85% accuracy, but the predicted segments did not match any business-meaningful customer groups. Marketing campaigns based on these segments showed no improvement over random targeting. The model was technically working — it was confidently predicting the wrong thing.

Assumption

The team assumed that supervised learning was the correct approach because they wanted to 'predict customer segments.' They did not realise that segment discovery is an unsupervised problem — you do not know the segments in advance. They invented labels (High/Medium/Low value) based on gut feel and then trained a model to reproduce those gut-feel labels.

Root cause

The team forced arbitrary labels onto customers without validating that these categories reflected natural groupings in the data. The supervised model learned to reproduce the arbitrary labels faithfully — which is exactly what it is supposed to do. The problem was not the model; it was the label design. The actual customer segments (frequent small buyers, infrequent bulk buyers, lapsed high-value customers) were hidden in the data and required unsupervised clustering to reveal. No amount of supervised tuning could have fixed this because the labels themselves were the mistake.

Fix

Switched to K-Means clustering (unsupervised) on the same dataset. Discovered 5 natural customer segments with distinct purchasing behaviours — segments the business had not anticipated. Validated segments with domain experts over two working sessions. Built a follow-on supervised classifier trained on the validated cluster labels so new customers could be assigned to segments in real time. Marketing campaigns targeted to the discovered segments showed 3x improvement in conversion over the previous supervised approach.

Key lesson

If you do not know the correct labels in advance, unsupervised learning is the right starting point — not label invention.
Supervised learning requires validated labels. Arbitrary labels produce arbitrary models that are confident about the wrong things.
Always validate whether your problem is prediction (supervised) or discovery (unsupervised) before choosing an approach.
The two paradigms often work in sequence — unsupervised to discover structure, supervised to operationalise it.

Production debug guideCommon signals that you chose the wrong paradigm.5 entries

Symptom · 01

Model accuracy is high but predictions are not actionable

→

Fix

Your labels may be arbitrary. Verify that labelled categories map to business-meaningful outcomes — not just internally consistent classifications. If the labels were invented rather than observed, the model learned to reproduce invented categories with high fidelity.

Symptom · 02

Spending more time labelling data than building models

→

Fix

Consider whether unsupervised methods can discover the structure you are trying to label. Run K-Means or DBSCAN on the unlabelled data first. If natural clusters emerge, label the cluster centroids rather than individual records — this can reduce labelling effort by 10-100x.

Symptom · 03

Clustering results change dramatically with small data additions

→

Fix

The data may not have stable natural clusters. Check silhouette scores across multiple runs with different random seeds. If scores are consistently low (below 0.3), apply dimensionality reduction with PCA before clustering, or consider whether the problem requires supervised prediction rather than discovery.

Symptom · 04

Classification model performs no better than random guessing on validation data

→

Fix

The features may not contain predictive signal for the chosen target. Do not immediately reach for a more complex model. Run unsupervised exploration first — PCA, t-SNE, and clustering can reveal whether any structure exists in the data at all, and what that structure correlates with.

Symptom · 05

The business cannot explain what the model's output categories mean

→

Fix

This is the unsupervised-in-disguise problem. The model is predicting categories that the business cannot interpret or act on. Go back to the problem definition. If the goal is discovery rather than prediction, restart with clustering and involve domain experts in interpreting the results.

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgemlsupervised_example.py	from sklearn.model_selection import train_test_split	What is Supervised Learning?
iothecodeforgemlunsupervised_example.py	from sklearn.cluster import KMeans	What is Unsupervised Learning?
iothecodeforgemlparadigm_comparison.py	def recommend_approach(has_labels, goal, label_quality_validated,	Side-by-Side Comparison
iothecodeforgemlclassification_deep_dive.py	from sklearn.datasets import make_classification	Supervised Learning
iothecodeforgemlregression_deep_dive.py	from sklearn.datasets import make_regression	Supervised Learning
iothecodeforgemlclustering_deep_dive.py	from sklearn.cluster import KMeans, DBSCAN	Unsupervised Learning
label_leak_detector.py	from sklearn.model_selection import train_test_split	Why Your Model Fails in Production
semi_supervised_self_train.py	from sklearn.svm import SVC	Semi-Supervised Learning

Key takeaways

Supervised learning requires labelled data

inputs paired with known, validated correct outputs. Label quality sets the ceiling of model performance.

Unsupervised learning discovers patterns in unlabelled data

no answer key exists. The algorithm finds structure; humans must interpret whether it is meaningful.

Classification and regression are supervised tasks. Clustering, dimensionality reduction, and anomaly detection are unsupervised tasks.

The choice depends on whether you have validated labels and whether you need prediction or discovery. Invented labels produce arbitrary supervised models.

Accuracy is deceptive on imbalanced data

always check per-class precision, recall, and AUC-ROC for the minority class.

Unsupervised results require human interpretation and domain expert validation

never ship cluster assignments without review.

Many production systems combine both paradigms

unsupervised for exploration and feature engineering, supervised for prediction and operationalisation at scale.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

Explain the difference between supervised and unsupervised learning with...

Q02SENIOR

A stakeholder asks you to build a model to 'predict customer segments.' ...

Q03SENIOR

You have 1 million unlabelled records and a budget to label 5,000 of the...

Q04SENIOR

Can unsupervised learning results be used to improve a supervised model?

Q01 of 04JUNIOR

Explain the difference between supervised and unsupervised learning with a real-world example of each.

ANSWER

Supervised learning trains on labelled data where each input has a known correct output attached. The model learns the mapping from inputs to labels and applies it to new data. A concrete example: email spam detection. The model trains on thousands of emails that humans have already labelled 'spam' or 'not spam'. It learns which feature combinations — sender patterns, word frequencies, link counts — predict each label, then classifies new emails automatically. Unsupervised learning finds patterns in unlabelled data without any known correct output. A concrete example: customer segmentation. You have purchase history for 500,000 customers but no predefined segments. The model groups customers by similarity in purchasing behaviour — frequency, spend, product categories — without being told what the groups should be. It might discover that high-frequency small-cart customers behave very differently from low-frequency large-cart customers, revealing a segment the business had not explicitly defined. The key difference is the label. Supervised requires a human to have already answered the question for training examples. Unsupervised discovers answers to questions the human had not thought to ask.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is Supervised vs Unsupervised Learning in simple terms?

Which is better: supervised or unsupervised learning?

Can I use both supervised and unsupervised learning together?

How much labelled data do I need for supervised learning?

What is the silhouette score and why does it matter for clustering?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

6 min read · try the examples if you haven't