Skip to content
Home ML / AI Supervised vs Unsupervised Learning Explained — With Real Examples and Code

Supervised vs Unsupervised Learning Explained — With Real Examples and Code

Where developers are forged. · Structured learning · Free forever.
📍 Part of: ML Basics → Topic 2 of 25
Supervised vs unsupervised learning explained from scratch with analogies, Python code, and real examples.
🧑‍💻 Beginner-friendly — no prior ML / AI experience needed
In this tutorial, you'll learn
Supervised vs unsupervised learning explained from scratch with analogies, Python code, and real examples.
  • Supervised learning requires labelled data — inputs paired with known, validated correct outputs. Label quality sets the ceiling of model performance.
  • Unsupervised learning discovers patterns in unlabelled data — no answer key exists. The algorithm finds structure; humans must interpret whether it is meaningful.
  • Classification and regression are supervised tasks. Clustering, dimensionality reduction, and anomaly detection are unsupervised tasks.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Supervised learning trains on labelled data — each input has a known correct output
  • Unsupervised learning finds patterns in unlabelled data — no answers provided
  • Use supervised when you have labelled examples and need predictions (classification, regression)
  • Use unsupervised when you need to discover structure (clustering, dimensionality reduction)
  • Labelling is expensive — 70% of real-world ML projects spend most time on data labelling
  • Biggest mistake: using unsupervised methods when labelled data exists, or forcing labels where patterns should be discovered
Production IncidentCustomer Segmentation Project Failed After Team Used Supervised Learning on Unlabelled DataA marketing team spent 3 months labelling 50,000 customer records manually before realising they did not know what the correct segments were — and the model they shipped was predicting arbitrary categories with no business meaning.
SymptomThe team produced a classification model with 85% accuracy, but the predicted segments did not match any business-meaningful customer groups. Marketing campaigns based on these segments showed no improvement over random targeting. The model was technically working — it was confidently predicting the wrong thing.
AssumptionThe team assumed that supervised learning was the correct approach because they wanted to 'predict customer segments.' They did not realise that segment discovery is an unsupervised problem — you do not know the segments in advance. They invented labels (High/Medium/Low value) based on gut feel and then trained a model to reproduce those gut-feel labels.
Root causeThe team forced arbitrary labels onto customers without validating that these categories reflected natural groupings in the data. The supervised model learned to reproduce the arbitrary labels faithfully — which is exactly what it is supposed to do. The problem was not the model; it was the label design. The actual customer segments (frequent small buyers, infrequent bulk buyers, lapsed high-value customers) were hidden in the data and required unsupervised clustering to reveal. No amount of supervised tuning could have fixed this because the labels themselves were the mistake.
FixSwitched to K-Means clustering (unsupervised) on the same dataset. Discovered 5 natural customer segments with distinct purchasing behaviours — segments the business had not anticipated. Validated segments with domain experts over two working sessions. Built a follow-on supervised classifier trained on the validated cluster labels so new customers could be assigned to segments in real time. Marketing campaigns targeted to the discovered segments showed 3x improvement in conversion over the previous supervised approach.
Key Lesson
If you do not know the correct labels in advance, unsupervised learning is the right starting point — not label invention.Supervised learning requires validated labels. Arbitrary labels produce arbitrary models that are confident about the wrong things.Always validate whether your problem is prediction (supervised) or discovery (unsupervised) before choosing an approach.The two paradigms often work in sequence — unsupervised to discover structure, supervised to operationalise it.
Production Debug GuideCommon signals that you chose the wrong paradigm.
Model accuracy is high but predictions are not actionableYour labels may be arbitrary. Verify that labelled categories map to business-meaningful outcomes — not just internally consistent classifications. If the labels were invented rather than observed, the model learned to reproduce invented categories with high fidelity.
Spending more time labelling data than building modelsConsider whether unsupervised methods can discover the structure you are trying to label. Run K-Means or DBSCAN on the unlabelled data first. If natural clusters emerge, label the cluster centroids rather than individual records — this can reduce labelling effort by 10-100x.
Clustering results change dramatically with small data additionsThe data may not have stable natural clusters. Check silhouette scores across multiple runs with different random seeds. If scores are consistently low (below 0.3), apply dimensionality reduction with PCA before clustering, or consider whether the problem requires supervised prediction rather than discovery.
Classification model performs no better than random guessing on validation dataThe features may not contain predictive signal for the chosen target. Do not immediately reach for a more complex model. Run unsupervised exploration first — PCA, t-SNE, and clustering can reveal whether any structure exists in the data at all, and what that structure correlates with.
The business cannot explain what the model's output categories meanThis is the unsupervised-in-disguise problem. The model is predicting categories that the business cannot interpret or act on. Go back to the problem definition. If the goal is discovery rather than prediction, restart with clustering and involve domain experts in interpreting the results.

Every recommendation you get on Netflix, every spam email that lands in your junk folder, and every fraud alert your bank sends you — all of these are powered by machine learning models. But not all machine learning works the same way. The single biggest fork in the road when building any ML system is deciding: do we have labelled data to learn from, or are we on our own? Getting this decision wrong does not just slow your project down — it can make your model completely useless, no matter how much compute you throw at it.

The core problem both approaches solve is teaching a computer to find patterns without explicitly programming every rule. Instead of writing 'if the email contains the word free AND the sender is unknown THEN mark as spam', you feed the machine examples and let it work out the rules itself. Supervised learning works when you already have examples with correct answers attached. Unsupervised learning works when you have mountains of raw data but nobody has sat down to label any of it — which, in the real world, is most of the time.

By the end of this article you will be able to explain the difference clearly in plain English, know exactly which approach to reach for given a problem, write working Python code for both paradigms from scratch, and avoid the three most common mistakes beginners make when choosing between them. No ML experience needed — we will build everything up piece by piece.

What is Supervised Learning?

Supervised learning trains a model on labelled data — every input example has a known correct output attached to it. The model learns the mapping from inputs to outputs, then applies that mapping to new, unseen data. The word 'supervised' refers to the fact that a human has already done the work of labelling — providing the answer key the model learns from.

The two main supervised tasks are classification (predicting categories) and regression (predicting numbers). Classification asks 'which category does this belong to?' — spam or not spam, will this customer churn or stay, is this tumour malignant or benign. Regression asks 'what number will this produce?' — what is this house worth, how many units will we sell next quarter, what temperature will it be tomorrow.

The quality of supervised learning is bounded by the quality of the labels. A perfectly tuned model trained on noisy or inconsistent labels will faithfully reproduce those noisy labels. This is why experienced ML engineers treat label auditing as a first-class engineering task, not an afterthought.

io/thecodeforge/ml/supervised_example.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Supervised learning: classification with labelled data
# Dataset: predict whether a customer will churn (1) or stay (0)
# Features: usage_minutes, support_tickets, months_active

X = np.array([
    [120, 3, 24],   # moderate usage, few tickets, long tenure
    [45,  8,  6],   # low usage, many tickets, new customer
    [200, 1, 36],   # high usage, few tickets, long tenure
    [30, 12,  3],   # very low usage, many tickets, very new
    [180, 2, 18],   # high usage, few tickets, mid tenure
    [60,  7,  8],   # low usage, several tickets, new
    [250, 0, 48],   # very high usage, zero tickets, veteran
    [40, 10,  4],   # low usage, many tickets, new
])

# Labels: 0 = stayed, 1 = churned (the answer key)
# These labels were sourced from historical CRM records — not invented
y = np.array([0, 1, 0, 1, 0, 1, 0, 1])

# Split into training and test sets
# stratify=y preserves the class ratio in both splits
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Train the model — it learns the mapping X -> y during fit()
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,          # prevent overfitting on small dataset
    random_state=42
)
model.fit(X_train, y_train)

# Predict on new data the model has never seen
predictions = model.predict(X_test)
print(f'Predictions: {predictions}')
print(f'Actual:      {y_test}')

# classification_report shows per-class precision, recall, F1
# Never rely only on accuracy — it hides class imbalance problems
print(classification_report(y_test, predictions,
      target_names=['stayed', 'churned']))

# Feature importance — which inputs drove predictions most?
for feature, importance in zip(
    ['usage_minutes', 'support_tickets', 'months_active'],
    model.feature_importances_
):
    print(f'  {feature}: {importance:.3f}')
Mental Model
Supervised Learning as Function Approximation
Supervised learning finds a function f such that f(inputs) approximately equals the known outputs. Everything else is implementation detail.
  • Training data = pairs of (input, correct_output) — the answer key the model learns from.
  • The model adjusts internal parameters to minimise the difference between its predictions and the correct outputs.
  • Once trained, the model predicts outputs for new inputs it has never seen.
  • Classification: output is a category — spam/not spam, churn/stay, fraud/legitimate.
  • Regression: output is a number — house price, revenue forecast, sensor reading.
  • The ceiling of model quality is set by label quality — a well-tuned model on bad labels produces bad predictions confidently.
📊 Production Insight
Supervised models are only as good as their labels.
Noisy labels from multiple annotators with no adjudication process silently degrade model accuracy — often by more than model architecture choices.
Rule: audit label quality and measure inter-annotator agreement before investing engineering time in model complexity. A cleaner dataset with a simpler model almost always beats a complex model on dirty labels.
🎯 Key Takeaway
Supervised learning requires labelled data — inputs paired with known, validated outputs.
It learns a mapping function and applies it to new, unseen data.
Label quality sets the ceiling of model performance — audit labels before tuning models.

What is Unsupervised Learning?

Unsupervised learning finds hidden patterns in data without any labels. The model has no answer key — it discovers structure on its own by finding data points that are similar to each other, or features that vary together, or records that behave differently from everything else.

The three main unsupervised tasks are clustering (grouping similar data points), dimensionality reduction (compressing many features into fewer while preserving structure), and anomaly detection (finding data points that deviate significantly from the norm).

The fundamental challenge of unsupervised learning is validation. With supervised learning, you compare predictions to known labels and compute accuracy. With unsupervised learning, there are no labels to compare against. You must use internal metrics like silhouette score, involve domain experts to validate whether discovered groups make business sense, or apply extrinsic evaluation by checking whether the discovered structure correlates with outcomes you care about.

This is why unsupervised results should never be shipped directly to production without human review. The algorithm finds groups — it cannot tell you whether those groups are meaningful.

io/thecodeforge/ml/unsupervised_example.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Unsupervised learning: clustering without labels
# Dataset: customer behaviour data — no predefined segments exist
# Features: annual_spend ($), visit_frequency (visits/year), avg_cart_value ($)

X = np.array([
    [5000,  50,  100],   # moderate spend, frequent, small carts
    [4800,  48,  100],
    [5200,  52,  100],
    [200,   12,   17],   # low spend, infrequent, very small carts
    [180,   10,   18],
    [220,   14,   16],
    [12000,  8, 1500],   # high spend, rare visits, very large carts
    [11500,  7, 1643],
    [12500,  9, 1389],
    [300,   45,    7],   # low spend, frequent, tiny carts (browse-heavy)
    [250,   40,    6],
    [280,   42,    7],
])

# IMPORTANT: scale features before clustering
# KMeans uses Euclidean distance — unscaled spend (0-12000) will
# completely dominate cart value (6-1643) and crush the signal
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal K using silhouette score
print('Searching for optimal K:')
print('K | Silhouette Score | Inertia')
print('--|-------------------|--------')
best_k, best_score = 2, -1
for k in range(2, 6):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f'{k} | {score:.4f}             | {km.inertia_:.1f}')
    if score > best_score:
        best_score, best_k = score, k

print(f'\nBest K = {best_k} (silhouette = {best_score:.4f})')

# Fit with best K and inspect discovered segments
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

print('\nDiscovered cluster assignments:')
for i, cluster in enumerate(clusters):
    print(f'  Customer {i+1}: spend=${X[i][0]}, '
          f'visits={X[i][1]}, cart=${X[i][2]} -> Cluster {cluster}')

print('\nCluster profiles (domain expert interpretation needed):')
for i in range(best_k):
    members = X[clusters == i]
    print(f'  Cluster {i}: {len(members)} customers | '
          f'avg spend=${members[:, 0].mean():.0f} | '
          f'avg visits={members[:, 1].mean():.0f} | '
          f'avg cart=${members[:, 2].mean():.0f}')
Mental Model
Unsupervised Learning as Pattern Discovery
Unsupervised learning finds structure that humans did not explicitly define or label. The algorithm discovers — you interpret.
  • No labels exist — the algorithm discovers groups, patterns, or anomalies entirely on its own.
  • Clustering groups similar data points together — customer segments, document topics, gene expression profiles.
  • Dimensionality reduction compresses many features into fewer while preserving the relationships between data points.
  • Anomaly detection identifies records that deviate significantly from the established norm — useful for fraud, equipment failure, and data quality issues.
  • The discovered patterns must be interpreted by humans — the algorithm outputs Cluster 0, 1, 2 — not 'Frequent Browsers', 'Bulk Buyers', 'High-Value Loyalists'.
  • Validation without labels requires internal metrics (silhouette score) and external validation (domain expert review).
📊 Production Insight
Unsupervised results require human interpretation and business validation before any production use.
A cluster labelled 'Cluster 2' has zero business meaning without domain expert analysis.
Rule: always involve domain experts when interpreting unsupervised results. Build in two to three review sessions before using cluster assignments to drive any decision.
🎯 Key Takeaway
Unsupervised learning discovers patterns without labels — no answer key exists.
Clustering, dimensionality reduction, and anomaly detection are the main tasks.
The algorithm finds groups — humans must interpret what those groups mean and whether they are worth acting on.

Side-by-Side Comparison

The choice between supervised and unsupervised learning depends on your data, your goal, and your resources. These two paradigms are not competitors — they are tools for different jobs. Choosing the wrong one wastes months of engineering time on a fundamentally unsolvable problem.

The most important question is not 'which is more accurate?' It is 'what do I actually have and what do I actually need?' If you have validated labels and need to predict a known outcome, supervised learning is the answer. If you have raw data and want to discover structure you did not anticipate, unsupervised learning is the answer. If you have both needs, you likely need both paradigms working together.

io/thecodeforge/ml/paradigm_comparison.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
# Decision framework: supervised vs unsupervised
def recommend_approach(has_labels, goal, label_quality_validated,
                       labeling_budget_days):
    """
    Return the recommended ML paradigm based on your actual situation.

    Parameters
    ----------
    has_labels              : bool  — do labelled examples exist?
    goal                    : str   — what are you trying to accomplish?
    label_quality_validated : bool  — have labels been audited for quality?
    labeling_budget_days    : int   — days available for labelling effort
    """

    if has_labels and label_quality_validated and labeling_budget_days > 0:
        if goal in ['classify', 'predict_category', 'detect']:
            return {
                'approach': 'Supervised — Classification',
                'algorithms': [
                    'Logistic Regression (interpretable baseline)',
                    'Random Forest (robust, handles nonlinearity)',
                    'XGBoost (high performance, competition favourite)'
                ],
                'evaluation': 'Accuracy, Precision, Recall, F1, AUC-ROC',
                'watch_out': 'Class imbalance — always use stratified splits',
                'data_requirement': '500+ validated labelled examples per class'
            }
        elif goal in ['predict_number', 'forecast', 'estimate']:
            return {
                'approach': 'Supervised — Regression',
                'algorithms': [
                    'Linear Regression (interpretable baseline)',
                    'Gradient Boosting Regressor (high performance)',
                    'XGBoost / LightGBM (production default)'
                ],
                'evaluation': 'MAE, RMSE, R-squared — report all three',
                'watch_out': 'Outliers inflate RMSE — check both MAE and RMSE',
                'data_requirement': '1000+ labelled examples'
            }

    elif not has_labels or labeling_budget_days == 0:
        if goal in ['group', 'segment', 'discover_structure', 'explore']:
            return {
                'approach': 'Unsupervised — Clustering',
                'algorithms': [
                    'K-Means (fast, interpretable, assumes spherical clusters)',
                    'DBSCAN (finds arbitrarily shaped clusters, handles noise)',
                    'Hierarchical (no K needed, good for small datasets)'
                ],
                'evaluation': 'Silhouette Score, Inertia, Domain Expert Validation',
                'watch_out': 'Scale features first — distance metrics break on raw data',
                'data_requirement': 'Any volume — more data = more stable clusters'
            }
        elif goal in ['reduce_dimensions', 'visualize', 'compress',
                      'feature_engineering']:
            return {
                'approach': 'Unsupervised — Dimensionality Reduction',
                'algorithms': [
                    'PCA (linear, fast, variance explained is interpretable)',
                    't-SNE (nonlinear, good for visualization, slow on large data)',
                    'UMAP (nonlinear, faster than t-SNE, preserves global structure)'
                ],
                'evaluation': 'Variance Explained (PCA), Visual Cluster Separation',
                'watch_out': 't-SNE is for visualization only — do not use as features',
                'data_requirement': 'Any volume'
            }

    elif has_labels and not label_quality_validated:
        return {
            'approach': 'Audit labels first before choosing paradigm',
            'reason': 'Unvalidated labels may be arbitrary — training on them '
                      'produces a model that confidently predicts the wrong thing.',
            'next_step': 'Measure inter-annotator agreement. If kappa < 0.6, '
                         'your labels are not reliable enough for supervised training.'
        }

    # Hybrid: no labels and complex goal
    return {
        'approach': 'Hybrid — start unsupervised, then label discovered groups',
        'steps': [
            '1. Cluster the unlabelled data to discover natural groups.',
            '2. Validate clusters with domain experts.',
            '3. Label cluster centroids instead of individual records.',
            '4. Train a supervised classifier on the validated cluster labels.',
            '5. Use the classifier to assign new records to discovered segments.'
        ]
    }


# Examples
print(recommend_approach(
    has_labels=True, goal='classify',
    label_quality_validated=True, labeling_budget_days=10
))
print()
print(recommend_approach(
    has_labels=False, goal='segment',
    label_quality_validated=False, labeling_budget_days=0
))
print()
print(recommend_approach(
    has_labels=True, goal='classify',
    label_quality_validated=False, labeling_budget_days=5
))

Supervised Learning: Classification Deep Dive

Classification is the most common supervised task in production. The model learns to assign inputs to predefined categories, and that assignment drives real decisions — flag this email as spam, decline this transaction, call this customer before they leave. The critical decisions are: choosing the right algorithm, handling class imbalance, selecting the correct evaluation metric, and ensuring your labels are actually meaningful.

The most common mistake in classification is reporting only accuracy. On a dataset where 90% of records are class 0, a model that always predicts class 0 achieves 90% accuracy while being completely useless — it never catches a single class 1 instance. This is not a rare edge case. Fraud, disease, and churn are all rare events. Class imbalance is the norm in production, not the exception.

io/thecodeforge/ml/classification_deep_dive.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Generate a realistic imbalanced classification dataset
# 90% class 0, 10% class 1 — typical of fraud or churn scenarios
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    weights=[0.9, 0.1],   # 900 negative, 100 positive
    random_state=42
)

# stratify=y is mandatory on imbalanced data
# Without it, the test set might have no positive examples at all
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f'Train class distribution: '
      f'{dict(zip(*np.unique(y_train, return_counts=True)))}')
print(f'Test class distribution:  '
      f'{dict(zip(*np.unique(y_test, return_counts=True)))}')
print()

# Train two models: a simple baseline and a stronger model
models = [
    ('Logistic Regression (baseline)',
     LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)),
    ('Random Forest',
     RandomForestClassifier(n_estimators=100, class_weight='balanced',
                            random_state=42))
]

for name, model in models:
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    probabilities = model.predict_proba(X_test)[:, 1]

    auc = roc_auc_score(y_test, probabilities)

    print(f'=== {name} ===')
    print(f'AUC-ROC: {auc:.4f}  '
          f'(0.5 = random, 1.0 = perfect)')
    print(classification_report(y_test, predictions,
                                target_names=['stayed', 'churned']))
    print(f'Confusion Matrix:')
    cm = confusion_matrix(y_test, predictions)
    print(f'  True Negatives:  {cm[0][0]:4d} | False Positives: {cm[0][1]:4d}')
    print(f'  False Negatives: {cm[1][0]:4d} | True Positives:  {cm[1][1]:4d}')
    print()

# KEY LESSON: a model that always predicts class 0 achieves 90% accuracy
# but AUC-ROC of 0.5 and recall of 0 for the positive class
class_zero_baseline = np.zeros(len(y_test), dtype=int)
print('=== Always-Predict-Zero Baseline ===')
print(f'Accuracy: {(class_zero_baseline == y_test).mean():.2%}  '
      f'(looks great — but catches zero positives)')
print(classification_report(y_test, class_zero_baseline,
                             target_names=['stayed', 'churned'],
                             zero_division=0))
⚠ Accuracy Is Deceptive on Imbalanced Data
On a dataset with 90% negative examples, a model that always predicts 'negative' achieves 90% accuracy and catches exactly zero positive cases. This model would pass a naive accuracy check and fail completely in production. Always check precision and recall for the minority class. Use AUC-ROC to evaluate the model's ability to rank positives above negatives across all thresholds. Use class_weight='balanced' or oversampling (SMOTE) to compensate for imbalance during training. Use stratified train/test splits to preserve class ratios in both sets.
📊 Production Insight
Class imbalance is the norm in production, not the exception.
Fraud detection, disease diagnosis, equipment failure, and churn prediction all have rare positive classes — typically 1-10% of total records.
Rule: never report only accuracy. Always show per-class precision, recall, F1, and AUC-ROC. If your stakeholder only looks at accuracy, educate them before the model ships.
🎯 Key Takeaway
Classification assigns inputs to predefined categories with known labels.
Always use stratified splits and per-class metrics on imbalanced data — accuracy alone is dangerously misleading.
AUC-ROC gives you a threshold-independent view of classification quality — use it alongside F1 for the minority class.

Supervised Learning: Regression Deep Dive

Regression predicts continuous numbers. The model learns a function that maps input features to a numeric output — not a category, a specific value. The output could be a house price, a delivery time estimate, a sales forecast, or a sensor reading. The model's quality is judged by how close its numeric predictions are to the true values.

The key decisions in regression are: choosing the loss function (MSE vs MAE vs Huber), handling outliers that distort gradient updates, preventing overfitting when features are many and data is sparse, and scaling features so that different-range inputs do not dominate each other. A regression model trained on unscaled features where income ranges from 0 to 500,000 and age ranges from 0 to 100 will behave as if income matters 5,000x more than age — not because income is more important, but because its raw numbers are larger.

io/thecodeforge/ml/regression_deep_dive.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Generate regression data with noise and some informative features
X, y = make_regression(
    n_samples=500,
    n_features=8,
    n_informative=4,   # only 4 of 8 features actually predict y
    noise=25,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

def evaluate_regression(name, model, X_train, X_test, y_train, y_test):
    """Fit a model and print all three regression metrics."""
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

    mae  = mean_absolute_error(y_test, predictions)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    r2   = r2_score(y_test, predictions)

    print(f'=== {name} ===')
    print(f'  MAE:  {mae:.2f}  (avg absolute error, same units as target)')
    print(f'  RMSE: {rmse:.2f}  (penalises large errors more than MAE)')
    print(f'  R²:   {r2:.4f}  (1.0 = perfect, 0.0 = predicts mean, <0 = bad)')
    print()

# Model 1: Linear Regression without scaling — naive baseline
evaluate_regression(
    'Linear Regression (no scaling)',
    LinearRegression(), X_train, X_test, y_train, y_test
)

# Model 2: Ridge Regression with scaling inside a Pipeline
# Pipeline ensures the scaler is fit on training data only,
# preventing data leakage into the validation set
ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge',  Ridge(alpha=1.0))
])
evaluate_regression(
    'Ridge Regression (scaled, L2 regularisation)',
    ridge_pipeline, X_train, X_test, y_train, y_test
)

# Model 3: Gradient Boosting — handles nonlinearity and feature interactions
evaluate_regression(
    'Gradient Boosting Regressor',
    GradientBoostingRegressor(n_estimators=200, learning_rate=0.05,
                              max_depth=4, random_state=42),
    X_train, X_test, y_train, y_test
)

# Cross-validation: more reliable than a single train/test split
gb = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05,
                               max_depth=4, random_state=42)
cv_scores = cross_val_score(gb, X, y, cv=5, scoring='neg_mean_absolute_error')
print(f'5-Fold CV MAE: {-cv_scores.mean():.2f} +/- {cv_scores.std():.2f}')
print('(More reliable than a single test split)')
🔥MAE vs RMSE: Which to Report
MAE (Mean Absolute Error) is in the same units as your target — if you are predicting house prices in dollars, MAE tells you the average dollar error directly. RMSE (Root Mean Squared Error) penalises large errors more heavily because it squares them before averaging. If your data has outliers, RMSE will look worse than MAE because the outlier errors dominate. Report both — MAE tells you the typical error, RMSE tells you the worst-case behaviour. R² (coefficient of determination) tells you what fraction of the target variance your model explains — 1.0 is perfect, 0.0 means your model is no better than predicting the mean every time.
📊 Production Insight
Regression models are sensitive to feature scale.
Income (0-500,000) will numerically dominate age (0-100) without normalisation, even if both are equally informative.
Rule: always use a Pipeline that wraps the scaler and model together. This prevents data leakage — the scaler sees only training data during cross-validation, not the validation fold.
🎯 Key Takeaway
Regression predicts continuous numbers, not categories.
Report MAE, RMSE, and R² together — each reveals a different aspect of model quality.
Always use a Pipeline for scaling to prevent data leakage from validation folds into the scaler fit.

Unsupervised Learning: Clustering Deep Dive

Clustering groups data points that are similar to each other without any labels guiding the process. The challenge is threefold: choosing the right number of clusters, validating that the discovered groups are stable and meaningful, and then interpreting what those groups represent in business terms.

K-Means is the most common starting point because it is fast, interpretable, and scales to large datasets. But K-Means makes assumptions that often do not hold in real data — it assumes clusters are spherical, roughly equal in size, and have similar density. When those assumptions break down, DBSCAN or hierarchical clustering produce better results.

The most common mistake in clustering is choosing K arbitrarily or picking the one that looks 'round'. Use the elbow method and silhouette score together. If they disagree, use domain knowledge as the tiebreaker — the number of clusters that makes the most business sense is the right answer.

io/thecodeforge/ml/clustering_deep_dive.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Customer behaviour data — 3 natural segments exist in this dataset
# Features: annual_spend ($), visit_frequency (visits/year), avg_cart_value ($)
np.random.seed(42)
X = np.array([
    # Segment A: moderate spend, frequent, small carts
    [5000, 50, 100], [4800, 48, 100], [5200, 52, 100],
    [4900, 49,  98], [5100, 51, 102], [4700, 47,  99],
    # Segment B: low spend, infrequent, tiny carts
    [200, 12, 17], [180, 10, 18], [220, 14, 16],
    [190, 11, 17], [210, 13, 16], [230, 15, 15],
    # Segment C: high spend, rare visits, large carts
    [12000, 8, 1500], [11500, 7, 1643], [12500, 9, 1389],
    [11800, 8, 1475], [12200, 9, 1356], [11200, 7, 1600],
    # Segment D (browse-heavy): low spend, very frequent, tiny carts
    [300, 45, 7], [250, 40, 6], [280, 42, 7],
    [260, 41, 6], [310, 46, 7], [270, 43, 6],
])

# Scale features BEFORE clustering
# Distance-based algorithms are dominated by the largest-scale feature
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Method 1: Elbow plot (inertia vs K) ---
inertias = []
silhouettes = []
K_range = range(2, 8)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

print('K  | Inertia   | Silhouette')
print('---|-----------|----------')
for k, inertia, sil in zip(K_range, inertias, silhouettes):
    print(f'{k}  | {inertia:9.1f} | {sil:.4f}')

best_k = K_range[np.argmax(silhouettes)]
print(f'\nBest K by silhouette: {best_k}')

# --- Fit final model with best K ---
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

print(f'\nCluster profiles (needs domain expert interpretation):')
for i in range(best_k):
    members = X[clusters == i]
    print(f'  Cluster {i} — {len(members)} customers:')
    print(f'    Avg spend:     ${members[:, 0].mean():,.0f}')
    print(f'    Avg visits:    {members[:, 1].mean():.0f}/year')
    print(f'    Avg cart:      ${members[:, 2].mean():,.0f}')

# --- Visualise clusters using PCA (2D projection) ---
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f'\nPCA variance explained: '
      f'{pca.explained_variance_ratio_.sum():.1%} in 2 components')

fig, ax = plt.subplots(figsize=(8, 6))
colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12']
for i in range(best_k):
    mask = clusters == i
    ax.scatter(X_pca[mask, 0], X_pca[mask, 1],
               c=colors[i], label=f'Cluster {i}',
               s=100, edgecolors='black', linewidth=0.5)
ax.set_title(f'Customer Segments (K={best_k}) — PCA Projection',
             fontweight='bold')
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
ax.legend()
fig.tight_layout()
fig.savefig('clusters_pca.png', dpi=300, bbox_inches='tight')
plt.close(fig)
print('Saved clusters_pca.png')
💡Choosing K: Elbow Method vs Silhouette Score
  • Elbow Method: plot inertia (within-cluster sum of squares) vs K. The point where improvement slows sharply — the 'elbow' — suggests the optimal K. The elbow is often ambiguous on real data.
  • Silhouette Score: measures how similar each point is to its own cluster versus the nearest other cluster. Ranges from -1 to 1. Above 0.5 is good. Above 0.7 is strong.
  • Always try both methods — if they agree, you have good evidence. If they disagree, use domain knowledge as the tiebreaker.
  • If no clear elbow exists and silhouette scores are uniformly low (below 0.3), the data may not have natural clusters. Dimensionality reduction before clustering often helps.
  • Visualise your final clusters with PCA or t-SNE — if clusters overlap heavily in 2D, they are probably not meaningfully separate in the original space.
📊 Production Insight
K-Means assumes spherical clusters of similar size and density — a set of assumptions that rarely holds in real customer or transactional data.
Irregularly shaped clusters, clusters of very different sizes, or data with significant noise require DBSCAN or hierarchical clustering.
Rule: always visualise clusters after fitting using PCA or t-SNE. If the clusters do not look visually separated in a 2D projection, do not ship them to the business.
🎯 Key Takeaway
Clustering discovers groups without labels — the algorithm finds structure, humans interpret it.
Use silhouette score and the elbow method together to choose K — never guess.
Always scale features before clustering and visualise results with PCA — trust metrics and plots, not the algorithm's confidence.

When to Use Which: A Decision Framework

The supervised vs unsupervised choice is not always binary. Many production systems combine both paradigms in sequence. The canonical pattern is: use unsupervised learning to discover structure you did not anticipate, validate those discoveries with domain experts, then build a supervised model on top of the validated structure to operationalise it at scale.

The framework below walks through the decision based on your actual situation — not what you wish your data looked like.

Supervised vs Unsupervised Decision Flowchart
IfYou have labelled data with validated labels and need to predict a known target for new inputs
UseUse supervised learning. Classification for categories, regression for numbers. Audit label quality first — arbitrary labels produce arbitrary models.
IfYou have unlabelled data and want to discover groups, patterns, or structure you did not define in advance
UseUse unsupervised learning. Start with clustering. Apply dimensionality reduction first if you have more than 20 features.
IfYou have labelled data but the labels were invented rather than observed from historical outcomes
UseStop and audit labels before doing anything else. Invented labels may not reflect real patterns. Consider running unsupervised clustering to see what structure actually exists in the data.
IfYou have unlabelled data but need a production system that assigns new records to groups in real time
UseUse hybrid approach: unsupervised to discover groups, label the discovered groups with domain experts, then train a supervised classifier to assign new records efficiently.
IfYou have a small amount of labelled data and a large amount of unlabelled data
UseUse semi-supervised or active learning. Cluster the unlabelled data, label representative samples from each cluster, then train a supervised model. Use the model's uncertainty to guide which unlabelled records to label next.
IfYour supervised model performance is unexpectedly poor and you cannot explain why
UseRun unsupervised exploration before debugging the model. PCA and t-SNE visualisations often reveal whether any structure exists in the data at all. If no structure is visible, the problem may be fundamentally underdetermined.

Common Pitfalls: What Beginners Get Wrong

Beginners make predictable mistakes when choosing between supervised and unsupervised learning. These mistakes waste months of engineering effort and produce models that cannot be deployed or that actively mislead decision-makers. The three most costly pitfalls are using supervised learning without validated labels, ignoring unsupervised methods when labelling is expensive, and evaluating unsupervised results with supervised metrics.

⚠ The Three Costliest Mistakes
1. Using supervised learning when labels are invented rather than observed — the model learns to confidently predict arbitrary categories. 2. Ignoring unsupervised methods when labelling would be cheaper applied to clusters than to individual records. 3. Evaluating clustering with accuracy or F1 — these metrics require ground-truth labels that do not exist in unsupervised settings. Each of these mistakes can waste two to six months of engineering time on a fundamentally broken approach.

🎯 Key Takeaways

  • Supervised learning requires labelled data — inputs paired with known, validated correct outputs. Label quality sets the ceiling of model performance.
  • Unsupervised learning discovers patterns in unlabelled data — no answer key exists. The algorithm finds structure; humans must interpret whether it is meaningful.
  • Classification and regression are supervised tasks. Clustering, dimensionality reduction, and anomaly detection are unsupervised tasks.
  • The choice depends on whether you have validated labels and whether you need prediction or discovery. Invented labels produce arbitrary supervised models.
  • Accuracy is deceptive on imbalanced data — always check per-class precision, recall, and AUC-ROC for the minority class.
  • Unsupervised results require human interpretation and domain expert validation — never ship cluster assignments without review.
  • Many production systems combine both paradigms: unsupervised for exploration and feature engineering, supervised for prediction and operationalisation at scale.

⚠ Common Mistakes to Avoid

    Memorising syntax before understanding the concept
    Symptom

    You can copy-paste code that runs without errors but cannot explain why each line exists, what it does, or how to adapt it when the data or problem changes. The code works until it does not, and then you are stuck.

    Fix

    Read the concept explanation first and make sure you can explain it without looking at code. Then write the implementation from memory. Explain each line out loud as you write it. If you cannot articulate what a line does, you have not understood it — look it up, understand it, then write it again.

    Skipping practice and only reading theory
    Symptom

    The concepts feel familiar after reading. You nod along. But when you sit down with a real dataset and a blank notebook, nothing comes. Theory without implementation produces false confidence that evaporates on contact with real problems.

    Fix

    After reading each section, close the article and implement the code on a different dataset. Change the number of features, the number of classes, the algorithm. Observe what changes and what does not. The only way to build genuine understanding is to build the thing, break it, and fix it.

    Using supervised learning on invented labels without realising it
    Symptom

    The model trains without errors, achieves high accuracy, passes automated tests, and gets deployed. Then the business reports that the predictions are not useful. On investigation, the labels were based on someone's intuition rather than observed historical outcomes.

    Fix

    Before training, ask: 'Where did these labels come from? Are they observed historical outcomes, or did someone assign them based on judgment?' If the labels were invented, the model learned to reproduce the invention. Run unsupervised clustering first to discover what structure actually exists in the data, then decide whether labelling makes sense.

    Ignoring unsupervised methods when labelled data is expensive
    Symptom

    The project stalls for weeks or months because labelling 100,000 records one at a time costs more time or money than the project budget allows.

    Fix

    Run unsupervised clustering first to discover natural groups. The cluster contains hundreds of similar records — label the cluster, not every individual record. This reduces the labelling effort by 10-100x. Then train a supervised classifier on the cluster-level labels to assign new records automatically.

    Evaluating unsupervised models with supervised metrics
    Symptom

    Attempting to compute accuracy or F1 on clustering results produces errors or nonsensical numbers. Or worse: someone computes accuracy by comparing cluster IDs to an arbitrary label and gets a misleading number that looks like evidence of model quality.

    Fix

    Use clustering-appropriate metrics. Silhouette score measures cohesion and separation without ground-truth labels (higher is better, maximum 1.0). Inertia measures within-cluster compactness (lower is better, use for elbow method). Domain expert validation confirms whether the groups make business sense. These three together give you a complete picture of clustering quality.

Interview Questions on This Topic

  • QExplain the difference between supervised and unsupervised learning with a real-world example of each.JuniorReveal
    Supervised learning trains on labelled data where each input has a known correct output attached. The model learns the mapping from inputs to labels and applies it to new data. A concrete example: email spam detection. The model trains on thousands of emails that humans have already labelled 'spam' or 'not spam'. It learns which feature combinations — sender patterns, word frequencies, link counts — predict each label, then classifies new emails automatically. Unsupervised learning finds patterns in unlabelled data without any known correct output. A concrete example: customer segmentation. You have purchase history for 500,000 customers but no predefined segments. The model groups customers by similarity in purchasing behaviour — frequency, spend, product categories — without being told what the groups should be. It might discover that high-frequency small-cart customers behave very differently from low-frequency large-cart customers, revealing a segment the business had not explicitly defined. The key difference is the label. Supervised requires a human to have already answered the question for training examples. Unsupervised discovers answers to questions the human had not thought to ask.
  • QA stakeholder asks you to build a model to 'predict customer segments.' How do you determine whether this is a supervised or unsupervised problem?Mid-levelReveal
    The first question I ask is: do we already know what the segments are and do we have historical examples of customers assigned to each segment? If the business has predefined segments with validated historical assignments — 'Premium', 'Standard', 'Budget' based on spend thresholds that have been used operationally for years — then I have labels. This is a supervised classification problem. I train on the historical assignments and predict which segment each new customer falls into. If the business wants to find segments they do not currently know exist — 'discover which natural groups are in our customer base' — then there are no labels, and segment discovery is an unsupervised problem. I use clustering to find natural groupings, validate the discovered segments with domain experts, and then present those segments as findings rather than predictions. In most real cases I have encountered, the request is the second kind phrased as the first kind. The stakeholder says 'predict segments' but means 'find segments'. My job is to clarify which problem we are actually solving before writing a line of code, because the two paradigms require fundamentally different data, methods, and validation approaches.
  • QYou have 1 million unlabelled records and a budget to label 5,000 of them. How would you maximise model performance?SeniorReveal
    I would use a combination of unsupervised clustering and active learning to spend the labelling budget as efficiently as possible. First, I would run unsupervised clustering on all 1 million records to discover natural groupings. This gives me a map of the data's structure before I spend a single label. Second, instead of randomly selecting 5,000 records to label, I would sample strategically — proportionally from each cluster, ensuring every discovered group has labelled representatives. Random sampling on 1 million records with rare positive classes might give me zero examples of important minority groups. Cluster-proportional sampling prevents this. Third, I would train an initial supervised model on the first 2,500 strategically sampled labels. Then I would run inference on all unlabelled records and identify the records where the model is most uncertain — where the predicted probability is closest to 0.5 for binary classification. Those uncertain records contain the most information per label. I would use the remaining 2,500 labels on those uncertain records (active learning). This approach typically achieves 80-90% of the performance of a fully labelled 1 million-record dataset at 0.5% of the labelling cost. The key insight is that not all labels carry equal information — strategic selection dramatically outperforms random selection at small label budgets.
  • QCan unsupervised learning results be used to improve a supervised model?SeniorReveal
    Yes, and in production this combination consistently outperforms either paradigm alone. There are four practical ways to combine them. First, cluster assignments as features. Run K-Means or DBSCAN on the training data and add the cluster membership as a categorical feature for the supervised model. The discovered group membership often captures nonlinear feature interactions that the supervised model cannot easily learn from raw features. Second, dimensionality reduction before supervised training. Run PCA on the raw features and use the principal components as inputs to the supervised model. This reduces multicollinearity, speeds up training, and can improve generalisation when the original feature space is high-dimensional relative to the training set size. Third, anomaly detection for label auditing. Run an unsupervised anomaly detector (Isolation Forest, Local Outlier Factor) on the training data. Records that are anomalous within their labelled class are strong candidates for mislabelling — they look nothing like the other members of their class. Reviewing these records often uncovers systematic labelling errors that are degrading model quality. Fourth, pseudo-labelling for semi-supervised learning. Train a supervised model on the small labelled set, use it to predict labels on the large unlabelled set with high confidence, add those high-confidence predictions to the training set, and retrain. Iterate. This leverages both paradigms to expand the effective training set without manual labelling.

Frequently Asked Questions

What is Supervised vs Unsupervised Learning in simple terms?

Supervised learning is studying with an answer key — you have examples with correct answers attached and the model learns to predict new ones. Unsupervised learning is discovering patterns on your own — you have data but no answers, and the algorithm groups things by similarity without being told what the groups should be. Supervised: teacher with labels. Unsupervised: no teacher, find structure yourself.

Which is better: supervised or unsupervised learning?

Neither is universally better — they solve different problems. Supervised learning is better when you have validated labels and need to predict a known outcome for new inputs (will this customer churn? is this transaction fraudulent?). Unsupervised learning is better when you have raw data and want to discover structure you did not anticipate (what natural customer groups exist? which transactions look anomalous?). In practice, most production ML systems use supervised learning for prediction and unsupervised learning for exploration, feature engineering, and anomaly detection — often together in the same pipeline.

Can I use both supervised and unsupervised learning together?

Yes, and you frequently should. Common production patterns include: running unsupervised clustering to discover natural groups, then labelling those groups and training a supervised classifier to assign new records; applying PCA (unsupervised dimensionality reduction) to compress features before training a supervised model; and using unsupervised anomaly detection to identify and review potential mislabels in the supervised training set. The two paradigms complement each other — unsupervised for discovery and exploration, supervised for prediction and operationalisation.

How much labelled data do I need for supervised learning?

It depends on the complexity of the problem, the number of classes, and the signal-to-noise ratio in your features. For simple binary classification with 10-20 informative features, 500-1,000 validated labelled examples per class is a reasonable starting point. For complex problems with many classes or subtle signal, you may need thousands per class. Transfer learning (starting from a pre-trained model) can reduce this requirement significantly — a fine-tuned language model may generalise well from 100-200 examples. If labelling is expensive, start with unsupervised exploration to understand your data's structure before committing to a labelling effort.

What is the silhouette score and why does it matter for clustering?

The silhouette score measures how well each data point fits its assigned cluster relative to the nearest other cluster. For each point it computes two distances: the average distance to other points in the same cluster (cohesion) and the average distance to points in the nearest different cluster (separation). The silhouette score is (separation - cohesion) / max(separation, cohesion). It ranges from -1 to +1. A score above 0.5 suggests reasonable clusters. Above 0.7 suggests well-separated clusters. Below 0.25 suggests the clusters are not meaningfully distinct. It matters because it is one of the few ways to evaluate clustering quality without ground-truth labels — it tells you whether the algorithm found real structure or just divided the data arbitrarily.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousIntroduction to Machine LearningNext →ML Workflow — Data to Deployment
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged