Senior 4 min · April 15, 2026

ML Algorithm Selection — Why Regression Broke Churn

Ops teams ignored churn scores like 0.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Algorithm choice depends on your data's label type and the problem you're solving.
  • Use regression for predicting continuous numbers (e.g., house prices, temperature, revenue).
  • Use classification for predicting discrete categories (e.g., spam/not spam, churn/no churn).
  • Use clustering to find natural groupings in unlabeled data (e.g., customer segments).
  • Always start with a simple baseline model before adding complexity — a simple model that works beats a complex model you cannot explain.
  • The biggest mistake is choosing an algorithm based on hype or familiarity instead of the problem's actual structure.
Plain-English First

Choosing an algorithm is like picking a tool from a toolbox. You would not use a hammer to turn a screw — not because hammers are bad, but because they are the wrong tool for the job. This guide gives you a simple decision map: look at your data, identify what you need to predict, and pick the tool designed for exactly that job. The map has two forks — do you have labeled examples, and what does your output look like? Every other choice flows from those two answers.

Selecting the wrong algorithm wastes time and produces misleading results. The frustrating part is that most algorithms will run without errors regardless of whether they are appropriate — they just produce results that look plausible but are fundamentally wrong for the problem.

This guide cuts through the noise with a direct decision flow based on your data's structure and prediction goal. We focus on the foundational algorithms every practitioner must know before reaching for advanced variants. The goal is not to catalog every technique in the literature — it is to build a reliable selection framework for the problems you will actually encounter in your first year of ML work.

The Core Decision: Labels and Goals

Every algorithm choice starts with two questions. First, do you have labeled data? Second, what does your output need to look like? These two questions narrow the entire space of possible algorithms down to two or three candidates before you look at a single line of code.

Labeled data means you have historical examples where you know the correct answer — house prices for each house sold, spam/not-spam labels for each email. The output type determines which algorithm family applies: predicting a number maps to regression, predicting a category maps to classification, finding hidden groups in unlabeled data maps to clustering.

Get these two questions wrong and no amount of hyperparameter tuning will rescue the model. The algorithm will train and evaluate without throwing errors — it will just produce results that are structurally misaligned with the problem.

decision_flow.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# TheCodeForge — Algorithm Selection Decision Flow
# Run this to map your problem to the right algorithm family

def choose_algorithm(has_labels: bool, goal: str, n_classes: int = None) -> str:
    """
    A structured decision flow for algorithm family selection.

    Parameters:
    -----------
    has_labels : bool
        True if your dataset has a target variable (supervised learning).
    goal : str
        One of: 'predict_number', 'predict_category',
                'find_groups', 'reduce_dimensions'
    n_classes : int or None
        Number of unique target classes (for classification problems).

    Returns:
    --------
    str : Recommended algorithm family and starting point.
    """
    if has_labels:
        if goal == 'predict_number':
            return (
                "Regression family.\n"
                "Start with: Linear Regression\n"
                "Evaluate with: MAE, RMSE (not accuracy)\n"
                "Watch for: outliers skewing coefficients"
            )
        elif goal == 'predict_category':
            if n_classes == 2:
                return (
                    "Binary Classification.\n"
                    "Start with: Logistic Regression\n"
                    "Evaluate with: Precision, Recall, F1-score, AUC-ROC\n"
                    "Watch for: class imbalance inflating accuracy"
                )
            elif n_classes and n_classes > 2:
                return (
                    "Multi-class Classification.\n"
                    "Start with: Logistic Regression (multi_class='auto')\n"
                    "Or: Decision Tree for non-linear boundaries\n"
                    "Evaluate with: Macro F1-score, per-class precision/recall"
                )
        else:
            return "Check your goal definition — labeled data implies supervised learning."
    else:
        if goal == 'find_groups':
            return (
                "Clustering family.\n"
                "Start with: K-Means (if K is known or estimable)\n"
                "Alternative: DBSCAN (if cluster shapes are irregular)\n"
                "Evaluate with: Silhouette Score, Inertia (elbow method)"
            )
        elif goal == 'reduce_dimensions':
            return (
                "Dimensionality Reduction.\n"
                "Start with: PCA for linear reduction\n"
                "Alternative: t-SNE or UMAP for visualization\n"
                "Note: scale features first — PCA is sensitive to magnitude"
            )
        else:
            return "Need more problem definition — can you obtain any labels?"

# Example usage
print(choose_algorithm(has_labels=True, goal='predict_category', n_classes=2))
print()
print(choose_algorithm(has_labels=False, goal='find_groups'))
print()
print(choose_algorithm(has_labels=True, goal='predict_number'))
Output
Binary Classification.
Start with: Logistic Regression
Evaluate with: Precision, Recall, F1-score, AUC-ROC
Watch for: class imbalance inflating accuracy
Clustering family.
Start with: K-Means (if K is known or estimable)
Alternative: DBSCAN (if cluster shapes are irregular)
Evaluate with: Silhouette Score, Inertia (elbow method)
Regression family.
Start with: Linear Regression
Evaluate with: MAE, RMSE (not accuracy)
Watch for: outliers skewing coefficients
Supervised vs. Unsupervised Learning
  • Supervised: You have labeled examples (X maps to Y). The algorithm learns the mapping from inputs to outputs.
  • Unsupervised: You have only data (X). The algorithm finds hidden structures — groups, patterns, or compressed representations.
  • Semi-supervised: A mix of both. A small number of labels guide discovery across a large unlabeled set.
  • The presence or absence of labels is the first fork in every algorithm selection decision.
Production Insight
In production, 'labeled data' often means 'clean, consistent, trustworthy labels' — which is rarely guaranteed.
Noisy labels from multiple human annotators, outdated labels that no longer reflect current business rules, or labels generated by a previous model will all corrupt supervised training.
Audit label quality before investing in model complexity. A simple model trained on clean labels outperforms a complex model trained on noisy ones every time.
Key Takeaway
Labels determine the learning type — supervised or unsupervised.
The prediction goal — a number or a category — selects the algorithm family.
Start with these two questions before looking at any algorithm documentation.
Basic Algorithm Selection Flow
IfYou have labeled data with a continuous target — price, temperature, revenue, time.
UseUse a Regression algorithm. Start with Linear Regression.
IfYou have labeled data with a categorical target — yes/no, spam/ham, type A/B/C.
UseUse a Classification algorithm. Start with Logistic Regression.
IfYou have no labels and want to discover natural groupings in the data.
UseUse a Clustering algorithm. Start with K-Means.
IfYou have many features and want to simplify, compress, or visualize.
UseUse a Dimensionality Reduction algorithm. Start with PCA.

Regression: Predicting Numbers

Use regression when your target variable is a continuous number — house prices, predicted revenue, temperature forecast, time to failure. The model learns to output any real-valued number, and you evaluate it by measuring the magnitude of prediction errors rather than counting correct or incorrect classifications.

Linear Regression is the correct starting point for most problems. It is fast, interpretable, and the coefficients tell you exactly how each feature contributes to the prediction. If the relationship between features and target is genuinely linear, it is often all you need. If the residuals show patterns — systematic over- or under-prediction — that is the signal to consider a more complex model like a Decision Tree Regressor or Gradient Boosting.

The critical mistake beginners make with regression is evaluating it with accuracy. Accuracy is undefined for continuous outputs. Use Mean Absolute Error for an interpretable error in the same units as your target, and RMSE when large errors are disproportionately expensive.

regression_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# TheCodeForge — Regression: Predicting Continuous Values
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Load dataset — predicting median house values
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Target range: ${y.min():.2f} to ${y.max():.2f} (units: $100k)')
print(f'Training samples: {len(X_train)}, Test samples: {len(X_test)}')

# Model 1: Linear Regression (always start here)
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearRegression())
])
lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)

mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

print(f'\n=== Linear Regression (Baseline) ===')
print(f'MAE:  {mae_lr:.4f} (avg error: ${mae_lr * 100:.0f}k)')
print(f'RMSE: {rmse_lr:.4f}')
print(f'R²:   {r2_lr:.4f} (explains {r2_lr:.1%} of variance)')

# Model 2: Decision Tree Regressor (for non-linear relationships)
dt_pipeline = Pipeline([
    ('model', DecisionTreeRegressor(max_depth=6, random_state=42))
])
dt_pipeline.fit(X_train, y_train)
y_pred_dt = dt_pipeline.predict(X_test)

mae_dt = mean_absolute_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
r2_dt = r2_score(y_test, y_pred_dt)

print(f'\n=== Decision Tree Regressor (max_depth=6) ===')
print(f'MAE:  {mae_dt:.4f} (avg error: ${mae_dt * 100:.0f}k)')
print(f'RMSE: {rmse_dt:.4f}')
print(f'R²:   {r2_dt:.4f} (explains {r2_dt:.1%} of variance)')

# Residual check — patterns in residuals indicate missed structure
residuals = y_test - y_pred_lr
print(f'\n=== Residual Diagnostics (Linear Regression) ===')
print(f'Mean residual:   {residuals.mean():.4f} (should be near 0)')
print(f'Residual std:    {residuals.std():.4f}')
print(f'Max over-pred:   {residuals.min():.4f}')
print(f'Max under-pred:  {residuals.max():.4f}')
print(f'\nIf residuals show patterns, the relationship is non-linear.')
print(f'Consider Decision Tree, Random Forest, or feature engineering.')
Output
Target range: $0.15 to $5.00 (units: $100k)
Training samples: 16512, Test samples: 4128
=== Linear Regression (Baseline) ===
MAE: 0.5332 (avg error: $53k)
RMSE: 0.7456
R²: 0.5758 (explains 57.6% of variance)
=== Decision Tree Regressor (max_depth=6) ===
MAE: 0.4421 (avg error: $44k)
RMSE: 0.6387
R²: 0.6721 (explains 67.2% of variance)
=== Residual Diagnostics (Linear Regression) ===
Mean residual: 0.0000 (should be near 0)
Residual std: 0.7456
Max over-pred: -2.8134
Max under-pred: 3.4221
If residuals show patterns, the relationship is non-linear.
Consider Decision Tree, Random Forest, or feature engineering.
How to Read Regression Metrics
  • MAE: average absolute error in target units — most interpretable for stakeholders
  • RMSE: penalizes large errors quadratically — use when large errors are costly
  • R²: proportion of variance explained — 1.0 is perfect, 0.0 means the model does no better than predicting the mean
  • Never use accuracy for regression — it is undefined for continuous outputs
Production Insight
Regression models are sensitive to outlier values in training data.
A single extreme data point can skew the entire model's coefficients — a house that sold for 10x market value because of a bidding war will pull Linear Regression's coefficients away from the true relationship.
Always visualize your target distribution with a histogram before training. If you see extreme values, investigate whether they are genuine or data quality issues. Consider Huber Regression as a robust alternative when outliers cannot be removed.
Key Takeaway
Predicting a continuous number? Use Regression — start with Linear Regression.
Evaluate with error magnitude (MAE, RMSE) and R², not accuracy.
Check residuals for patterns — systematic errors signal non-linearity that a more complex model can capture.

Classification: Predicting Categories

Use classification when your target is a discrete category — spam or not spam, will churn or will not churn, disease present or absent. The model learns decision boundaries that separate categories in feature space, and the output is either a class label or a probability of belonging to each class.

Logistic Regression is the right starting point for binary problems — two classes. Despite the name, it is a classification algorithm. It outputs a probability between 0 and 1, which you convert to a class label using a decision threshold (typically 0.5, but this is tunable based on the cost of false positives versus false negatives). For multi-class problems — three or more categories — Decision Trees are often more interpretable for initial exploration.

The most dangerous mistake in classification is trusting accuracy on imbalanced datasets. If 95% of your training examples belong to one class, a model that always predicts that class achieves 95% accuracy while being completely useless. Always print the full classification report and confusion matrix.

classification_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# TheCodeForge — Classification: Predicting Categories
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, accuracy_score
)
import numpy as np

# Binary classification with moderate class imbalance
X, y = make_classification(
    n_samples=1000, n_features=10,
    weights=[0.75, 0.25],  # 75% class 0, 25% class 1
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print('Class distribution (training):')
unique, counts = np.unique(y_train, return_counts=True)
for cls, cnt in zip(unique, counts):
    print(f'  Class {cls}: {cnt} samples ({cnt/len(y_train):.1%})')

# Model 1: Logistic Regression (start here for binary classification)
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(class_weight='balanced', random_state=42))
])
lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)
y_prob_lr = lr_pipeline.predict_proba(X_test)[:, 1]

print(f'\n=== Logistic Regression ===')
print(f'Accuracy:  {accuracy_score(y_test, y_pred_lr):.2%}  <- often misleading')
print(f'AUC-ROC:   {roc_auc_score(y_test, y_prob_lr):.4f}  <- use this')
print(f'\nClassification Report:')
print(classification_report(y_test, y_pred_lr, target_names=['No Churn', 'Churn']))
print(f'Confusion Matrix:')
print(confusion_matrix(y_test, y_pred_lr))

# Model 2: Decision Tree (for interpretable non-linear boundaries)
dt_pipeline = Pipeline([
    ('classifier', DecisionTreeClassifier(
        max_depth=5, class_weight='balanced', random_state=42
    ))
])
dt_pipeline.fit(X_train, y_train)
y_pred_dt = dt_pipeline.predict(X_test)
y_prob_dt = dt_pipeline.predict_proba(X_test)[:, 1]

print(f'\n=== Decision Tree (max_depth=5) ===')
print(f'Accuracy:  {accuracy_score(y_test, y_pred_dt):.2%}')
print(f'AUC-ROC:   {roc_auc_score(y_test, y_prob_dt):.4f}')
print(f'\nClassification Report:')
print(classification_report(y_test, y_pred_dt, target_names=['No Churn', 'Churn']))

# Decision threshold tuning
print(f'\n=== Threshold Tuning (Logistic Regression) ===')
print(f'Default threshold (0.5): predicts class 1 if P(churn) > 0.5')
print(f'Lower threshold (0.3): catches more churners, more false alarms')
print(f'Higher threshold (0.7): fewer false alarms, misses more churners')
for threshold in [0.3, 0.5, 0.7]:
    y_pred_t = (y_prob_lr >= threshold).astype(int)
    from sklearn.metrics import precision_score, recall_score
    p = precision_score(y_test, y_pred_t)
    r = recall_score(y_test, y_pred_t)
    print(f'  Threshold {threshold}: Precision={p:.2%}, Recall={r:.2%}')
Output
Class distribution (training):
Class 0: 600 samples (75.0%)
Class 1: 200 samples (25.0%)
=== Logistic Regression ===
Accuracy: 79.00% <- often misleading
AUC-ROC: 0.8712 <- use this
Classification Report:
precision recall f1-score support
No Churn 0.88 0.82 0.85 150
Churn 0.62 0.72 0.67 50
accuracy 0.79 200
macro avg 0.75 0.77 0.76 200
weighted avg 0.80 0.79 0.79 200
Confusion Matrix:
[[123 27]
[ 14 36]]
=== Decision Tree (max_depth=5) ===
Accuracy: 76.50%
AUC-ROC: 0.8134
Classification Report:
precision recall f1-score support
No Churn 0.86 0.81 0.83 150
Churn 0.57 0.66 0.61 50
accuracy 0.77 200
macro avg 0.71 0.73 0.72 200
weighted avg 0.78 0.77 0.77 200
=== Threshold Tuning (Logistic Regression) ===
Default threshold (0.5): predicts class 1 if P(churn) > 0.5
Lower threshold (0.3): catches more churners, more false alarms
Higher threshold (0.7): fewer false alarms, misses more churners
Threshold 0.3: Precision=51.43%, Recall=90.00%
Threshold 0.5: Precision=62.07%, Recall=72.00%
Threshold 0.7: Precision=78.57%, Recall=44.00%
Accuracy Is Deceptive on Imbalanced Data
On imbalanced data — for example, 95% 'no churn' and 5% 'churn' — a model that always predicts 'no churn' achieves 95% accuracy but is completely useless. It will never flag a single churner. Always check precision and recall for the minority class using classification_report. AUC-ROC is your best single-number summary for imbalanced binary classification.
Production Insight
The decision threshold — the probability cutoff above which you predict the positive class — is a tunable business parameter, not a fixed technical constant.
Lowering it catches more true positives (higher recall) but increases false alarms (lower precision). Raising it reduces false alarms but misses more real cases.
The right threshold depends on the relative cost of each error type — a missed fraud case costs far more than a false fraud alert. Set the threshold on a validation set using a business cost metric, not just F1-score.
Key Takeaway
Predicting a category? Use Classification — start with Logistic Regression for binary problems.
Ignore accuracy on imbalanced data. Use AUC-ROC, precision, and recall.
The decision threshold is a business lever — agree on it with stakeholders before deployment.

Clustering: Finding Natural Groups

Use clustering when you have no labels and want to discover inherent groupings in your data — customer segments with different spending patterns, documents organized by topic, sensor readings that cluster into operational states. The key distinction from classification is that clustering is exploratory: you are not predicting a known category, you are discovering whether natural categories exist.

K-Means is the standard starting point. It partitions data into K clusters by minimizing within-cluster variance. The constraint is that you must specify K in advance. Use the elbow method (plot inertia vs. K) or the silhouette score to estimate a sensible value. If clusters are irregular in shape, have very different densities, or you genuinely do not know how many groups to expect, DBSCAN is a better choice — it finds dense regions and explicitly marks sparse points as noise rather than forcing them into a cluster.

Clustering results are not self-validating. Statistical measures like silhouette score tell you whether clusters are internally cohesive, but they cannot tell you whether the clusters are meaningful for your business. Always present cluster profiles to domain experts and ask whether the discovered groups make practical sense.

clustering_example.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# TheCodeForge — Clustering: Finding Natural Groups
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Simulated customer data — 3 natural segments
X, true_labels = make_blobs(
    n_samples=300, centers=3, cluster_std=1.2, random_state=42
)
# Scale features — critical for distance-based algorithms
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 1: Find the right K using the elbow method
print('=== Elbow Method: Finding the Right K ===')
inertias = []
silhouette_scores = []
K_range = range(2, 9)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
    score = silhouette_score(X_scaled, kmeans.labels_)
    silhouette_scores.append(score)
    print(f'  K={k}: Inertia={kmeans.inertia_:.1f}, Silhouette={score:.3f}')

best_k = K_range[np.argmax(silhouette_scores)]
print(f'\nBest K by silhouette score: {best_k}')

# Step 2: Fit K-Means with the best K
kmeans_final = KMeans(n_clusters=best_k, random_state=42, n_init=10)
kmeans_final.fit(X_scaled)
km_labels = kmeans_final.labels_

print(f'\n=== K-Means Results (K={best_k}) ===')
for cluster_id in range(best_k):
    cluster_size = np.sum(km_labels == cluster_id)
    print(f'  Cluster {cluster_id}: {cluster_size} samples ({cluster_size/len(X):.1%})')
print(f'Final silhouette score: {silhouette_score(X_scaled, km_labels):.3f}')
print(f'  (0 = overlapping, 1 = well-separated — higher is better)')

# Step 3: DBSCAN for when K is unknown or clusters are irregular
print(f'\n=== DBSCAN (no K required) ===')
dbscan = DBSCAN(eps=0.5, min_samples=5)
db_labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(db_labels)) - (1 if -1 in db_labels else 0)
n_noise = np.sum(db_labels == -1)
print(f'  Clusters found: {n_clusters}')
print(f'  Noise points:   {n_noise} ({n_noise/len(X):.1%} of data)')
if n_clusters > 1:
    non_noise = db_labels != -1
    print(f'  Silhouette score: {silhouette_score(X_scaled[non_noise], db_labels[non_noise]):.3f}')

# Step 4: Profile the clusters — make them actionable
print(f'\n=== Cluster Profiles (K-Means) ===')
print('Always profile clusters — statistical groupings need business meaning.')
for cluster_id in range(best_k):
    mask = km_labels == cluster_id
    cluster_data = X[mask]
    print(f'\n  Cluster {cluster_id} ({np.sum(mask)} members):')
    print(f'    Feature 0 mean: {cluster_data[:, 0].mean():.2f}')
    print(f'    Feature 1 mean: {cluster_data[:, 1].mean():.2f}')
Output
=== Elbow Method: Finding the Right K ===
K=2: Inertia=421.3, Silhouette=0.512
K=3: Inertia=218.7, Silhouette=0.681
K=4: Inertia=198.4, Silhouette=0.543
K=5: Inertia=181.2, Silhouette=0.501
K=6: Inertia=165.8, Silhouette=0.448
K=7: Inertia=152.1, Silhouette=0.412
K=8: Inertia=141.3, Silhouette=0.387
Best K by silhouette score: 3
=== K-Means Results (K=3) ===
Cluster 0: 103 samples (34.3%)
Cluster 1: 98 samples (32.7%)
Cluster 2: 99 samples (33.0%)
Final silhouette score: 0.681
(0 = overlapping, 1 = well-separated — higher is better)
=== DBSCAN (no K required) ===
Clusters found: 3
Noise points: 8 (2.7% of data)
Silhouette score: 0.658
=== Cluster Profiles (K-Means) ===
Always profile clusters — statistical groupings need business meaning.
Cluster 0 (103 members):
Feature 0 mean: -7.32
Feature 1 mean: 3.14
Cluster 1 (98 members):
Feature 0 mean: 1.84
Feature 1 mean: -6.21
Cluster 2 (99 members):
Feature 0 mean: 5.11
Feature 1 mean: 4.87
K-Means vs. DBSCAN — When to Use Each
  • K-Means: use when clusters are roughly spherical and similarly sized, and you have a reasonable estimate of K
  • DBSCAN: use when clusters have irregular shapes, you do not know K, or you expect noise and outliers
  • Silhouette score measures how well-separated clusters are — higher is better, range is -1 to 1
  • Always scale features before clustering — K-Means is dominated by high-magnitude features
Production Insight
Clustering results are not deterministic with K-Means — they depend on random initial centroids.
Run the algorithm multiple times with different random_state values using n_init=10 (the default in scikit-learn 1.2+) and select the result with the lowest inertia.
Beyond stability, always validate cluster meaning with domain experts. A silhouette score of 0.7 on segments that the business cannot interpret or act on is still a failed model.
Key Takeaway
No labels and need groups? Use Clustering — start with K-Means.
K-Means requires pre-defining K — use elbow method and silhouette score to choose it.
Always scale features, validate stability across runs, and profile clusters with domain experts.
Choosing a Clustering Algorithm
IfYou have a rough idea of how many groups exist, clusters are roughly spherical, and data volume is large.
UseStart with K-Means. Use the elbow method and silhouette score to confirm K.
IfClusters are irregularly shaped, you do not know K, or you expect outliers and noise points.
UseUse DBSCAN. It finds dense regions and explicitly labels sparse points as noise — no forced assignment.
IfYou want to understand cluster hierarchy — how groups merge or split at different scales.
UseUse Hierarchical Clustering (AgglomerativeClustering). Plot a dendrogram to choose the number of clusters visually.

The Comparison Table

This table summarizes the key decision points for the core algorithm families. Use it as a quick-reference after you have answered the two foundational questions — labeled or unlabeled, number or category. The table is not exhaustive. Its purpose is to capture the decisions that matter in the first 80% of problems you will encounter as a beginner practitioner.

● Production incidentPOST-MORTEMseverity: high

Customer Churn Model Fails in Production After Choosing Regression for a Yes/No Problem

Symptom
Business users received churn scores like 0.73 or 0.21 but had no clear threshold for action. The model's output did not map to a clear 'will churn' or 'will not churn' decision. The operations team started ignoring the scores entirely after the first week.
Assumption
The data science team assumed a continuous output would provide more granularity and nuance than a binary label. They thought giving the business a score rather than a decision would be more flexible.
Root cause
The problem was fundamentally a binary classification task — will this customer churn in the next 30 days: yes or no. Using regression imposed an incorrect output structure. The continuous scores lacked probabilistic meaning in the business context and provided no actionable decision boundary. The model optimized for minimizing squared error on a 0/1 target, which is not the same as learning the probability of churn.
Fix
Retrain the model using Logistic Regression or a Random Forest classifier. Output a calibrated probability score (0 to 1) with a defined decision threshold agreed upon with the business team — for example, probability > 0.7 triggers an outreach call. Document the threshold, its business rationale, and how it should be revisited as the model ages.
Key lesson
  • Match the algorithm's output type to the business decision required — not to what seems technically richer.
  • A continuous number is not always more informative than a clear category with an associated confidence.
  • Validate model outputs with end-users before deployment — a model nobody acts on provides zero value regardless of its accuracy.
Production debug guideCommon signals you have chosen the wrong algorithm family.5 entries
Symptom · 01
Model outputs a number, but users need a clear yes/no decision.
Fix
You likely need a classification algorithm, not regression. Switch to Logistic Regression for binary outcomes or a multi-class classifier for more than two categories. Define a probability threshold with input from the business team — this is a domain decision, not a technical one.
Symptom · 02
Accuracy is high, but predictions are useless — for example, the model always predicts the majority class.
Fix
Check for class imbalance with value_counts() on your target. Replace accuracy with precision, recall, and F1-score. Apply class_weight='balanced' in your classifier or use SMOTE oversampling. A model that always predicts 'no churn' on a 95/5 split will report 95% accuracy and catch zero churners.
Symptom · 03
Clustering results change drastically with minor data additions or different random seeds.
Fix
K-Means is sensitive to initialization and outliers. Run it multiple times with different random_state values and compare inertia. If instability persists, try DBSCAN — it identifies dense regions and does not require pre-specifying the number of clusters. Also check whether your features are scaled; unscaled features cause K-Means to be dominated by high-magnitude features.
Symptom · 04
Regression model predictions are systematically off — always too high or too low for a subset of the data.
Fix
Check for non-linearity in the target relationship. Plot residuals against predicted values — a pattern in the residuals means linear regression is missing structure. Try a Decision Tree Regressor or add polynomial features. Also inspect for outliers in the target variable that are skewing the model's coefficients.
Symptom · 05
Classification model performs well on training data but precision collapses on the held-out test set.
Fix
Overfitting combined with possible class imbalance. Check the training-to-test accuracy gap. If the gap exceeds 10%, reduce model complexity. If precision is specifically collapsing on the minority class, apply class weighting or resampling. Use stratify=y in train_test_split to ensure class proportions are preserved in both splits.
★ Algorithm Misapplication Cheat SheetQuick checks when your model's results do not make sense.
Predictions are all the same value — always '0', always the mean, or always the majority class label.
Immediate action
Check your target variable distribution for severe imbalance or near-zero variance.
Commands
df['target'].value_counts(normalize=True) # For classification — check class proportions
df['target'].describe() # For regression — check if variance is near zero
Fix now
For classification: use stratified sampling in train_test_split and set class_weight='balanced'. For regression: check whether the target variable has been accidentally encoded as a constant or rounded to integer bins.
Clustering gives one giant cluster and many tiny outlier clusters.+
Immediate action
Visualize data distribution and use the elbow method to find the right K. Also check whether features are scaled.
Commands
from sklearn.cluster import KMeans; inertias = []; [inertias.append(KMeans(n_clusters=k, random_state=42).fit(X).inertia_) for k in range(1, 11)]
import matplotlib.pyplot as plt; plt.plot(range(1, 11), inertias, marker='o'); plt.xlabel('K'); plt.ylabel('Inertia'); plt.title('Elbow Method'); plt.show()
Fix now
The elbow in the plot indicates a good K. If no clear elbow exists, try DBSCAN — it handles variable-density clusters and does not require pre-specifying K. Always scale features with StandardScaler before clustering.
Regression model produces predictions outside the valid range — for example, negative house prices or probabilities above 1.+
Immediate action
Check whether this is a classification problem disguised as regression. For bounded outputs, add a transformation or switch algorithms.
Commands
print(predictions.min(), predictions.max()) # Check prediction bounds
import numpy as np; print(np.sum(predictions < 0), 'negative predictions out of', len(predictions))
Fix now
For binary outcomes (0/1 target), switch to Logistic Regression. For bounded continuous outcomes, apply a target transformation (log for skewed targets) or use a model that respects bounds natively.

Key takeaways

1
Start with your data
labeled or unlabeled? This single question determines supervised versus unsupervised learning and eliminates half the algorithm space immediately.
2
Define your prediction goal
a continuous number means regression, a discrete category means classification. These two answers narrow you to one algorithm family before you write a line of code.
3
Always begin with the simplest model in the chosen family to establish a baseline. Complexity is justified by measured improvement, not assumption.
4
Use the correct evaluation metric for the problem type
accuracy is often misleading, undefined for regression, and dangerous on imbalanced classification data.
5
Algorithm choice is a hypothesis about your data's structure. Validate it with experiments, residual analysis, and domain expert review
not just the training metric.

Common mistakes to avoid

5 patterns
×

Using a complex model like a Neural Network or Gradient Boosting as a first attempt.

Symptom
Model is hard to interpret, slow to train, requires significant tuning, and ultimately performs similarly to a simple baseline. Debugging failures is difficult because you cannot isolate what went wrong.
Fix
Always start with the simplest model in the family — Linear Regression for regression, Logistic Regression for classification. Use it as a baseline. Only add complexity when you can clearly measure that the simple model's performance is insufficient for the business requirement and you understand the root cause of its failure. Complexity should be a deliberate response to a specific limitation, not a default starting point.
×

Choosing an algorithm based on hype or familiarity rather than problem structure.

Symptom
You are using a Random Forest on a simple linear relationship where it adds no value, or using K-Means clustering when you actually have labels and the problem is classification. Results look plausible but the algorithm is solving the wrong problem.
Fix
Return to the two foundational questions before writing any code: do you have labeled data, and what does your output need to look like? Let those answers determine the algorithm family. Hype, trending papers, and tutorial familiarity are not valid selection criteria.
×

Not scaling features for distance-based and gradient-based algorithms.

Symptom
Clustering results are dominated by the feature with the largest raw values — annual income in dollars overwhelms age in years. Logistic Regression converges slowly or not at all. KNN accuracy is inexplicably poor.
Fix
Apply StandardScaler (zero mean, unit variance) before any algorithm that uses Euclidean distance or gradient descent — K-Means, KNN, SVM, Logistic Regression, PCA, Neural Networks. Tree-based algorithms (Decision Trees, Random Forests, Gradient Boosting) are scale-invariant and do not require scaling. Use sklearn Pipeline to chain scaling and model together so it is never forgotten and never applied incorrectly.
×

Evaluating regression models with accuracy.

Symptom
You print accuracy_score on a regression model and get a number that looks meaningful but is actually meaningless. Accuracy is undefined for continuous outputs — sklearn will either throw an error or produce a result that counts exact matches, which is almost always zero for real-valued predictions.
Fix
Use mean_absolute_error, mean_squared_error, and r2_score for regression evaluation. MAE gives you the average error in the same units as your target variable, which stakeholders can understand directly. R² tells you how much of the target variance your model explains.
×

Treating clustering output as ground truth labels without domain validation.

Symptom
You present cluster assignments as definitive customer segments without verifying they map to meaningful business distinctions. The clusters are statistically valid but useless for decision-making.
Fix
Profile every cluster by summarizing the mean and range of each feature within it. Present these profiles to domain experts and ask whether each cluster describes a recognizable group. Statistically coherent clusters with no business interpretation are not deployable outputs.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
When would you choose a Decision Tree over Logistic Regression for a cla...
Q02SENIOR
Your clustering model produces one very large cluster and several tiny o...
Q03JUNIOR
Why is accuracy a poor metric for a classification problem with 99% nega...
Q04SENIOR
Walk me through how you would approach a new ML problem from scratch — s...
Q01 of 04SENIOR

When would you choose a Decision Tree over Logistic Regression for a classification problem?

ANSWER
I would choose a Decision Tree when the relationship between features and the target is non-linear and the linear decision boundary that Logistic Regression assumes would underfit the data. Decision Trees are also the better choice when interpretability through explicit if-then rules is a business requirement — you can show the tree to a non-technical stakeholder and walk through the logic. Logistic Regression is the better starting point when the relationship is approximately linear, when I need calibrated probability outputs for downstream decisions like threshold tuning, or when the training data is limited and I want to reduce variance through a simpler model. In practice, I start with Logistic Regression, measure its performance, and switch to a tree-based model only when the residual analysis or validation metrics indicate that the linear assumption is failing.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Can I use regression for a binary (0/1) outcome?
02
How do I know how many clusters (K) to use in K-Means?
03
What is the difference between a classification and a clustering problem?
04
When should I use Random Forest instead of a single Decision Tree?
05
Does my dataset need to be large to use machine learning?
🔥

That's ML Basics. Mark it forged?

4 min read · try the examples if you haven't

Previous
How to Visualize Machine Learning Results (Matplotlib & Seaborn)
21 / 25 · ML Basics
Next
Understanding Loss Functions and Gradient Descent Visually