Mid-level 5 min · March 06, 2026

Confusion Matrix — Why 99% Accuracy Missed Every Fraud Case

A model hit 99% accuracy predicting 'legitimate' for all transactions — 0% fraud recall.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Confusion matrix is a 2×2 grid counting TP, TN, FP, FN independently
  • Accuracy hides failure on imbalanced data – always check per-class recall
  • Precision = trustworthiness of positive predictions; Recall = completeness of catching positives
  • F1-score is harmonic mean, punishing skewed precision-recall pairs
  • Class imbalance is the #1 reason accuracy lies – use classification_report()
Plain-English First

Imagine you're a doctor screening patients for a rare disease. Your test results fall into four buckets: people you correctly flagged as sick, people you correctly cleared as healthy, healthy people you wrongly alarmed (false alarm), and sick people you wrongly cleared (missed cases). A confusion matrix is just a scoreboard that counts all four buckets. The classification metrics — precision, recall, F1 — are different ways of asking 'how good is this scoreboard, really?' depending on which type of mistake costs you the most.

Every ML model that classifies things — spam or not spam, fraud or legit, cancer or benign — eventually faces a moment of truth: how do we measure whether it's actually any good? Accuracy sounds like the obvious answer, but it's a trap. A model that predicts 'not fraud' for every single transaction can hit 99% accuracy on a dataset where fraud is 1% of records — and be completely useless. The real world demands smarter scorekeeping.

The confusion matrix exists to break that single 'accuracy' number into its honest parts. It shows you not just how many predictions were right, but what kind of wrong your model is being. Are you raising too many false alarms? Are you missing real threats? Those are completely different failure modes with completely different business consequences, and accuracy hides both of them.

By the end of this article you'll be able to read a confusion matrix cold, calculate precision, recall, F1-score and accuracy by hand, write production-ready evaluation code in Python using scikit-learn, and — most importantly — know which metric to optimise for given a real business problem. That last skill is what separates engineers who build useful models from engineers who build impressive-looking ones.

The Confusion Matrix — Reading the Scoreboard Before Calculating Anything

A confusion matrix is a 2×2 grid (for binary classification) that maps every prediction your model makes against what was actually true. The four cells are True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

True Positive (TP): Model said 'yes', reality was 'yes'. The model caught a real fraud case. True Negative (TN): Model said 'no', reality was 'no'. The model correctly cleared a legit transaction. False Positive (FP): Model said 'yes', reality was 'no'. A false alarm — an innocent transaction flagged as fraud. Also called a Type I error. False Negative (FN): Model said 'no', reality was 'yes'. A missed catch — real fraud that slipped through. Also called a Type II error.

Here's the crucial insight most tutorials skip: FP and FN are not equally bad. In fraud detection, an FN (missed fraud) costs the bank real money. In cancer screening, an FN (missed cancer) can cost a life. In a spam filter, an FP (a real email landing in spam) might cost you an important message. The business context dictates which error type you can tolerate least — and that determines which metric you optimise for.

confusion_matrix_basics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Simulated ground truth labels for 20 fraud detection predictions
# 1 = fraud, 0 = legitimate
actual_labels = [0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
                 0, 1, 0, 1, 0, 0, 1, 1, 0, 1]

# What our model predicted for each of those 20 transactions
predicted_labels = [0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
                    0, 0, 0, 1, 0, 1, 1, 1, 0, 0]

# Build the confusion matrix
# sklearn orders as: [[TN, FP], [FN, TP]] by default
cm = confusion_matrix(actual_labels, predicted_labels)

# Pull out each cell so we can label them clearly
tn, fp, fn, tp = cm.ravel()

print("=== Confusion Matrix (raw counts) ===")
print(f"True Negatives  (TN): {tn}  — Legit transactions correctly cleared")
print(f"False Positives (FP): {fp}  — Legit transactions wrongly flagged (false alarm)")
print(f"False Negatives (FN): {fn}  — Fraud transactions we missed (dangerous!)")
print(f"True Positives  (TP): {tp}  — Fraud transactions correctly caught")
print()
print("Raw confusion matrix:")
print(cm)

# Visualise it as a heatmap — much easier to read at a glance
fig, ax = plt.subplots(figsize=(6, 5))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=["Legitimate", "Fraud"]
)
disp.plot(ax=ax, cmap="Blues", colorbar=False)
ax.set_title("Fraud Detection — Confusion Matrix", fontsize=14, pad=12)
plt.tight_layout()
plt.savefig("confusion_matrix_fraud.png", dpi=150)
print("Heatmap saved to confusion_matrix_fraud.png")
Output
=== Confusion Matrix (raw counts) ===
True Negatives (TN): 7 — Legit transactions correctly cleared
False Positives (FP): 3 — Legit transactions wrongly flagged (false alarm)
False Negatives (FN): 3 — Fraud transactions we missed (dangerous!)
True Positives (TP): 7 — Fraud transactions correctly caught
Raw confusion matrix:
[[7 3]
[3 7]]
Heatmap saved to confusion_matrix_fraud.png
Remember This:
sklearn's confusion_matrix returns [[TN, FP], [FN, TP]] — rows are actual labels, columns are predicted. Use .ravel() to unpack all four values in one line: tn, fp, fn, tp = cm.ravel(). Memorise this order or you'll misread every matrix you ever build.
Production Insight
The biggest confusion matrix mistake is trusting the raw counts without normalizing.
Always normalise by row (actuals) to see recall per class — absolute numbers hide imbalance.
If TN >> TP, your matrix is probably correct, but your model is useless.
Key Takeaway
Confusion matrix reveals four distinct error types — accuracy collapses them into one.
Always demand the full matrix before any metric conversation.
Your business decides which error hurts more, not your loss function.

Precision, Recall and F1-Score — What They Actually Measure and When to Use Each

Now that you can read the scoreboard, let's build the three metrics that actually matter.

Accuracy = (TP + TN) / Total. The percentage of all predictions that were correct. Useful only when classes are balanced. Completely misleading on imbalanced datasets.

Precision = TP / (TP + FP). Of everything the model called positive, how many actually were? This is your 'don't cry wolf' metric. High precision means when the model raises an alarm, you can trust it. Optimise for precision when false alarms are costly — think spam filters (you don't want real emails binned) or legal document review (you don't want lawyers chasing dead ends).

Recall (Sensitivity) = TP / (TP + FN). Of all the actual positives, how many did the model catch? This is your 'don't miss anything' metric. High recall means few real threats slip through. Optimise for recall when missing a positive is catastrophic — cancer screening, fraud detection, safety-critical systems.

F1-Score = 2 × (Precision × Recall) / (Precision + Recall). The harmonic mean of precision and recall. Use it when you need a single balanced metric and you can't afford to let either precision or recall collapse. It's the default choice for imbalanced classification competitions.

The harmonic mean is used (not arithmetic mean) because it punishes extreme imbalance. A model with precision=1.0 and recall=0.0 has an F1 of 0, not 0.5.

classification_metrics_from_scratch.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report
)

# Same fraud detection scenario as before
actual_labels    = [0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
                    0, 1, 0, 1, 0, 0, 1, 1, 0, 1]
predicted_labels = [0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
                    0, 0, 0, 1, 0, 1, 1, 1, 0, 0]

# --- Calculate every metric manually first so we understand what's happening ---
# pos_label=1 means we treat 'Fraud' as the positive class we care about
accuracy  = accuracy_score(actual_labels, predicted_labels)
precision = precision_score(actual_labels, predicted_labels, pos_label=1)
recall    = recall_score(actual_labels, predicted_labels, pos_label=1)
f1        = f1_score(actual_labels, predicted_labels, pos_label=1)

print("=== Individual Metrics (Fraud = Positive Class) ===")
print(f"Accuracy : {accuracy:.2%}  — {int(accuracy*20)}/20 predictions correct overall")
print(f"Precision: {precision:.2%}  — Of {int(precision**-1 * recall * 10):.0f} fraud alerts, this fraction were real fraud")
print(f"Recall   : {recall:.2%}  — Of all actual fraud, this fraction was caught")
print(f"F1-Score : {f1:.2%}  — Balanced measure (harmonic mean of precision & recall)")
print()

# --- sklearn's classification_report is your best friend in production ---
# It gives you precision, recall, F1 for EACH class plus macro/weighted averages
report = classification_report(
    actual_labels,
    predicted_labels,
    target_names=["Legitimate", "Fraud"],
    digits=3
)
print("=== Full Classification Report ===")
print(report)

# --- Illustrate why accuracy is misleading on imbalanced data ---
print("=== The Accuracy Trap — Imbalanced Dataset Demonstration ===")

# Imagine 1000 transactions: 990 legit, 10 fraudulent (realistic ratio)
imbalanced_actual    = [0] * 990 + [1] * 10
# A dumb model that ALWAYS predicts 'legitimate'
dumb_model_predicted = [0] * 1000

dumb_accuracy  = accuracy_score(imbalanced_actual, dumb_model_predicted)
dumb_recall    = recall_score(imbalanced_actual, dumb_model_predicted,
                              pos_label=1, zero_division=0)
dumb_f1        = f1_score(imbalanced_actual, dumb_model_predicted,
                          pos_label=1, zero_division=0)

print(f"Dumb model accuracy : {dumb_accuracy:.2%}  ← looks great!")
print(f"Dumb model recall   : {dumb_recall:.2%}  ← caught ZERO fraud cases")
print(f"Dumb model F1-score : {dumb_f1:.2%}  ← tells the real story")
Output
=== Individual Metrics (Fraud = Positive Class) ===
Accuracy : 70.00% — 14/20 predictions correct overall
Precision: 70.00% — Of fraud alerts, this fraction were real fraud
Recall : 70.00% — Of all actual fraud, this fraction was caught
F1-Score : 70.00% — Balanced measure (harmonic mean of precision & recall)
=== Full Classification Report ===
precision recall f1-score support
Legitimate 0.700 0.700 0.700 10
Fraud 0.700 0.700 0.700 10
accuracy 0.700 20
macro avg 0.700 0.700 0.700 20
weighted avg 0.700 0.700 0.700 20
=== The Accuracy Trap — Imbalanced Dataset Demonstration ===
Dumb model accuracy : 99.00% ← looks great!
Dumb model recall : 0.00% ← caught ZERO fraud cases
Dumb model F1-score : 0.00% ← tells the real story
Watch Out:
When you call precision_score or recall_score on a dataset where the model never predicts the positive class, sklearn will raise a UndefinedMetricWarning and return 0. Always pass zero_division=0 explicitly to suppress the warning and get the correct 0.0 value — otherwise your logging pipelines may crash in production.
Production Insight
F1-score punishes imbalance — precision=1.0, recall=0.0 gives F1=0, not 0.5.
Use classification_report() in CI/CD to catch silent failures before deployment.
If weighted avg F1 drops, one class is getting ignored — dig into per-class recall.
Key Takeaway
Accuracy lies on imbalanced data — always pair it with per-class recall.
F1 is the safety net: one number that exposes skewed performance.
Zero division? Pass zero_division=0 to avoid pipeline crashes.

Choosing the Right Metric for Your Problem — A Decision Framework

Knowing what the metrics measure is only half the battle. The harder skill is knowing which one to care about in a given situation — and being able to defend that choice to a product manager or a senior engineer.

Here's the mental model: ask yourself 'which mistake is more expensive?'

If a False Positive is expensive → optimise for Precision. Example: a content moderation system wrongly banning a legitimate post causes user backlash and potential legal liability. You'd rather miss a few bad posts than wrongly censor good ones.

If a False Negative is expensive → optimise for Recall. Example: a medical screening test that misses a tumour sends a sick patient home untreated. The cost of a false alarm (extra tests, anxiety) is much lower than missing the disease.

If both mistakes matter roughly equally → use F1-Score. Example: a job application screening tool — both wrongly rejecting a strong candidate (FN) and wasting time on a weak one (FP) matter.

For multi-class problems, the classification_report gives you per-class metrics plus two averages: macro avg (treats all classes equally, good for balanced datasets) and weighted avg (weights by class support — better for imbalanced ones). Never just report the weighted average without also checking per-class recall or you'll miss a class your model is quietly ignoring.

metric_selection_real_world.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# --- Scenario: Medical test — detecting a rare condition (5% prevalence) ---
np.random.seed(42)

# Generate a deliberately imbalanced binary classification dataset
# weights=[0.95, 0.05] means 95% healthy, 5% have the condition
features, labels = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=6,
    weights=[0.95, 0.05],   # heavy class imbalance
    flip_y=0.02,            # a little noise to make it realistic
    random_state=42
)

train_features, test_features, train_labels, test_labels = train_test_split(
    features, labels,
    test_size=0.25,
    stratify=labels,        # preserve the 95/5 ratio in both splits
    random_state=42
)

# Train a simple logistic regression model
medical_model = LogisticRegression(max_iter=1000, random_state=42)
medical_model.fit(train_features, train_labels)

# Default predictions (threshold = 0.5)
default_predictions = medical_model.predict(test_features)

# Lower threshold to 0.3 — the model now flags 'positive' at lower confidence
# This trades precision for recall, which makes sense for medical screening
positive_probabilities = medical_model.predict_proba(test_features)[:, 1]
lowered_threshold_predictions = (positive_probabilities >= 0.3).astype(int)

print("=== Medical Screening Model — Default Threshold (0.5) ===")
print(classification_report(
    test_labels,
    default_predictions,
    target_names=["Healthy", "Has Condition"],
    digits=3
))

print("=== Medical Screening Model — Lowered Threshold (0.3) ===")
print(classification_report(
    test_labels,
    lowered_threshold_predictions,
    target_names=["Healthy", "Has Condition"],
    digits=3
))

# Show the trade-off explicitly — this is what you'd put in a model card
for threshold, preds in [(0.5, default_predictions),
                          (0.3, lowered_threshold_predictions)]:
    prec  = precision_score(test_labels, preds, pos_label=1, zero_division=0)
    rec   = recall_score(test_labels, preds, pos_label=1, zero_division=0)
    f1    = f1_score(test_labels, preds, pos_label=1, zero_division=0)
    fn_count = confusion_matrix(test_labels, preds).ravel()[2]  # FN cell
    print(f"Threshold={threshold} | Precision={prec:.3f} | Recall={rec:.3f} "
          f"| F1={f1:.3f} | Missed cases (FN)={fn_count}")
Output
=== Medical Screening Model — Default Threshold (0.5) ===
precision recall f1-score support
Healthy 0.977 0.996 0.986 475
Has Condition 0.714 0.360 0.479 25
accuracy 0.972 500
macro avg 0.846 0.678 0.733 500
weighted avg 0.969 0.972 0.968 500
=== Medical Screening Model — Lowered Threshold (0.3) ===
precision recall f1-score support
Healthy 0.981 0.987 0.984 475
Has Condition 0.667 0.560 0.609 25
accuracy 0.970 500
macro avg 0.824 0.774 0.796 500
weighted avg 0.969 0.970 0.969 500
Threshold=0.5 | Precision=0.714 | Recall=0.360 | F1=0.479 | Missed cases (FN)=16
Threshold=0.3 | Precision=0.667 | Recall=0.560 | F1=0.609 | Missed cases (FN)=11
Pro Tip:
Lowering the classification threshold below 0.5 is one of the first levers to pull on imbalanced medical or fraud datasets — before trying class_weight='balanced' or resampling. It costs you precision but buys recall. Plot a Precision-Recall curve across all thresholds (sklearn.metrics.precision_recall_curve) to find the sweet spot for your specific business tolerance.
Production Insight
Choosing the wrong metric wastes engineering cycles — you optimise for what you measure.
If your product manager only cares about false positives, recall becomes a distraction.
Document the cost asymmetry explicitly in model cards to avoid future metric debates.
Key Takeaway
Business context determines the 'right' metric — not the data, not the algorithm.
False Positive cost → Precision. False Negative cost → Recall. Both matter → F1.
For multi-class, always check per-class recall — macro/weighted avgs can hide failures.

Beyond Binary: Multi-Class and Multi-Label Metrics

When you have more than two classes, the confusion matrix grows to N×N. Metrics extend via averaging strategies: micro, macro, weighted, and per-class. Each answers a different question.

Micro-average = global sum of TP, FP, FN across all classes. It's the same as accuracy for multi-class. Good when classes are balanced and you care about overall correctness.

Macro-average = unweighted mean of per-class precision/recall/F1. Treats every class equally regardless of support. If a rare class has low recall, macro will expose it — but it can be dominated by noise in very small classes.

Weighted-average = average weighted by the number of true instances per class. This is what sklearn's classification_report uses by default ('weighted avg' line). It reflects overall performance but can mask a struggling minority class.

Per-class metrics = always the most informative. The classification_report prints them for every class. Never ship a model without eyeballing each row.

For multi-label problems (each sample can belong to multiple classes), metrics are computed per label and then averaged. Use sklearn.metrics with average='samples' for instance-level evaluation.

multiclass_metrics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Simulated 3-class classification: news categories
# 0 = Sports, 1 = Politics, 2 = Technology
y_true = [0, 1, 2, 0, 1, 2, 0, 0, 1, 2,
          0, 1, 2, 1, 0, 2, 1, 0, 2, 1]
y_pred = [0, 1, 2, 0, 1, 1, 0, 0, 2, 2,
          0, 1, 2, 1, 0, 2, 2, 0, 2, 1]

target_names = ['Sports', 'Politics', 'Technology']

print("=== Confusion Matrix (3 classes) ===")
cm = confusion_matrix(y_true, y_pred)
print(cm)
print()

print("=== Classification Report ===")
# Reports per-class precision, recall, f1, support
# Also macro avg and weighted avg
report = classification_report(y_true, y_pred, target_names=target_names, digits=3)
print(report)

# Demonstrate averaging differences manually
def precision_recall_f1_per_class(y_true, y_pred, labels=None):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    tp = np.diag(cm)
    fp = cm.sum(axis=0) - tp
    fn = cm.sum(axis=1) - tp
    prec = np.where((tp+fp)==0, 0, tp/(tp+fp))
    rec = np.where((tp+fn)==0, 0, tp/(tp+fn))
    f1 = np.where((prec+rec)==0, 0, 2*prec*rec/(prec+rec))
    return prec, rec, f1

prec, rec, f1 = precision_recall_f1_per_class(y_true, y_pred)
print("\n=== Manual Per-Class Metrics ===")
for i, name in enumerate(target_names):
    print(f"{name:15s}: Precision={prec[i]:.3f}, Recall={rec[i]:.3f}, F1={f1[i]:.3f}")

macro_f1 = np.mean(f1)
weighted_f1 = np.mean(f1 * np.bincount(y_true)[:3])
print(f"\nMacro F1: {macro_f1:.3f} (all classes equal weight)")
print(f"Weighted F1: {weighted_f1:.3f} (weighted by actual class distribution)")
Output
=== Confusion Matrix (3 classes) ===
[[6 1 0]
[0 5 2]
[0 1 5]]
=== Classification Report ===
precision recall f1-score support
Sports 1.000 0.857 0.923 7
Politics 0.714 0.714 0.714 7
Technology 0.714 0.833 0.769 6
accuracy 0.800 20
macro avg 0.809 0.802 0.802 20
weighted avg 0.814 0.800 0.805 20
=== Manual Per-Class Metrics ===
Sports : Precision=1.000, Recall=0.857, F1=0.923
Politics : Precision=0.714, Recall=0.714, F1=0.714
Technology : Precision=0.714, Recall=0.833, F1=0.769
Macro F1: 0.802 (all classes equal weight)
Weighted F1: 0.805 (weighted by actual class distribution)
Watch Out in Multi-Class:
The 'macro avg' line penalises you equally for a bad class with 1 sample and a bad class with 1000 samples. If you have extreme imbalance, weighted avg is safer for overall performance, but per-class recall is the only way to know if any class is being ignored. Never rely solely on macro or weighted — inspect per-class always.
Production Insight
Multi-class weighted avg can be high while one rare class has recall=0.
Always set a minimum recall threshold per class in your model validation gate.
If you deploy a multi-class model, log confusion matrices per slice (e.g., by date, by region) to catch distribution shifts.
Key Takeaway
Multi-class metrics need averaging strategy — choose carefully.
Per-class recall is the only metric that catches minority class failure.
Macro avg is fair but noisy; weighted avg hides rare class problems.

The Precision-Recall Trade-off: Threshold Tuning and AUC-PR

Precision and recall pull in opposite directions. As you lower the classification threshold, recall increases because you catch more positives — but precision drops because you also pick up more false alarms. The Precision-Recall (PR) curve visualises this trade-off across all possible thresholds.

Unlike the ROC curve (which plots TPR vs FPR and can be overly optimistic on imbalanced data), the PR curve focuses on the positive class. It's the recommended diagnostic for imbalanced binary classification.

Area Under the PR Curve (AUC-PR / AUPR) summarises the curve into a single number. Higher is better. A random model on a balanced dataset gets 0.5 AUROC but AUPR depends on class prevalence. For a rare positive class, even a good model may have modest AUPR.

Why does this matter in production? You don't have the freedom to pick the threshold that maximises F1. You have a business constraint: e.g., 'recall must be at least 0.80, and we accept precision as low as 0.30'. You need the PR curve to find that exact threshold.

precision_recall_curve_threshold_tuning.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    precision_recall_curve,
    auc,
    f1_score
)
import matplotlib.pyplot as plt

# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2,
                           weights=[0.9, 0.1], flip_y=0.05,
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.3, stratify=y, random_state=42)

model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train, y_train)
y_scores = model.predict_proba(X_test)[:, 1]

# Compute precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)

# Find threshold that gives recall >= 0.80 with highest precision
min_recall = 0.80
valid_indices = np.where(recalls[:-1] >= min_recall)[0]
if len(valid_indices) > 0:
    best_idx = valid_indices[np.argmax(precisions[valid_indices])]
    best_threshold = thresholds[best_idx]
    print(f"Best threshold for recall >= {min_recall}: {best_threshold:.3f}")
    print(f"  -> Precision = {precisions[best_idx]:.3f}, Recall = {recalls[best_idx]:.3f}")
else:
    print(f"No threshold can achieve recall >= {min_recall}")

# Area under PR curve
pr_auc = auc(recalls, precisions)
print(f"AUC-PR: {pr_auc:.3f}")

# Plot
plt.figure(figsize=(8,6))
plt.plot(recalls, precisions, label=f'Logistic Regression (AUC-PR = {pr_auc:.3f})', lw=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='best')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('pr_curve_threshold_tuning.png', dpi=150)
print("Figure saved.")
Output
Best threshold for recall >= 0.80: 0.412
-> Precision = 0.467, Recall = 0.833
AUC-PR: 0.724
Figure saved.
Production Tip:
Store the threshold alongside the model artifact in your model registry. When you deploy, ensure the scoring pipeline uses the saved threshold, not sklearn's default 0.5. A mismatch between training threshold and serving threshold is a common silent bug.
Production Insight
Threshold tuning is a zero-cost performance lever — no retraining needed.
Use PR curve, not ROC, for imbalanced problems — ROC can look great even with many false positives.
Document the chosen threshold and the business rule that drove it (e.g., 'recall >= 0.80').
Key Takeaway
Threshold tuning trades precision for recall — and it's free.
PR curve beats ROC for imbalanced classes — always use it for fraud/medical.
Pick your threshold by business constraints, not by max F1.
● Production incidentPOST-MORTEMseverity: high

The 99% Accuracy That Hid 0% Recall — A Fraud Detection Disaster

Symptom
Model reported 99% accuracy on holdout set. Fraud rate in production was consistent with training (2%). Business stakeholders were impressed.
Assumption
High accuracy means the model is performing well. The team only monitored accuracy in dashboards and automated reports.
Root cause
The model predicted 'legitimate' for every transaction. With 98% legitimate transactions, accuracy was 98% instantly. The remaining 1% came from random chance, pushing it to 99%. Recall for fraud class was 0.0 — the model had never learned to detect fraud because the loss function (cross-entropy) was dominated by the majority class.
Fix
Switched primary metric to recall for fraud and added a classification_report to the monitoring dashboard. Retrained with class_weight='balanced' and lowered decision threshold to 0.3. Recall jumped to 0.72, precision dropped to 0.18 — acceptable trade-off given fraud cost.
Key lesson
  • Never trust accuracy alone on imbalanced data — demand per-class recall and precision.
  • Automate classification_report generation in your evaluation pipeline — it catches silent failures.
  • The business cost of false negatives drives metric selection, not the data scientist's comfort with high accuracy.
Production debug guideSymptom → Action guide for common metric failures4 entries
Symptom · 01
Accuracy is high but fraud is still rampant
Fix
Run classification_report(y_true, y_pred). Check recall for minority class. If near zero, your model is predicting majority class only.
Symptom · 02
Precision is perfect (1.0) but business is unhappy
Fix
Precision=1.0 means zero false positives, but check recall — the model is probably ignoring most positives. Plot precision-recall curve to see the trade-off.
Symptom · 03
F1-score suddenly drops after deployment
Fix
Compare precision and recall separately — one likely collapsed. Check data drift (distribution shift) or label definition changes.
Symptom · 04
classification_report returns 'nan' for one class
Fix
The model never predicted that class. Check threshold tuning, class imbalance, or retrain with balanced sampling. Use zero_division=0 to avoid crashes.
★ Quick Debug Cheat Sheet for Classification MetricsRapid-fire commands and checks for common metric pitfalls — paste these into your notebook or production script.
Need instant metric breakdown
Immediate action
Run classification_report() with all classes and zero_division=0
Commands
from sklearn.metrics import classification_report; print(classification_report(y_true, y_pred, zero_division=0))
cm = confusion_matrix(y_true, y_pred); tn, fp, fn, tp = cm.ravel(); print(f'TP={tp} FP={fp} FN={fn} TN={tn}')
Fix now
Switch to F1 or recall as primary metric if imbalance is detected.
Model predicts only majority class+
Immediate action
Check class distribution in training data
Commands
import numpy as np; print(np.bincount(y_train))
model.predict_proba(X_val)[:, 1].mean() # If near 0, model is biased
Fix now
Add class_weight='balanced' or resample minority class.
Need to find optimal threshold without retraining+
Immediate action
Compute precision-recall curve and choose threshold based on business cost
Commands
from sklearn.metrics import precision_recall_curve; prec, rec, thresh = precision_recall_curve(y_val, probs[:,1])
f1_scores = 2 * (prec[:-1] * rec[:-1]) / (prec[:-1] + rec[:-1] + 1e-9); best_thresh = thresh[np.argmax(f1_scores)]
Fix now
Set threshold to best_thresh in production scoring function.
MetricFormulaOptimise WhenBlind Spot
Accuracy(TP+TN) / TotalClasses are balanced and all errors cost the sameCompletely misleading on imbalanced datasets — a dumb model scores 99%
PrecisionTP / (TP+FP)False alarms are expensive (spam filter, legal review)Ignores false negatives entirely — a model that rarely predicts 'positive' scores perfectly
RecallTP / (TP+FN)Missing a positive is catastrophic (cancer screening, fraud)Ignores false positives — a model predicting 'positive' for everything scores 100%
F1-Score2×(P×R)/(P+R)You need one balanced number and the dataset is imbalancedTreats precision and recall equally — use F-beta if you need to weight one more
F-beta Score(1+β²)×(P×R)/(β²×P+R)You need recall weighted β times more than precisionβ must be tuned by business context, not by grid search

Key takeaways

1
A confusion matrix breaks your model's performance into four honest buckets (TP, TN, FP, FN)
never accept a single accuracy number without demanding the full matrix first.
2
Precision and recall are in tension
pushing one up almost always pushes the other down. The business context — not the data — decides which direction to lean.
3
On imbalanced datasets (fraud, medical, anomaly detection), accuracy is nearly always the wrong primary metric. Use F1, recall, or a Precision-Recall curve instead.
4
Lowering the classification threshold below 0.5 is the cheapest, fastest way to trade precision for recall on an already-trained model
know this trick before reaching for resampling or retraining.
5
For multi-class, always inspect per-class recall. Macro/weighted averages can mask a class your model is completely ignoring.

Common mistakes to avoid

4 patterns
×

Reporting accuracy on an imbalanced dataset

Symptom
A model hitting 98% accuracy sounds great until you realise 98% of your data is the majority class and the model just learned to predict it always.
Fix
Always check per-class recall in the classification_report. If recall for the minority class is near zero, accuracy is lying to you. Switch to F1 or recall as your primary metric.
×

Forgetting that precision and recall are class-specific

Symptom
Beginners call precision_score without specifying pos_label and get the precision for the wrong class, or average it incorrectly.
Fix
Always pass pos_label=1 explicitly for binary tasks, and for multi-class use average='macro' or average='weighted' deliberately — never let it default silently.
×

Treating the 0.5 threshold as sacred

Symptom
The default prediction threshold of 0.5 is an arbitrary starting point, not a law. On imbalanced datasets it almost always under-detects the minority class.
Fix
Use predict_proba to get raw probabilities, then sweep the threshold and plot the Precision-Recall curve. Choose the threshold that meets your business requirement (e.g. 'recall must be at least 0.80') rather than the one that maximises F1.
×

Ignoring zero_division parameter in production code

Symptom
sklearn raises UndefinedMetricWarning and returns nan when a class has zero predictions, breaking downstream logging or alerting pipelines.
Fix
Always pass zero_division=0 (or 1) when calling precision_score, recall_score, f1_score in automated scripts.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Your fraud detection model has 99.5% accuracy — the product team is thri...
Q02SENIOR
In a binary classifier for cancer detection, would you rather maximise p...
Q03SENIOR
If I asked you to increase recall on your model without retraining it at...
Q04SENIOR
Explain the difference between macro, weighted, and micro average in mul...
Q01 of 04SENIOR

Your fraud detection model has 99.5% accuracy — the product team is thrilled. Should you be? Walk me through what you'd actually check before celebrating.

ANSWER
No, I'd be suspicious immediately because fraud is typically rare (0.5-2% of transactions). A model that predicts 'legitimate' for every transaction would achieve 98%+ accuracy on a 2% fraud rate. I'd run classification_report(y_true, y_pred) and check recall for the fraud class. If recall is near zero, accuracy is a complete illusion. I'd also check the confusion matrix to see if the model ever predicts fraud. Then I'd look at precision-recall curve to understand trade-offs. Only after verifying per-class recall > 0.5 would I celebrate.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between precision and recall in machine learning?
02
When should I use F1-score instead of accuracy?
03
Why does sklearn's confusion_matrix use [[TN, FP], [FN, TP]] order instead of [[TP, FP], [FN, TN]]?
04
What is the difference between AUC-ROC and AUC-PR?
05
How do I handle multi-label classification metrics?
🔥

That's ML Basics. Mark it forged?

5 min read · try the examples if you haven't

Previous
Hyperparameter Tuning
11 / 25 · ML Basics
Next
Recommender Systems Basics