Intermediate 6 min · March 06, 2026

Confusion Matrix and Classification Metrics

Confusion Matrix — Why 99% Accuracy Missed Every Fraud Case

Q: What is the difference between precision and recall in machine learning?

Precision asks 'of everything the model labelled positive, how many actually were?' — it measures false alarm rate. Recall asks 'of all the real positives, how many did the model catch?' — it measures missed detection rate. They pull in opposite directions: increasing one usually decreases the other. Your business context determines which to prioritise.

Q: When should I use F1-score instead of accuracy?

Use F1-score whenever your dataset is imbalanced — meaning one class appears significantly more often than the other. Accuracy on imbalanced data rewards a model for doing nothing (just predicting the majority class). F1-score is the harmonic mean of precision and recall and exposes that failure immediately by returning a near-zero score.

Q: Why does sklearn's confusion_matrix use [[TN, FP], [FN, TP]] order instead of [[TP, FP], [FN, TN]]?

sklearn follows the convention where rows represent actual labels and columns represent predicted labels, ordered from class 0 to class 1. So row 0 = actual negatives, row 1 = actual positives, and within each row column 0 = predicted negative, column 1 = predicted positive. This gives [[TN, FP], [FN, TP]]. Always use cm.ravel() to unpack as tn, fp, fn, tp to avoid reading the grid wrong.

Q: What is the difference between AUC-ROC and AUC-PR?

AUC-ROC plots True Positive Rate vs False Positive Rate and is overly optimistic for imbalanced datasets because the false positive rate stays low naturally when the negative class is large. AUC-PR (Area Under Precision-Recall curve) focuses on the positive class and is much more informative for imbalanced problems like fraud detection or medical diagnosis. For rare positives, always prefer AUC-PR over AUC-ROC.

Q: How do I handle multi-label classification metrics?

For multi-label problems (each sample can have multiple labels), treat each label as a binary classification and compute metrics per label. Then average using average='samples' (for instance-level), 'micro', 'macro', or 'weighted' as appropriate. sklearn.metrics provides precision_score, recall_score, f1_score with average='samples' parameter directly.

A model hit 99% accuracy predicting 'legitimate' for all transactions — 0% fraud recall.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Confusion matrix is a 2×2 grid counting TP, TN, FP, FN independently
Accuracy hides failure on imbalanced data – always check per-class recall
Precision = trustworthiness of positive predictions; Recall = completeness of catching positives
F1-score is harmonic mean, punishing skewed precision-recall pairs
Class imbalance is the #1 reason accuracy lies – use classification_report()

✦ Definition~90s read

What is Confusion Matrix and Classification Metrics?

A confusion matrix is a table that lays out the full prediction breakdown for a classification model, showing not just how many predictions were right but exactly where they went wrong. It’s a 2x2 grid for binary classification with four cells: true positives (correctly predicted fraud), true negatives (correctly predicted non-fraud), false positives (flagged fraud that wasn’t), and false negatives (missed fraud).

★

Imagine you're a doctor screening patients for a rare disease.

Without it, accuracy alone can deceive you — a model that predicts "not fraud" for every transaction hits 99% accuracy in a dataset with 1% fraud, but catches zero actual fraud cases. The confusion matrix forces you to see that blind spot.

This tool exists because accuracy is a terrible metric for imbalanced problems — fraud detection, rare disease diagnosis, or spam filtering where the minority class is what you actually care about. The matrix gives you the raw counts to calculate precision (how many flagged frauds were real), recall (how many real frauds you caught), and F1-score (their harmonic mean).

In production systems handling millions of transactions daily, like Stripe’s fraud detection or credit card issuer models, teams tune thresholds on the confusion matrix to balance false positives (annoying legitimate customers) against false negatives (letting fraud through).

Beyond binary, confusion matrices extend to multi-class problems (e.g., classifying animal species) as NxN grids where rows are actual classes and columns are predictions. Diagonal cells are correct, off-diagonals are specific misclassifications — like confusing a wolf for a husky.

For multi-label problems (an image can have both "dog" and "cat"), you need per-class matrices or micro/macro averaging. The matrix is the scoreboard before any metric; without it, you’re guessing which metric matters.

Plain-English First

Imagine you're a doctor screening patients for a rare disease. Your test results fall into four buckets: people you correctly flagged as sick, people you correctly cleared as healthy, healthy people you wrongly alarmed (false alarm), and sick people you wrongly cleared (missed cases). A confusion matrix is just a scoreboard that counts all four buckets. The classification metrics — precision, recall, F1 — are different ways of asking 'how good is this scoreboard, really?' depending on which type of mistake costs you the most.

Every ML model that classifies things — spam or not spam, fraud or legit, cancer or benign — eventually faces a moment of truth: how do we measure whether it's actually any good? Accuracy sounds like the obvious answer, but it's a trap. A model that predicts 'not fraud' for every single transaction can hit 99% accuracy on a dataset where fraud is 1% of records — and be completely useless. The real world demands smarter scorekeeping.

The confusion matrix exists to break that single 'accuracy' number into its honest parts. It shows you not just how many predictions were right, but what kind of wrong your model is being. Are you raising too many false alarms? Are you missing real threats? Those are completely different failure modes with completely different business consequences, and accuracy hides both of them.

By the end of this article you'll be able to read a confusion matrix cold, calculate precision, recall, F1-score and accuracy by hand, write production-ready evaluation code in Python using scikit-learn, and — most importantly — know which metric to optimise for given a real business problem. That last skill is what separates engineers who build useful models from engineers who build impressive-looking ones.

Why 99% Accuracy Missed Every Fraud Case

A confusion matrix is a 2x2 table that compares predicted classifications against actual outcomes: True Positives, True Negatives, False Positives, and False Negatives. It's the foundation for all classification metrics because it separates correct from incorrect predictions by type, not just count. Accuracy alone hides which errors you're making — in fraud detection, 99% accuracy can mean you correctly flagged 99% of legitimate transactions while missing every single fraud case.

From the four counts, you derive precision (TP / (TP+FP)), recall (TP / (TP+FN)), F1-score, and specificity. Each metric answers a different question: precision asks 'how many flagged cases were real?', recall asks 'how many real cases did we catch?'. The trade-off is explicit — increasing recall often increases false positives, and vice versa. In practice, you choose the metric that matches the cost of each error type.

Use a confusion matrix whenever your classes are imbalanced (fraud: 0.1%, disease: 2%, churn: 5%). It forces you to look beyond accuracy and measure what matters: false negatives in medical diagnosis, false positives in spam filters. Production systems must track all four cells over time — a drift in false positive rate can silently degrade user trust before accuracy drops.

⚠ Accuracy Paradox

With 99.9% legitimate transactions, a model that always predicts 'legitimate' achieves 99.9% accuracy but catches zero fraud — the confusion matrix reveals this instantly.

📊 Production Insight

Fraud detection pipeline: model flags 0.1% of transactions as fraud, but manual review team can only handle 0.05%. Precision drops below 50%, review queue fills with false positives, fraudsters slip through.

Symptom: review team burnout, rising chargeback rates, but accuracy stays above 99%.

Rule: never deploy a classifier without setting a minimum precision floor at the expected fraud rate — measure confusion matrix on production data weekly.

🎯 Key Takeaway

Accuracy is meaningless on imbalanced data — always inspect the full confusion matrix.

Choose your metric by the cost of false positives vs. false negatives, not by convention.

Track all four cells over time — a shift in false positive rate is often the first sign of model decay.

thecodeforge.io

Confusion Matrix Classification Metrics

The Confusion Matrix — Reading the Scoreboard Before Calculating Anything

A confusion matrix is a 2×2 grid (for binary classification) that maps every prediction your model makes against what was actually true. The four cells are True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

True Positive (TP): Model said 'yes', reality was 'yes'. The model caught a real fraud case. True Negative (TN): Model said 'no', reality was 'no'. The model correctly cleared a legit transaction. False Positive (FP): Model said 'yes', reality was 'no'. A false alarm — an innocent transaction flagged as fraud. Also called a Type I error. False Negative (FN): Model said 'no', reality was 'yes'. A missed catch — real fraud that slipped through. Also called a Type II error.

Here's the crucial insight most tutorials skip: FP and FN are not equally bad. In fraud detection, an FN (missed fraud) costs the bank real money. In cancer screening, an FN (missed cancer) can cost a life. In a spam filter, an FP (a real email landing in spam) might cost you an important message. The business context dictates which error type you can tolerate least — and that determines which metric you optimise for.

confusion_matrix_basics.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Simulated ground truth labels for 20 fraud detection predictions
# 1 = fraud, 0 = legitimate
actual_labels = [0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
                 0, 1, 0, 1, 0, 0, 1, 1, 0, 1]

# What our model predicted for each of those 20 transactions
predicted_labels = [0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
                    0, 0, 0, 1, 0, 1, 1, 1, 0, 0]

# Build the confusion matrix
# sklearn orders as: [[TN, FP], [FN, TP]] by default
cm = confusion_matrix(actual_labels, predicted_labels)

# Pull out each cell so we can label them clearly
tn, fp, fn, tp = cm.ravel()

print("=== Confusion Matrix (raw counts) ===")
print(f"True Negatives  (TN): {tn}  — Legit transactions correctly cleared")
print(f"False Positives (FP): {fp}  — Legit transactions wrongly flagged (false alarm)")
print(f"False Negatives (FN): {fn}  — Fraud transactions we missed (dangerous!)")
print(f"True Positives  (TP): {tp}  — Fraud transactions correctly caught")
print()
print("Raw confusion matrix:")
print(cm)

# Visualise it as a heatmap — much easier to read at a glance
fig, ax = plt.subplots(figsize=(6, 5))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=["Legitimate", "Fraud"]
)
disp.plot(ax=ax, cmap="Blues", colorbar=False)
ax.set_title("Fraud Detection — Confusion Matrix", fontsize=14, pad=12)
plt.tight_layout()
plt.savefig("confusion_matrix_fraud.png", dpi=150)
print("Heatmap saved to confusion_matrix_fraud.png")

Output

=== Confusion Matrix (raw counts) ===

True Negatives (TN): 7 — Legit transactions correctly cleared

False Positives (FP): 3 — Legit transactions wrongly flagged (false alarm)

False Negatives (FN): 3 — Fraud transactions we missed (dangerous!)

True Positives (TP): 7 — Fraud transactions correctly caught

Raw confusion matrix:

[[7 3]

[3 7]]

Heatmap saved to confusion_matrix_fraud.png

🔥Remember This:

sklearn's confusion_matrix returns [[TN, FP], [FN, TP]] — rows are actual labels, columns are predicted. Use .ravel() to unpack all four values in one line: tn, fp, fn, tp = cm.ravel(). Memorise this order or you'll misread every matrix you ever build.

📊 Production Insight

The biggest confusion matrix mistake is trusting the raw counts without normalizing.

Always normalise by row (actuals) to see recall per class — absolute numbers hide imbalance.

If TN >> TP, your matrix is probably correct, but your model is useless.

🎯 Key Takeaway

Confusion matrix reveals four distinct error types — accuracy collapses them into one.

Always demand the full matrix before any metric conversation.

Your business decides which error hurts more, not your loss function.

Precision, Recall and F1-Score — What They Actually Measure and When to Use Each

Now that you can read the scoreboard, let's build the three metrics that actually matter.

Accuracy = (TP + TN) / Total. The percentage of all predictions that were correct. Useful only when classes are balanced. Completely misleading on imbalanced datasets.

Precision = TP / (TP + FP). Of everything the model called positive, how many actually were? This is your 'don't cry wolf' metric. High precision means when the model raises an alarm, you can trust it. Optimise for precision when false alarms are costly — think spam filters (you don't want real emails binned) or legal document review (you don't want lawyers chasing dead ends).

Recall (Sensitivity) = TP / (TP + FN). Of all the actual positives, how many did the model catch? This is your 'don't miss anything' metric. High recall means few real threats slip through. Optimise for recall when missing a positive is catastrophic — cancer screening, fraud detection, safety-critical systems.

F1-Score = 2 × (Precision × Recall) / (Precision + Recall). The harmonic mean of precision and recall. Use it when you need a single balanced metric and you can't afford to let either precision or recall collapse. It's the default choice for imbalanced classification competitions.

The harmonic mean is used (not arithmetic mean) because it punishes extreme imbalance. A model with precision=1.0 and recall=0.0 has an F1 of 0, not 0.5.

classification_metrics_from_scratch.pyPYTHON

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report
)

# Same fraud detection scenario as before
actual_labels    = [0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
                    0, 1, 0, 1, 0, 0, 1, 1, 0, 1]
predicted_labels = [0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
                    0, 0, 0, 1, 0, 1, 1, 1, 0, 0]

# --- Calculate every metric manually first so we understand what's happening ---
# pos_label=1 means we treat 'Fraud' as the positive class we care about
accuracy  = accuracy_score(actual_labels, predicted_labels)
precision = precision_score(actual_labels, predicted_labels, pos_label=1)
recall    = recall_score(actual_labels, predicted_labels, pos_label=1)
f1        = f1_score(actual_labels, predicted_labels, pos_label=1)

print("=== Individual Metrics (Fraud = Positive Class) ===")
print(f"Accuracy : {accuracy:.2%}  — {int(accuracy*20)}/20 predictions correct overall")
print(f"Precision: {precision:.2%}  — Of {int(precision**-1 * recall * 10):.0f} fraud alerts, this fraction were real fraud")
print(f"Recall   : {recall:.2%}  — Of all actual fraud, this fraction was caught")
print(f"F1-Score : {f1:.2%}  — Balanced measure (harmonic mean of precision & recall)")
print()

# --- sklearn's classification_report is your best friend in production ---
# It gives you precision, recall, F1 for EACH class plus macro/weighted averages
report = classification_report(
    actual_labels,
    predicted_labels,
    target_names=["Legitimate", "Fraud"],
    digits=3
)
print("=== Full Classification Report ===")
print(report)

# --- Illustrate why accuracy is misleading on imbalanced data ---
print("=== The Accuracy Trap — Imbalanced Dataset Demonstration ===")

# Imagine 1000 transactions: 990 legit, 10 fraudulent (realistic ratio)
imbalanced_actual    = [0] * 990 + [1] * 10
# A dumb model that ALWAYS predicts 'legitimate'
dumb_model_predicted = [0] * 1000

dumb_accuracy  = accuracy_score(imbalanced_actual, dumb_model_predicted)
dumb_recall    = recall_score(imbalanced_actual, dumb_model_predicted,
                              pos_label=1, zero_division=0)
dumb_f1        = f1_score(imbalanced_actual, dumb_model_predicted,
                          pos_label=1, zero_division=0)

print(f"Dumb model accuracy : {dumb_accuracy:.2%}  ← looks great!")
print(f"Dumb model recall   : {dumb_recall:.2%}  ← caught ZERO fraud cases")
print(f"Dumb model F1-score : {dumb_f1:.2%}  ← tells the real story")

Output

=== Individual Metrics (Fraud = Positive Class) ===

Accuracy : 70.00% — 14/20 predictions correct overall

Precision: 70.00% — Of fraud alerts, this fraction were real fraud

Recall : 70.00% — Of all actual fraud, this fraction was caught

F1-Score : 70.00% — Balanced measure (harmonic mean of precision & recall)

=== Full Classification Report ===

precision recall f1-score support

Legitimate 0.700 0.700 0.700 10

Fraud 0.700 0.700 0.700 10

accuracy 0.700 20

macro avg 0.700 0.700 0.700 20

weighted avg 0.700 0.700 0.700 20

=== The Accuracy Trap — Imbalanced Dataset Demonstration ===

Dumb model accuracy : 99.00% ← looks great!

Dumb model recall : 0.00% ← caught ZERO fraud cases

Dumb model F1-score : 0.00% ← tells the real story

⚠ Watch Out:

When you call precision_score or recall_score on a dataset where the model never predicts the positive class, sklearn will raise a UndefinedMetricWarning and return 0. Always pass zero_division=0 explicitly to suppress the warning and get the correct 0.0 value — otherwise your logging pipelines may crash in production.

📊 Production Insight

F1-score punishes imbalance — precision=1.0, recall=0.0 gives F1=0, not 0.5.

Use classification_report() in CI/CD to catch silent failures before deployment.

If weighted avg F1 drops, one class is getting ignored — dig into per-class recall.

🎯 Key Takeaway

Accuracy lies on imbalanced data — always pair it with per-class recall.

F1 is the safety net: one number that exposes skewed performance.

Zero division? Pass zero_division=0 to avoid pipeline crashes.

thecodeforge.io

Confusion Matrix Classification Metrics

Choosing the Right Metric for Your Problem — A Decision Framework

Knowing what the metrics measure is only half the battle. The harder skill is knowing which one to care about in a given situation — and being able to defend that choice to a product manager or a senior engineer.

Here's the mental model: ask yourself 'which mistake is more expensive?'

If a False Positive is expensive → optimise for Precision. Example: a content moderation system wrongly banning a legitimate post causes user backlash and potential legal liability. You'd rather miss a few bad posts than wrongly censor good ones.

If a False Negative is expensive → optimise for Recall. Example: a medical screening test that misses a tumour sends a sick patient home untreated. The cost of a false alarm (extra tests, anxiety) is much lower than missing the disease.

If both mistakes matter roughly equally → use F1-Score. Example: a job application screening tool — both wrongly rejecting a strong candidate (FN) and wasting time on a weak one (FP) matter.

For multi-class problems, the classification_report gives you per-class metrics plus two averages: macro avg (treats all classes equally, good for balanced datasets) and weighted avg (weights by class support — better for imbalanced ones). Never just report the weighted average without also checking per-class recall or you'll miss a class your model is quietly ignoring.

metric_selection_real_world.pyPYTHON

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

# --- Scenario: Medical test — detecting a rare condition (5% prevalence) ---
np.random.seed(42)

# Generate a deliberately imbalanced binary classification dataset
# weights=[0.95, 0.05] means 95% healthy, 5% have the condition
features, labels = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=6,
    weights=[0.95, 0.05],   # heavy class imbalance
    flip_y=0.02,            # a little noise to make it realistic
    random_state=42
)

train_features, test_features, train_labels, test_labels = train_test_split(
    features, labels,
    test_size=0.25,
    stratify=labels,        # preserve the 95/5 ratio in both splits
    random_state=42
)

# Train a simple logistic regression model
medical_model = LogisticRegression(max_iter=1000, random_state=42)
medical_model.fit(train_features, train_labels)

# Default predictions (threshold = 0.5)
default_predictions = medical_model.predict(test_features)

# Lower threshold to 0.3 — the model now flags 'positive' at lower confidence
# This trades precision for recall, which makes sense for medical screening
positive_probabilities = medical_model.predict_proba(test_features)[:, 1]
lowered_threshold_predictions = (positive_probabilities >= 0.3).astype(int)

print("=== Medical Screening Model — Default Threshold (0.5) ===")
print(classification_report(
    test_labels,
    default_predictions,
    target_names=["Healthy", "Has Condition"],
    digits=3
))

print("=== Medical Screening Model — Lowered Threshold (0.3) ===")
print(classification_report(
    test_labels,
    lowered_threshold_predictions,
    target_names=["Healthy", "Has Condition"],
    digits=3
))

# Show the trade-off explicitly — this is what you'd put in a model card
for threshold, preds in [(0.5, default_predictions),
                          (0.3, lowered_threshold_predictions)]:
    prec  = precision_score(test_labels, preds, pos_label=1, zero_division=0)
    rec   = recall_score(test_labels, preds, pos_label=1, zero_division=0)
    f1    = f1_score(test_labels, preds, pos_label=1, zero_division=0)
    fn_count = confusion_matrix(test_labels, preds).ravel()[2]  # FN cell
    print(f"Threshold={threshold} | Precision={prec:.3f} | Recall={rec:.3f} "
          f"| F1={f1:.3f} | Missed cases (FN)={fn_count}")

Output

=== Medical Screening Model — Default Threshold (0.5) ===

precision recall f1-score support

Healthy 0.977 0.996 0.986 475

Has Condition 0.714 0.360 0.479 25

accuracy 0.972 500

macro avg 0.846 0.678 0.733 500

weighted avg 0.969 0.972 0.968 500

=== Medical Screening Model — Lowered Threshold (0.3) ===

precision recall f1-score support

Healthy 0.981 0.987 0.984 475

Has Condition 0.667 0.560 0.609 25

accuracy 0.970 500

macro avg 0.824 0.774 0.796 500

weighted avg 0.969 0.970 0.969 500

Threshold=0.5 | Precision=0.714 | Recall=0.360 | F1=0.479 | Missed cases (FN)=16

Threshold=0.3 | Precision=0.667 | Recall=0.560 | F1=0.609 | Missed cases (FN)=11

💡Pro Tip:

Lowering the classification threshold below 0.5 is one of the first levers to pull on imbalanced medical or fraud datasets — before trying class_weight='balanced' or resampling. It costs you precision but buys recall. Plot a Precision-Recall curve across all thresholds (sklearn.metrics.precision_recall_curve) to find the sweet spot for your specific business tolerance.

📊 Production Insight

Choosing the wrong metric wastes engineering cycles — you optimise for what you measure.

If your product manager only cares about false positives, recall becomes a distraction.

Document the cost asymmetry explicitly in model cards to avoid future metric debates.

🎯 Key Takeaway

Business context determines the 'right' metric — not the data, not the algorithm.

False Positive cost → Precision. False Negative cost → Recall. Both matter → F1.

For multi-class, always check per-class recall — macro/weighted avgs can hide failures.

Beyond Binary: Multi-Class and Multi-Label Metrics

When you have more than two classes, the confusion matrix grows to N×N. Metrics extend via averaging strategies: micro, macro, weighted, and per-class. Each answers a different question.

Micro-average = global sum of TP, FP, FN across all classes. It's the same as accuracy for multi-class. Good when classes are balanced and you care about overall correctness.

Macro-average = unweighted mean of per-class precision/recall/F1. Treats every class equally regardless of support. If a rare class has low recall, macro will expose it — but it can be dominated by noise in very small classes.

Weighted-average = average weighted by the number of true instances per class. This is what sklearn's classification_report uses by default ('weighted avg' line). It reflects overall performance but can mask a struggling minority class.

Per-class metrics = always the most informative. The classification_report prints them for every class. Never ship a model without eyeballing each row.

For multi-label problems (each sample can belong to multiple classes), metrics are computed per label and then averaged. Use sklearn.metrics with average='samples' for instance-level evaluation.

multiclass_metrics.pyPYTHON

from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Simulated 3-class classification: news categories
# 0 = Sports, 1 = Politics, 2 = Technology
y_true = [0, 1, 2, 0, 1, 2, 0, 0, 1, 2,
          0, 1, 2, 1, 0, 2, 1, 0, 2, 1]
y_pred = [0, 1, 2, 0, 1, 1, 0, 0, 2, 2,
          0, 1, 2, 1, 0, 2, 2, 0, 2, 1]

target_names = ['Sports', 'Politics', 'Technology']

print("=== Confusion Matrix (3 classes) ===")
cm = confusion_matrix(y_true, y_pred)
print(cm)
print()

print("=== Classification Report ===")
# Reports per-class precision, recall, f1, support
# Also macro avg and weighted avg
report = classification_report(y_true, y_pred, target_names=target_names, digits=3)
print(report)

# Demonstrate averaging differences manually
def precision_recall_f1_per_class(y_true, y_pred, labels=None):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    tp = np.diag(cm)
    fp = cm.sum(axis=0) - tp
    fn = cm.sum(axis=1) - tp
    prec = np.where((tp+fp)==0, 0, tp/(tp+fp))
    rec = np.where((tp+fn)==0, 0, tp/(tp+fn))
    f1 = np.where((prec+rec)==0, 0, 2*prec*rec/(prec+rec))
    return prec, rec, f1

prec, rec, f1 = precision_recall_f1_per_class(y_true, y_pred)
print("\n=== Manual Per-Class Metrics ===")
for i, name in enumerate(target_names):
    print(f"{name:15s}: Precision={prec[i]:.3f}, Recall={rec[i]:.3f}, F1={f1[i]:.3f}")

macro_f1 = np.mean(f1)
weighted_f1 = np.mean(f1 * np.bincount(y_true)[:3])
print(f"\nMacro F1: {macro_f1:.3f} (all classes equal weight)")
print(f"Weighted F1: {weighted_f1:.3f} (weighted by actual class distribution)")

Output

=== Confusion Matrix (3 classes) ===

[[6 1 0]

[0 5 2]

[0 1 5]]

=== Classification Report ===

precision recall f1-score support

Sports 1.000 0.857 0.923 7

Politics 0.714 0.714 0.714 7

Technology 0.714 0.833 0.769 6

accuracy 0.800 20

macro avg 0.809 0.802 0.802 20

weighted avg 0.814 0.800 0.805 20

=== Manual Per-Class Metrics ===

Sports : Precision=1.000, Recall=0.857, F1=0.923

Politics : Precision=0.714, Recall=0.714, F1=0.714

Technology : Precision=0.714, Recall=0.833, F1=0.769

Macro F1: 0.802 (all classes equal weight)

Weighted F1: 0.805 (weighted by actual class distribution)

⚠ Watch Out in Multi-Class:

The 'macro avg' line penalises you equally for a bad class with 1 sample and a bad class with 1000 samples. If you have extreme imbalance, weighted avg is safer for overall performance, but per-class recall is the only way to know if any class is being ignored. Never rely solely on macro or weighted — inspect per-class always.

📊 Production Insight

Multi-class weighted avg can be high while one rare class has recall=0.

Always set a minimum recall threshold per class in your model validation gate.

If you deploy a multi-class model, log confusion matrices per slice (e.g., by date, by region) to catch distribution shifts.

🎯 Key Takeaway

Multi-class metrics need averaging strategy — choose carefully.

Per-class recall is the only metric that catches minority class failure.

Macro avg is fair but noisy; weighted avg hides rare class problems.

The Precision-Recall Trade-off: Threshold Tuning and AUC-PR

Precision and recall pull in opposite directions. As you lower the classification threshold, recall increases because you catch more positives — but precision drops because you also pick up more false alarms. The Precision-Recall (PR) curve visualises this trade-off across all possible thresholds.

Unlike the ROC curve (which plots TPR vs FPR and can be overly optimistic on imbalanced data), the PR curve focuses on the positive class. It's the recommended diagnostic for imbalanced binary classification.

Area Under the PR Curve (AUC-PR / AUPR) summarises the curve into a single number. Higher is better. A random model on a balanced dataset gets 0.5 AUROC but AUPR depends on class prevalence. For a rare positive class, even a good model may have modest AUPR.

Why does this matter in production? You don't have the freedom to pick the threshold that maximises F1. You have a business constraint: e.g., 'recall must be at least 0.80, and we accept precision as low as 0.30'. You need the PR curve to find that exact threshold.

precision_recall_curve_threshold_tuning.pyPYTHON

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    precision_recall_curve,
    auc,
    f1_score
)
import matplotlib.pyplot as plt

# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2,
                           weights=[0.9, 0.1], flip_y=0.05,
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.3, stratify=y, random_state=42)

model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train, y_train)
y_scores = model.predict_proba(X_test)[:, 1]

# Compute precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_scores)

# Find threshold that gives recall >= 0.80 with highest precision
min_recall = 0.80
valid_indices = np.where(recalls[:-1] >= min_recall)[0]
if len(valid_indices) > 0:
    best_idx = valid_indices[np.argmax(precisions[valid_indices])]
    best_threshold = thresholds[best_idx]
    print(f"Best threshold for recall >= {min_recall}: {best_threshold:.3f}")
    print(f"  -> Precision = {precisions[best_idx]:.3f}, Recall = {recalls[best_idx]:.3f}")
else:
    print(f"No threshold can achieve recall >= {min_recall}")

# Area under PR curve
pr_auc = auc(recalls, precisions)
print(f"AUC-PR: {pr_auc:.3f}")

# Plot
plt.figure(figsize=(8,6))
plt.plot(recalls, precisions, label=f'Logistic Regression (AUC-PR = {pr_auc:.3f})', lw=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='best')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('pr_curve_threshold_tuning.png', dpi=150)
print("Figure saved.")

Output

Best threshold for recall >= 0.80: 0.412

-> Precision = 0.467, Recall = 0.833

AUC-PR: 0.724

Figure saved.

💡Production Tip:

Store the threshold alongside the model artifact in your model registry. When you deploy, ensure the scoring pipeline uses the saved threshold, not sklearn's default 0.5. A mismatch between training threshold and serving threshold is a common silent bug.

📊 Production Insight

Threshold tuning is a zero-cost performance lever — no retraining needed.

Use PR curve, not ROC, for imbalanced problems — ROC can look great even with many false positives.

Document the chosen threshold and the business rule that drove it (e.g., 'recall >= 0.80').

🎯 Key Takeaway

Threshold tuning trades precision for recall — and it's free.

PR curve beats ROC for imbalanced classes — always use it for fraud/medical.

Pick your threshold by business constraints, not by max F1.

Reading the Classification Report Like a Postmortem Log

Scikit-learn's classification_report prints a table of precision, recall, f1-score, and support per class. It looks clean. Don't trust it blindly. The report hides class imbalance. When you have 10,000 legitimate transactions and 10 fraud cases, a 99% recall on the majority class won't save you. Always check the support column first. It tells you how many actual samples exist for each class. If support for a critical class is under 5% of the total, your precision and recall numbers for that class are likely unstable. A small error in prediction flips them dramatically. That's why you must never average metrics across classes without weighting by support. The weighted avg row accounts for this. The macro avg row doesn't. Use weighted avg when class distribution matters. Use macro avg only when you care equally about every class regardless of frequency. Which is almost never in production.

classification_report.pyPYTHON

// io.thecodeforge
from sklearn.metrics import classification_report
import numpy as np

# 1000 legit (0), 10 fraud (1)
y_true = np.array([0]*1000 + [1]*10)
y_pred = np.array([0]*995 + [1]*5 + [0]*8 + [1]*2)

report = classification_report(y_true, y_pred, target_names=['legit', 'fraud'])
print(report)

# Note: fraud recall is 0.20 (2/10). Macro avg hides imbalance.
# Weighted avg is 0.98 but useless for fraud detection.

Output

precision recall f1-score support

legit 0.99 0.99 0.99 1000

fraud 0.29 0.20 0.24 10

accuracy 0.99 1010

macro avg 0.64 0.60 0.62 1010

weighted avg 0.98 0.99 0.99 1010

⚠ Production Trap:

The default classification_report uses 2 decimal places. When support is low, round to 1 decimal. A 0.00 precision on 'fraud' due to 10 samples might be noise, not model failure. Inspect the confusion matrix alongside.

🎯 Key Takeaway

Never trust a classification report before checking the support column. Low support means unreliable metrics.

Threshold Tuning: Why the Default 0.5 Is a Trap

Most classifiers output a probability, not a hard label. Scikit-learn's predict defaults to threshold 0.5. That's arbitrary. In fraud detection, you want high recall — catch every fraud even if you get more false alarms. Drop the threshold to 0.3. In spam filtering, you want high precision — never misclassify a legit email. Raise the threshold to 0.8. Tuning thresholds is how you steer the precision-recall trade-off. The scikit-learn precision_recall_curve function gives you precision and recall for every threshold from 0 to 1. Plot it. Pick the threshold where the metrics match your business cost. For example, if a false negative costs $1000 and a false positive costs $10, you want the threshold that minimizes total cost. Never ship a model without adjusting the threshold. The default 0.5 is a starting point, not a destination.

threshold_tuning.pyPYTHON

// io.thecodeforge
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
import numpy as np

# probabilities from model
y_probs = np.array([0.9, 0.4, 0.2, 0.7, 0.1])
y_true = np.array([1, 0, 0, 1, 0])

precisions, recalls, thresholds = precision_recall_curve(y_true, y_probs)

# Find threshold where recall >= 0.8
for i, t in enumerate(thresholds):
    if recalls[i] >= 0.8:
        print(f'Threshold: {t:.2f}, Precision: {precisions[i]:.2f}, Recall: {recalls[i]:.2f}')
        break

plt.plot(thresholds, precisions[:-1], 'b-', label='Precision')
plt.plot(thresholds, recalls[:-1], 'r-', label='Recall')
plt.xlabel('Threshold')
plt.legend()
plt.show()

Output

Threshold: 0.70, Precision: 0.50, Recall: 1.00

🔥Why This Matters:

Threshold tuning is free. It costs zero compute. It directly changes how your model behaves in production. Ignoring it is like driving a car with the emergency brake on.

🎯 Key Takeaway

Default threshold (0.5) is rarely optimal. Always tune based on the real cost of false positives vs. false negatives.

● Production incidentPOST-MORTEMseverity: high

The 99% Accuracy That Hid 0% Recall — A Fraud Detection Disaster

Symptom

Model reported 99% accuracy on holdout set. Fraud rate in production was consistent with training (2%). Business stakeholders were impressed.

Assumption

High accuracy means the model is performing well. The team only monitored accuracy in dashboards and automated reports.

Root cause

The model predicted 'legitimate' for every transaction. With 98% legitimate transactions, accuracy was 98% instantly. The remaining 1% came from random chance, pushing it to 99%. Recall for fraud class was 0.0 — the model had never learned to detect fraud because the loss function (cross-entropy) was dominated by the majority class.

Fix

Switched primary metric to recall for fraud and added a classification_report to the monitoring dashboard. Retrained with class_weight='balanced' and lowered decision threshold to 0.3. Recall jumped to 0.72, precision dropped to 0.18 — acceptable trade-off given fraud cost.

Key lesson

Never trust accuracy alone on imbalanced data — demand per-class recall and precision.
Automate classification_report generation in your evaluation pipeline — it catches silent failures.
The business cost of false negatives drives metric selection, not the data scientist's comfort with high accuracy.

Production debug guideSymptom → Action guide for common metric failures4 entries

Symptom · 01

Accuracy is high but fraud is still rampant

→

Fix

Run classification_report(y_true, y_pred). Check recall for minority class. If near zero, your model is predicting majority class only.

Symptom · 02

Precision is perfect (1.0) but business is unhappy

→

Fix

Precision=1.0 means zero false positives, but check recall — the model is probably ignoring most positives. Plot precision-recall curve to see the trade-off.

Symptom · 03

F1-score suddenly drops after deployment

→

Fix

Compare precision and recall separately — one likely collapsed. Check data drift (distribution shift) or label definition changes.

Symptom · 04

classification_report returns 'nan' for one class

→

Fix

The model never predicted that class. Check threshold tuning, class imbalance, or retrain with balanced sampling. Use zero_division=0 to avoid crashes.

★ Quick Debug Cheat Sheet for Classification MetricsRapid-fire commands and checks for common metric pitfalls — paste these into your notebook or production script.

Need instant metric breakdown−

Immediate action

Run classification_report() with all classes and zero_division=0

Commands

from sklearn.metrics import classification_report; print(classification_report(y_true, y_pred, zero_division=0))

cm = confusion_matrix(y_true, y_pred); tn, fp, fn, tp = cm.ravel(); print(f'TP={tp} FP={fp} FN={fn} TN={tn}')

Fix now

Switch to F1 or recall as primary metric if imbalance is detected.

Model predicts only majority class+

Need to find optimal threshold without retraining+

Metric	Formula	Optimise When	Blind Spot
Accuracy	(TP+TN) / Total	Classes are balanced and all errors cost the same	Completely misleading on imbalanced datasets — a dumb model scores 99%
Precision	TP / (TP+FP)	False alarms are expensive (spam filter, legal review)	Ignores false negatives entirely — a model that rarely predicts 'positive' scores perfectly
Recall	TP / (TP+FN)	Missing a positive is catastrophic (cancer screening, fraud)	Ignores false positives — a model predicting 'positive' for everything scores 100%
F1-Score	2×(P×R)/(P+R)	You need one balanced number and the dataset is imbalanced	Treats precision and recall equally — use F-beta if you need to weight one more
F-beta Score	(1+β²)×(P×R)/(β²×P+R)	You need recall weighted β times more than precision	β must be tuned by business context, not by grid search

⚙ Quick Reference

7 commands from this guide

File	Command / Code	Purpose
confusion_matrix_basics.py	from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay	The Confusion Matrix
classification_metrics_from_scratch.py	from sklearn.metrics import (	Precision, Recall and F1-Score
metric_selection_real_world.py	from sklearn.datasets import make_classification	Choosing the Right Metric for Your Problem
multiclass_metrics.py	from sklearn.metrics import classification_report, confusion_matrix	Beyond Binary
precision_recall_curve_threshold_tuning.py	from sklearn.datasets import make_classification	The Precision-Recall Trade-off
classification_report.py	from sklearn.metrics import classification_report	Reading the Classification Report Like a Postmortem Log
threshold_tuning.py	from sklearn.metrics import precision_recall_curve	Threshold Tuning

Key takeaways

A confusion matrix breaks your model's performance into four honest buckets (TP, TN, FP, FN)

never accept a single accuracy number without demanding the full matrix first.

Precision and recall are in tension

pushing one up almost always pushes the other down. The business context — not the data — decides which direction to lean.

On imbalanced datasets (fraud, medical, anomaly detection), accuracy is nearly always the wrong primary metric. Use F1, recall, or a Precision-Recall curve instead.

Lowering the classification threshold below 0.5 is the cheapest, fastest way to trade precision for recall on an already-trained model

know this trick before reaching for resampling or retraining.

For multi-class, always inspect per-class recall. Macro/weighted averages can mask a class your model is completely ignoring.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Your fraud detection model has 99.5% accuracy — the product team is thri...

Q02SENIOR

In a binary classifier for cancer detection, would you rather maximise p...

Q03SENIOR

If I asked you to increase recall on your model without retraining it at...

Q04SENIOR

Explain the difference between macro, weighted, and micro average in mul...

Q01 of 04SENIOR

Your fraud detection model has 99.5% accuracy — the product team is thrilled. Should you be? Walk me through what you'd actually check before celebrating.

ANSWER

No, I'd be suspicious immediately because fraud is typically rare (0.5-2% of transactions). A model that predicts 'legitimate' for every transaction would achieve 98%+ accuracy on a 2% fraud rate. I'd run classification_report(y_true, y_pred) and check recall for the fraud class. If recall is near zero, accuracy is a complete illusion. I'd also check the confusion matrix to see if the model ever predicts fraud. Then I'd look at precision-recall curve to understand trade-offs. Only after verifying per-class recall > 0.5 would I celebrate.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is the difference between precision and recall in machine learning?

When should I use F1-score instead of accuracy?

Why does sklearn's confusion_matrix use [[TN, FP], [FN, TP]] order instead of [[TP, FP], [FN, TN]]?

What is the difference between AUC-ROC and AUC-PR?

How do I handle multi-label classification metrics?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's ML Basics. Mark it forged?

6 min read · try the examples if you haven't