Confusion matrix is a 2×2 grid counting TP, TN, FP, FN independently
Accuracy hides failure on imbalanced data – always check per-class recall
Precision = trustworthiness of positive predictions; Recall = completeness of catching positives
F1-score is harmonic mean, punishing skewed precision-recall pairs
Class imbalance is the #1 reason accuracy lies – use classification_report()
Plain-English First
Imagine you're a doctor screening patients for a rare disease. Your test results fall into four buckets: people you correctly flagged as sick, people you correctly cleared as healthy, healthy people you wrongly alarmed (false alarm), and sick people you wrongly cleared (missed cases). A confusion matrix is just a scoreboard that counts all four buckets. The classification metrics — precision, recall, F1 — are different ways of asking 'how good is this scoreboard, really?' depending on which type of mistake costs you the most.
Every ML model that classifies things — spam or not spam, fraud or legit, cancer or benign — eventually faces a moment of truth: how do we measure whether it's actually any good? Accuracy sounds like the obvious answer, but it's a trap. A model that predicts 'not fraud' for every single transaction can hit 99% accuracy on a dataset where fraud is 1% of records — and be completely useless. The real world demands smarter scorekeeping.
The confusion matrix exists to break that single 'accuracy' number into its honest parts. It shows you not just how many predictions were right, but what kind of wrong your model is being. Are you raising too many false alarms? Are you missing real threats? Those are completely different failure modes with completely different business consequences, and accuracy hides both of them.
By the end of this article you'll be able to read a confusion matrix cold, calculate precision, recall, F1-score and accuracy by hand, write production-ready evaluation code in Python using scikit-learn, and — most importantly — know which metric to optimise for given a real business problem. That last skill is what separates engineers who build useful models from engineers who build impressive-looking ones.
The Confusion Matrix — Reading the Scoreboard Before Calculating Anything
A confusion matrix is a 2×2 grid (for binary classification) that maps every prediction your model makes against what was actually true. The four cells are True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
True Positive (TP): Model said 'yes', reality was 'yes'. The model caught a real fraud case. True Negative (TN): Model said 'no', reality was 'no'. The model correctly cleared a legit transaction. False Positive (FP): Model said 'yes', reality was 'no'. A false alarm — an innocent transaction flagged as fraud. Also called a Type I error. False Negative (FN): Model said 'no', reality was 'yes'. A missed catch — real fraud that slipped through. Also called a Type II error.
Here's the crucial insight most tutorials skip: FP and FN are not equally bad. In fraud detection, an FN (missed fraud) costs the bank real money. In cancer screening, an FN (missed cancer) can cost a life. In a spam filter, an FP (a real email landing in spam) might cost you an important message. The business context dictates which error type you can tolerate least — and that determines which metric you optimise for.
confusion_matrix_basics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay# Simulated ground truth labels for 20 fraud detection predictions# 1 = fraud, 0 = legitimate
actual_labels = [0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
0, 1, 0, 1, 0, 0, 1, 1, 0, 1]
# What our model predicted for each of those 20 transactions
predicted_labels = [0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
0, 0, 0, 1, 0, 1, 1, 1, 0, 0]
# Build the confusion matrix# sklearn orders as: [[TN, FP], [FN, TP]] by default
cm = confusion_matrix(actual_labels, predicted_labels)
# Pull out each cell so we can label them clearly
tn, fp, fn, tp = cm.ravel()
print("=== Confusion Matrix (raw counts) ===")
print(f"True Negatives (TN): {tn} — Legit transactions correctly cleared")
print(f"False Positives (FP): {fp} — Legit transactions wrongly flagged (false alarm)")
print(f"False Negatives (FN): {fn} — Fraud transactions we missed (dangerous!)")
print(f"True Positives (TP): {tp} — Fraud transactions correctly caught")
print()
print("Raw confusion matrix:")
print(cm)
# Visualise it as a heatmap — much easier to read at a glance
fig, ax = plt.subplots(figsize=(6, 5))
disp = ConfusionMatrixDisplay(
confusion_matrix=cm,
display_labels=["Legitimate", "Fraud"]
)
disp.plot(ax=ax, cmap="Blues", colorbar=False)
ax.set_title("Fraud Detection — Confusion Matrix", fontsize=14, pad=12)
plt.tight_layout()
plt.savefig("confusion_matrix_fraud.png", dpi=150)
print("Heatmap saved to confusion_matrix_fraud.png")
sklearn's confusion_matrix returns [[TN, FP], [FN, TP]] — rows are actual labels, columns are predicted. Use .ravel() to unpack all four values in one line: tn, fp, fn, tp = cm.ravel(). Memorise this order or you'll misread every matrix you ever build.
Production Insight
The biggest confusion matrix mistake is trusting the raw counts without normalizing.
Always normalise by row (actuals) to see recall per class — absolute numbers hide imbalance.
If TN >> TP, your matrix is probably correct, but your model is useless.
Key Takeaway
Confusion matrix reveals four distinct error types — accuracy collapses them into one.
Always demand the full matrix before any metric conversation.
Your business decides which error hurts more, not your loss function.
Precision, Recall and F1-Score — What They Actually Measure and When to Use Each
Now that you can read the scoreboard, let's build the three metrics that actually matter.
Accuracy = (TP + TN) / Total. The percentage of all predictions that were correct. Useful only when classes are balanced. Completely misleading on imbalanced datasets.
Precision = TP / (TP + FP). Of everything the model called positive, how many actually were? This is your 'don't cry wolf' metric. High precision means when the model raises an alarm, you can trust it. Optimise for precision when false alarms are costly — think spam filters (you don't want real emails binned) or legal document review (you don't want lawyers chasing dead ends).
Recall (Sensitivity) = TP / (TP + FN). Of all the actual positives, how many did the model catch? This is your 'don't miss anything' metric. High recall means few real threats slip through. Optimise for recall when missing a positive is catastrophic — cancer screening, fraud detection, safety-critical systems.
F1-Score = 2 × (Precision × Recall) / (Precision + Recall). The harmonic mean of precision and recall. Use it when you need a single balanced metric and you can't afford to let either precision or recall collapse. It's the default choice for imbalanced classification competitions.
The harmonic mean is used (not arithmetic mean) because it punishes extreme imbalance. A model with precision=1.0 and recall=0.0 has an F1 of 0, not 0.5.
classification_metrics_from_scratch.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
classification_report
)
# Same fraud detection scenario as before
actual_labels = [0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
0, 1, 0, 1, 0, 0, 1, 1, 0, 1]
predicted_labels = [0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
0, 0, 0, 1, 0, 1, 1, 1, 0, 0]
# --- Calculate every metric manually first so we understand what's happening ---# pos_label=1 means we treat 'Fraud' as the positive class we care about
accuracy = accuracy_score(actual_labels, predicted_labels)
precision = precision_score(actual_labels, predicted_labels, pos_label=1)
recall = recall_score(actual_labels, predicted_labels, pos_label=1)
f1 = f1_score(actual_labels, predicted_labels, pos_label=1)
print("=== Individual Metrics (Fraud = Positive Class) ===")
print(f"Accuracy : {accuracy:.2%} — {int(accuracy*20)}/20 predictions correct overall")
print(f"Precision: {precision:.2%} — Of {int(precision**-1 * recall * 10):.0f} fraud alerts, this fraction were real fraud")
print(f"Recall : {recall:.2%} — Of all actual fraud, this fraction was caught")
print(f"F1-Score : {f1:.2%} — Balanced measure (harmonic mean of precision & recall)")
print()
# --- sklearn's classification_report is your best friend in production ---# It gives you precision, recall, F1 for EACH class plus macro/weighted averages
report = classification_report(
actual_labels,
predicted_labels,
target_names=["Legitimate", "Fraud"],
digits=3
)
print("=== Full Classification Report ===")
print(report)
# --- Illustrate why accuracy is misleading on imbalanced data ---print("=== The Accuracy Trap — Imbalanced Dataset Demonstration ===")
# Imagine 1000 transactions: 990 legit, 10 fraudulent (realistic ratio)
imbalanced_actual = [0] * 990 + [1] * 10# A dumb model that ALWAYS predicts 'legitimate'
dumb_model_predicted = [0] * 1000
dumb_accuracy = accuracy_score(imbalanced_actual, dumb_model_predicted)
dumb_recall = recall_score(imbalanced_actual, dumb_model_predicted,
pos_label=1, zero_division=0)
dumb_f1 = f1_score(imbalanced_actual, dumb_model_predicted,
pos_label=1, zero_division=0)
print(f"Dumb model accuracy : {dumb_accuracy:.2%} ← looks great!")
print(f"Dumb model recall : {dumb_recall:.2%} ← caught ZERO fraud cases")
print(f"Dumb model F1-score : {dumb_f1:.2%} ← tells the real story")
Precision: 70.00% — Of fraud alerts, this fraction were real fraud
Recall : 70.00% — Of all actual fraud, this fraction was caught
F1-Score : 70.00% — Balanced measure (harmonic mean of precision & recall)
=== Full Classification Report ===
precision recall f1-score support
Legitimate 0.700 0.700 0.700 10
Fraud 0.700 0.700 0.700 10
accuracy 0.700 20
macro avg 0.700 0.700 0.700 20
weighted avg 0.700 0.700 0.700 20
=== The Accuracy Trap — Imbalanced Dataset Demonstration ===
Dumb model accuracy : 99.00% ← looks great!
Dumb model recall : 0.00% ← caught ZERO fraud cases
Dumb model F1-score : 0.00% ← tells the real story
Watch Out:
When you call precision_score or recall_score on a dataset where the model never predicts the positive class, sklearn will raise a UndefinedMetricWarning and return 0. Always pass zero_division=0 explicitly to suppress the warning and get the correct 0.0 value — otherwise your logging pipelines may crash in production.
Production Insight
F1-score punishes imbalance — precision=1.0, recall=0.0 gives F1=0, not 0.5.
Use classification_report() in CI/CD to catch silent failures before deployment.
If weighted avg F1 drops, one class is getting ignored — dig into per-class recall.
Key Takeaway
Accuracy lies on imbalanced data — always pair it with per-class recall.
F1 is the safety net: one number that exposes skewed performance.
Zero division? Pass zero_division=0 to avoid pipeline crashes.
Choosing the Right Metric for Your Problem — A Decision Framework
Knowing what the metrics measure is only half the battle. The harder skill is knowing which one to care about in a given situation — and being able to defend that choice to a product manager or a senior engineer.
Here's the mental model: ask yourself 'which mistake is more expensive?'
If a False Positive is expensive → optimise for Precision. Example: a content moderation system wrongly banning a legitimate post causes user backlash and potential legal liability. You'd rather miss a few bad posts than wrongly censor good ones.
If a False Negative is expensive → optimise for Recall. Example: a medical screening test that misses a tumour sends a sick patient home untreated. The cost of a false alarm (extra tests, anxiety) is much lower than missing the disease.
If both mistakes matter roughly equally → use F1-Score. Example: a job application screening tool — both wrongly rejecting a strong candidate (FN) and wasting time on a weak one (FP) matter.
For multi-class problems, the classification_report gives you per-class metrics plus two averages: macro avg (treats all classes equally, good for balanced datasets) and weighted avg (weights by class support — better for imbalanced ones). Never just report the weighted average without also checking per-class recall or you'll miss a class your model is quietly ignoring.
metric_selection_real_world.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from sklearn.datasets import make_classification
from sklearn.linear_model importLogisticRegressionfrom sklearn.model_selection import train_test_split
from sklearn.metrics import (
precision_score, recall_score, f1_score,
classification_report, confusion_matrix
)
import numpy as np
# --- Scenario: Medical test — detecting a rare condition (5% prevalence) ---
np.random.seed(42)
# Generate a deliberately imbalanced binary classification dataset# weights=[0.95, 0.05] means 95% healthy, 5% have the condition
features, labels = make_classification(
n_samples=2000,
n_features=10,
n_informative=6,
weights=[0.95, 0.05], # heavy class imbalance
flip_y=0.02, # a little noise to make it realistic
random_state=42
)
train_features, test_features, train_labels, test_labels = train_test_split(
features, labels,
test_size=0.25,
stratify=labels, # preserve the 95/5 ratio in both splits
random_state=42
)
# Train a simple logistic regression model
medical_model = LogisticRegression(max_iter=1000, random_state=42)
medical_model.fit(train_features, train_labels)
# Default predictions (threshold = 0.5)
default_predictions = medical_model.predict(test_features)
# Lower threshold to 0.3 — the model now flags 'positive' at lower confidence# This trades precision for recall, which makes sense for medical screening
positive_probabilities = medical_model.predict_proba(test_features)[:, 1]
lowered_threshold_predictions = (positive_probabilities >= 0.3).astype(int)
print("=== Medical Screening Model — Default Threshold (0.5) ===")
print(classification_report(
test_labels,
default_predictions,
target_names=["Healthy", "Has Condition"],
digits=3
))
print("=== Medical Screening Model — Lowered Threshold (0.3) ===")
print(classification_report(
test_labels,
lowered_threshold_predictions,
target_names=["Healthy", "Has Condition"],
digits=3
))
# Show the trade-off explicitly — this is what you'd put in a model cardfor threshold, preds in [(0.5, default_predictions),
(0.3, lowered_threshold_predictions)]:
prec = precision_score(test_labels, preds, pos_label=1, zero_division=0)
rec = recall_score(test_labels, preds, pos_label=1, zero_division=0)
f1 = f1_score(test_labels, preds, pos_label=1, zero_division=0)
fn_count = confusion_matrix(test_labels, preds).ravel()[2] # FN cellprint(f"Threshold={threshold} | Precision={prec:.3f} | Recall={rec:.3f} "
f"| F1={f1:.3f} | Missed cases (FN)={fn_count}")
Output
=== Medical Screening Model — Default Threshold (0.5) ===
precision recall f1-score support
Healthy 0.977 0.996 0.986 475
Has Condition 0.714 0.360 0.479 25
accuracy 0.972 500
macro avg 0.846 0.678 0.733 500
weighted avg 0.969 0.972 0.968 500
=== Medical Screening Model — Lowered Threshold (0.3) ===
Lowering the classification threshold below 0.5 is one of the first levers to pull on imbalanced medical or fraud datasets — before trying class_weight='balanced' or resampling. It costs you precision but buys recall. Plot a Precision-Recall curve across all thresholds (sklearn.metrics.precision_recall_curve) to find the sweet spot for your specific business tolerance.
Production Insight
Choosing the wrong metric wastes engineering cycles — you optimise for what you measure.
If your product manager only cares about false positives, recall becomes a distraction.
Document the cost asymmetry explicitly in model cards to avoid future metric debates.
Key Takeaway
Business context determines the 'right' metric — not the data, not the algorithm.
For multi-class, always check per-class recall — macro/weighted avgs can hide failures.
Beyond Binary: Multi-Class and Multi-Label Metrics
When you have more than two classes, the confusion matrix grows to N×N. Metrics extend via averaging strategies: micro, macro, weighted, and per-class. Each answers a different question.
Micro-average = global sum of TP, FP, FN across all classes. It's the same as accuracy for multi-class. Good when classes are balanced and you care about overall correctness.
Macro-average = unweighted mean of per-class precision/recall/F1. Treats every class equally regardless of support. If a rare class has low recall, macro will expose it — but it can be dominated by noise in very small classes.
Weighted-average = average weighted by the number of true instances per class. This is what sklearn's classification_report uses by default ('weighted avg' line). It reflects overall performance but can mask a struggling minority class.
Per-class metrics = always the most informative. The classification_report prints them for every class. Never ship a model without eyeballing each row.
For multi-label problems (each sample can belong to multiple classes), metrics are computed per label and then averaged. Use sklearn.metrics with average='samples' for instance-level evaluation.
Weighted F1: 0.805 (weighted by actual class distribution)
Watch Out in Multi-Class:
The 'macro avg' line penalises you equally for a bad class with 1 sample and a bad class with 1000 samples. If you have extreme imbalance, weighted avg is safer for overall performance, but per-class recall is the only way to know if any class is being ignored. Never rely solely on macro or weighted — inspect per-class always.
Production Insight
Multi-class weighted avg can be high while one rare class has recall=0.
Always set a minimum recall threshold per class in your model validation gate.
If you deploy a multi-class model, log confusion matrices per slice (e.g., by date, by region) to catch distribution shifts.
Key Takeaway
Multi-class metrics need averaging strategy — choose carefully.
Per-class recall is the only metric that catches minority class failure.
Macro avg is fair but noisy; weighted avg hides rare class problems.
The Precision-Recall Trade-off: Threshold Tuning and AUC-PR
Precision and recall pull in opposite directions. As you lower the classification threshold, recall increases because you catch more positives — but precision drops because you also pick up more false alarms. The Precision-Recall (PR) curve visualises this trade-off across all possible thresholds.
Unlike the ROC curve (which plots TPR vs FPR and can be overly optimistic on imbalanced data), the PR curve focuses on the positive class. It's the recommended diagnostic for imbalanced binary classification.
Area Under the PR Curve (AUC-PR / AUPR) summarises the curve into a single number. Higher is better. A random model on a balanced dataset gets 0.5 AUROC but AUPR depends on class prevalence. For a rare positive class, even a good model may have modest AUPR.
Why does this matter in production? You don't have the freedom to pick the threshold that maximises F1. You have a business constraint: e.g., 'recall must be at least 0.80, and we accept precision as low as 0.30'. You need the PR curve to find that exact threshold.
Store the threshold alongside the model artifact in your model registry. When you deploy, ensure the scoring pipeline uses the saved threshold, not sklearn's default 0.5. A mismatch between training threshold and serving threshold is a common silent bug.
Production Insight
Threshold tuning is a zero-cost performance lever — no retraining needed.
Use PR curve, not ROC, for imbalanced problems — ROC can look great even with many false positives.
Document the chosen threshold and the business rule that drove it (e.g., 'recall >= 0.80').
Key Takeaway
Threshold tuning trades precision for recall — and it's free.
PR curve beats ROC for imbalanced classes — always use it for fraud/medical.
Pick your threshold by business constraints, not by max F1.
● Production incidentPOST-MORTEMseverity: high
The 99% Accuracy That Hid 0% Recall — A Fraud Detection Disaster
Symptom
Model reported 99% accuracy on holdout set. Fraud rate in production was consistent with training (2%). Business stakeholders were impressed.
Assumption
High accuracy means the model is performing well. The team only monitored accuracy in dashboards and automated reports.
Root cause
The model predicted 'legitimate' for every transaction. With 98% legitimate transactions, accuracy was 98% instantly. The remaining 1% came from random chance, pushing it to 99%. Recall for fraud class was 0.0 — the model had never learned to detect fraud because the loss function (cross-entropy) was dominated by the majority class.
Fix
Switched primary metric to recall for fraud and added a classification_report to the monitoring dashboard. Retrained with class_weight='balanced' and lowered decision threshold to 0.3. Recall jumped to 0.72, precision dropped to 0.18 — acceptable trade-off given fraud cost.
Key lesson
Never trust accuracy alone on imbalanced data — demand per-class recall and precision.
Automate classification_report generation in your evaluation pipeline — it catches silent failures.
The business cost of false negatives drives metric selection, not the data scientist's comfort with high accuracy.
Production debug guideSymptom → Action guide for common metric failures4 entries
Symptom · 01
Accuracy is high but fraud is still rampant
→
Fix
Run classification_report(y_true, y_pred). Check recall for minority class. If near zero, your model is predicting majority class only.
Symptom · 02
Precision is perfect (1.0) but business is unhappy
→
Fix
Precision=1.0 means zero false positives, but check recall — the model is probably ignoring most positives. Plot precision-recall curve to see the trade-off.
Symptom · 03
F1-score suddenly drops after deployment
→
Fix
Compare precision and recall separately — one likely collapsed. Check data drift (distribution shift) or label definition changes.
Symptom · 04
classification_report returns 'nan' for one class
→
Fix
The model never predicted that class. Check threshold tuning, class imbalance, or retrain with balanced sampling. Use zero_division=0 to avoid crashes.
★ Quick Debug Cheat Sheet for Classification MetricsRapid-fire commands and checks for common metric pitfalls — paste these into your notebook or production script.
Need instant metric breakdown−
Immediate action
Run classification_report() with all classes and zero_division=0
Commands
from sklearn.metrics import classification_report; print(classification_report(y_true, y_pred, zero_division=0))
Set threshold to best_thresh in production scoring function.
Metric
Formula
Optimise When
Blind Spot
Accuracy
(TP+TN) / Total
Classes are balanced and all errors cost the same
Completely misleading on imbalanced datasets — a dumb model scores 99%
Precision
TP / (TP+FP)
False alarms are expensive (spam filter, legal review)
Ignores false negatives entirely — a model that rarely predicts 'positive' scores perfectly
Recall
TP / (TP+FN)
Missing a positive is catastrophic (cancer screening, fraud)
Ignores false positives — a model predicting 'positive' for everything scores 100%
F1-Score
2×(P×R)/(P+R)
You need one balanced number and the dataset is imbalanced
Treats precision and recall equally — use F-beta if you need to weight one more
F-beta Score
(1+β²)×(P×R)/(β²×P+R)
You need recall weighted β times more than precision
β must be tuned by business context, not by grid search
Key takeaways
1
A confusion matrix breaks your model's performance into four honest buckets (TP, TN, FP, FN)
never accept a single accuracy number without demanding the full matrix first.
2
Precision and recall are in tension
pushing one up almost always pushes the other down. The business context — not the data — decides which direction to lean.
3
On imbalanced datasets (fraud, medical, anomaly detection), accuracy is nearly always the wrong primary metric. Use F1, recall, or a Precision-Recall curve instead.
4
Lowering the classification threshold below 0.5 is the cheapest, fastest way to trade precision for recall on an already-trained model
know this trick before reaching for resampling or retraining.
5
For multi-class, always inspect per-class recall. Macro/weighted averages can mask a class your model is completely ignoring.
Common mistakes to avoid
4 patterns
×
Reporting accuracy on an imbalanced dataset
Symptom
A model hitting 98% accuracy sounds great until you realise 98% of your data is the majority class and the model just learned to predict it always.
Fix
Always check per-class recall in the classification_report. If recall for the minority class is near zero, accuracy is lying to you. Switch to F1 or recall as your primary metric.
×
Forgetting that precision and recall are class-specific
Symptom
Beginners call precision_score without specifying pos_label and get the precision for the wrong class, or average it incorrectly.
Fix
Always pass pos_label=1 explicitly for binary tasks, and for multi-class use average='macro' or average='weighted' deliberately — never let it default silently.
×
Treating the 0.5 threshold as sacred
Symptom
The default prediction threshold of 0.5 is an arbitrary starting point, not a law. On imbalanced datasets it almost always under-detects the minority class.
Fix
Use predict_proba to get raw probabilities, then sweep the threshold and plot the Precision-Recall curve. Choose the threshold that meets your business requirement (e.g. 'recall must be at least 0.80') rather than the one that maximises F1.
×
Ignoring zero_division parameter in production code
Symptom
sklearn raises UndefinedMetricWarning and returns nan when a class has zero predictions, breaking downstream logging or alerting pipelines.
Fix
Always pass zero_division=0 (or 1) when calling precision_score, recall_score, f1_score in automated scripts.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Your fraud detection model has 99.5% accuracy — the product team is thri...
Q02SENIOR
In a binary classifier for cancer detection, would you rather maximise p...
Q03SENIOR
If I asked you to increase recall on your model without retraining it at...
Q04SENIOR
Explain the difference between macro, weighted, and micro average in mul...
Q01 of 04SENIOR
Your fraud detection model has 99.5% accuracy — the product team is thrilled. Should you be? Walk me through what you'd actually check before celebrating.
ANSWER
No, I'd be suspicious immediately because fraud is typically rare (0.5-2% of transactions). A model that predicts 'legitimate' for every transaction would achieve 98%+ accuracy on a 2% fraud rate. I'd run classification_report(y_true, y_pred) and check recall for the fraud class. If recall is near zero, accuracy is a complete illusion. I'd also check the confusion matrix to see if the model ever predicts fraud. Then I'd look at precision-recall curve to understand trade-offs. Only after verifying per-class recall > 0.5 would I celebrate.
Q02 of 04SENIOR
In a binary classifier for cancer detection, would you rather maximise precision or recall, and why? What's the concrete trade-off you're making either way?
ANSWER
I'd optimise for recall because the cost of a false negative (missed cancer) can be a life, while a false positive only causes extra tests and anxiety. The trade-off: maximising recall will lower precision — meaning more healthy people will be wrongly flagged and need follow-up. That's an operational cost (more biopsies, longer wait times). But it's a far lower cost than sending a cancer patient home. In practice, I'd set a minimum recall threshold (e.g., 0.95) and then try to maximise precision within that constraint using the PR curve.
Q03 of 04SENIOR
If I asked you to increase recall on your model without retraining it at all, what would you do — and what would you expect to happen to precision?
ANSWER
I'd lower the decision threshold below 0.5. That makes the model flag more samples as positive, so recall goes up (catches more actual positives). But precision will drop because some of those extra positives are false alarms. I'd use the predict_proba output and sweep thresholds, plotting the precision-recall curve to find the threshold that gives the required recall with the minimum acceptable precision. This is a free lunch — no retraining needed, just a configuration change in the scoring pipeline.
Q04 of 04SENIOR
Explain the difference between macro, weighted, and micro average in multi-class metrics. When would you use each?
ANSWER
Micro average aggregates TP/FP/FN globally across all classes — it's equivalent to accuracy and is class-balance agnostic. Use it if overall correctness is your goal. Macro average computes per-class metrics then takes an unweighted mean — every class matters equally regardless of support. Use it if you care about all classes equally and want to detect rare class failures. Weighted average weights each class's metric by its support (number of true instances) — it reflects the model's performance on the data as it naturally exists. Use it when you're reporting to business stakeholders who care about overall, but always supplement with per-class recall.
01
Your fraud detection model has 99.5% accuracy — the product team is thrilled. Should you be? Walk me through what you'd actually check before celebrating.
SENIOR
02
In a binary classifier for cancer detection, would you rather maximise precision or recall, and why? What's the concrete trade-off you're making either way?
SENIOR
03
If I asked you to increase recall on your model without retraining it at all, what would you do — and what would you expect to happen to precision?
SENIOR
04
Explain the difference between macro, weighted, and micro average in multi-class metrics. When would you use each?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
What is the difference between precision and recall in machine learning?
Precision asks 'of everything the model labelled positive, how many actually were?' — it measures false alarm rate. Recall asks 'of all the real positives, how many did the model catch?' — it measures missed detection rate. They pull in opposite directions: increasing one usually decreases the other. Your business context determines which to prioritise.
Was this helpful?
02
When should I use F1-score instead of accuracy?
Use F1-score whenever your dataset is imbalanced — meaning one class appears significantly more often than the other. Accuracy on imbalanced data rewards a model for doing nothing (just predicting the majority class). F1-score is the harmonic mean of precision and recall and exposes that failure immediately by returning a near-zero score.
Was this helpful?
03
Why does sklearn's confusion_matrix use [[TN, FP], [FN, TP]] order instead of [[TP, FP], [FN, TN]]?
sklearn follows the convention where rows represent actual labels and columns represent predicted labels, ordered from class 0 to class 1. So row 0 = actual negatives, row 1 = actual positives, and within each row column 0 = predicted negative, column 1 = predicted positive. This gives [[TN, FP], [FN, TP]]. Always use cm.ravel() to unpack as tn, fp, fn, tp to avoid reading the grid wrong.
Was this helpful?
04
What is the difference between AUC-ROC and AUC-PR?
AUC-ROC plots True Positive Rate vs False Positive Rate and is overly optimistic for imbalanced datasets because the false positive rate stays low naturally when the negative class is large. AUC-PR (Area Under Precision-Recall curve) focuses on the positive class and is much more informative for imbalanced problems like fraud detection or medical diagnosis. For rare positives, always prefer AUC-PR over AUC-ROC.
Was this helpful?
05
How do I handle multi-label classification metrics?
For multi-label problems (each sample can have multiple labels), treat each label as a binary classification and compute metrics per label. Then average using average='samples' (for instance-level), 'micro', 'macro', or 'weighted' as appropriate. sklearn.metrics provides precision_score, recall_score, f1_score with average='samples' parameter directly.