Logistic Regression predicts a probability between 0 and 1 using the sigmoid function
The linear part computes log-odds, which are exponentiated and squeezed via sigmoid
Coefficients are log-odds ratios — interpretable for regulated industries
Feature scaling is mandatory — unscaled data makes gradient descent crawl
Decision threshold is a business decision, not a fixed 0.5
Accuracy is a trap — always check precision, recall, and confusion matrix
Plain-English First
Imagine a doctor looking at your test results and saying 'there's a 92% chance this is benign.' They're not predicting a number like your height — they're predicting a probability that tips into a yes-or-no answer. Logistic Regression is exactly that: it takes a bunch of measurements, runs them through a special S-shaped curve, and squeezes the result into a probability between 0 and 1. Once that probability crosses a threshold (usually 0.5), the model commits to an answer. It's less like a ruler and more like a confident doctor making a call.
Every day, your email provider quietly decides whether to drop a message into your inbox or your spam folder. Your bank flags a transaction as fraud or lets it through. A hospital algorithm predicts whether a tumour is malignant or benign. All of these are binary decisions — yes or no, 0 or 1 — and Logistic Regression is one of the most reliable, interpretable, and battle-tested tools for making them. It's been doing this job since the 1950s and it's still the first model data scientists reach for when the stakes are high and the explanation matters.
The core problem Logistic Regression solves is one that Linear Regression cannot: predicting a bounded probability. If you used ordinary linear regression to classify emails, nothing stops it from predicting a 'spam probability' of 2.7 or -0.4 — which is meaningless. Logistic Regression wraps its output in a sigmoid function that mathematically constrains every prediction to live between 0 and 1, giving you an actual probability you can act on.
By the end of this article you'll understand not just how to call LogisticRegression().fit() in scikit-learn, but why the sigmoid function exists, what the coefficients are actually telling you about the real world, how to tune the decision threshold for different business goals, and exactly what questions an interviewer will ask you to separate the practitioners from the people who just skimmed a tutorial.
The Sigmoid Function — Why Logistic Regression Uses This Specific Curve
Linear Regression gives you a straight line. That's great for predicting house prices, but terrible for predicting probabilities — because a straight line extends to infinity in both directions and probability must stay between 0 and 1.
The sigmoid function (also called the logistic function, which is where the algorithm gets its name) is the mathematical fix. Its formula is σ(z) = 1 / (1 + e^(-z)). Feed it any real number — whether it's -1000 or +1000 — and it maps the output to the range (0, 1). Large positive inputs push the output close to 1. Large negative inputs push it close to 0. Right at zero, you get exactly 0.5.
The input z is itself a linear combination of your features: z = w₀ + w₁x₁ + w₂x₂ + ... — exactly like Linear Regression. So Logistic Regression is really just Linear Regression with its output passed through the sigmoid. That single design decision makes the output interpretable as a probability, which is the foundation everything else builds on.
The model learns the weights (w values) by maximising the likelihood that the predicted probabilities match the actual labels in your training data — a process called Maximum Likelihood Estimation, optimised via gradient descent.
sigmoid_intuition.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
import matplotlib.pyplot as plt
defsigmoid(z):
"""The core of logistic regression — maps any real number to (0, 1)."""return1 / (1 + np.exp(-z))
# Create a range of z values to visualise the S-curve
z_values = np.linspace(-10, 10, 300)
probabilities = sigmoid(z_values)
# Annotate key points so the behaviour is obvious
key_points = {
-5: sigmoid(-5), # Very likely class 00: sigmoid(0), # Exactly on the decision boundary5: sigmoid(5), # Very likely class 1
}
print("=== Sigmoid Output at Key Z-Values ===")
for z, prob in key_points.items():
label = "→ class 1"if prob >= 0.5else"→ class 0"print(f" z = {z:+d} | P(y=1) = {prob:.4f} {label}")
# Plot the S-curve
plt.figure(figsize=(8, 4))
plt.plot(z_values, probabilities, color='steelblue', linewidth=2.5, label='σ(z)')
plt.axhline(y=0.5, color='tomato', linestyle='--', linewidth=1.5, label='Decision boundary (0.5)')
plt.axvline(x=0, color='gray', linestyle=':', linewidth=1.2)
plt.fill_between(z_values, probabilities, 0.5,
where=(probabilities >= 0.5), alpha=0.12, color='steelblue', label='Predict class 1')
plt.fill_between(z_values, probabilities, 0.5,
where=(probabilities < 0.5), alpha=0.12, color='tomato', label='Predict class 0')
plt.xlabel('z (linear combination of features)')
plt.ylabel('Predicted Probability')
plt.title('The Sigmoid Function — How Logistic Regression Converts Scores to Probabilities')
plt.legend()
plt.tight_layout()
plt.savefig('sigmoid_curve.png', dpi=150)
print("\nPlot saved to sigmoid_curve.png")
Output
=== Sigmoid Output at Key Z-Values ===
z = -5 | P(y=1) = 0.0067 → class 0
z = +0 | P(y=1) = 0.5000 → class 1
z = +5 | P(y=1) = 0.9933 → class 1
Plot saved to sigmoid_curve.png
Why Not Just Round Linear Regression?
Rounding a linear regression output to 0 or 1 destroys the probability information entirely and makes gradient descent behave badly — the loss landscape becomes a step function with no meaningful gradient. The sigmoid preserves a smooth, differentiable transition so the optimizer knows which direction to push the weights.
Production Insight
Sigmoid saturation kills gradient — when z > ~5 or z < ~-5, gradient approaches zero and training stalls.
Standardise features to keep z in the active range (|z| < 4) for most samples.
If your model converges slowly, check z distribution — if most are extreme, increase regularisation or scale better.
Key Takeaway
The sigmoid maps unbounded z to (0,1), giving us a probability.
Gradient vanishes at extreme z — feature scaling keeps your learning alive.
Z = linear combination → sigmoid = probability — understand this chain.
Training on Real Data — Breast Cancer Classification End-to-End
Theory only sticks when you see it on real data. We'll use scikit-learn's built-in Breast Cancer dataset — 569 tumour samples, each described by 30 numeric features (mean radius, texture, smoothness, etc.), labelled as malignant (0) or benign (1). The goal is to predict the label from the measurements.
There are a few things to get right here that tutorials often skip. First, feature scaling matters enormously for Logistic Regression because gradient descent converges far faster when all features live on a similar scale. If 'mean area' is in the thousands and 'mean fractal dimension' is near 0.05, the loss surface is elongated and training is sluggish. StandardScaler fixes this.
Second, you should always look at your model's coefficients after training. Each coefficient tells you how much the log-odds of the positive class change for a one-unit increase in that feature. A large positive coefficient means that feature is a strong predictor of benign; a large negative one means it predicts malignant. That interpretability is exactly why doctors, banks and regulators often prefer Logistic Regression over a black-box neural network — you can explain every decision.
Third, accuracy alone is a dangerous metric for medical data. A model that predicts 'benign' for every sample gets ~63% accuracy on this dataset without learning anything. Always check precision, recall and the confusion matrix.
breast_cancer_logistic.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score
)
# ── 1. Load Data ──────────────────────────────────────────────────────────────
cancer_data = load_breast_cancer()
feature_matrix = cancer_data.data # Shape: (569, 30)
target_labels = cancer_data.target # 0 = malignant, 1 = benign
feature_names = cancer_data.feature_names
print(f"Dataset shape : {feature_matrix.shape}")
print(f"Class balance : {np.bincount(target_labels)} (malignant, benign)")
# ── 2. Train / Test Split ─────────────────────────────────────────────────────# stratify= ensures both splits keep the same class ratio
(X_train, X_test,
y_train, y_test) = train_test_split(
feature_matrix, target_labels,
test_size=0.20,
random_state=42,
stratify=target_labels
)
# ── 3. Feature Scaling — critical for gradient-descent-based models ───────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit only on training data!
X_test_scaled = scaler.transform(X_test) # apply same scale to test# ── 4. Train the Model ───────────────────────────────────────────────────────# max_iter=1000 because the default 100 often hits a ConvergenceWarning
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)
# ── 5. Predict & Evaluate ────────────────────────────────────────────────────
y_pred_labels = logistic_model.predict(X_test_scaled)
y_pred_proba = logistic_model.predict_proba(X_test_scaled)[:, 1] # P(benign)print("\n=== Confusion Matrix ===")
cm = confusion_matrix(y_test, y_pred_labels)
print(f" True Negatives (Malignant correctly caught) : {cm[0,0]}")
print(f" False Positives (Malignant missed as Benign) : {cm[0,1]}")
print(f" False Negatives (Benign wrongly flagged) : {cm[1,0]}")
print(f" True Positives (Benign correctly caught) : {cm[1,1]}")
print("\n=== Classification Report ===")
print(classification_report(y_test, y_pred_labels,
target_names=['Malignant', 'Benign']))
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score : {roc_auc:.4f}")
# ── 6. Inspect Coefficients — this is where Logistic Regression shines ───────print("\n=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===")
coef_pairs = sorted(
zip(feature_names, logistic_model.coef_[0]),
key=lambda pair: pair[1]
)
for feature_name, coefficient in coef_pairs[:5]:
print(f" {feature_name:<35} coef = {coefficient:+.4f}")
print("\n=== Top 5 Features Pushing Towards Benign (positive coefficients) ===")
for feature_name, coefficient in coef_pairs[-5:][::-1]:
print(f" {feature_name:<35} coef = {coefficient:+.4f}")
Output
Dataset shape : (569, 30)
Class balance : [212 357] (malignant, benign)
=== Confusion Matrix ===
True Negatives (Malignant correctly caught) : 40
False Positives (Malignant missed as Benign) : 2
False Negatives (Benign wrongly flagged) : 1
True Positives (Benign correctly caught) : 71
=== Classification Report ===
precision recall f1-score support
Malignant 0.98 0.95 0.96 42
Benign 0.97 0.99 0.98 72
accuracy 0.974 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
ROC-AUC Score : 0.9960
=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===
worst concave points coef = -1.7683
mean concave points coef = -1.2418
worst perimeter coef = -1.1892
worst radius coef = -1.0754
mean perimeter coef = -0.8921
=== Top 5 Features Pushing Towards Benign (positive coefficients) ===
worst texture coef = +0.7143
mean texture coef = +0.4821
worst smoothness coef = +0.3902
fractal dimension error coef = +0.2814
smoothness error coef = +0.2301
Watch Out: Fit the Scaler on Training Data Only
Calling scaler.fit_transform() on your entire dataset before splitting leaks test-set statistics into training — a subtle form of data leakage that inflates your reported accuracy. Always fit the scaler on X_train, then use .transform() (not .fit_transform()) on X_test.
Production Insight
Data leakage from scaler inflates accuracy by 5–10% in real deployments — always split first, then scale.
Coefficient interpretation depends on scale — standardised coefficients let you compare feature importance directly.
Monitor feature distribution drift — if a feature's mean shifts significantly, the model's log-odds change even if the coefficient remains constant.
Key Takeaway
Split data first, then fit scaler on training only — never the other way around.
Accuracy lies — confusion matrix and per-class recall tell the real story.
Tuning the Decision Threshold — When 0.5 Is the Wrong Cut-Off
Most tutorials treat the 0.5 threshold as sacred. It isn't. The threshold is a business decision, not a mathematical constant, and understanding when to move it separates good practitioners from great ones.
Consider the breast cancer case: a False Negative (predicting benign when the tumour is actually malignant) sends a patient home without treatment. A False Positive (flagging benign as malignant) means an unnecessary biopsy — uncomfortable, but survivable. These mistakes are not equal. You should tolerate more False Positives to drive False Negatives toward zero, which means lowering your threshold below 0.5 so the model cries 'malignant' sooner.
Conversely, in a spam filter, a False Positive (blocking a legitimate email) is worse than a False Negative (letting spam through). Here you'd raise the threshold.
The ROC curve plots True Positive Rate against False Positive Rate across every possible threshold. The area under it (AUC-ROC) tells you how well the model separates classes regardless of threshold — it's the metric to optimise during model selection. The Precision-Recall curve is more informative when your classes are heavily imbalanced.
The code below shows how to find the threshold that maximises recall for malignant detection — exactly the kind of analysis you'd run before deploying a medical model.
threshold_tuning.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.metrics import precision_recall_curve, roc_curve
import matplotlib.pyplot as plt
# ── Reuse the trained model setup from the previous example ──────────────────
cancer_data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer_data.data, cancer_data.target,
test_size=0.20, random_state=42, stratify=cancer_data.target
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)
# Predicted probabilities for the positive class (benign = 1)
y_proba_benign = logistic_model.predict_proba(X_test_scaled)[:, 1]
# ── Find threshold that maximises recall for MALIGNANT class ─────────────────# Note: precision_recall_curve works with respect to the positive label.# We flip the probabilities so 'malignant' becomes the positive class.
y_proba_malignant = 1 - y_proba_benign
y_test_malignant = 1 - y_test # 1 = malignant, 0 = benign (flipped)
precisions, recalls, thresholds = precision_recall_curve(
y_test_malignant, y_proba_malignant
)
# We want recall >= 0.99 with the highest possible precision
high_recall_mask = recalls[:-1] >= 0.99# exclude last point (no threshold)
candidates = list(zip(
thresholds[high_recall_mask],
precisions[:-1][high_recall_mask],
recalls[:-1][high_recall_mask]
))
print("=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===")
print(f" {'Threshold':>12} {'Precision':>10} {'Recall':>8}")
for thresh, prec, rec in candidates:
print(f" {thresh:>12.4f} {prec:>10.4f} {rec:>8.4f}")
# Pick the threshold with highest precision among our high-recall candidates
best_threshold, best_precision, best_recall = max(candidates, key=lambda t: t[1])
print(f"\n✔ Best threshold = {best_threshold:.4f}")
print(f" At this threshold — Precision: {best_precision:.4f}, Recall: {best_recall:.4f}")
# ── Apply the chosen threshold and see its real-world impact ─────────────────# We predict 'malignant' whenever P(malignant) >= best_threshold
y_pred_tuned = (y_proba_malignant >= best_threshold).astype(int)
malignant_actual = np.sum(y_test_malignant == 1)
malignant_caught = np.sum((y_pred_tuned == 1) & (y_test_malignant == 1))
malignant_missed = malignant_actual - malignant_caught
print(f"\n=== Clinical Impact at Tuned Threshold ===")
print(f" Total malignant tumours in test set : {malignant_actual}")
print(f" Correctly flagged (True Positives) : {malignant_caught}")
print(f" Missed (False Negatives) : {malignant_missed} ← the dangerous ones")
# ── ROC Curve ─────────────────────────────────────────────────────────────────
fpr, tpr, roc_thresholds = roc_curve(y_test_malignant, y_proba_malignant)
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='steelblue', lw=2, label='ROC Curve')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve — Malignant Detection')
plt.legend()
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
print("\nROC curve saved to roc_curve.png")
Output
=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===
Threshold Precision Recall
0.1823 0.9130 1.0000
0.2041 0.9130 1.0000
0.2289 0.9130 1.0000
✔ Best threshold = 0.1823
At this threshold — Precision: 0.9130, Recall: 1.0000
=== Clinical Impact at Tuned Threshold ===
Total malignant tumours in test set : 42
Correctly flagged (True Positives) : 42
Missed (False Negatives) : 0 ← the dangerous ones
ROC curve saved to roc_curve.png
Pro Tip: AUC-ROC Is Model Quality; Threshold Is Business Policy
Optimise AUC-ROC during model training and cross-validation — it tells you how good the model's raw probability estimates are. Then, separately, pick your threshold based on the real-world cost of each type of error. These are two distinct decisions and conflating them leads to silently sub-optimal deployments.
Production Insight
AUC-ROC of 0.99 doesn't mean the model is safe for deployment — it measures ranking, not absolute risk at a specific threshold.
The optimal threshold changes with business conditions — re-evaluate it quarterly or whenever the cost matrix shifts.
If you lower the threshold too much, you'll drown your team in false positives — always measure operational cost per false alarm.
Key Takeaway
Threshold is a business decision, not a model parameter — never use 0.5 by default.
Optimise AUC-ROC for model selection, then tune threshold for cost minimisation.
False negatives and false positives have asymmetric costs — your threshold must reflect the real-world stakes.
Maximum Likelihood Estimation and Log-Loss — How Logistic Regression Learns
You've seen the sigmoid and the coefficients. But how does the model actually find those coefficients? The answer is Maximum Likelihood Estimation (MLE). Logistic Regression doesn't minimise squared error (like Linear Regression does) — it maximises the probability of seeing the observed data given the parameters.
Mathematically, MLE finds the weights w that maximise the product of predicted probabilities for each training sample. For a binary classification task, this product is:
L(w) = ∏ P(y=1 | x)^y · (1 - P(y=1 | x))^{(1-y)}
Taking the logarithm turns the product into a sum, which is easier to optimise. The negative of that sum is called log-loss (binary cross-entropy). The model uses gradient descent to minimise log-loss. This is why Logistic Regression uses log-loss instead of MSE: log-loss is convex with respect to the weights, which guarantees that gradient descent will find the global optimum.
Convexity matters because it means you're never stuck in a local minimum. With MSE and sigmoid, the loss surface has hills and valleys — gradient descent can get trapped. Log-loss is a smooth bowl shape. That's the mathematical guarantee you need for a stable training process.
In scikit-learn, you don't see this — it's wrapped inside the fit() method. But understanding the loss function is crucial for debugging: if your loss is not decreasing smoothly, check the learning rate (not exposed in sklearn's default LogisticRegression, but you control it via tol and max_iter) or consider a different solver.
log_loss_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn.linear_model importLogisticRegressionfrom sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.metrics import log_loss
# Use the same breast cancer data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train model with different regularisation strengthsfor C in [0.01, 0.1, 1, 10, 100]:
model = LogisticRegression(C=C, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred_train_proba = model.predict_proba(X_train)[:, 1]
y_pred_test_proba = model.predict_proba(X_test)[:, 1]
train_loss = log_loss(y_train, y_pred_train_proba)
test_loss = log_loss(y_test, y_pred_test_proba)
acc = model.score(X_test, y_test)
print(f"C={C:>6.2f} | Train log-loss: {train_loss:.4f} | Test log-loss: {test_loss:.4f} | Test Acc: {acc:.4f}")
# Observe: as C increases (less regularisation), train loss decreases, test loss may start increasing (overfitting).
Output
C= 0.01 | Train log-loss: 0.1234 | Test log-loss: 0.1478 | Test Acc: 0.9737
C= 0.10 | Train log-loss: 0.0987 | Test log-loss: 0.1123 | Test Acc: 0.9825
C= 1.00 | Train log-loss: 0.0854 | Test log-loss: 0.0986 | Test Acc: 0.9737
C= 10.00 | Train log-loss: 0.0801 | Test log-loss: 0.0962 | Test Acc: 0.9737
C=100.00 | Train log-loss: 0.0789 | Test log-loss: 0.0960 | Test Acc: 0.9737
The Bowl Analogy for Convex Loss
Convex functions have one global minimum — no local minima to trap you.
MSE applied to a sigmoid produces a non-convex landscape — that's why linear regression + rounding fails.
Scikit-learn's default solver (lbfgs) assumes convexity and may converge faster than other solvers.
If your loss curve is jagged or increasing, you might have a bug in feature scaling or a too-high learning rate (not exposed in sklearn's default API).
Production Insight
Log-loss penalises confident wrong predictions heavily — a 0.99 probability on a wrong label yields nearly infinite loss, forcing the model to be calibrated.
Monitor log-loss on validation set during training — if it plateaus then rises, you're overfitting; if it never drops, check feature scaling or label noise.
For production, log-loss is also a useful monitoring metric — a sudden increase indicates data drift.
Key Takeaway
Log-loss is convex — gradient descent is guaranteed to find the global minimum.
Log-loss penalises confident wrong predictions more than MSE — it enforces calibrated probabilities.
MLE is the reason logistic regression produces well-calibrated probabilities — don't use it for feature selection without regularisation.
Regularisation — L1 (Lasso) and L2 (Ridge) in Logistic Regression
Logistic Regression without regularisation can overfit, especially when you have many features or highly correlated predictors. Regularisation adds a penalty term to the loss function that discourages large coefficients. Scikit-learn's LogisticRegression uses L2 regularisation by default (controlled by the C parameter).
L2 (Ridge) adds the squared sum of coefficients to the loss. It shrinks all coefficients toward zero but rarely makes them exactly zero. Use L2 when you expect all features to contribute some signal, or when features are correlated (it handles multicollinearity gracefully).
L1 (Lasso) adds the absolute sum of coefficients. It can drive some coefficients to exactly zero, performing automatic feature selection. Use L1 when you have many irrelevant features and want a sparse model. The trade-off: L1 can be unstable with highly correlated features — it might pick one and drop the other arbitrarily.
ElasticNet combines L1 and L2 penalties. In scikit-learn, you can use LogisticRegression with penalty='elasticnet' and set the l1_ratio parameter. This gives you the best of both worlds: sparsity from L1 and stability from L2.
The C parameter controls the inverse of regularisation strength. Lower C = more regularisation (simpler model). Tune C via cross-validation — too high C leads to overfitting, too low C underfits. This is the most important hyperparameter to tune for Logistic Regression.
regularisation_compare.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn.linear_model importLogisticRegressionfrom sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing importStandardScaler
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Compare L1 and L2 with same C
models = {
'L2 (default)': LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000),
'L1 (lasso)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000),
'ElasticNet (l1_ratio=0.5)': LogisticRegression(penalty='elasticnet', C=1.0, solver='saga', l1_ratio=0.5, max_iter=1000),
}
for name, model in models.items():
model.fit(X_train, y_train)
nonzero_coefs = np.sum(np.abs(model.coef_) > 1e-10)
test_acc = model.score(X_test, y_test)
print(f"{name:20} | Non-zero coefficients: {nonzero_coefs:2d} | Test accuracy: {test_acc:.4f}")
# Output shows L1 produces sparser models (fewer non-zero coefficients).
Output
L2 (default) | Non-zero coefficients: 30 | Test accuracy: 0.9737
L1 (lasso) | Non-zero coefficients: 18 | Test accuracy: 0.9737
ElasticNet (l1_ratio=0.5) | Non-zero coefficients: 22 | Test accuracy: 0.9737
C is Inverse Regularisation — Lower C = Stronger Penalty
Many beginners mistakenly increase C hoping for more regularisation. Remember: C = 1/λ. To reduce overfitting, decrease C. To allow more complex models, increase C. Use GridSearchCV over log-spaced C values (e.g., 0.001 to 1000).
Production Insight
In regulated industries, L1 is often preferred because it produces interpretable models with fewer features — auditors like that.
L2 is safer when you don't know which features are relevant — it keeps all features but limits their impact.
ElasticNet gives you both sparsity and stability, but requires tuning l1_ratio — another hyperparameter to manage.
Key Takeaway
L2 shrinks all coefficients — good for correlated features, no feature selection.
L1 zeroes out coefficients — automatic feature selection, but unstable with collinear data.
C is the dial: lower C = simpler model; tune it via cross-validation.
● Production incidentPOST-MORTEMseverity: high
The Cancer Model That Missed a Malignant Tumour Because of a Bad Threshold
Symptom
During a retrospective audit, the oncology team found that 3 out of 100 malignant patients had been incorrectly classified as benign and sent home without biopsy. The model's accuracy was 97% — but the clinical outcome was unacceptable.
Assumption
The team assumed the model was 'good enough' because accuracy was high and AUC-ROC was 0.99. They never questioned the default threshold or the actual cost of each error type.
Root cause
The default probability threshold of 0.5 assumes false positives and false negatives are equally costly. In cancer detection, the cost of a false negative is a patient's life; the cost of a false positive is an unnecessary biopsy. The threshold needed to be lowered to catch more malignancies, sacrificing some precision for recall.
Fix
The team used predict_proba() to get raw probabilities, then tuned the threshold so that recall for malignant cases was above 99.5%. The new threshold of 0.18 meant the model flagged more borderline cases — but the false negative rate dropped to near zero. Precision fell from 98% to 91%, but no malignant tumour was missed.
Key lesson
Never deploy a binary classifier without explicitly setting the decision threshold based on the business cost matrix.
Accuracy is dangerous when classes are imbalanced or costs are asymmetric — always compute confusion matrix and per-class recall.
AUC-ROC tells you the model's ranking quality, not the optimal threshold — that's a separate business decision.
Production debug guideRun these checks when your logistic regression model behaves unexpectedly in production or during training.5 entries
Symptom · 01
Scikit-learn ConvergenceWarning appears even at max_iter=1000
→
Fix
Feature scaling is missing or inadequate. Apply StandardScaler; if still failing, try solver='lbfgs' or 'saga'. For very large datasets, increase max_iter or reduce tol.
Symptom · 02
Model achieves high accuracy but low F1 for minority class
→
Fix
Check class balance with np.bincount(y). If imbalanced, use class_weight='balanced' or resample. Also evaluate using precision-recall curve instead of ROC.
Symptom · 03
Coefficients are unreasonably large (e.g., >100)
→
Fix
This indicates perfect separation or extreme multicollinearity. Apply L2 regularisation (increase C) or check for near-constant features. Remove perfectly correlated features.
Symptom · 04
Predicted probabilities are all near 0.5, never close to 0 or 1
→
Fix
Features may not be predictive enough. Check whether the linear combination z has low variance. Add feature interactions or non-linear transformations. Consider model capacity.
Symptom · 05
Training log-loss decreases but test log-loss increases after some iterations
→
Fix
Overfitting — regularisation too weak. Reduce C (increase regularisation strength) or add L1 penalty to perform feature selection. Use cross-validation to tune C.
★ Quick Debug Cheat Sheet: Logistic RegressionThe three most common logistic regression failures and how to fix them — no theory, just commands.
ConvergenceWarning at default max_iter−
Immediate action
Scale features with StandardScaler and retry.
Commands
from sklearn.preprocessing import StandardScaler; X_scaled = scaler.fit_transform(X)
model = LogisticRegression(max_iter=1000, solver='lbfgs'); model.fit(X_scaled, y)
Fix now
If still failing, switch to solver='saga' or increase max_iter to 5000.
Model predicts all samples as the majority class+
Immediate action
Check class distribution and set class_weight='balanced'.
Commands
np.bincount(y); # check class counts
model = LogisticRegression(class_weight='balanced'); model.fit(X_scaled, y)
Fix now
If still imbalanced, try SMOTE oversampling or collect more minority data.
Decision boundary is nonlinear but you used logistic regression expecting poor performance+
Difficult to audit without post-hoc explainability tools
Key takeaways
1
Logistic Regression does not predict a class directly
it predicts a calibrated probability via the sigmoid function, and a threshold converts that probability to a label. The threshold is a business decision, not a model parameter.
2
The coefficients are log-odds ratios
a coefficient of +0.8 on a feature means a one-unit increase in that feature multiplies the odds of the positive class by e^0.8 ≈ 2.23. This interpretability is the primary reason regulated industries still choose Logistic Regression over more powerful models.
3
Always scale your features before training
Logistic Regression uses gradient descent, which is highly sensitive to features with vastly different magnitudes. Fitting the StandardScaler on training data only is non-negotiable; leaking test statistics inflates your metrics and is a common interview red flag.
4
AUC-ROC measures the quality of the model's probability estimates across all thresholds
optimise this during model selection. Your chosen decision threshold is then a separate, downstream business decision based on the relative costs of false positives versus false negatives in your specific application.
5
Regularisation is essential when you have many features or worry about overfitting. Tune C via cross-validation. Use L1 for feature selection, L2 for stability, ElasticNet for both.
Common mistakes to avoid
5 patterns
×
Forgetting to scale features
Symptom
ConvergenceWarning appears even at max_iter=1000, and model accuracy is significantly lower than expected. The loss surface is elongated, causing gradient descent to take many steps.
Fix
Always apply StandardScaler (or MinMaxScaler) to your features before fitting. Remember to fit the scaler on training data only, then transform both train and test sets separately.
×
Using accuracy as the only metric on imbalanced data
Symptom
Model reports 95% accuracy on a fraud dataset where 95% of transactions are legitimate — it learned to predict 'not fraud' for everything. Precision and recall for the minority class are near zero.
Fix
Always compute the confusion matrix plus precision, recall, and F1-score per class. For severe imbalance, use class_weight='balanced' in LogisticRegression() or oversample the minority class using SMOTE.
×
Treating the 0.5 threshold as immovable
Symptom
Deployed model has acceptable accuracy but unacceptable real-world outcomes — e.g., too many missed cancer diagnoses or too many blocked legitimate emails. The cost of errors is asymmetric.
Fix
Use predict_proba() to get raw probabilities, then sweep thresholds using precision_recall_curve() and select the cut-off that minimises your most costly error type for the specific business context.
×
Ignoring multicollinearity among features
Symptom
Coefficients swing wildly between large positive and large negative values even though the model seems to work. Small changes in training data drastically change coefficient estimates.
Fix
Check pairwise correlations and Variance Inflation Factor (VIF). Remove or combine highly correlated features. Use L2 regularisation to stabilise coefficient estimates.
×
Not considering regularisation when number of features is large
Symptom
Model fits training data perfectly (loss near zero) but performs poorly on validation or test data. Coefficients are large in magnitude.
Fix
Use cross-validation to tune the C parameter. Start with a grid of values from 0.001 to 1000. Combine with L1 penalty if feature selection is needed.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Why does Logistic Regression use log-loss (binary cross-entropy) instead...
Q02SENIOR
What is the difference between L1 and L2 regularisation in Logistic Regr...
Q03SENIOR
If Logistic Regression outputs a probability of 0.73 for a sample, what ...
Q04SENIOR
Explain Maximum Likelihood Estimation in the context of Logistic Regress...
Q05SENIOR
How do you handle non-linear decision boundaries with Logistic Regressio...
Q01 of 05SENIOR
Why does Logistic Regression use log-loss (binary cross-entropy) instead of mean squared error as its loss function?
ANSWER
Interviewers love this because MSE with a sigmoid output creates a non-convex loss surface full of local minima. Log-loss is convex with respect to the weights, guaranteeing gradient descent finds the global minimum. A good answer also mentions that log-loss heavily penalises confident wrong predictions, which is exactly the behaviour you want.
Q02 of 05SENIOR
What is the difference between L1 and L2 regularisation in Logistic Regression, and when would you choose each?
ANSWER
L2 (Ridge, the default in scikit-learn) shrinks all coefficients toward zero but rarely to exactly zero — good for multicollinearity. L1 (Lasso) can drive some coefficients to exactly zero, performing automatic feature selection — ideal when you suspect many features are irrelevant. In scikit-learn, control this with the penalty parameter ('l1' or 'l2') and the C parameter (inverse of regularisation strength — lower C = more regularisation).
Q03 of 05SENIOR
If Logistic Regression outputs a probability of 0.73 for a sample, what does that actually mean mathematically — and what are the underlying log-odds?
ANSWER
This trips people up. The probability 0.73 means the model believes there is a 73% chance of the positive class. The log-odds (logit) is log(0.73 / 0.27) = log(2.70) ≈ 0.994. The log-odds is what the linear part of the model (w₀ + w₁x₁ + ...) is directly computing — the sigmoid then maps it back to a probability. Understanding this chain — linear combination → log-odds → sigmoid → probability — shows you truly understand the model, not just its API.
Q04 of 05SENIOR
Explain Maximum Likelihood Estimation in the context of Logistic Regression. How does it differ from minimising least squares?
ANSWER
MLE finds the parameters that maximise the likelihood of observing the training data. For logistic regression, that's the product of predicted probabilities for each sample's true label. Optimising MLE is equivalent to minimising log-loss. Least squares minimisation is used in Linear Regression and assumes normally distributed errors. MLE is more appropriate for classification because it directly models the probability distribution of the binary outcome and produces a convex loss function.
Q05 of 05SENIOR
How do you handle non-linear decision boundaries with Logistic Regression? What are the trade-offs compared to using a non-linear model like Random Forest?
ANSWER
You can add polynomial features or interaction terms to the input — e.g., x₁², x₁*x₂. This allows the decision boundary to be non-linear in the original feature space. The trade-off: feature engineering is manual and can lead to a combinatorial explosion. Regularisation (L1) helps control overfitting when adding many features. Random Forest handles non-linearity automatically and doesn't require scaling, but is less interpretable and harder to audit in regulated settings.
01
Why does Logistic Regression use log-loss (binary cross-entropy) instead of mean squared error as its loss function?
SENIOR
02
What is the difference between L1 and L2 regularisation in Logistic Regression, and when would you choose each?
SENIOR
03
If Logistic Regression outputs a probability of 0.73 for a sample, what does that actually mean mathematically — and what are the underlying log-odds?
SENIOR
04
Explain Maximum Likelihood Estimation in the context of Logistic Regression. How does it differ from minimising least squares?
SENIOR
05
How do you handle non-linear decision boundaries with Logistic Regression? What are the trade-offs compared to using a non-linear model like Random Forest?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Can logistic regression handle multi-class classification problems?
Yes — scikit-learn's LogisticRegression supports multi-class out of the box via the multi_class parameter. It uses either One-vs-Rest (OvR), which trains one binary classifier per class, or the Multinomial (softmax) strategy, which optimises a single joint loss across all classes. Set multi_class='multinomial' and solver='lbfgs' for most multi-class problems.
Was this helpful?
02
Why does scikit-learn show a ConvergenceWarning for logistic regression?
It means gradient descent didn't reach the minimum within the allowed number of iterations. The two most common fixes are: (1) scale your features with StandardScaler — unscaled data creates an elongated loss surface that takes far more steps to traverse, and (2) increase max_iter to 1000 or higher. If it still doesn't converge, try a different solver like 'lbfgs' or 'saga'.
Was this helpful?
03
Is logistic regression still useful in the age of deep learning and gradient boosting?
Absolutely — and not just as a baseline. Anywhere a decision needs to be explained to a non-technical stakeholder, audited by a regulator, or deployed in a low-latency environment, Logistic Regression is the right tool. Credit scoring, clinical risk scoring, and legal-domain AI are all areas where its transparency is a hard requirement, not a nice-to-have.
Was this helpful?
04
What is the difference between predict() and predict_proba() in scikit-learn's LogisticRegression?
predict() returns the class label (0 or 1) based on a default threshold of 0.5. predict_proba() returns the raw probabilities for both classes, shaped (n_samples, 2). The second column is typically the probability of the positive class. Always use predict_proba() when you need to tune the decision threshold.
Was this helpful?
05
How do I interpret the coefficients of a logistic regression model?
Each coefficient represents the change in log-odds of the positive outcome for a one-unit increase in that feature, holding all other features constant. Exponentiate to get odds ratio: e^coef. A coefficient of 0 means the feature has no effect. The sign indicates direction: positive increases odds, negative decreases odds. In scikit-learn, coefficients are stored in the coef_ attribute.