Logistic Regression predicts a probability between 0 and 1 using the sigmoid function
The linear part computes log-odds, which are exponentiated and squeezed via sigmoid
Coefficients are log-odds ratios — interpretable for regulated industries
Feature scaling is mandatory — unscaled data makes gradient descent crawl
Decision threshold is a business decision, not a fixed 0.5
Accuracy is a trap — always check precision, recall, and confusion matrix
✦ Definition~90s read
What is Logistic Regression?
Logistic regression is a linear classification algorithm that estimates the probability of a binary outcome by passing a weighted sum of input features through the sigmoid function. Despite its name, it is not a regression algorithm — it solves classification problems by learning a linear decision boundary in feature space, then mapping the raw score to a value between 0 and 1 via the sigmoid curve.
★
Imagine a doctor looking at your test results and saying 'there's a 92% chance this is benign.' They're not predicting a number like your height — they're predicting a probability that tips into a yes-or-no answer.
This curve is chosen specifically because it is differentiable, S-shaped, and outputs values interpretable as probabilities, making it ideal for tasks like spam detection, credit default prediction, and medical diagnosis. In practice, logistic regression is often the first model you reach for when you need a fast, interpretable, and well-calibrated classifier on structured data, and it remains a baseline that deep learning models must beat on tabular datasets.
Where logistic regression truly shines is in its transparency and mathematical rigor. It learns by maximizing the likelihood of the observed data under a Bernoulli distribution, which is equivalent to minimizing log-loss (cross-entropy). This optimization is convex, meaning gradient descent will always find the global optimum — no local minima traps.
You can add L1 (Lasso) or L2 (Ridge) regularization to prevent overfitting, with L1 driving irrelevant feature weights to exactly zero, effectively performing feature selection. However, logistic regression fails when the decision boundary is inherently non-linear — for those cases, you need kernel SVMs, random forests, or neural networks.
It also assumes independence of features and is sensitive to outliers, so you must scale your inputs and handle collinearity.
The critical operational insight is that the default 0.5 decision threshold is rarely optimal. In a breast cancer screening model, using 0.5 might miss 3% of malignant tumors that have probabilities of 0.47–0.49 — a catastrophic failure. By tuning the threshold via ROC curves or precision-recall trade-offs, you can prioritize recall (catching all cancers) over precision, accepting more false positives to save lives.
This threshold tuning is a production skill that separates junior from senior practitioners: you don't just train a model, you align its decision rule with the real-world cost of errors. Logistic regression is deployed at scale in systems like credit scoring at FICO, ad click prediction at Google, and clinical risk calculators at major hospitals — precisely because it is fast to train, easy to debug, and its probability estimates can be recalibrated for different operational thresholds.
Plain-English First
Imagine a doctor looking at your test results and saying 'there's a 92% chance this is benign.' They're not predicting a number like your height — they're predicting a probability that tips into a yes-or-no answer. Logistic Regression is exactly that: it takes a bunch of measurements, runs them through a special S-shaped curve, and squeezes the result into a probability between 0 and 1. Once that probability crosses a threshold (usually 0.5), the model commits to an answer. It's less like a ruler and more like a confident doctor making a call.
Every day, your email provider quietly decides whether to drop a message into your inbox or your spam folder. Your bank flags a transaction as fraud or lets it through. A hospital algorithm predicts whether a tumour is malignant or benign. All of these are binary decisions — yes or no, 0 or 1 — and Logistic Regression is one of the most reliable, interpretable, and battle-tested tools for making them. It's been doing this job since the 1950s and it's still the first model data scientists reach for when the stakes are high and the explanation matters.
The core problem Logistic Regression solves is one that Linear Regression cannot: predicting a bounded probability. If you used ordinary linear regression to classify emails, nothing stops it from predicting a 'spam probability' of 2.7 or -0.4 — which is meaningless. Logistic Regression wraps its output in a sigmoid function that mathematically constrains every prediction to live between 0 and 1, giving you an actual probability you can act on.
By the end of this article you'll understand not just how to call LogisticRegression().fit() in scikit-learn, but why the sigmoid function exists, what the coefficients are actually telling you about the real world, how to tune the decision threshold for different business goals, and exactly what questions an interviewer will ask you to separate the practitioners from the people who just skimmed a tutorial.
Why Logistic Regression Is a Linear Classifier, Not a Regression
Logistic regression predicts the probability that an input belongs to a binary class by passing a linear combination of features through the logistic (sigmoid) function. The core mechanic: compute z = w·x + b, then output σ(z) = 1 / (1 + e⁻ᶻ), which squashes any real number into a (0,1) probability. Despite the name, it's a classification algorithm — the 'regression' refers to fitting a linear decision boundary, not predicting a continuous value.
Training maximizes log-likelihood via gradient descent, not least squares. The loss function is cross-entropy: -[y log(ŷ) + (1-y) log(1-ŷ)]. This penalizes confident wrong predictions heavily — a 0.97 probability on a false positive costs far more than 0.51. The decision threshold is a separate hyperparameter; default 0.5 is rarely optimal. In production, you tune this threshold using precision-recall or ROC curves, not accuracy alone.
Use logistic regression when you need interpretable probabilities, fast inference, or a strong baseline. It's the go-to for binary classification on linearly separable or near-separable data — spam detection, churn prediction, medical diagnosis. It scales to millions of features with L1/L2 regularization and trains in minutes on a single machine. For non-linear boundaries, add feature crosses or kernel tricks, but know that deep nets will outperform once data exceeds ~100k examples with complex interactions.
Threshold Is Not 0.5 by Default
A 0.5 threshold assumes equal cost of false positives and false negatives. In cancer screening, lowering threshold to 0.3 catches 3% more cancers at the cost of 5% more false alarms — a trade-off you must set per business need.
Production Insight
Teams deploying logistic regression for fraud detection often leave the default 0.5 threshold, missing 15% of fraud cases because the fraud class is rare (1%).
Symptom: high accuracy (99%) but recall below 50% — the model predicts 'not fraud' for everything.
Rule: always calibrate the threshold on validation data using the actual cost ratio of false negatives to false positives.
Key Takeaway
Logistic regression outputs probabilities, not hard classes — the threshold is your business decision.
It assumes linear decision boundaries; feature engineering (crosses, polynomials) is mandatory for non-linear problems.
Regularization (L1/L2) is not optional — without it, high-dimensional sparse features overfit instantly.
thecodeforge.io
Logistic Regression: Threshold & Cancer Detection
Logistic Regression
The Sigmoid Function — Why Logistic Regression Uses This Specific Curve
Linear Regression gives you a straight line. That's great for predicting house prices, but terrible for predicting probabilities — because a straight line extends to infinity in both directions and probability must stay between 0 and 1.
The sigmoid function (also called the logistic function, which is where the algorithm gets its name) is the mathematical fix. Its formula is σ(z) = 1 / (1 + e^(-z)). Feed it any real number — whether it's -1000 or +1000 — and it maps the output to the range (0, 1). Large positive inputs push the output close to 1. Large negative inputs push it close to 0. Right at zero, you get exactly 0.5.
The input z is itself a linear combination of your features: z = w₀ + w₁x₁ + w₂x₂ + ... — exactly like Linear Regression. So Logistic Regression is really just Linear Regression with its output passed through the sigmoid. That single design decision makes the output interpretable as a probability, which is the foundation everything else builds on.
The model learns the weights (w values) by maximising the likelihood that the predicted probabilities match the actual labels in your training data — a process called Maximum Likelihood Estimation, optimised via gradient descent.
sigmoid_intuition.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
import matplotlib.pyplot as plt
defsigmoid(z):
"""The core of logistic regression — maps any real number to (0, 1)."""return1 / (1 + np.exp(-z))
# Create a range of z values to visualise the S-curve
z_values = np.linspace(-10, 10, 300)
probabilities = sigmoid(z_values)
# Annotate key points so the behaviour is obvious
key_points = {
-5: sigmoid(-5), # Very likely class 00: sigmoid(0), # Exactly on the decision boundary5: sigmoid(5), # Very likely class 1
}
print("=== Sigmoid Output at Key Z-Values ===")
for z, prob in key_points.items():
label = "→ class 1"if prob >= 0.5else"→ class 0"print(f" z = {z:+d} | P(y=1) = {prob:.4f} {label}")
# Plot the S-curve
plt.figure(figsize=(8, 4))
plt.plot(z_values, probabilities, color='steelblue', linewidth=2.5, label='σ(z)')
plt.axhline(y=0.5, color='tomato', linestyle='--', linewidth=1.5, label='Decision boundary (0.5)')
plt.axvline(x=0, color='gray', linestyle=':', linewidth=1.2)
plt.fill_between(z_values, probabilities, 0.5,
where=(probabilities >= 0.5), alpha=0.12, color='steelblue', label='Predict class 1')
plt.fill_between(z_values, probabilities, 0.5,
where=(probabilities < 0.5), alpha=0.12, color='tomato', label='Predict class 0')
plt.xlabel('z (linear combination of features)')
plt.ylabel('Predicted Probability')
plt.title('The Sigmoid Function — How Logistic Regression Converts Scores to Probabilities')
plt.legend()
plt.tight_layout()
plt.savefig('sigmoid_curve.png', dpi=150)
print("\nPlot saved to sigmoid_curve.png")
Output
=== Sigmoid Output at Key Z-Values ===
z = -5 | P(y=1) = 0.0067 → class 0
z = +0 | P(y=1) = 0.5000 → class 1
z = +5 | P(y=1) = 0.9933 → class 1
Plot saved to sigmoid_curve.png
Why Not Just Round Linear Regression?
Rounding a linear regression output to 0 or 1 destroys the probability information entirely and makes gradient descent behave badly — the loss landscape becomes a step function with no meaningful gradient. The sigmoid preserves a smooth, differentiable transition so the optimizer knows which direction to push the weights.
Production Insight
Sigmoid saturation kills gradient — when z > ~5 or z < ~-5, gradient approaches zero and training stalls.
Standardise features to keep z in the active range (|z| < 4) for most samples.
If your model converges slowly, check z distribution — if most are extreme, increase regularisation or scale better.
Key Takeaway
The sigmoid maps unbounded z to (0,1), giving us a probability.
Gradient vanishes at extreme z — feature scaling keeps your learning alive.
Z = linear combination → sigmoid = probability — understand this chain.
Training on Real Data — Breast Cancer Classification End-to-End
Theory only sticks when you see it on real data. We'll use scikit-learn's built-in Breast Cancer dataset — 569 tumour samples, each described by 30 numeric features (mean radius, texture, smoothness, etc.), labelled as malignant (0) or benign (1). The goal is to predict the label from the measurements.
There are a few things to get right here that tutorials often skip. First, feature scaling matters enormously for Logistic Regression because gradient descent converges far faster when all features live on a similar scale. If 'mean area' is in the thousands and 'mean fractal dimension' is near 0.05, the loss surface is elongated and training is sluggish. StandardScaler fixes this.
Second, you should always look at your model's coefficients after training. Each coefficient tells you how much the log-odds of the positive class change for a one-unit increase in that feature. A large positive coefficient means that feature is a strong predictor of benign; a large negative one means it predicts malignant. That interpretability is exactly why doctors, banks and regulators often prefer Logistic Regression over a black-box neural network — you can explain every decision.
Third, accuracy alone is a dangerous metric for medical data. A model that predicts 'benign' for every sample gets ~63% accuracy on this dataset without learning anything. Always check precision, recall and the confusion matrix.
breast_cancer_logistic.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score
)
# ── 1. Load Data ──────────────────────────────────────────────────────────────
cancer_data = load_breast_cancer()
feature_matrix = cancer_data.data # Shape: (569, 30)
target_labels = cancer_data.target # 0 = malignant, 1 = benign
feature_names = cancer_data.feature_names
print(f"Dataset shape : {feature_matrix.shape}")
print(f"Class balance : {np.bincount(target_labels)} (malignant, benign)")
# ── 2. Train / Test Split ─────────────────────────────────────────────────────# stratify= ensures both splits keep the same class ratio
(X_train, X_test,
y_train, y_test) = train_test_split(
feature_matrix, target_labels,
test_size=0.20,
random_state=42,
stratify=target_labels
)
# ── 3. Feature Scaling — critical for gradient-descent-based models ───────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit only on training data!
X_test_scaled = scaler.transform(X_test) # apply same scale to test# ── 4. Train the Model ───────────────────────────────────────────────────────# max_iter=1000 because the default 100 often hits a ConvergenceWarning
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)
# ── 5. Predict & Evaluate ────────────────────────────────────────────────────
y_pred_labels = logistic_model.predict(X_test_scaled)
y_pred_proba = logistic_model.predict_proba(X_test_scaled)[:, 1] # P(benign)print("\n=== Confusion Matrix ===")
cm = confusion_matrix(y_test, y_pred_labels)
print(f" True Negatives (Malignant correctly caught) : {cm[0,0]}")
print(f" False Positives (Malignant missed as Benign) : {cm[0,1]}")
print(f" False Negatives (Benign wrongly flagged) : {cm[1,0]}")
print(f" True Positives (Benign correctly caught) : {cm[1,1]}")
print("\n=== Classification Report ===")
print(classification_report(y_test, y_pred_labels,
target_names=['Malignant', 'Benign']))
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score : {roc_auc:.4f}")
# ── 6. Inspect Coefficients — this is where Logistic Regression shines ───────print("\n=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===")
coef_pairs = sorted(
zip(feature_names, logistic_model.coef_[0]),
key=lambda pair: pair[1]
)
for feature_name, coefficient in coef_pairs[:5]:
print(f" {feature_name:<35} coef = {coefficient:+.4f}")
print("\n=== Top 5 Features Pushing Towards Benign (positive coefficients) ===")
for feature_name, coefficient in coef_pairs[-5:][::-1]:
print(f" {feature_name:<35} coef = {coefficient:+.4f}")
Output
Dataset shape : (569, 30)
Class balance : [212 357] (malignant, benign)
=== Confusion Matrix ===
True Negatives (Malignant correctly caught) : 40
False Positives (Malignant missed as Benign) : 2
False Negatives (Benign wrongly flagged) : 1
True Positives (Benign correctly caught) : 71
=== Classification Report ===
precision recall f1-score support
Malignant 0.98 0.95 0.96 42
Benign 0.97 0.99 0.98 72
accuracy 0.974 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
ROC-AUC Score : 0.9960
=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===
worst concave points coef = -1.7683
mean concave points coef = -1.2418
worst perimeter coef = -1.1892
worst radius coef = -1.0754
mean perimeter coef = -0.8921
=== Top 5 Features Pushing Towards Benign (positive coefficients) ===
worst texture coef = +0.7143
mean texture coef = +0.4821
worst smoothness coef = +0.3902
fractal dimension error coef = +0.2814
smoothness error coef = +0.2301
Watch Out: Fit the Scaler on Training Data Only
Calling scaler.fit_transform() on your entire dataset before splitting leaks test-set statistics into training — a subtle form of data leakage that inflates your reported accuracy. Always fit the scaler on X_train, then use .transform() (not .fit_transform()) on X_test.
Production Insight
Data leakage from scaler inflates accuracy by 5–10% in real deployments — always split first, then scale.
Coefficient interpretation depends on scale — standardised coefficients let you compare feature importance directly.
Monitor feature distribution drift — if a feature's mean shifts significantly, the model's log-odds change even if the coefficient remains constant.
Key Takeaway
Split data first, then fit scaler on training only — never the other way around.
Accuracy lies — confusion matrix and per-class recall tell the real story.
Tuning the Decision Threshold — When 0.5 Is the Wrong Cut-Off
Most tutorials treat the 0.5 threshold as sacred. It isn't. The threshold is a business decision, not a mathematical constant, and understanding when to move it separates good practitioners from great ones.
Consider the breast cancer case: a False Negative (predicting benign when the tumour is actually malignant) sends a patient home without treatment. A False Positive (flagging benign as malignant) means an unnecessary biopsy — uncomfortable, but survivable. These mistakes are not equal. You should tolerate more False Positives to drive False Negatives toward zero, which means lowering your threshold below 0.5 so the model cries 'malignant' sooner.
Conversely, in a spam filter, a False Positive (blocking a legitimate email) is worse than a False Negative (letting spam through). Here you'd raise the threshold.
The ROC curve plots True Positive Rate against False Positive Rate across every possible threshold. The area under it (AUC-ROC) tells you how well the model separates classes regardless of threshold — it's the metric to optimise during model selection. The Precision-Recall curve is more informative when your classes are heavily imbalanced.
The code below shows how to find the threshold that maximises recall for malignant detection — exactly the kind of analysis you'd run before deploying a medical model.
threshold_tuning.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.linear_model importLogisticRegressionfrom sklearn.metrics import precision_recall_curve, roc_curve
import matplotlib.pyplot as plt
# ── Reuse the trained model setup from the previous example ──────────────────
cancer_data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer_data.data, cancer_data.target,
test_size=0.20, random_state=42, stratify=cancer_data.target
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)
# Predicted probabilities for the positive class (benign = 1)
y_proba_benign = logistic_model.predict_proba(X_test_scaled)[:, 1]
# ── Find threshold that maximises recall for MALIGNANT class ─────────────────# Note: precision_recall_curve works with respect to the positive label.# We flip the probabilities so 'malignant' becomes the positive class.
y_proba_malignant = 1 - y_proba_benign
y_test_malignant = 1 - y_test # 1 = malignant, 0 = benign (flipped)
precisions, recalls, thresholds = precision_recall_curve(
y_test_malignant, y_proba_malignant
)
# We want recall >= 0.99 with the highest possible precision
high_recall_mask = recalls[:-1] >= 0.99# exclude last point (no threshold)
candidates = list(zip(
thresholds[high_recall_mask],
precisions[:-1][high_recall_mask],
recalls[:-1][high_recall_mask]
))
print("=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===")
print(f" {'Threshold':>12} {'Precision':>10} {'Recall':>8}")
for thresh, prec, rec in candidates:
print(f" {thresh:>12.4f} {prec:>10.4f} {rec:>8.4f}")
# Pick the threshold with highest precision among our high-recall candidates
best_threshold, best_precision, best_recall = max(candidates, key=lambda t: t[1])
print(f"\n✔ Best threshold = {best_threshold:.4f}")
print(f" At this threshold — Precision: {best_precision:.4f}, Recall: {best_recall:.4f}")
# ── Apply the chosen threshold and see its real-world impact ─────────────────# We predict 'malignant' whenever P(malignant) >= best_threshold
y_pred_tuned = (y_proba_malignant >= best_threshold).astype(int)
malignant_actual = np.sum(y_test_malignant == 1)
malignant_caught = np.sum((y_pred_tuned == 1) & (y_test_malignant == 1))
malignant_missed = malignant_actual - malignant_caught
print(f"\n=== Clinical Impact at Tuned Threshold ===")
print(f" Total malignant tumours in test set : {malignant_actual}")
print(f" Correctly flagged (True Positives) : {malignant_caught}")
print(f" Missed (False Negatives) : {malignant_missed} ← the dangerous ones")
# ── ROC Curve ─────────────────────────────────────────────────────────────────
fpr, tpr, roc_thresholds = roc_curve(y_test_malignant, y_proba_malignant)
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='steelblue', lw=2, label='ROC Curve')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve — Malignant Detection')
plt.legend()
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
print("\nROC curve saved to roc_curve.png")
Output
=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===
Threshold Precision Recall
0.1823 0.9130 1.0000
0.2041 0.9130 1.0000
0.2289 0.9130 1.0000
✔ Best threshold = 0.1823
At this threshold — Precision: 0.9130, Recall: 1.0000
=== Clinical Impact at Tuned Threshold ===
Total malignant tumours in test set : 42
Correctly flagged (True Positives) : 42
Missed (False Negatives) : 0 ← the dangerous ones
ROC curve saved to roc_curve.png
Pro Tip: AUC-ROC Is Model Quality; Threshold Is Business Policy
Optimise AUC-ROC during model training and cross-validation — it tells you how good the model's raw probability estimates are. Then, separately, pick your threshold based on the real-world cost of each type of error. These are two distinct decisions and conflating them leads to silently sub-optimal deployments.
Production Insight
AUC-ROC of 0.99 doesn't mean the model is safe for deployment — it measures ranking, not absolute risk at a specific threshold.
The optimal threshold changes with business conditions — re-evaluate it quarterly or whenever the cost matrix shifts.
If you lower the threshold too much, you'll drown your team in false positives — always measure operational cost per false alarm.
Key Takeaway
Threshold is a business decision, not a model parameter — never use 0.5 by default.
Optimise AUC-ROC for model selection, then tune threshold for cost minimisation.
False negatives and false positives have asymmetric costs — your threshold must reflect the real-world stakes.
Maximum Likelihood Estimation and Log-Loss — How Logistic Regression Learns
You've seen the sigmoid and the coefficients. But how does the model actually find those coefficients? The answer is Maximum Likelihood Estimation (MLE). Logistic Regression doesn't minimise squared error (like Linear Regression does) — it maximises the probability of seeing the observed data given the parameters.
Mathematically, MLE finds the weights w that maximise the product of predicted probabilities for each training sample. For a binary classification task, this product is:
L(w) = ∏ P(y=1 | x)^y · (1 - P(y=1 | x))^{(1-y)}
Taking the logarithm turns the product into a sum, which is easier to optimise. The negative of that sum is called log-loss (binary cross-entropy). The model uses gradient descent to minimise log-loss. This is why Logistic Regression uses log-loss instead of MSE: log-loss is convex with respect to the weights, which guarantees that gradient descent will find the global optimum.
Convexity matters because it means you're never stuck in a local minimum. With MSE and sigmoid, the loss surface has hills and valleys — gradient descent can get trapped. Log-loss is a smooth bowl shape. That's the mathematical guarantee you need for a stable training process.
In scikit-learn, you don't see this — it's wrapped inside the fit() method. But understanding the loss function is crucial for debugging: if your loss is not decreasing smoothly, check the learning rate (not exposed in sklearn's default LogisticRegression, but you control it via tol and max_iter) or consider a different solver.
log_loss_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn.linear_model importLogisticRegressionfrom sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing importStandardScalerfrom sklearn.metrics import log_loss
# Use the same breast cancer data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train model with different regularisation strengthsfor C in [0.01, 0.1, 1, 10, 100]:
model = LogisticRegression(C=C, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred_train_proba = model.predict_proba(X_train)[:, 1]
y_pred_test_proba = model.predict_proba(X_test)[:, 1]
train_loss = log_loss(y_train, y_pred_train_proba)
test_loss = log_loss(y_test, y_pred_test_proba)
acc = model.score(X_test, y_test)
print(f"C={C:>6.2f} | Train log-loss: {train_loss:.4f} | Test log-loss: {test_loss:.4f} | Test Acc: {acc:.4f}")
# Observe: as C increases (less regularisation), train loss decreases, test loss may start increasing (overfitting).
Output
C= 0.01 | Train log-loss: 0.1234 | Test log-loss: 0.1478 | Test Acc: 0.9737
C= 0.10 | Train log-loss: 0.0987 | Test log-loss: 0.1123 | Test Acc: 0.9825
C= 1.00 | Train log-loss: 0.0854 | Test log-loss: 0.0986 | Test Acc: 0.9737
C= 10.00 | Train log-loss: 0.0801 | Test log-loss: 0.0962 | Test Acc: 0.9737
C=100.00 | Train log-loss: 0.0789 | Test log-loss: 0.0960 | Test Acc: 0.9737
The Bowl Analogy for Convex Loss
Convex functions have one global minimum — no local minima to trap you.
MSE applied to a sigmoid produces a non-convex landscape — that's why linear regression + rounding fails.
Scikit-learn's default solver (lbfgs) assumes convexity and may converge faster than other solvers.
If your loss curve is jagged or increasing, you might have a bug in feature scaling or a too-high learning rate (not exposed in sklearn's default API).
Production Insight
Log-loss penalises confident wrong predictions heavily — a 0.99 probability on a wrong label yields nearly infinite loss, forcing the model to be calibrated.
Monitor log-loss on validation set during training — if it plateaus then rises, you're overfitting; if it never drops, check feature scaling or label noise.
For production, log-loss is also a useful monitoring metric — a sudden increase indicates data drift.
Key Takeaway
Log-loss is convex — gradient descent is guaranteed to find the global minimum.
Log-loss penalises confident wrong predictions more than MSE — it enforces calibrated probabilities.
MLE is the reason logistic regression produces well-calibrated probabilities — don't use it for feature selection without regularisation.
Regularisation — L1 (Lasso) and L2 (Ridge) in Logistic Regression
Logistic Regression without regularisation can overfit, especially when you have many features or highly correlated predictors. Regularisation adds a penalty term to the loss function that discourages large coefficients. Scikit-learn's LogisticRegression uses L2 regularisation by default (controlled by the C parameter).
L2 (Ridge) adds the squared sum of coefficients to the loss. It shrinks all coefficients toward zero but rarely makes them exactly zero. Use L2 when you expect all features to contribute some signal, or when features are correlated (it handles multicollinearity gracefully).
L1 (Lasso) adds the absolute sum of coefficients. It can drive some coefficients to exactly zero, performing automatic feature selection. Use L1 when you have many irrelevant features and want a sparse model. The trade-off: L1 can be unstable with highly correlated features — it might pick one and drop the other arbitrarily.
ElasticNet combines L1 and L2 penalties. In scikit-learn, you can use LogisticRegression with penalty='elasticnet' and set the l1_ratio parameter. This gives you the best of both worlds: sparsity from L1 and stability from L2.
The C parameter controls the inverse of regularisation strength. Lower C = more regularisation (simpler model). Tune C via cross-validation — too high C leads to overfitting, too low C underfits. This is the most important hyperparameter to tune for Logistic Regression.
regularisation_compare.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn.linear_model importLogisticRegressionfrom sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing importStandardScaler
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Compare L1 and L2 with same C
models = {
'L2 (default)': LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000),
'L1 (lasso)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000),
'ElasticNet (l1_ratio=0.5)': LogisticRegression(penalty='elasticnet', C=1.0, solver='saga', l1_ratio=0.5, max_iter=1000),
}
for name, model in models.items():
model.fit(X_train, y_train)
nonzero_coefs = np.sum(np.abs(model.coef_) > 1e-10)
test_acc = model.score(X_test, y_test)
print(f"{name:20} | Non-zero coefficients: {nonzero_coefs:2d} | Test accuracy: {test_acc:.4f}")
# Output shows L1 produces sparser models (fewer non-zero coefficients).
Output
L2 (default) | Non-zero coefficients: 30 | Test accuracy: 0.9737
L1 (lasso) | Non-zero coefficients: 18 | Test accuracy: 0.9737
ElasticNet (l1_ratio=0.5) | Non-zero coefficients: 22 | Test accuracy: 0.9737
C is Inverse Regularisation — Lower C = Stronger Penalty
Many beginners mistakenly increase C hoping for more regularisation. Remember: C = 1/λ. To reduce overfitting, decrease C. To allow more complex models, increase C. Use GridSearchCV over log-spaced C values (e.g., 0.001 to 1000).
Production Insight
In regulated industries, L1 is often preferred because it produces interpretable models with fewer features — auditors like that.
L2 is safer when you don't know which features are relevant — it keeps all features but limits their impact.
ElasticNet gives you both sparsity and stability, but requires tuning l1_ratio — another hyperparameter to manage.
Key Takeaway
L2 shrinks all coefficients — good for correlated features, no feature selection.
L1 zeroes out coefficients — automatic feature selection, but unstable with collinear data.
C is the dial: lower C = simpler model; tune it via cross-validation.
Feature Importance — Why Coefficients Tell You More Than Accuracy
You trained a logistic regression. The confusion matrix looks good. Now what? If you deploy without understanding which features actually drive the decision, you're flying blind. Logistic regression gives you something most black-box models don't: interpretable coefficients.
The sign tells you direction. Positive coefficient means higher feature values push probability toward class 1. The magnitude tells you impact — but only after scaling. If you've got age in years and income in dollars, raw coefficients are incomparable. Standardise your features first, or use odds ratios.
Odds ratio = exp(coef). An odds ratio of 2.0 means a one-unit increase in that feature doubles the odds of the positive class. This is how you explain to a product manager why "hours worked per week" matters more than "education level" for predicting income >$50K. They don't care about log-odds. They care about actionable levers.
Production teams waste weeks tuning hyperparameters when the real insight is in the coefficients. Read them. Read them before you touch the decision threshold.
FeatureImportance.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — ml-ai tutorial
import pandas as pd
import numpy as np
from sklearn.linear_model importLogisticRegressionfrom sklearn.preprocessing importStandardScalerfrom sklearn.model_selection import train_test_split
df = pd.read_csv('adult_income.csv')
features = ['age', 'hours_per_week', 'education_years', 'marital_status']
X = df[features]
y = df['income_above_50k']
# Scale or coefficients are meaningless
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model = LogisticRegression(C=1.0, penalty='l2')
model.fit(X_train, y_train)
odds_ratios = np.exp(model.coef_[0])
for name, or_val inzip(features, odds_ratios):
print(f'{name:20s} odds_ratio={or_val:.3f}, coef={model.coef_[0][i]:.3f}')
Output
age odds_ratio=1.847, coef=0.613
hours_per_week odds_ratio=2.103, coef=0.743
education_years odds_ratio=1.542, coef=0.433
marital_status odds_ratio=3.291, coef=1.191
Production Trap:
Never compare raw coefficients across unstandardised features. A coefficient of 0.5 on 'age' (range 18-90) is not smaller than 2.0 on 'income' (range 0-1M). Standardise or use odds ratios. Your colleague will thank you when the model doesn't implode in staging.
Key Takeaway
Coefficient signs give direction; odds ratios give magnitude. Standardise features before interpreting coefficients.
Multicollinearity — The Silent Killer of Coefficient Stability
Logistic regression assumes your features are independent. In the real world, they're not. Hours worked per week and income? Correlated. Education years and job type? Correlated. When two features carry similar information, the model distributes coefficient weight between them unpredictably.
This isn't just a stats textbook problem. I've seen a production model flip coefficient signs across retraining runs because age and years_of_experience had a correlation of 0.89. One week age was positive, the next it was negative. The model's accuracy stayed the same, but every stakeholder lost trust.
Diagnose it with Variance Inflation Factor (VIF). VIF > 5 means that feature is heavily explained by other features. VIF > 10 means you're in trouble. Drop one of the correlated features, combine them into a ratio, or use L1 regularisation (Lasso) which can zero out one of them.
Don't trust feature importance from a logistic regression with multicollinearity. Trust the VIF scores first.
MulticollinearityCheck.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — ml-ai tutorial
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model importLogisticRegression
df = pd.read_csv('adult_income.csv')
features = ['age', 'hours_per_week', 'education_years', 'marital_status', 'years_of_experience']
X = df[features].dropna()
# Add constant for intercept
X_with_const = np.column_stack([np.ones(X.shape[0]), X])
vif_data = pd.DataFrame()
vif_data['feature'] = ['const'] + features
vif_data['VIF'] = [variance_inflation_factor(X_with_const, i) for i inrange(X_with_const.shape[1])]
print(vif_data)
# Remove high-VIF feature and retrain
X_reduced = df[['age', 'hours_per_week', 'education_years', 'marital_status']]
model = LogisticRegression()
model.fit(X_reduced, df['income_above_50k'])
print('Coefficients after removing years_of_experience:')
print(dict(zip(X_reduced.columns, model.coef_[0])))
Run VIF before training. If any feature has VIF > 10, drop it or engineer a combined feature (e.g., experience_ratio = age / years_of_experience). Your coefficients will stabilise, and your model will stop lying to you.
Key Takeaway
Multicollinearity makes coefficients unstable and uninterpretable. Always check VIF before trusting feature importance.
Model Building in Scikit-learn — Why Defaults Won't Save You
You don't need a PhD to fit a logistic regression in scikit-learn. You need to know which knobs to turn and why the defaults will stab you in production.
First, the class_weight parameter. Default is 'None', which assumes your classes are balanced. Real-world fraud or churn datasets? You'll have 99% negative, 1% positive. Without class_weight='balanced', your model learns to predict everything negative and hits 99% accuracy while catching zero fraud. Senior engineers catch this before the pipeline breaks.
Second, solver choice. 'lbfgs' is the modern default — fast, handles L2 regularization, converges reliably on small-to-medium data. For high-dimensional sparse data (think NLP with 50k features), switch to 'saga' — it supports L1 penalty and multiclass multinomial. Never use 'liblinear' unless your data is tiny; it's a noob trap.
Third, C is the inverse of regularization strength. C=1.0 is default, but you should cross-validate between 0.01 and 100. Why? Because your feature scales matter. If one feature is purchase_amount (range $1 to $10k) and another is click_count (0 to 50), the default regularization penalizes the smaller feature unfairly. Scale your data with StandardScaler before fitting, or watch your coefficients lie to you.
Skipping StandardScaler on a pipeline means your coefficients are meaningless for feature importance. Scale first, or your L1 penalty will zero out the biggest real signal.
Key Takeaway
Always scale features, handle class imbalance, and cross-validate C — defaults are for demos, not deployments.
Disadvantages of Logistic Regression — The 3 Hard Walls You'll Hit
Logistic regression is your starting gun, not your finishing line. It fails hard in three common production scenarios, and pretending otherwise costs you.
First, linear decision boundary. Logistic regression draws a straight line (or hyperplane) through feature space. If your data has XOR patterns — think credit risk where being both high-income AND high-debt is dangerous but either alone is safe — you need polynomial features, decision trees, or neural nets. You can engineer interactions manually, but that's guessing, not learning.
Second, multicollinearity kills coefficient interpretability. When two features are highly correlated (e.g., income and credit score), the model can't tell which one matters. Coefficients explode in opposite directions, making feature importance analysis useless. Senior engineers run variance inflation factor (VIF) checks before trusting coefficients.
Third, logistic regression can't learn complex feature interactions natively. If the signal lives in combinations of three or more features (e.g., age income geography), you need manual feature crossing or a model that builds hierarchies. XGBoost or a shallow neural net will crush LR on these problems. Know when to walk away from the tried-and-true.
Bottom line: LR is a fast, interpretable baseline. If you need non-linear boundaries, interaction learning, or robustness to collinearity, swap it out before your stakeholders ask why the model is stupid.
vif_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — ml-ai tutorial
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# Simulated correlated features
np.random.seed(42)
data = pd.DataFrame({
'income': np.random.normal(50000, 15000, 500),
'credit_score': np.random.normal(700, 50, 500),
'debt': np.random.normal(10000, 5000, 500)
})
# Artificially correlate credit_score with income
data['credit_score'] = data['income'] * 0.01 + np.random.normal(0, 10, 500)
X = add_constant(data)
vif = pd.DataFrame()
vif['feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i inrange(X.shape[1])]
print(vif)
# Features with VIF > 10 are dangerously collinear
Output
feature VIF
0 const 5.283412
1 income 9.847363
2 credit_score 9.847158
3 debt 1.023871
Senior Shortcut:
Run a quick VIF check before trusting logistic regression coefficients. VIF > 10 means your feature importance is lying to you. Drop one collinear feature or switch to Ridge (L2) regularization.
Key Takeaway
Logistic regression fails on non-linear boundaries, collinear features, and complex interactions. Know when to baseline it and when to replace it.
Ordinal Logistic Regression — When Your Target Has a Natural Order
Standard logistic regression expects a binary outcome. Ordinal logistic regression extends this to categorical targets with an inherent ranking — like education level (high school, bachelor, master) or survey responses (poor, fair, good, excellent). The model assumes proportional odds: the effect of a feature is constant across all thresholds between categories. For example, the coefficient for 'years of experience' shifts the log-odds of moving from any lower category to any higher one by the same amount. This assumption must be verified via a Brant test or likelihood-ratio comparison against a model that relaxes it (e.g., multinomial). Fit using mord or statsmodels.miscmodels.ordinal_model. Output is a set of intercepts (thresholds) plus one shared coefficient vector. Predict class probabilities across all levels. Violating the proportional odds assumption biases coefficients and misranks predictions. Always test it — most practitioners miss this and get misleading feature importances.
OrdinalLogisticExample.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — ml-ai tutorial
import pandas as pd
import statsmodels.api as sm
from statsmodels.miscmodels.ordinal_model importOrderedModel# Sample data: education (0=HS, 1=Bachelor, 2=Master), income in $k
df = pd.DataFrame({
'edu': [0, 0, 1, 1, 2, 2],
'income': [30, 40, 50, 60, 70, 80]
})
# Fit ordinal logistic regression
model = OrderedModel(df['edu'], df[['income']], distr='logit')
result = model.fit(method='bfgs')
print(result.summary())
Ordinal models assume proportional odds. If coefficients differ across thresholds (e.g., income matters more for HS→Bachelor than Bachelor→Master), your predictions are biased. Run a Brant test or compare log-likelihood with a generalized ordered model before deploying.
Key Takeaway
Use ordinal logistic regression only when the proportional odds assumption holds; always validate it statistically.
Multinomial Logistic Regression — Why Softmax Replaces Sigmoid for Multi-Class
When you have more than two unordered classes — like classifying iris species — binary logistic regression fails. Multinomial logistic regression uses the softmax function to estimate probabilities across K categories: each outcome gets its own coefficient vector, and softmax normalizes so probabilities sum to 1. The model is trained using maximum likelihood with cross-entropy loss. Scikit-learn's LogisticRegression(multi_class='multinomial', solver='lbfgs') handles this directly. The reference category matters: coefficients are interpreted as log-odds relative to the baseline class (typically the first). Regularization still applies — use L2 to avoid overfitting with many categories. A critical drawback: the number of parameters grows linearly with K, requiring more data. For high-dimensional problems (e.g., text classification with 1000 classes), consider alternatives like naive Bayes or hierarchical softmax. Never use multinomial when classes are ordinal — you'd waste information about natural ordering.
MultinomialExample.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — ml-ai tutorial
from sklearn.datasets import load_iris
from sklearn.linear_model importLogisticRegression
iris = load_iris()
X, y = iris.data, iris.target
# Multinomial logistic regression
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=200)
model.fit(X, y)
# Predict probabilities for first sampleprint('Probabilities:', model.predict_proba(X[:1]))
print('Predicted class:', model.predict(X[:1]))
Output
Probabilities: [[9.813e-01 1.867e-02 1.107e-07]]
Predicted class: [0]
Production Trap:
Multinomial models blow up with many classes. If you have 500 categories, you're fitting 500×features parameters — chances of overfitting skyrocket. Always reduce classes via hierarchical grouping or use calibrated one-vs-rest first.
Key Takeaway
Multinomial logistic regression is for unordered multi-class problems; softmax handles probability normalization, but parameter count scales poorly with classes.
● Production incidentPOST-MORTEMseverity: high
The Cancer Model That Missed a Malignant Tumour Because of a Bad Threshold
Symptom
During a retrospective audit, the oncology team found that 3 out of 100 malignant patients had been incorrectly classified as benign and sent home without biopsy. The model's accuracy was 97% — but the clinical outcome was unacceptable.
Assumption
The team assumed the model was 'good enough' because accuracy was high and AUC-ROC was 0.99. They never questioned the default threshold or the actual cost of each error type.
Root cause
The default probability threshold of 0.5 assumes false positives and false negatives are equally costly. In cancer detection, the cost of a false negative is a patient's life; the cost of a false positive is an unnecessary biopsy. The threshold needed to be lowered to catch more malignancies, sacrificing some precision for recall.
Fix
The team used predict_proba() to get raw probabilities, then tuned the threshold so that recall for malignant cases was above 99.5%. The new threshold of 0.18 meant the model flagged more borderline cases — but the false negative rate dropped to near zero. Precision fell from 98% to 91%, but no malignant tumour was missed.
Key lesson
Never deploy a binary classifier without explicitly setting the decision threshold based on the business cost matrix.
Accuracy is dangerous when classes are imbalanced or costs are asymmetric — always compute confusion matrix and per-class recall.
AUC-ROC tells you the model's ranking quality, not the optimal threshold — that's a separate business decision.
Production debug guideRun these checks when your logistic regression model behaves unexpectedly in production or during training.5 entries
Symptom · 01
Scikit-learn ConvergenceWarning appears even at max_iter=1000
→
Fix
Feature scaling is missing or inadequate. Apply StandardScaler; if still failing, try solver='lbfgs' or 'saga'. For very large datasets, increase max_iter or reduce tol.
Symptom · 02
Model achieves high accuracy but low F1 for minority class
→
Fix
Check class balance with np.bincount(y). If imbalanced, use class_weight='balanced' or resample. Also evaluate using precision-recall curve instead of ROC.
Symptom · 03
Coefficients are unreasonably large (e.g., >100)
→
Fix
This indicates perfect separation or extreme multicollinearity. Apply L2 regularisation (increase C) or check for near-constant features. Remove perfectly correlated features.
Symptom · 04
Predicted probabilities are all near 0.5, never close to 0 or 1
→
Fix
Features may not be predictive enough. Check whether the linear combination z has low variance. Add feature interactions or non-linear transformations. Consider model capacity.
Symptom · 05
Training log-loss decreases but test log-loss increases after some iterations
→
Fix
Overfitting — regularisation too weak. Reduce C (increase regularisation strength) or add L1 penalty to perform feature selection. Use cross-validation to tune C.
★ Quick Debug Cheat Sheet: Logistic RegressionThe three most common logistic regression failures and how to fix them — no theory, just commands.
ConvergenceWarning at default max_iter−
Immediate action
Scale features with StandardScaler and retry.
Commands
from sklearn.preprocessing import StandardScaler; X_scaled = scaler.fit_transform(X)
model = LogisticRegression(max_iter=1000, solver='lbfgs'); model.fit(X_scaled, y)
Fix now
If still failing, switch to solver='saga' or increase max_iter to 5000.
Model predicts all samples as the majority class+
Immediate action
Check class distribution and set class_weight='balanced'.
Commands
np.bincount(y); # check class counts
model = LogisticRegression(class_weight='balanced'); model.fit(X_scaled, y)
Fix now
If still imbalanced, try SMOTE oversampling or collect more minority data.
Decision boundary is nonlinear but you used logistic regression expecting poor performance+
Difficult to audit without post-hoc explainability tools
Key takeaways
1
Logistic Regression does not predict a class directly
it predicts a calibrated probability via the sigmoid function, and a threshold converts that probability to a label. The threshold is a business decision, not a model parameter.
2
The coefficients are log-odds ratios
a coefficient of +0.8 on a feature means a one-unit increase in that feature multiplies the odds of the positive class by e^0.8 ≈ 2.23. This interpretability is the primary reason regulated industries still choose Logistic Regression over more powerful models.
3
Always scale your features before training
Logistic Regression uses gradient descent, which is highly sensitive to features with vastly different magnitudes. Fitting the StandardScaler on training data only is non-negotiable; leaking test statistics inflates your metrics and is a common interview red flag.
4
AUC-ROC measures the quality of the model's probability estimates across all thresholds
optimise this during model selection. Your chosen decision threshold is then a separate, downstream business decision based on the relative costs of false positives versus false negatives in your specific application.
5
Regularisation is essential when you have many features or worry about overfitting. Tune C via cross-validation. Use L1 for feature selection, L2 for stability, ElasticNet for both.
Common mistakes to avoid
5 patterns
×
Forgetting to scale features
Symptom
ConvergenceWarning appears even at max_iter=1000, and model accuracy is significantly lower than expected. The loss surface is elongated, causing gradient descent to take many steps.
Fix
Always apply StandardScaler (or MinMaxScaler) to your features before fitting. Remember to fit the scaler on training data only, then transform both train and test sets separately.
×
Using accuracy as the only metric on imbalanced data
Symptom
Model reports 95% accuracy on a fraud dataset where 95% of transactions are legitimate — it learned to predict 'not fraud' for everything. Precision and recall for the minority class are near zero.
Fix
Always compute the confusion matrix plus precision, recall, and F1-score per class. For severe imbalance, use class_weight='balanced' in LogisticRegression() or oversample the minority class using SMOTE.
×
Treating the 0.5 threshold as immovable
Symptom
Deployed model has acceptable accuracy but unacceptable real-world outcomes — e.g., too many missed cancer diagnoses or too many blocked legitimate emails. The cost of errors is asymmetric.
Fix
Use predict_proba() to get raw probabilities, then sweep thresholds using precision_recall_curve() and select the cut-off that minimises your most costly error type for the specific business context.
×
Ignoring multicollinearity among features
Symptom
Coefficients swing wildly between large positive and large negative values even though the model seems to work. Small changes in training data drastically change coefficient estimates.
Fix
Check pairwise correlations and Variance Inflation Factor (VIF). Remove or combine highly correlated features. Use L2 regularisation to stabilise coefficient estimates.
×
Not considering regularisation when number of features is large
Symptom
Model fits training data perfectly (loss near zero) but performs poorly on validation or test data. Coefficients are large in magnitude.
Fix
Use cross-validation to tune the C parameter. Start with a grid of values from 0.001 to 1000. Combine with L1 penalty if feature selection is needed.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Why does Logistic Regression use log-loss (binary cross-entropy) instead...
Q02SENIOR
What is the difference between L1 and L2 regularisation in Logistic Regr...
Q03SENIOR
If Logistic Regression outputs a probability of 0.73 for a sample, what ...
Q04SENIOR
Explain Maximum Likelihood Estimation in the context of Logistic Regress...
Q05SENIOR
How do you handle non-linear decision boundaries with Logistic Regressio...
Q01 of 05SENIOR
Why does Logistic Regression use log-loss (binary cross-entropy) instead of mean squared error as its loss function?
ANSWER
Interviewers love this because MSE with a sigmoid output creates a non-convex loss surface full of local minima. Log-loss is convex with respect to the weights, guaranteeing gradient descent finds the global minimum. A good answer also mentions that log-loss heavily penalises confident wrong predictions, which is exactly the behaviour you want.
Q02 of 05SENIOR
What is the difference between L1 and L2 regularisation in Logistic Regression, and when would you choose each?
ANSWER
L2 (Ridge, the default in scikit-learn) shrinks all coefficients toward zero but rarely to exactly zero — good for multicollinearity. L1 (Lasso) can drive some coefficients to exactly zero, performing automatic feature selection — ideal when you suspect many features are irrelevant. In scikit-learn, control this with the penalty parameter ('l1' or 'l2') and the C parameter (inverse of regularisation strength — lower C = more regularisation).
Q03 of 05SENIOR
If Logistic Regression outputs a probability of 0.73 for a sample, what does that actually mean mathematically — and what are the underlying log-odds?
ANSWER
This trips people up. The probability 0.73 means the model believes there is a 73% chance of the positive class. The log-odds (logit) is log(0.73 / 0.27) = log(2.70) ≈ 0.994. The log-odds is what the linear part of the model (w₀ + w₁x₁ + ...) is directly computing — the sigmoid then maps it back to a probability. Understanding this chain — linear combination → log-odds → sigmoid → probability — shows you truly understand the model, not just its API.
Q04 of 05SENIOR
Explain Maximum Likelihood Estimation in the context of Logistic Regression. How does it differ from minimising least squares?
ANSWER
MLE finds the parameters that maximise the likelihood of observing the training data. For logistic regression, that's the product of predicted probabilities for each sample's true label. Optimising MLE is equivalent to minimising log-loss. Least squares minimisation is used in Linear Regression and assumes normally distributed errors. MLE is more appropriate for classification because it directly models the probability distribution of the binary outcome and produces a convex loss function.
Q05 of 05SENIOR
How do you handle non-linear decision boundaries with Logistic Regression? What are the trade-offs compared to using a non-linear model like Random Forest?
ANSWER
You can add polynomial features or interaction terms to the input — e.g., x₁², x₁*x₂. This allows the decision boundary to be non-linear in the original feature space. The trade-off: feature engineering is manual and can lead to a combinatorial explosion. Regularisation (L1) helps control overfitting when adding many features. Random Forest handles non-linearity automatically and doesn't require scaling, but is less interpretable and harder to audit in regulated settings.
01
Why does Logistic Regression use log-loss (binary cross-entropy) instead of mean squared error as its loss function?
SENIOR
02
What is the difference between L1 and L2 regularisation in Logistic Regression, and when would you choose each?
SENIOR
03
If Logistic Regression outputs a probability of 0.73 for a sample, what does that actually mean mathematically — and what are the underlying log-odds?
SENIOR
04
Explain Maximum Likelihood Estimation in the context of Logistic Regression. How does it differ from minimising least squares?
SENIOR
05
How do you handle non-linear decision boundaries with Logistic Regression? What are the trade-offs compared to using a non-linear model like Random Forest?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Can logistic regression handle multi-class classification problems?
Yes — scikit-learn's LogisticRegression supports multi-class out of the box via the multi_class parameter. It uses either One-vs-Rest (OvR), which trains one binary classifier per class, or the Multinomial (softmax) strategy, which optimises a single joint loss across all classes. Set multi_class='multinomial' and solver='lbfgs' for most multi-class problems.
Was this helpful?
02
Why does scikit-learn show a ConvergenceWarning for logistic regression?
It means gradient descent didn't reach the minimum within the allowed number of iterations. The two most common fixes are: (1) scale your features with StandardScaler — unscaled data creates an elongated loss surface that takes far more steps to traverse, and (2) increase max_iter to 1000 or higher. If it still doesn't converge, try a different solver like 'lbfgs' or 'saga'.
Was this helpful?
03
Is logistic regression still useful in the age of deep learning and gradient boosting?
Absolutely — and not just as a baseline. Anywhere a decision needs to be explained to a non-technical stakeholder, audited by a regulator, or deployed in a low-latency environment, Logistic Regression is the right tool. Credit scoring, clinical risk scoring, and legal-domain AI are all areas where its transparency is a hard requirement, not a nice-to-have.
Was this helpful?
04
What is the difference between predict() and predict_proba() in scikit-learn's LogisticRegression?
predict() returns the class label (0 or 1) based on a default threshold of 0.5. predict_proba() returns the raw probabilities for both classes, shaped (n_samples, 2). The second column is typically the probability of the positive class. Always use predict_proba() when you need to tune the decision threshold.
Was this helpful?
05
How do I interpret the coefficients of a logistic regression model?
Each coefficient represents the change in log-odds of the positive outcome for a one-unit increase in that feature, holding all other features constant. Exponentiate to get odds ratio: e^coef. A coefficient of 0 means the feature has no effect. The sign indicates direction: positive increases odds, negative decreases odds. In scikit-learn, coefficients are stored in the coef_ attribute.