Intermediate 10 min · March 06, 2026

Logistic Regression — The Threshold That Missed 3% Cancers

Q: Can logistic regression handle multi-class classification problems?

Yes — scikit-learn's LogisticRegression supports multi-class out of the box via the multi_class parameter. It uses either One-vs-Rest (OvR), which trains one binary classifier per class, or the Multinomial (softmax) strategy, which optimises a single joint loss across all classes. Set multi_class='multinomial' and solver='lbfgs' for most multi-class problems.

Q: Why does scikit-learn show a ConvergenceWarning for logistic regression?

It means gradient descent didn't reach the minimum within the allowed number of iterations. The two most common fixes are: (1) scale your features with StandardScaler — unscaled data creates an elongated loss surface that takes far more steps to traverse, and (2) increase max_iter to 1000 or higher. If it still doesn't converge, try a different solver like 'lbfgs' or 'saga'.

Q: Is logistic regression still useful in the age of deep learning and gradient boosting?

Absolutely — and not just as a baseline. Anywhere a decision needs to be explained to a non-technical stakeholder, audited by a regulator, or deployed in a low-latency environment, Logistic Regression is the right tool. Credit scoring, clinical risk scoring, and legal-domain AI are all areas where its transparency is a hard requirement, not a nice-to-have.

Q: What is the difference between predict() and predict_proba() in scikit-learn's LogisticRegression?

predict() returns the class label (0 or 1) based on a default threshold of 0.5. predict_proba() returns the raw probabilities for both classes, shaped (n_samples, 2). The second column is typically the probability of the positive class. Always use predict_proba() when you need to tune the decision threshold.

Q: How do I interpret the coefficients of a logistic regression model?

Each coefficient represents the change in log-odds of the positive outcome for a one-unit increase in that feature, holding all other features constant. Exponentiate to get odds ratio: e^coef. A coefficient of 0 means the feature has no effect. The sign indicates direction: positive increases odds, negative decreases odds. In scikit-learn, coefficients are stored in the coef_ attribute.

3 out of 100 malignant patients were sent home - the default 0.5 threshold failed.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Logistic Regression predicts a probability between 0 and 1 using the sigmoid function
The linear part computes log-odds, which are exponentiated and squeezed via sigmoid
Coefficients are log-odds ratios — interpretable for regulated industries
Feature scaling is mandatory — unscaled data makes gradient descent crawl
Decision threshold is a business decision, not a fixed 0.5
Accuracy is a trap — always check precision, recall, and confusion matrix

✦ Definition~90s read

What is Logistic Regression?

Logistic regression is a linear classification algorithm that estimates the probability of a binary outcome by passing a weighted sum of input features through the sigmoid function. Despite its name, it is not a regression algorithm — it solves classification problems by learning a linear decision boundary in feature space, then mapping the raw score to a value between 0 and 1 via the sigmoid curve.

★

This curve is chosen specifically because it is differentiable, S-shaped, and outputs values interpretable as probabilities, making it ideal for tasks like spam detection, credit default prediction, and medical diagnosis. In practice, logistic regression is often the first model you reach for when you need a fast, interpretable, and well-calibrated classifier on structured data, and it remains a baseline that deep learning models must beat on tabular datasets.

Where logistic regression truly shines is in its transparency and mathematical rigor. It learns by maximizing the likelihood of the observed data under a Bernoulli distribution, which is equivalent to minimizing log-loss (cross-entropy). This optimization is convex, meaning gradient descent will always find the global optimum — no local minima traps.

You can add L1 (Lasso) or L2 (Ridge) regularization to prevent overfitting, with L1 driving irrelevant feature weights to exactly zero, effectively performing feature selection. However, logistic regression fails when the decision boundary is inherently non-linear — for those cases, you need kernel SVMs, random forests, or neural networks.

It also assumes independence of features and is sensitive to outliers, so you must scale your inputs and handle collinearity.

The critical operational insight is that the default 0.5 decision threshold is rarely optimal. In a breast cancer screening model, using 0.5 might miss 3% of malignant tumors that have probabilities of 0.47–0.49 — a catastrophic failure. By tuning the threshold via ROC curves or precision-recall trade-offs, you can prioritize recall (catching all cancers) over precision, accepting more false positives to save lives.

This threshold tuning is a production skill that separates junior from senior practitioners: you don't just train a model, you align its decision rule with the real-world cost of errors. Logistic regression is deployed at scale in systems like credit scoring at FICO, ad click prediction at Google, and clinical risk calculators at major hospitals — precisely because it is fast to train, easy to debug, and its probability estimates can be recalibrated for different operational thresholds.

Plain-English First

Imagine a doctor looking at your test results and saying 'there's a 92% chance this is benign.' They're not predicting a number like your height — they're predicting a probability that tips into a yes-or-no answer. Logistic Regression is exactly that: it takes a bunch of measurements, runs them through a special S-shaped curve, and squeezes the result into a probability between 0 and 1. Once that probability crosses a threshold (usually 0.5), the model commits to an answer. It's less like a ruler and more like a confident doctor making a call.

Every day, your email provider quietly decides whether to drop a message into your inbox or your spam folder. Your bank flags a transaction as fraud or lets it through. A hospital algorithm predicts whether a tumour is malignant or benign. All of these are binary decisions — yes or no, 0 or 1 — and Logistic Regression is one of the most reliable, interpretable, and battle-tested tools for making them. It's been doing this job since the 1950s and it's still the first model data scientists reach for when the stakes are high and the explanation matters.

The core problem Logistic Regression solves is one that Linear Regression cannot: predicting a bounded probability. If you used ordinary linear regression to classify emails, nothing stops it from predicting a 'spam probability' of 2.7 or -0.4 — which is meaningless. Logistic Regression wraps its output in a sigmoid function that mathematically constrains every prediction to live between 0 and 1, giving you an actual probability you can act on.

By the end of this article you'll understand not just how to call LogisticRegression().fit() in scikit-learn, but why the sigmoid function exists, what the coefficients are actually telling you about the real world, how to tune the decision threshold for different business goals, and exactly what questions an interviewer will ask you to separate the practitioners from the people who just skimmed a tutorial.

Why Logistic Regression Is a Linear Classifier, Not a Regression

Logistic regression predicts the probability that an input belongs to a binary class by passing a linear combination of features through the logistic (sigmoid) function. The core mechanic: compute z = w·x + b, then output σ(z) = 1 / (1 + e⁻ᶻ), which squashes any real number into a (0,1) probability. Despite the name, it's a classification algorithm — the 'regression' refers to fitting a linear decision boundary, not predicting a continuous value.

Training maximizes log-likelihood via gradient descent, not least squares. The loss function is cross-entropy: -[y log(ŷ) + (1-y) log(1-ŷ)]. This penalizes confident wrong predictions heavily — a 0.97 probability on a false positive costs far more than 0.51. The decision threshold is a separate hyperparameter; default 0.5 is rarely optimal. In production, you tune this threshold using precision-recall or ROC curves, not accuracy alone.

Use logistic regression when you need interpretable probabilities, fast inference, or a strong baseline. It's the go-to for binary classification on linearly separable or near-separable data — spam detection, churn prediction, medical diagnosis. It scales to millions of features with L1/L2 regularization and trains in minutes on a single machine. For non-linear boundaries, add feature crosses or kernel tricks, but know that deep nets will outperform once data exceeds ~100k examples with complex interactions.

⚠ Threshold Is Not 0.5 by Default

A 0.5 threshold assumes equal cost of false positives and false negatives. In cancer screening, lowering threshold to 0.3 catches 3% more cancers at the cost of 5% more false alarms — a trade-off you must set per business need.

📊 Production Insight

Teams deploying logistic regression for fraud detection often leave the default 0.5 threshold, missing 15% of fraud cases because the fraud class is rare (1%).

Symptom: high accuracy (99%) but recall below 50% — the model predicts 'not fraud' for everything.

Rule: always calibrate the threshold on validation data using the actual cost ratio of false negatives to false positives.

🎯 Key Takeaway

Logistic regression outputs probabilities, not hard classes — the threshold is your business decision.

It assumes linear decision boundaries; feature engineering (crosses, polynomials) is mandatory for non-linear problems.

Regularization (L1/L2) is not optional — without it, high-dimensional sparse features overfit instantly.

thecodeforge.io

Logistic Regression

The Sigmoid Function — Why Logistic Regression Uses This Specific Curve

Linear Regression gives you a straight line. That's great for predicting house prices, but terrible for predicting probabilities — because a straight line extends to infinity in both directions and probability must stay between 0 and 1.

The sigmoid function (also called the logistic function, which is where the algorithm gets its name) is the mathematical fix. Its formula is σ(z) = 1 / (1 + e^(-z)). Feed it any real number — whether it's -1000 or +1000 — and it maps the output to the range (0, 1). Large positive inputs push the output close to 1. Large negative inputs push it close to 0. Right at zero, you get exactly 0.5.

The input z is itself a linear combination of your features: z = w₀ + w₁x₁ + w₂x₂ + ... — exactly like Linear Regression. So Logistic Regression is really just Linear Regression with its output passed through the sigmoid. That single design decision makes the output interpretable as a probability, which is the foundation everything else builds on.

The model learns the weights (w values) by maximising the likelihood that the predicted probabilities match the actual labels in your training data — a process called Maximum Likelihood Estimation, optimised via gradient descent.

sigmoid_intuition.pyPYTHON

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """The core of logistic regression — maps any real number to (0, 1)."""
    return 1 / (1 + np.exp(-z))

# Create a range of z values to visualise the S-curve
z_values = np.linspace(-10, 10, 300)
probabilities = sigmoid(z_values)

# Annotate key points so the behaviour is obvious
key_points = {
    -5: sigmoid(-5),   # Very likely class 0
     0: sigmoid(0),    # Exactly on the decision boundary
     5: sigmoid(5),    # Very likely class 1
}

print("=== Sigmoid Output at Key Z-Values ===")
for z, prob in key_points.items():
    label = "→ class 1" if prob >= 0.5 else "→ class 0"
    print(f"  z = {z:+d}  |  P(y=1) = {prob:.4f}  {label}")

# Plot the S-curve
plt.figure(figsize=(8, 4))
plt.plot(z_values, probabilities, color='steelblue', linewidth=2.5, label='σ(z)')
plt.axhline(y=0.5, color='tomato', linestyle='--', linewidth=1.5, label='Decision boundary (0.5)')
plt.axvline(x=0,   color='gray',   linestyle=':',  linewidth=1.2)
plt.fill_between(z_values, probabilities, 0.5,
                 where=(probabilities >= 0.5), alpha=0.12, color='steelblue', label='Predict class 1')
plt.fill_between(z_values, probabilities, 0.5,
                 where=(probabilities < 0.5),  alpha=0.12, color='tomato',    label='Predict class 0')
plt.xlabel('z  (linear combination of features)')
plt.ylabel('Predicted Probability')
plt.title('The Sigmoid Function — How Logistic Regression Converts Scores to Probabilities')
plt.legend()
plt.tight_layout()
plt.savefig('sigmoid_curve.png', dpi=150)
print("\nPlot saved to sigmoid_curve.png")

Output

=== Sigmoid Output at Key Z-Values ===

z = -5 | P(y=1) = 0.0067 → class 0

z = +0 | P(y=1) = 0.5000 → class 1

z = +5 | P(y=1) = 0.9933 → class 1

Plot saved to sigmoid_curve.png

🔥Why Not Just Round Linear Regression?

Rounding a linear regression output to 0 or 1 destroys the probability information entirely and makes gradient descent behave badly — the loss landscape becomes a step function with no meaningful gradient. The sigmoid preserves a smooth, differentiable transition so the optimizer knows which direction to push the weights.

📊 Production Insight

Sigmoid saturation kills gradient — when z > ~5 or z < ~-5, gradient approaches zero and training stalls.

Standardise features to keep z in the active range (|z| < 4) for most samples.

If your model converges slowly, check z distribution — if most are extreme, increase regularisation or scale better.

🎯 Key Takeaway

The sigmoid maps unbounded z to (0,1), giving us a probability.

Gradient vanishes at extreme z — feature scaling keeps your learning alive.

Z = linear combination → sigmoid = probability — understand this chain.

Training on Real Data — Breast Cancer Classification End-to-End

Theory only sticks when you see it on real data. We'll use scikit-learn's built-in Breast Cancer dataset — 569 tumour samples, each described by 30 numeric features (mean radius, texture, smoothness, etc.), labelled as malignant (0) or benign (1). The goal is to predict the label from the measurements.

There are a few things to get right here that tutorials often skip. First, feature scaling matters enormously for Logistic Regression because gradient descent converges far faster when all features live on a similar scale. If 'mean area' is in the thousands and 'mean fractal dimension' is near 0.05, the loss surface is elongated and training is sluggish. StandardScaler fixes this.

Second, you should always look at your model's coefficients after training. Each coefficient tells you how much the log-odds of the positive class change for a one-unit increase in that feature. A large positive coefficient means that feature is a strong predictor of benign; a large negative one means it predicts malignant. That interpretability is exactly why doctors, banks and regulators often prefer Logistic Regression over a black-box neural network — you can explain every decision.

Third, accuracy alone is a dangerous metric for medical data. A model that predicts 'benign' for every sample gets ~63% accuracy on this dataset without learning anything. Always check precision, recall and the confusion matrix.

breast_cancer_logistic.pyPYTHON

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

# ── 1. Load Data ──────────────────────────────────────────────────────────────
cancer_data = load_breast_cancer()
feature_matrix = cancer_data.data        # Shape: (569, 30)
target_labels  = cancer_data.target      # 0 = malignant, 1 = benign
feature_names  = cancer_data.feature_names

print(f"Dataset shape : {feature_matrix.shape}")
print(f"Class balance : {np.bincount(target_labels)} (malignant, benign)")

# ── 2. Train / Test Split ─────────────────────────────────────────────────────
# stratify= ensures both splits keep the same class ratio
(X_train, X_test,
 y_train, y_test) = train_test_split(
    feature_matrix, target_labels,
    test_size=0.20,
    random_state=42,
    stratify=target_labels
)

# ── 3. Feature Scaling — critical for gradient-descent-based models ───────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit only on training data!
X_test_scaled  = scaler.transform(X_test)         # apply same scale to test

# ── 4. Train the Model ───────────────────────────────────────────────────────
# max_iter=1000 because the default 100 often hits a ConvergenceWarning
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)

# ── 5. Predict & Evaluate ────────────────────────────────────────────────────
y_pred_labels = logistic_model.predict(X_test_scaled)
y_pred_proba  = logistic_model.predict_proba(X_test_scaled)[:, 1]  # P(benign)

print("\n=== Confusion Matrix ===")
cm = confusion_matrix(y_test, y_pred_labels)
print(f"  True Negatives  (Malignant correctly caught) : {cm[0,0]}")
print(f"  False Positives (Malignant missed as Benign) : {cm[0,1]}")
print(f"  False Negatives (Benign wrongly flagged)     : {cm[1,0]}")
print(f"  True Positives  (Benign correctly caught)    : {cm[1,1]}")

print("\n=== Classification Report ===")
print(classification_report(y_test, y_pred_labels,
                            target_names=['Malignant', 'Benign']))

roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score : {roc_auc:.4f}")

# ── 6. Inspect Coefficients — this is where Logistic Regression shines ───────
print("\n=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===")
coef_pairs = sorted(
    zip(feature_names, logistic_model.coef_[0]),
    key=lambda pair: pair[1]
)
for feature_name, coefficient in coef_pairs[:5]:
    print(f"  {feature_name:<35}  coef = {coefficient:+.4f}")

print("\n=== Top 5 Features Pushing Towards Benign (positive coefficients) ===")
for feature_name, coefficient in coef_pairs[-5:][::-1]:
    print(f"  {feature_name:<35}  coef = {coefficient:+.4f}")

Output

Dataset shape : (569, 30)

Class balance : [212 357] (malignant, benign)

=== Confusion Matrix ===

True Negatives (Malignant correctly caught) : 40

False Positives (Malignant missed as Benign) : 2

False Negatives (Benign wrongly flagged) : 1

True Positives (Benign correctly caught) : 71

=== Classification Report ===

precision recall f1-score support

Malignant 0.98 0.95 0.96 42

Benign 0.97 0.99 0.98 72

accuracy 0.974 114

macro avg 0.97 0.97 0.97 114

weighted avg 0.97 0.97 0.97 114

ROC-AUC Score : 0.9960

=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===

worst concave points coef = -1.7683

mean concave points coef = -1.2418

worst perimeter coef = -1.1892

worst radius coef = -1.0754

mean perimeter coef = -0.8921

=== Top 5 Features Pushing Towards Benign (positive coefficients) ===

worst texture coef = +0.7143

mean texture coef = +0.4821

worst smoothness coef = +0.3902

fractal dimension error coef = +0.2814

smoothness error coef = +0.2301

⚠ Watch Out: Fit the Scaler on Training Data Only

Calling scaler.fit_transform() on your entire dataset before splitting leaks test-set statistics into training — a subtle form of data leakage that inflates your reported accuracy. Always fit the scaler on X_train, then use .transform() (not .fit_transform()) on X_test.

📊 Production Insight

Data leakage from scaler inflates accuracy by 5–10% in real deployments — always split first, then scale.

Coefficient interpretation depends on scale — standardised coefficients let you compare feature importance directly.

Monitor feature distribution drift — if a feature's mean shifts significantly, the model's log-odds change even if the coefficient remains constant.

🎯 Key Takeaway

Split data first, then fit scaler on training only — never the other way around.

Coefficients are log-odds: negative = pushes toward malignant, positive = pushes toward benign.

Accuracy lies — confusion matrix and per-class recall tell the real story.

thecodeforge.io

Logistic Regression

Tuning the Decision Threshold — When 0.5 Is the Wrong Cut-Off

Most tutorials treat the 0.5 threshold as sacred. It isn't. The threshold is a business decision, not a mathematical constant, and understanding when to move it separates good practitioners from great ones.

Consider the breast cancer case: a False Negative (predicting benign when the tumour is actually malignant) sends a patient home without treatment. A False Positive (flagging benign as malignant) means an unnecessary biopsy — uncomfortable, but survivable. These mistakes are not equal. You should tolerate more False Positives to drive False Negatives toward zero, which means lowering your threshold below 0.5 so the model cries 'malignant' sooner.

Conversely, in a spam filter, a False Positive (blocking a legitimate email) is worse than a False Negative (letting spam through). Here you'd raise the threshold.

The ROC curve plots True Positive Rate against False Positive Rate across every possible threshold. The area under it (AUC-ROC) tells you how well the model separates classes regardless of threshold — it's the metric to optimise during model selection. The Precision-Recall curve is more informative when your classes are heavily imbalanced.

The code below shows how to find the threshold that maximises recall for malignant detection — exactly the kind of analysis you'd run before deploying a medical model.

threshold_tuning.pyPYTHON

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_curve
import matplotlib.pyplot as plt

# ── Reuse the trained model setup from the previous example ──────────────────
cancer_data    = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer_data.data, cancer_data.target,
    test_size=0.20, random_state=42, stratify=cancer_data.target
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)

# Predicted probabilities for the positive class (benign = 1)
y_proba_benign = logistic_model.predict_proba(X_test_scaled)[:, 1]

# ── Find threshold that maximises recall for MALIGNANT class ─────────────────
# Note: precision_recall_curve works with respect to the positive label.
# We flip the probabilities so 'malignant' becomes the positive class.
y_proba_malignant = 1 - y_proba_benign
y_test_malignant  = 1 - y_test        # 1 = malignant, 0 = benign (flipped)

precisions, recalls, thresholds = precision_recall_curve(
    y_test_malignant, y_proba_malignant
)

# We want recall >= 0.99 with the highest possible precision
high_recall_mask = recalls[:-1] >= 0.99   # exclude last point (no threshold)
candidates = list(zip(
    thresholds[high_recall_mask],
    precisions[:-1][high_recall_mask],
    recalls[:-1][high_recall_mask]
))

print("=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===")
print(f"  {'Threshold':>12}  {'Precision':>10}  {'Recall':>8}")
for thresh, prec, rec in candidates:
    print(f"  {thresh:>12.4f}  {prec:>10.4f}  {rec:>8.4f}")

# Pick the threshold with highest precision among our high-recall candidates
best_threshold, best_precision, best_recall = max(candidates, key=lambda t: t[1])
print(f"\n✔ Best threshold = {best_threshold:.4f}")
print(f"  At this threshold — Precision: {best_precision:.4f}, Recall: {best_recall:.4f}")

# ── Apply the chosen threshold and see its real-world impact ─────────────────
# We predict 'malignant' whenever P(malignant) >= best_threshold
y_pred_tuned = (y_proba_malignant >= best_threshold).astype(int)

malignant_actual   = np.sum(y_test_malignant == 1)
malignant_caught   = np.sum((y_pred_tuned == 1) & (y_test_malignant == 1))
malignant_missed   = malignant_actual - malignant_caught

print(f"\n=== Clinical Impact at Tuned Threshold ===")
print(f"  Total malignant tumours in test set : {malignant_actual}")
print(f"  Correctly flagged (True Positives)  : {malignant_caught}")
print(f"  Missed (False Negatives)            : {malignant_missed}  ← the dangerous ones")

# ── ROC Curve ─────────────────────────────────────────────────────────────────
fpr, tpr, roc_thresholds = roc_curve(y_test_malignant, y_proba_malignant)

plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='steelblue', lw=2, label='ROC Curve')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve — Malignant Detection')
plt.legend()
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
print("\nROC curve saved to roc_curve.png")

Output

=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===

Threshold Precision Recall

0.1823 0.9130 1.0000

0.2041 0.9130 1.0000

0.2289 0.9130 1.0000

✔ Best threshold = 0.1823

At this threshold — Precision: 0.9130, Recall: 1.0000

=== Clinical Impact at Tuned Threshold ===

Total malignant tumours in test set : 42

Correctly flagged (True Positives) : 42

Missed (False Negatives) : 0 ← the dangerous ones

ROC curve saved to roc_curve.png

💡Pro Tip: AUC-ROC Is Model Quality; Threshold Is Business Policy

Optimise AUC-ROC during model training and cross-validation — it tells you how good the model's raw probability estimates are. Then, separately, pick your threshold based on the real-world cost of each type of error. These are two distinct decisions and conflating them leads to silently sub-optimal deployments.

📊 Production Insight

AUC-ROC of 0.99 doesn't mean the model is safe for deployment — it measures ranking, not absolute risk at a specific threshold.

The optimal threshold changes with business conditions — re-evaluate it quarterly or whenever the cost matrix shifts.

If you lower the threshold too much, you'll drown your team in false positives — always measure operational cost per false alarm.

🎯 Key Takeaway

Threshold is a business decision, not a model parameter — never use 0.5 by default.

Optimise AUC-ROC for model selection, then tune threshold for cost minimisation.

False negatives and false positives have asymmetric costs — your threshold must reflect the real-world stakes.

Maximum Likelihood Estimation and Log-Loss — How Logistic Regression Learns

You've seen the sigmoid and the coefficients. But how does the model actually find those coefficients? The answer is Maximum Likelihood Estimation (MLE). Logistic Regression doesn't minimise squared error (like Linear Regression does) — it maximises the probability of seeing the observed data given the parameters.

Mathematically, MLE finds the weights w that maximise the product of predicted probabilities for each training sample. For a binary classification task, this product is:

L(w) = ∏ P(y=1 | x)^y · (1 - P(y=1 | x))^{(1-y)}

Taking the logarithm turns the product into a sum, which is easier to optimise. The negative of that sum is called log-loss (binary cross-entropy). The model uses gradient descent to minimise log-loss. This is why Logistic Regression uses log-loss instead of MSE: log-loss is convex with respect to the weights, which guarantees that gradient descent will find the global optimum.

Convexity matters because it means you're never stuck in a local minimum. With MSE and sigmoid, the loss surface has hills and valleys — gradient descent can get trapped. Log-loss is a smooth bowl shape. That's the mathematical guarantee you need for a stable training process.

In scikit-learn, you don't see this — it's wrapped inside the fit() method. But understanding the loss function is crucial for debugging: if your loss is not decreasing smoothly, check the learning rate (not exposed in sklearn's default LogisticRegression, but you control it via tol and max_iter) or consider a different solver.

log_loss_demo.pyPYTHON

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss

# Use the same breast cancer data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model with different regularisation strengths
for C in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(C=C, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_pred_train_proba = model.predict_proba(X_train)[:, 1]
    y_pred_test_proba = model.predict_proba(X_test)[:, 1]
    train_loss = log_loss(y_train, y_pred_train_proba)
    test_loss = log_loss(y_test, y_pred_test_proba)
    acc = model.score(X_test, y_test)
    print(f"C={C:>6.2f} | Train log-loss: {train_loss:.4f} | Test log-loss: {test_loss:.4f} | Test Acc: {acc:.4f}")

# Observe: as C increases (less regularisation), train loss decreases, test loss may start increasing (overfitting).

Output

C= 0.01 | Train log-loss: 0.1234 | Test log-loss: 0.1478 | Test Acc: 0.9737

C= 0.10 | Train log-loss: 0.0987 | Test log-loss: 0.1123 | Test Acc: 0.9825

C= 1.00 | Train log-loss: 0.0854 | Test log-loss: 0.0986 | Test Acc: 0.9737

C= 10.00 | Train log-loss: 0.0801 | Test log-loss: 0.0962 | Test Acc: 0.9737

C=100.00 | Train log-loss: 0.0789 | Test log-loss: 0.0960 | Test Acc: 0.9737

Mental Model

The Bowl Analogy for Convex Loss

Log-loss is convex — it's shaped like a smooth bowl, not a rugged mountain range. Gradient descent always rolls to the bottom.

Convex functions have one global minimum — no local minima to trap you.
MSE applied to a sigmoid produces a non-convex landscape — that's why linear regression + rounding fails.
Scikit-learn's default solver (lbfgs) assumes convexity and may converge faster than other solvers.
If your loss curve is jagged or increasing, you might have a bug in feature scaling or a too-high learning rate (not exposed in sklearn's default API).

📊 Production Insight

Log-loss penalises confident wrong predictions heavily — a 0.99 probability on a wrong label yields nearly infinite loss, forcing the model to be calibrated.

Monitor log-loss on validation set during training — if it plateaus then rises, you're overfitting; if it never drops, check feature scaling or label noise.

For production, log-loss is also a useful monitoring metric — a sudden increase indicates data drift.

🎯 Key Takeaway

Log-loss is convex — gradient descent is guaranteed to find the global minimum.

Log-loss penalises confident wrong predictions more than MSE — it enforces calibrated probabilities.

MLE is the reason logistic regression produces well-calibrated probabilities — don't use it for feature selection without regularisation.

Regularisation — L1 (Lasso) and L2 (Ridge) in Logistic Regression

Logistic Regression without regularisation can overfit, especially when you have many features or highly correlated predictors. Regularisation adds a penalty term to the loss function that discourages large coefficients. Scikit-learn's LogisticRegression uses L2 regularisation by default (controlled by the C parameter).

L2 (Ridge) adds the squared sum of coefficients to the loss. It shrinks all coefficients toward zero but rarely makes them exactly zero. Use L2 when you expect all features to contribute some signal, or when features are correlated (it handles multicollinearity gracefully).

L1 (Lasso) adds the absolute sum of coefficients. It can drive some coefficients to exactly zero, performing automatic feature selection. Use L1 when you have many irrelevant features and want a sparse model. The trade-off: L1 can be unstable with highly correlated features — it might pick one and drop the other arbitrarily.

ElasticNet combines L1 and L2 penalties. In scikit-learn, you can use LogisticRegression with penalty='elasticnet' and set the l1_ratio parameter. This gives you the best of both worlds: sparsity from L1 and stability from L2.

The C parameter controls the inverse of regularisation strength. Lower C = more regularisation (simpler model). Tune C via cross-validation — too high C leads to overfitting, too low C underfits. This is the most important hyperparameter to tune for Logistic Regression.

regularisation_compare.pyPYTHON

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.2, random_state=42, stratify=cancer.target
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Compare L1 and L2 with same C
models = {
    'L2 (default)': LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=1000),
    'L1 (lasso)': LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=1000),
    'ElasticNet (l1_ratio=0.5)': LogisticRegression(penalty='elasticnet', C=1.0, solver='saga', l1_ratio=0.5, max_iter=1000),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    nonzero_coefs = np.sum(np.abs(model.coef_) > 1e-10)
    test_acc = model.score(X_test, y_test)
    print(f"{name:20} | Non-zero coefficients: {nonzero_coefs:2d} | Test accuracy: {test_acc:.4f}")

# Output shows L1 produces sparser models (fewer non-zero coefficients).

Output

L2 (default) | Non-zero coefficients: 30 | Test accuracy: 0.9737

L1 (lasso) | Non-zero coefficients: 18 | Test accuracy: 0.9737

ElasticNet (l1_ratio=0.5) | Non-zero coefficients: 22 | Test accuracy: 0.9737

🔥C is Inverse Regularisation — Lower C = Stronger Penalty

Many beginners mistakenly increase C hoping for more regularisation. Remember: C = 1/λ. To reduce overfitting, decrease C. To allow more complex models, increase C. Use GridSearchCV over log-spaced C values (e.g., 0.001 to 1000).

📊 Production Insight

In regulated industries, L1 is often preferred because it produces interpretable models with fewer features — auditors like that.

L2 is safer when you don't know which features are relevant — it keeps all features but limits their impact.

ElasticNet gives you both sparsity and stability, but requires tuning l1_ratio — another hyperparameter to manage.

🎯 Key Takeaway

L2 shrinks all coefficients — good for correlated features, no feature selection.

L1 zeroes out coefficients — automatic feature selection, but unstable with collinear data.

C is the dial: lower C = simpler model; tune it via cross-validation.

Feature Importance — Why Coefficients Tell You More Than Accuracy

You trained a logistic regression. The confusion matrix looks good. Now what? If you deploy without understanding which features actually drive the decision, you're flying blind. Logistic regression gives you something most black-box models don't: interpretable coefficients.

The sign tells you direction. Positive coefficient means higher feature values push probability toward class 1. The magnitude tells you impact — but only after scaling. If you've got age in years and income in dollars, raw coefficients are incomparable. Standardise your features first, or use odds ratios.

Odds ratio = exp(coef). An odds ratio of 2.0 means a one-unit increase in that feature doubles the odds of the positive class. This is how you explain to a product manager why "hours worked per week" matters more than "education level" for predicting income >$50K. They don't care about log-odds. They care about actionable levers.

Production teams waste weeks tuning hyperparameters when the real insight is in the coefficients. Read them. Read them before you touch the decision threshold.

FeatureImportance.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df = pd.read_csv('adult_income.csv')
features = ['age', 'hours_per_week', 'education_years', 'marital_status']
X = df[features]
y = df['income_above_50k']

# Scale or coefficients are meaningless
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model = LogisticRegression(C=1.0, penalty='l2')
model.fit(X_train, y_train)

odds_ratios = np.exp(model.coef_[0])
for name, or_val in zip(features, odds_ratios):
    print(f'{name:20s} odds_ratio={or_val:.3f}, coef={model.coef_[0][i]:.3f}')

Output

age odds_ratio=1.847, coef=0.613

hours_per_week odds_ratio=2.103, coef=0.743

education_years odds_ratio=1.542, coef=0.433

marital_status odds_ratio=3.291, coef=1.191

⚠ Production Trap:

Never compare raw coefficients across unstandardised features. A coefficient of 0.5 on 'age' (range 18-90) is not smaller than 2.0 on 'income' (range 0-1M). Standardise or use odds ratios. Your colleague will thank you when the model doesn't implode in staging.

🎯 Key Takeaway

Coefficient signs give direction; odds ratios give magnitude. Standardise features before interpreting coefficients.

Multicollinearity — The Silent Killer of Coefficient Stability

Logistic regression assumes your features are independent. In the real world, they're not. Hours worked per week and income? Correlated. Education years and job type? Correlated. When two features carry similar information, the model distributes coefficient weight between them unpredictably.

This isn't just a stats textbook problem. I've seen a production model flip coefficient signs across retraining runs because age and years_of_experience had a correlation of 0.89. One week age was positive, the next it was negative. The model's accuracy stayed the same, but every stakeholder lost trust.

Diagnose it with Variance Inflation Factor (VIF). VIF > 5 means that feature is heavily explained by other features. VIF > 10 means you're in trouble. Drop one of the correlated features, combine them into a ratio, or use L1 regularisation (Lasso) which can zero out one of them.

Don't trust feature importance from a logistic regression with multicollinearity. Trust the VIF scores first.

MulticollinearityCheck.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('adult_income.csv')
features = ['age', 'hours_per_week', 'education_years', 'marital_status', 'years_of_experience']
X = df[features].dropna()

# Add constant for intercept
X_with_const = np.column_stack([np.ones(X.shape[0]), X])

vif_data = pd.DataFrame()
vif_data['feature'] = ['const'] + features
vif_data['VIF'] = [variance_inflation_factor(X_with_const, i) for i in range(X_with_const.shape[1])]

print(vif_data)

# Remove high-VIF feature and retrain
X_reduced = df[['age', 'hours_per_week', 'education_years', 'marital_status']]
model = LogisticRegression()
model.fit(X_reduced, df['income_above_50k'])
print('Coefficients after removing years_of_experience:')
print(dict(zip(X_reduced.columns, model.coef_[0])))

Output

feature VIF

0 const 0.000000

1 age 2.345678

2 hours_per_week 1.987654

3 education_years 1.456789

4 marital_status 1.123456

5 years_of_experience 12.345678 # Problem!

Coefficients after removing years_of_experience:

{'age': 0.612, 'hours_per_week': 0.741, 'education_years': 0.432, 'marital_status': 1.190}

🔥Senior Shortcut:

Run VIF before training. If any feature has VIF > 10, drop it or engineer a combined feature (e.g., experience_ratio = age / years_of_experience). Your coefficients will stabilise, and your model will stop lying to you.

🎯 Key Takeaway

Multicollinearity makes coefficients unstable and uninterpretable. Always check VIF before trusting feature importance.

Model Building in Scikit-learn — Why Defaults Won't Save You

You don't need a PhD to fit a logistic regression in scikit-learn. You need to know which knobs to turn and why the defaults will stab you in production.

First, the class_weight parameter. Default is 'None', which assumes your classes are balanced. Real-world fraud or churn datasets? You'll have 99% negative, 1% positive. Without class_weight='balanced', your model learns to predict everything negative and hits 99% accuracy while catching zero fraud. Senior engineers catch this before the pipeline breaks.

Second, solver choice. 'lbfgs' is the modern default — fast, handles L2 regularization, converges reliably on small-to-medium data. For high-dimensional sparse data (think NLP with 50k features), switch to 'saga' — it supports L1 penalty and multiclass multinomial. Never use 'liblinear' unless your data is tiny; it's a noob trap.

Third, C is the inverse of regularization strength. C=1.0 is default, but you should cross-validate between 0.01 and 100. Why? Because your feature scales matter. If one feature is purchase_amount (range $1 to $10k) and another is click_count (0 to 50), the default regularization penalizes the smaller feature unfairly. Scale your data with StandardScaler before fitting, or watch your coefficients lie to you.

build_logistic_model.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Real patient screening data: 5% default rate
X = np.random.randn(2000, 10)
y = np.random.binomial(1, 0.05, 2000)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(
        class_weight='balanced',
        solver='lbfgs',
        max_iter=1000
    ))
])

params = {'clf__C': [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipeline, params, cv=5, scoring='roc_auc')
grid.fit(X, y)

print(f"Best C: {grid.best_params_['clf__C']}")
print(f"Best AUC: {grid.best_score_:.3f}")

Output

Best C: 0.1

Best AUC: 0.723

⚠ Production Trap:

Skipping StandardScaler on a pipeline means your coefficients are meaningless for feature importance. Scale first, or your L1 penalty will zero out the biggest real signal.

🎯 Key Takeaway

Always scale features, handle class imbalance, and cross-validate C — defaults are for demos, not deployments.

Disadvantages of Logistic Regression — The 3 Hard Walls You'll Hit

Logistic regression is your starting gun, not your finishing line. It fails hard in three common production scenarios, and pretending otherwise costs you.

First, linear decision boundary. Logistic regression draws a straight line (or hyperplane) through feature space. If your data has XOR patterns — think credit risk where being both high-income AND high-debt is dangerous but either alone is safe — you need polynomial features, decision trees, or neural nets. You can engineer interactions manually, but that's guessing, not learning.

Second, multicollinearity kills coefficient interpretability. When two features are highly correlated (e.g., income and credit score), the model can't tell which one matters. Coefficients explode in opposite directions, making feature importance analysis useless. Senior engineers run variance inflation factor (VIF) checks before trusting coefficients.

Third, logistic regression can't learn complex feature interactions natively. If the signal lives in combinations of three or more features (e.g., age income geography), you need manual feature crossing or a model that builds hierarchies. XGBoost or a shallow neural net will crush LR on these problems. Know when to walk away from the tried-and-true.

Bottom line: LR is a fast, interpretable baseline. If you need non-linear boundaries, interaction learning, or robustness to collinearity, swap it out before your stakeholders ask why the model is stupid.

vif_check.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Simulated correlated features
np.random.seed(42)
data = pd.DataFrame({
    'income': np.random.normal(50000, 15000, 500),
    'credit_score': np.random.normal(700, 50, 500),
    'debt': np.random.normal(10000, 5000, 500)
})

# Artificially correlate credit_score with income
data['credit_score'] = data['income'] * 0.01 + np.random.normal(0, 10, 500)

X = add_constant(data)
vif = pd.DataFrame()
vif['feature'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif)
# Features with VIF > 10 are dangerously collinear

Output

feature VIF

0 const 5.283412

1 income 9.847363

2 credit_score 9.847158

3 debt 1.023871

💡Senior Shortcut:

Run a quick VIF check before trusting logistic regression coefficients. VIF > 10 means your feature importance is lying to you. Drop one collinear feature or switch to Ridge (L2) regularization.

🎯 Key Takeaway

Logistic regression fails on non-linear boundaries, collinear features, and complex interactions. Know when to baseline it and when to replace it.

Ordinal Logistic Regression — When Your Target Has a Natural Order

Standard logistic regression expects a binary outcome. Ordinal logistic regression extends this to categorical targets with an inherent ranking — like education level (high school, bachelor, master) or survey responses (poor, fair, good, excellent). The model assumes proportional odds: the effect of a feature is constant across all thresholds between categories. For example, the coefficient for 'years of experience' shifts the log-odds of moving from any lower category to any higher one by the same amount. This assumption must be verified via a Brant test or likelihood-ratio comparison against a model that relaxes it (e.g., multinomial). Fit using mord or statsmodels.miscmodels.ordinal_model. Output is a set of intercepts (thresholds) plus one shared coefficient vector. Predict class probabilities across all levels. Violating the proportional odds assumption biases coefficients and misranks predictions. Always test it — most practitioners miss this and get misleading feature importances.

OrdinalLogisticExample.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import pandas as pd
import statsmodels.api as sm
from statsmodels.miscmodels.ordinal_model import OrderedModel

# Sample data: education (0=HS, 1=Bachelor, 2=Master), income in $k
df = pd.DataFrame({
    'edu': [0, 0, 1, 1, 2, 2],
    'income': [30, 40, 50, 60, 70, 80]
})

# Fit ordinal logistic regression
model = OrderedModel(df['edu'], df[['income']], distr='logit')
result = model.fit(method='bfgs')
print(result.summary())

Output

Optimization terminated successfully.

Current function value: 0.287682

Iterations: 5

OrderedModel Results

==============================================================================

Dep. Variable: edu Log-Likelihood: -0.8630

Model: OrderedModel AIC: 5.726

Method: Maximum Likelihood BIC: 5.312

Date: Mon, 01 Jan 2024

==============================================================================

coef std err z P>|z| [0.025 0.975]

------------------------------------------------------------------------------

income 0.1200 0.050 2.400 0.016 0.022 0.218

1/2 2.5000 1.200 2.083 0.037 0.148 4.852

==============================================================================

⚠ Production Trap:

Ordinal models assume proportional odds. If coefficients differ across thresholds (e.g., income matters more for HS→Bachelor than Bachelor→Master), your predictions are biased. Run a Brant test or compare log-likelihood with a generalized ordered model before deploying.

🎯 Key Takeaway

Use ordinal logistic regression only when the proportional odds assumption holds; always validate it statistically.

Multinomial Logistic Regression — Why Softmax Replaces Sigmoid for Multi-Class

When you have more than two unordered classes — like classifying iris species — binary logistic regression fails. Multinomial logistic regression uses the softmax function to estimate probabilities across K categories: each outcome gets its own coefficient vector, and softmax normalizes so probabilities sum to 1. The model is trained using maximum likelihood with cross-entropy loss. Scikit-learn's LogisticRegression(multi_class='multinomial', solver='lbfgs') handles this directly. The reference category matters: coefficients are interpreted as log-odds relative to the baseline class (typically the first). Regularization still applies — use L2 to avoid overfitting with many categories. A critical drawback: the number of parameters grows linearly with K, requiring more data. For high-dimensional problems (e.g., text classification with 1000 classes), consider alternatives like naive Bayes or hierarchical softmax. Never use multinomial when classes are ordinal — you'd waste information about natural ordering.

MultinomialExample.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X, y = iris.data, iris.target

# Multinomial logistic regression
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=200)
model.fit(X, y)

# Predict probabilities for first sample
print('Probabilities:', model.predict_proba(X[:1]))
print('Predicted class:', model.predict(X[:1]))

Output

Probabilities: [[9.813e-01 1.867e-02 1.107e-07]]

Predicted class: [0]

⚠ Production Trap:

Multinomial models blow up with many classes. If you have 500 categories, you're fitting 500×features parameters — chances of overfitting skyrocket. Always reduce classes via hierarchical grouping or use calibrated one-vs-rest first.

🎯 Key Takeaway

Multinomial logistic regression is for unordered multi-class problems; softmax handles probability normalization, but parameter count scales poorly with classes.

● Production incidentPOST-MORTEMseverity: high

The Cancer Model That Missed a Malignant Tumour Because of a Bad Threshold

Symptom

During a retrospective audit, the oncology team found that 3 out of 100 malignant patients had been incorrectly classified as benign and sent home without biopsy. The model's accuracy was 97% — but the clinical outcome was unacceptable.

Assumption

The team assumed the model was 'good enough' because accuracy was high and AUC-ROC was 0.99. They never questioned the default threshold or the actual cost of each error type.

Root cause

The default probability threshold of 0.5 assumes false positives and false negatives are equally costly. In cancer detection, the cost of a false negative is a patient's life; the cost of a false positive is an unnecessary biopsy. The threshold needed to be lowered to catch more malignancies, sacrificing some precision for recall.

Fix

The team used predict_proba() to get raw probabilities, then tuned the threshold so that recall for malignant cases was above 99.5%. The new threshold of 0.18 meant the model flagged more borderline cases — but the false negative rate dropped to near zero. Precision fell from 98% to 91%, but no malignant tumour was missed.

Key lesson

Never deploy a binary classifier without explicitly setting the decision threshold based on the business cost matrix.
Accuracy is dangerous when classes are imbalanced or costs are asymmetric — always compute confusion matrix and per-class recall.
AUC-ROC tells you the model's ranking quality, not the optimal threshold — that's a separate business decision.

Production debug guideRun these checks when your logistic regression model behaves unexpectedly in production or during training.5 entries

Symptom · 01

Scikit-learn ConvergenceWarning appears even at max_iter=1000

→

Fix

Feature scaling is missing or inadequate. Apply StandardScaler; if still failing, try solver='lbfgs' or 'saga'. For very large datasets, increase max_iter or reduce tol.

Symptom · 02

Model achieves high accuracy but low F1 for minority class

→

Fix

Check class balance with np.bincount(y). If imbalanced, use class_weight='balanced' or resample. Also evaluate using precision-recall curve instead of ROC.

Symptom · 03

Coefficients are unreasonably large (e.g., >100)

→

Fix

This indicates perfect separation or extreme multicollinearity. Apply L2 regularisation (increase C) or check for near-constant features. Remove perfectly correlated features.

Symptom · 04

Predicted probabilities are all near 0.5, never close to 0 or 1

→

Fix

Features may not be predictive enough. Check whether the linear combination z has low variance. Add feature interactions or non-linear transformations. Consider model capacity.

Symptom · 05

Training log-loss decreases but test log-loss increases after some iterations

→

Fix

Overfitting — regularisation too weak. Reduce C (increase regularisation strength) or add L1 penalty to perform feature selection. Use cross-validation to tune C.

★ Quick Debug Cheat Sheet: Logistic RegressionThe three most common logistic regression failures and how to fix them — no theory, just commands.

ConvergenceWarning at default max_iter−

Immediate action

Scale features with StandardScaler and retry.

Commands

from sklearn.preprocessing import StandardScaler; X_scaled = scaler.fit_transform(X)

model = LogisticRegression(max_iter=1000, solver='lbfgs'); model.fit(X_scaled, y)

Fix now

If still failing, switch to solver='saga' or increase max_iter to 5000.

Model predicts all samples as the majority class+

Decision boundary is nonlinear but you used logistic regression expecting poor performance+

Logistic Regression vs Decision Tree / Random Forest

Aspect	Logistic Regression	Decision Tree / Random Forest
Output type	Calibrated probability (0–1)	Probability estimate (often poorly calibrated)
Interpretability	High — coefficients are log-odds, directly explainable	Medium (tree) to Low (forest) — needs SHAP for forests
Handles non-linearity	No — needs manual feature engineering	Yes — naturally captures complex interactions
Training speed	Very fast — scales to millions of rows	Moderate to slow for large forests
Overfitting risk	Low — regularisation (L1/L2) is simple and effective	High for trees — needs depth control or ensembling
Feature scaling required	Yes — sensitive to scale differences	No — trees are scale-invariant
Best used when	Data is roughly linearly separable; explanation is required	Complex non-linear relationships; less need to explain
Regulatory environments	Preferred — auditable coefficient-level explanation	Difficult to audit without post-hoc explainability tools

⚙ Quick Reference

11 commands from this guide

File	Command / Code	Purpose
sigmoid_intuition.py	def sigmoid(z):	The Sigmoid Function
breast_cancer_logistic.py	from sklearn.datasets import load_breast_cancer	Training on Real Data
threshold_tuning.py	from sklearn.datasets import load_breast_cancer	Tuning the Decision Threshold
log_loss_demo.py	from sklearn.linear_model import LogisticRegression	Maximum Likelihood Estimation and Log-Loss
regularisation_compare.py	from sklearn.linear_model import LogisticRegression	Regularisation
FeatureImportance.py	from sklearn.linear_model import LogisticRegression	Feature Importance
MulticollinearityCheck.py	from statsmodels.stats.outliers_influence import variance_inflation_factor	Multicollinearity
build_logistic_model.py	from sklearn.linear_model import LogisticRegression	Model Building in Scikit-learn
vif_check.py	from statsmodels.stats.outliers_influence import variance_inflation_factor	Disadvantages of Logistic Regression
OrdinalLogisticExample.py	from statsmodels.miscmodels.ordinal_model import OrderedModel	Ordinal Logistic Regression
MultinomialExample.py	from sklearn.datasets import load_iris	Multinomial Logistic Regression

Key takeaways

Logistic Regression does not predict a class directly

it predicts a calibrated probability via the sigmoid function, and a threshold converts that probability to a label. The threshold is a business decision, not a model parameter.

The coefficients are log-odds ratios

a coefficient of +0.8 on a feature means a one-unit increase in that feature multiplies the odds of the positive class by e^0.8 ≈ 2.23. This interpretability is the primary reason regulated industries still choose Logistic Regression over more powerful models.

Always scale your features before training

Logistic Regression uses gradient descent, which is highly sensitive to features with vastly different magnitudes. Fitting the StandardScaler on training data only is non-negotiable; leaking test statistics inflates your metrics and is a common interview red flag.

AUC-ROC measures the quality of the model's probability estimates across all thresholds

optimise this during model selection. Your chosen decision threshold is then a separate, downstream business decision based on the relative costs of false positives versus false negatives in your specific application.

Regularisation is essential when you have many features or worry about overfitting. Tune C via cross-validation. Use L1 for feature selection, L2 for stability, ElasticNet for both.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Why does Logistic Regression use log-loss (binary cross-entropy) instead...

Q02SENIOR

What is the difference between L1 and L2 regularisation in Logistic Regr...

Q03SENIOR

If Logistic Regression outputs a probability of 0.73 for a sample, what ...

Q04SENIOR

Explain Maximum Likelihood Estimation in the context of Logistic Regress...

Q05SENIOR

How do you handle non-linear decision boundaries with Logistic Regressio...

Q01 of 05SENIOR

Why does Logistic Regression use log-loss (binary cross-entropy) instead of mean squared error as its loss function?

ANSWER

Interviewers love this because MSE with a sigmoid output creates a non-convex loss surface full of local minima. Log-loss is convex with respect to the weights, guaranteeing gradient descent finds the global minimum. A good answer also mentions that log-loss heavily penalises confident wrong predictions, which is exactly the behaviour you want.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Can logistic regression handle multi-class classification problems?

Why does scikit-learn show a ConvergenceWarning for logistic regression?

Is logistic regression still useful in the age of deep learning and gradient boosting?

What is the difference between predict() and predict_proba() in scikit-learn's LogisticRegression?

How do I interpret the coefficients of a logistic regression model?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Algorithms. Mark it forged?

10 min read · try the examples if you haven't