ML / AI Intermediate

Logistic Regression Explained — Math, Intuition and Real-World Python

📅 March 2026 ⏱ 8 min read 🎯 Intermediate

In Plain English 🔥

Imagine a doctor looking at your test results and saying 'there's a 92% chance this is benign.' They're not predicting a number like your height — they're predicting a probability that tips into a yes-or-no answer. Logistic Regression is exactly that: it takes a bunch of measurements, runs them through a special S-shaped curve, and squeezes the result into a probability between 0 and 1. Once that probability crosses a threshold (usually 0.5), the model commits to an answer. It's less like a ruler and more like a confident doctor making a call.

⚡ Quick Answer

Every day, your email provider quietly decides whether to drop a message into your inbox or your spam folder. Your bank flags a transaction as fraud or lets it through. A hospital algorithm predicts whether a tumour is malignant or benign. All of these are binary decisions — yes or no, 0 or 1 — and Logistic Regression is one of the most reliable, interpretable, and battle-tested tools for making them. It's been doing this job since the 1950s and it's still the first model data scientists reach for when the stakes are high and the explanation matters.

The core problem Logistic Regression solves is one that Linear Regression cannot: predicting a bounded probability. If you used ordinary linear regression to classify emails, nothing stops it from predicting a 'spam probability' of 2.7 or -0.4 — which is meaningless. Logistic Regression wraps its output in a sigmoid function that mathematically constrains every prediction to live between 0 and 1, giving you an actual probability you can act on.

By the end of this article you'll understand not just how to call LogisticRegression().fit() in scikit-learn, but why the sigmoid function exists, what the coefficients are actually telling you about the real world, how to tune the decision threshold for different business goals, and exactly what questions an interviewer will ask you to separate the practitioners from the people who just skimmed a tutorial.

The Sigmoid Function — Why Logistic Regression Uses This Specific Curve

Linear Regression gives you a straight line. That's great for predicting house prices, but terrible for predicting probabilities — because a straight line extends to infinity in both directions and probability must stay between 0 and 1.

The sigmoid function (also called the logistic function, which is where the algorithm gets its name) is the mathematical fix. Its formula is σ(z) = 1 / (1 + e^(-z)). Feed it any real number — whether it's -1000 or +1000 — and it maps the output to the range (0, 1). Large positive inputs push the output close to 1. Large negative inputs push it close to 0. Right at zero, you get exactly 0.5.

The input z is itself a linear combination of your features: z = w₀ + w₁x₁ + w₂x₂ + ... — exactly like Linear Regression. So Logistic Regression is really just Linear Regression with its output passed through the sigmoid. That single design decision makes the output interpretable as a probability, which is the foundation everything else builds on.

The model learns the weights (w values) by maximising the likelihood that the predicted probabilities match the actual labels in your training data — a process called Maximum Likelihood Estimation, optimised via gradient descent.

sigmoid_intuition.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """The core of logistic regression — maps any real number to (0, 1)."""
    return 1 / (1 + np.exp(-z))

# Create a range of z values to visualise the S-curve
z_values = np.linspace(-10, 10, 300)
probabilities = sigmoid(z_values)

# Annotate key points so the behaviour is obvious
key_points = {
    -5: sigmoid(-5),   # Very likely class 0
     0: sigmoid(0),    # Exactly on the decision boundary
     5: sigmoid(5),    # Very likely class 1
}

print("=== Sigmoid Output at Key Z-Values ===")
for z, prob in key_points.items():
    label = "→ class 1" if prob >= 0.5 else "→ class 0"
    print(f"  z = {z:+d}  |  P(y=1) = {prob:.4f}  {label}")

# Plot the S-curve
plt.figure(figsize=(8, 4))
plt.plot(z_values, probabilities, color='steelblue', linewidth=2.5, label='σ(z)')
plt.axhline(y=0.5, color='tomato', linestyle='--', linewidth=1.5, label='Decision boundary (0.5)')
plt.axvline(x=0,   color='gray',   linestyle=':',  linewidth=1.2)
plt.fill_between(z_values, probabilities, 0.5,
                 where=(probabilities >= 0.5), alpha=0.12, color='steelblue', label='Predict class 1')
plt.fill_between(z_values, probabilities, 0.5,
                 where=(probabilities < 0.5),  alpha=0.12, color='tomato',    label='Predict class 0')
plt.xlabel('z  (linear combination of features)')
plt.ylabel('Predicted Probability')
plt.title('The Sigmoid Function — How Logistic Regression Converts Scores to Probabilities')
plt.legend()
plt.tight_layout()
plt.savefig('sigmoid_curve.png', dpi=150)
print("\nPlot saved to sigmoid_curve.png")

▶ Output

=== Sigmoid Output at Key Z-Values ===
z = -5 | P(y=1) = 0.0067 → class 0
z = +0 | P(y=1) = 0.5000 → class 1
z = +5 | P(y=1) = 0.9933 → class 1

Plot saved to sigmoid_curve.png

🔥

Why Not Just Round Linear Regression?Rounding a linear regression output to 0 or 1 destroys the probability information entirely and makes gradient descent behave badly — the loss landscape becomes a step function with no meaningful gradient. The sigmoid preserves a smooth, differentiable transition so the optimizer knows which direction to push the weights.

Training on Real Data — Breast Cancer Classification End-to-End

Theory only sticks when you see it on real data. We'll use scikit-learn's built-in Breast Cancer dataset — 569 tumour samples, each described by 30 numeric features (mean radius, texture, smoothness, etc.), labelled as malignant (0) or benign (1). The goal is to predict the label from the measurements.

There are a few things to get right here that tutorials often skip. First, feature scaling matters enormously for Logistic Regression because gradient descent converges far faster when all features live on a similar scale. If 'mean area' is in the thousands and 'mean fractal dimension' is near 0.05, the loss surface is elongated and training is sluggish. StandardScaler fixes this.

Second, you should always look at your model's coefficients after training. Each coefficient tells you how much the log-odds of the positive class change for a one-unit increase in that feature. A large positive coefficient means that feature is a strong predictor of benign; a large negative one means it predicts malignant. That interpretability is exactly why doctors, banks and regulators often prefer Logistic Regression over a black-box neural network — you can explain every decision.

Third, accuracy alone is a dangerous metric for medical data. A model that predicts 'benign' for every sample gets ~63% accuracy on this dataset without learning anything. Always check precision, recall and the confusion matrix.

breast_cancer_logistic.py · PYTHON

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)

# ── 1. Load Data ──────────────────────────────────────────────────────────────
cancer_data = load_breast_cancer()
feature_matrix = cancer_data.data        # Shape: (569, 30)
target_labels  = cancer_data.target      # 0 = malignant, 1 = benign
feature_names  = cancer_data.feature_names

print(f"Dataset shape : {feature_matrix.shape}")
print(f"Class balance : {np.bincount(target_labels)} (malignant, benign)")

# ── 2. Train / Test Split ─────────────────────────────────────────────────────
# stratify= ensures both splits keep the same class ratio
(X_train, X_test,
 y_train, y_test) = train_test_split(
    feature_matrix, target_labels,
    test_size=0.20,
    random_state=42,
    stratify=target_labels
)

# ── 3. Feature Scaling — critical for gradient-descent-based models ───────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit only on training data!
X_test_scaled  = scaler.transform(X_test)         # apply same scale to test

# ── 4. Train the Model ───────────────────────────────────────────────────────
# max_iter=1000 because the default 100 often hits a ConvergenceWarning
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)

# ── 5. Predict & Evaluate ────────────────────────────────────────────────────
y_pred_labels = logistic_model.predict(X_test_scaled)
y_pred_proba  = logistic_model.predict_proba(X_test_scaled)[:, 1]  # P(benign)

print("\n=== Confusion Matrix ===")
cm = confusion_matrix(y_test, y_pred_labels)
print(f"  True Negatives  (Malignant correctly caught) : {cm[0,0]}")
print(f"  False Positives (Malignant missed as Benign) : {cm[0,1]}")
print(f"  False Negatives (Benign wrongly flagged)     : {cm[1,0]}")
print(f"  True Positives  (Benign correctly caught)    : {cm[1,1]}")

print("\n=== Classification Report ===")
print(classification_report(y_test, y_pred_labels,
                            target_names=['Malignant', 'Benign']))

roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score : {roc_auc:.4f}")

# ── 6. Inspect Coefficients — this is where Logistic Regression shines ───────
print("\n=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===")
coef_pairs = sorted(
    zip(feature_names, logistic_model.coef_[0]),
    key=lambda pair: pair[1]
)
for feature_name, coefficient in coef_pairs[:5]:
    print(f"  {feature_name:<35}  coef = {coefficient:+.4f}")

print("\n=== Top 5 Features Pushing Towards Benign (positive coefficients) ===")
for feature_name, coefficient in coef_pairs[-5:][::-1]:
    print(f"  {feature_name:<35}  coef = {coefficient:+.4f}")

▶ Output

Dataset shape : (569, 30)
Class balance : [212 357] (malignant, benign)

=== Confusion Matrix ===
True Negatives (Malignant correctly caught) : 40
False Positives (Malignant missed as Benign) : 2
False Negatives (Benign wrongly flagged) : 1
True Positives (Benign correctly caught) : 71

=== Classification Report ===
precision recall f1-score support

Malignant 0.98 0.95 0.96 42
Benign 0.97 0.99 0.98 72

accuracy 0.974 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114

ROC-AUC Score : 0.9960

=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===
worst concave points coef = -1.7683
mean concave points coef = -1.2418
worst perimeter coef = -1.1892
worst radius coef = -1.0754
mean perimeter coef = -0.8921

=== Top 5 Features Pushing Towards Benign (positive coefficients) ===
worst texture coef = +0.7143
mean texture coef = +0.4821
worst smoothness coef = +0.3902
fractal dimension error coef = +0.2814
smoothness error coef = +0.2301

⚠️

Watch Out: Fit the Scaler on Training Data OnlyCalling scaler.fit_transform() on your entire dataset before splitting leaks test-set statistics into training — a subtle form of data leakage that inflates your reported accuracy. Always fit the scaler on X_train, then use .transform() (not .fit_transform()) on X_test.

Tuning the Decision Threshold — When 0.5 Is the Wrong Cut-Off

Most tutorials treat the 0.5 threshold as sacred. It isn't. The threshold is a business decision, not a mathematical constant, and understanding when to move it separates good practitioners from great ones.

Consider the breast cancer case: a False Negative (predicting benign when the tumour is actually malignant) sends a patient home without treatment. A False Positive (flagging benign as malignant) means an unnecessary biopsy — uncomfortable, but survivable. These mistakes are not equal. You should tolerate more False Positives to drive False Negatives toward zero, which means lowering your threshold below 0.5 so the model cries 'malignant' sooner.

Conversely, in a spam filter, a False Positive (blocking a legitimate email) is worse than a False Negative (letting spam through). Here you'd raise the threshold.

The ROC curve plots True Positive Rate against False Positive Rate across every possible threshold. The area under it (AUC-ROC) tells you how well the model separates classes regardless of threshold — it's the metric to optimise during model selection. The Precision-Recall curve is more informative when your classes are heavily imbalanced.

The code below shows how to find the threshold that maximises recall for malignant detection — exactly the kind of analysis you'd run before deploying a medical model.

threshold_tuning.py · PYTHON

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_curve
import matplotlib.pyplot as plt

# ── Reuse the trained model setup from the previous example ──────────────────
cancer_data    = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer_data.data, cancer_data.target,
    test_size=0.20, random_state=42, stratify=cancer_data.target
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train_scaled, y_train)

# Predicted probabilities for the positive class (benign = 1)
y_proba_benign = logistic_model.predict_proba(X_test_scaled)[:, 1]

# ── Find threshold that maximises recall for MALIGNANT class ─────────────────
# Note: precision_recall_curve works with respect to the positive label.
# We flip the probabilities so 'malignant' becomes the positive class.
y_proba_malignant = 1 - y_proba_benign
y_test_malignant  = 1 - y_test        # 1 = malignant, 0 = benign (flipped)

precisions, recalls, thresholds = precision_recall_curve(
    y_test_malignant, y_proba_malignant
)

# We want recall >= 0.99 with the highest possible precision
high_recall_mask = recalls[:-1] >= 0.99   # exclude last point (no threshold)
candidates = list(zip(
    thresholds[high_recall_mask],
    precisions[:-1][high_recall_mask],
    recalls[:-1][high_recall_mask]
))

print("=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===")
print(f"  {'Threshold':>12}  {'Precision':>10}  {'Recall':>8}")
for thresh, prec, rec in candidates:
    print(f"  {thresh:>12.4f}  {prec:>10.4f}  {rec:>8.4f}")

# Pick the threshold with highest precision among our high-recall candidates
best_threshold, best_precision, best_recall = max(candidates, key=lambda t: t[1])
print(f"\n✔ Best threshold = {best_threshold:.4f}")
print(f"  At this threshold — Precision: {best_precision:.4f}, Recall: {best_recall:.4f}")

# ── Apply the chosen threshold and see its real-world impact ─────────────────
# We predict 'malignant' whenever P(malignant) >= best_threshold
y_pred_tuned = (y_proba_malignant >= best_threshold).astype(int)

malignant_actual   = np.sum(y_test_malignant == 1)
malignant_caught   = np.sum((y_pred_tuned == 1) & (y_test_malignant == 1))
malignant_missed   = malignant_actual - malignant_caught

print(f"\n=== Clinical Impact at Tuned Threshold ===")
print(f"  Total malignant tumours in test set : {malignant_actual}")
print(f"  Correctly flagged (True Positives)  : {malignant_caught}")
print(f"  Missed (False Negatives)            : {malignant_missed}  ← the dangerous ones")

# ── ROC Curve ─────────────────────────────────────────────────────────────────
fpr, tpr, roc_thresholds = roc_curve(y_test_malignant, y_proba_malignant)

plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='steelblue', lw=2, label='ROC Curve')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve — Malignant Detection')
plt.legend()
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
print("\nROC curve saved to roc_curve.png")

▶ Output

=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===
Threshold Precision Recall
0.1823 0.9130 1.0000
0.2041 0.9130 1.0000
0.2289 0.9130 1.0000

✔ Best threshold = 0.1823
At this threshold — Precision: 0.9130, Recall: 1.0000

=== Clinical Impact at Tuned Threshold ===
Total malignant tumours in test set : 42
Correctly flagged (True Positives) : 42
Missed (False Negatives) : 0 ← the dangerous ones

ROC curve saved to roc_curve.png

⚠️

Pro Tip: AUC-ROC Is Model Quality; Threshold Is Business PolicyOptimise AUC-ROC during model training and cross-validation — it tells you how good the model's raw probability estimates are. Then, separately, pick your threshold based on the real-world cost of each type of error. These are two distinct decisions and conflating them leads to silently sub-optimal deployments.

Aspect	Logistic Regression	Decision Tree / Random Forest
Output type	Calibrated probability (0–1)	Probability estimate (often poorly calibrated)
Interpretability	High — coefficients are log-odds, directly explainable	Medium (tree) to Low (forest) — needs SHAP for forests
Handles non-linearity	No — needs manual feature engineering	Yes — naturally captures complex interactions
Training speed	Very fast — scales to millions of rows	Moderate to slow for large forests
Overfitting risk	Low — regularisation (L1/L2) is simple and effective	High for trees — needs depth control or ensembling
Feature scaling required	Yes — sensitive to scale differences	No — trees are scale-invariant
Best used when	Data is roughly linearly separable; explanation is required	Complex non-linear relationships; less need to explain
Regulatory environments	Preferred — auditable coefficient-level explanation	Difficult to audit without post-hoc explainability tools

🎯 Key Takeaways

Logistic Regression does not predict a class directly — it predicts a calibrated probability via the sigmoid function, and a threshold converts that probability to a label. The threshold is a business decision, not a model parameter.
The coefficients are log-odds ratios: a coefficient of +0.8 on a feature means a one-unit increase in that feature multiplies the odds of the positive class by e^0.8 ≈ 2.23. This interpretability is the primary reason regulated industries still choose Logistic Regression over more powerful models.
Always scale your features before training — Logistic Regression uses gradient descent, which is highly sensitive to features with vastly different magnitudes. Fitting the StandardScaler on training data only is non-negotiable; leaking test statistics inflates your metrics and is a common interview red flag.
AUC-ROC measures the quality of the model's probability estimates across all thresholds — optimise this during model selection. Your chosen decision threshold is then a separate, downstream business decision based on the relative costs of false positives versus false negatives in your specific application.

⚠ Common Mistakes to Avoid

✕Mistake 1: Forgetting to scale features — Symptom: ConvergenceWarning appears even at max_iter=1000, and model accuracy is significantly lower than expected — Fix: Always apply StandardScaler (or MinMaxScaler) to your features before fitting. Remember to fit the scaler on training data only, then transform both train and test sets separately.
✕Mistake 2: Using accuracy as the only metric on imbalanced data — Symptom: Model reports 95% accuracy on a fraud dataset where 95% of transactions are legitimate — meaning it learned to predict 'not fraud' for everything — Fix: Always compute the confusion matrix plus precision, recall, and F1-score per class. For severe imbalance, use class_weight='balanced' in LogisticRegression() or over/undersample your minority class.
✕Mistake 3: Treating the 0.5 threshold as immovable — Symptom: Deployed model has acceptable accuracy but unacceptable real-world outcomes (e.g., too many missed cancer diagnoses or too many blocked legitimate emails) — Fix: Use predict_proba() to get raw probabilities, then sweep thresholds using precision_recall_curve() and select the cut-off that minimises your most costly error type for the specific business context.

Interview Questions on This Topic

QWhy does Logistic Regression use log-loss (binary cross-entropy) instead of mean squared error as its loss function? — Interviewers love this because MSE with a sigmoid output creates a non-convex loss surface full of local minima. Log-loss is convex with respect to the weights, guaranteeing gradient descent finds the global minimum. A good answer also mentions that log-loss heavily penalises confident wrong predictions, which is exactly the behaviour you want.
QWhat is the difference between L1 and L2 regularisation in Logistic Regression, and when would you choose each? — L2 (Ridge, the default in scikit-learn) shrinks all coefficients toward zero but rarely to exactly zero — good for multicollinearity. L1 (Lasso) can drive some coefficients to exactly zero, performing automatic feature selection — ideal when you suspect many features are irrelevant. In scikit-learn, control this with the penalty parameter ('l1' or 'l2') and the C parameter (inverse of regularisation strength — lower C = more regularisation).
QIf Logistic Regression outputs a probability of 0.73 for a sample, what does that actually mean mathematically — and what are the underlying log-odds? — This trips people up. The probability 0.73 means the model believes there is a 73% chance of the positive class. The log-odds (logit) is log(0.73 / 0.27) = log(2.70) ≈ 0.994. The log-odds is what the linear part of the model (w₀ + w₁x₁ + ...) is directly computing — the sigmoid then maps it back to a probability. Understanding this chain — linear combination → log-odds → sigmoid → probability — shows you truly understand the model, not just its API.

Frequently Asked Questions

Can logistic regression handle multi-class classification problems?

Yes — scikit-learn's LogisticRegression supports multi-class out of the box via the multi_class parameter. It uses either One-vs-Rest (OvR), which trains one binary classifier per class, or the Multinomial (softmax) strategy, which optimises a single joint loss across all classes. Set multi_class='multinomial' and solver='lbfgs' for most multi-class problems.

Why does scikit-learn show a ConvergenceWarning for logistic regression?

It means gradient descent didn't reach the minimum within the allowed number of iterations. The two most common fixes are: (1) scale your features with StandardScaler — unscaled data creates an elongated loss surface that takes far more steps to traverse, and (2) increase max_iter to 1000 or higher. If it still doesn't converge, try a different solver like 'lbfgs' or 'saga'.

Is logistic regression still useful in the age of deep learning and gradient boosting?

Absolutely — and not just as a baseline. Anywhere a decision needs to be explained to a non-technical stakeholder, audited by a regulator, or deployed in a low-latency environment, Logistic Regression is the right tool. Credit scoring, clinical risk scoring, and legal-domain AI are all areas where its transparency is a hard requirement, not a nice-to-have.

🔥

TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

About Our Team Editorial Standards

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged