Logistic Regression Explained — Math, Intuition and Real-World Python
Every day, your email provider quietly decides whether to drop a message into your inbox or your spam folder. Your bank flags a transaction as fraud or lets it through. A hospital algorithm predicts whether a tumour is malignant or benign. All of these are binary decisions — yes or no, 0 or 1 — and Logistic Regression is one of the most reliable, interpretable, and battle-tested tools for making them. It's been doing this job since the 1950s and it's still the first model data scientists reach for when the stakes are high and the explanation matters.
The core problem Logistic Regression solves is one that Linear Regression cannot: predicting a bounded probability. If you used ordinary linear regression to classify emails, nothing stops it from predicting a 'spam probability' of 2.7 or -0.4 — which is meaningless. Logistic Regression wraps its output in a sigmoid function that mathematically constrains every prediction to live between 0 and 1, giving you an actual probability you can act on.
By the end of this article you'll understand not just how to call LogisticRegression().fit() in scikit-learn, but why the sigmoid function exists, what the coefficients are actually telling you about the real world, how to tune the decision threshold for different business goals, and exactly what questions an interviewer will ask you to separate the practitioners from the people who just skimmed a tutorial.
The Sigmoid Function — Why Logistic Regression Uses This Specific Curve
Linear Regression gives you a straight line. That's great for predicting house prices, but terrible for predicting probabilities — because a straight line extends to infinity in both directions and probability must stay between 0 and 1.
The sigmoid function (also called the logistic function, which is where the algorithm gets its name) is the mathematical fix. Its formula is σ(z) = 1 / (1 + e^(-z)). Feed it any real number — whether it's -1000 or +1000 — and it maps the output to the range (0, 1). Large positive inputs push the output close to 1. Large negative inputs push it close to 0. Right at zero, you get exactly 0.5.
The input z is itself a linear combination of your features: z = w₀ + w₁x₁ + w₂x₂ + ... — exactly like Linear Regression. So Logistic Regression is really just Linear Regression with its output passed through the sigmoid. That single design decision makes the output interpretable as a probability, which is the foundation everything else builds on.
The model learns the weights (w values) by maximising the likelihood that the predicted probabilities match the actual labels in your training data — a process called Maximum Likelihood Estimation, optimised via gradient descent.
import numpy as np import matplotlib.pyplot as plt def sigmoid(z): """The core of logistic regression — maps any real number to (0, 1).""" return 1 / (1 + np.exp(-z)) # Create a range of z values to visualise the S-curve z_values = np.linspace(-10, 10, 300) probabilities = sigmoid(z_values) # Annotate key points so the behaviour is obvious key_points = { -5: sigmoid(-5), # Very likely class 0 0: sigmoid(0), # Exactly on the decision boundary 5: sigmoid(5), # Very likely class 1 } print("=== Sigmoid Output at Key Z-Values ===") for z, prob in key_points.items(): label = "→ class 1" if prob >= 0.5 else "→ class 0" print(f" z = {z:+d} | P(y=1) = {prob:.4f} {label}") # Plot the S-curve plt.figure(figsize=(8, 4)) plt.plot(z_values, probabilities, color='steelblue', linewidth=2.5, label='σ(z)') plt.axhline(y=0.5, color='tomato', linestyle='--', linewidth=1.5, label='Decision boundary (0.5)') plt.axvline(x=0, color='gray', linestyle=':', linewidth=1.2) plt.fill_between(z_values, probabilities, 0.5, where=(probabilities >= 0.5), alpha=0.12, color='steelblue', label='Predict class 1') plt.fill_between(z_values, probabilities, 0.5, where=(probabilities < 0.5), alpha=0.12, color='tomato', label='Predict class 0') plt.xlabel('z (linear combination of features)') plt.ylabel('Predicted Probability') plt.title('The Sigmoid Function — How Logistic Regression Converts Scores to Probabilities') plt.legend() plt.tight_layout() plt.savefig('sigmoid_curve.png', dpi=150) print("\nPlot saved to sigmoid_curve.png")
z = -5 | P(y=1) = 0.0067 → class 0
z = +0 | P(y=1) = 0.5000 → class 1
z = +5 | P(y=1) = 0.9933 → class 1
Plot saved to sigmoid_curve.png
Training on Real Data — Breast Cancer Classification End-to-End
Theory only sticks when you see it on real data. We'll use scikit-learn's built-in Breast Cancer dataset — 569 tumour samples, each described by 30 numeric features (mean radius, texture, smoothness, etc.), labelled as malignant (0) or benign (1). The goal is to predict the label from the measurements.
There are a few things to get right here that tutorials often skip. First, feature scaling matters enormously for Logistic Regression because gradient descent converges far faster when all features live on a similar scale. If 'mean area' is in the thousands and 'mean fractal dimension' is near 0.05, the loss surface is elongated and training is sluggish. StandardScaler fixes this.
Second, you should always look at your model's coefficients after training. Each coefficient tells you how much the log-odds of the positive class change for a one-unit increase in that feature. A large positive coefficient means that feature is a strong predictor of benign; a large negative one means it predicts malignant. That interpretability is exactly why doctors, banks and regulators often prefer Logistic Regression over a black-box neural network — you can explain every decision.
Third, accuracy alone is a dangerous metric for medical data. A model that predicts 'benign' for every sample gets ~63% accuracy on this dataset without learning anything. Always check precision, recall and the confusion matrix.
import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import ( classification_report, confusion_matrix, roc_auc_score ) # ── 1. Load Data ────────────────────────────────────────────────────────────── cancer_data = load_breast_cancer() feature_matrix = cancer_data.data # Shape: (569, 30) target_labels = cancer_data.target # 0 = malignant, 1 = benign feature_names = cancer_data.feature_names print(f"Dataset shape : {feature_matrix.shape}") print(f"Class balance : {np.bincount(target_labels)} (malignant, benign)") # ── 2. Train / Test Split ───────────────────────────────────────────────────── # stratify= ensures both splits keep the same class ratio (X_train, X_test, y_train, y_test) = train_test_split( feature_matrix, target_labels, test_size=0.20, random_state=42, stratify=target_labels ) # ── 3. Feature Scaling — critical for gradient-descent-based models ─────────── scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit only on training data! X_test_scaled = scaler.transform(X_test) # apply same scale to test # ── 4. Train the Model ─────────────────────────────────────────────────────── # max_iter=1000 because the default 100 often hits a ConvergenceWarning logistic_model = LogisticRegression(max_iter=1000, random_state=42) logistic_model.fit(X_train_scaled, y_train) # ── 5. Predict & Evaluate ──────────────────────────────────────────────────── y_pred_labels = logistic_model.predict(X_test_scaled) y_pred_proba = logistic_model.predict_proba(X_test_scaled)[:, 1] # P(benign) print("\n=== Confusion Matrix ===") cm = confusion_matrix(y_test, y_pred_labels) print(f" True Negatives (Malignant correctly caught) : {cm[0,0]}") print(f" False Positives (Malignant missed as Benign) : {cm[0,1]}") print(f" False Negatives (Benign wrongly flagged) : {cm[1,0]}") print(f" True Positives (Benign correctly caught) : {cm[1,1]}") print("\n=== Classification Report ===") print(classification_report(y_test, y_pred_labels, target_names=['Malignant', 'Benign'])) roc_auc = roc_auc_score(y_test, y_pred_proba) print(f"ROC-AUC Score : {roc_auc:.4f}") # ── 6. Inspect Coefficients — this is where Logistic Regression shines ─────── print("\n=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===") coef_pairs = sorted( zip(feature_names, logistic_model.coef_[0]), key=lambda pair: pair[1] ) for feature_name, coefficient in coef_pairs[:5]: print(f" {feature_name:<35} coef = {coefficient:+.4f}") print("\n=== Top 5 Features Pushing Towards Benign (positive coefficients) ===") for feature_name, coefficient in coef_pairs[-5:][::-1]: print(f" {feature_name:<35} coef = {coefficient:+.4f}")
Class balance : [212 357] (malignant, benign)
=== Confusion Matrix ===
True Negatives (Malignant correctly caught) : 40
False Positives (Malignant missed as Benign) : 2
False Negatives (Benign wrongly flagged) : 1
True Positives (Benign correctly caught) : 71
=== Classification Report ===
precision recall f1-score support
Malignant 0.98 0.95 0.96 42
Benign 0.97 0.99 0.98 72
accuracy 0.974 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
ROC-AUC Score : 0.9960
=== Top 5 Features Pushing Towards Malignant (negative coefficients) ===
worst concave points coef = -1.7683
mean concave points coef = -1.2418
worst perimeter coef = -1.1892
worst radius coef = -1.0754
mean perimeter coef = -0.8921
=== Top 5 Features Pushing Towards Benign (positive coefficients) ===
worst texture coef = +0.7143
mean texture coef = +0.4821
worst smoothness coef = +0.3902
fractal dimension error coef = +0.2814
smoothness error coef = +0.2301
Tuning the Decision Threshold — When 0.5 Is the Wrong Cut-Off
Most tutorials treat the 0.5 threshold as sacred. It isn't. The threshold is a business decision, not a mathematical constant, and understanding when to move it separates good practitioners from great ones.
Consider the breast cancer case: a False Negative (predicting benign when the tumour is actually malignant) sends a patient home without treatment. A False Positive (flagging benign as malignant) means an unnecessary biopsy — uncomfortable, but survivable. These mistakes are not equal. You should tolerate more False Positives to drive False Negatives toward zero, which means lowering your threshold below 0.5 so the model cries 'malignant' sooner.
Conversely, in a spam filter, a False Positive (blocking a legitimate email) is worse than a False Negative (letting spam through). Here you'd raise the threshold.
The ROC curve plots True Positive Rate against False Positive Rate across every possible threshold. The area under it (AUC-ROC) tells you how well the model separates classes regardless of threshold — it's the metric to optimise during model selection. The Precision-Recall curve is more informative when your classes are heavily imbalanced.
The code below shows how to find the threshold that maximises recall for malignant detection — exactly the kind of analysis you'd run before deploying a medical model.
import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import precision_recall_curve, roc_curve import matplotlib.pyplot as plt # ── Reuse the trained model setup from the previous example ────────────────── cancer_data = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( cancer_data.data, cancer_data.target, test_size=0.20, random_state=42, stratify=cancer_data.target ) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) logistic_model = LogisticRegression(max_iter=1000, random_state=42) logistic_model.fit(X_train_scaled, y_train) # Predicted probabilities for the positive class (benign = 1) y_proba_benign = logistic_model.predict_proba(X_test_scaled)[:, 1] # ── Find threshold that maximises recall for MALIGNANT class ───────────────── # Note: precision_recall_curve works with respect to the positive label. # We flip the probabilities so 'malignant' becomes the positive class. y_proba_malignant = 1 - y_proba_benign y_test_malignant = 1 - y_test # 1 = malignant, 0 = benign (flipped) precisions, recalls, thresholds = precision_recall_curve( y_test_malignant, y_proba_malignant ) # We want recall >= 0.99 with the highest possible precision high_recall_mask = recalls[:-1] >= 0.99 # exclude last point (no threshold) candidates = list(zip( thresholds[high_recall_mask], precisions[:-1][high_recall_mask], recalls[:-1][high_recall_mask] )) print("=== Threshold Candidates Achieving ≥99% Recall on Malignant Class ===") print(f" {'Threshold':>12} {'Precision':>10} {'Recall':>8}") for thresh, prec, rec in candidates: print(f" {thresh:>12.4f} {prec:>10.4f} {rec:>8.4f}") # Pick the threshold with highest precision among our high-recall candidates best_threshold, best_precision, best_recall = max(candidates, key=lambda t: t[1]) print(f"\n✔ Best threshold = {best_threshold:.4f}") print(f" At this threshold — Precision: {best_precision:.4f}, Recall: {best_recall:.4f}") # ── Apply the chosen threshold and see its real-world impact ───────────────── # We predict 'malignant' whenever P(malignant) >= best_threshold y_pred_tuned = (y_proba_malignant >= best_threshold).astype(int) malignant_actual = np.sum(y_test_malignant == 1) malignant_caught = np.sum((y_pred_tuned == 1) & (y_test_malignant == 1)) malignant_missed = malignant_actual - malignant_caught print(f"\n=== Clinical Impact at Tuned Threshold ===") print(f" Total malignant tumours in test set : {malignant_actual}") print(f" Correctly flagged (True Positives) : {malignant_caught}") print(f" Missed (False Negatives) : {malignant_missed} ← the dangerous ones") # ── ROC Curve ───────────────────────────────────────────────────────────────── fpr, tpr, roc_thresholds = roc_curve(y_test_malignant, y_proba_malignant) plt.figure(figsize=(6, 5)) plt.plot(fpr, tpr, color='steelblue', lw=2, label='ROC Curve') plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Random classifier') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate (Recall)') plt.title('ROC Curve — Malignant Detection') plt.legend() plt.tight_layout() plt.savefig('roc_curve.png', dpi=150) print("\nROC curve saved to roc_curve.png")
Threshold Precision Recall
0.1823 0.9130 1.0000
0.2041 0.9130 1.0000
0.2289 0.9130 1.0000
✔ Best threshold = 0.1823
At this threshold — Precision: 0.9130, Recall: 1.0000
=== Clinical Impact at Tuned Threshold ===
Total malignant tumours in test set : 42
Correctly flagged (True Positives) : 42
Missed (False Negatives) : 0 ← the dangerous ones
ROC curve saved to roc_curve.png
| Aspect | Logistic Regression | Decision Tree / Random Forest |
|---|---|---|
| Output type | Calibrated probability (0–1) | Probability estimate (often poorly calibrated) |
| Interpretability | High — coefficients are log-odds, directly explainable | Medium (tree) to Low (forest) — needs SHAP for forests |
| Handles non-linearity | No — needs manual feature engineering | Yes — naturally captures complex interactions |
| Training speed | Very fast — scales to millions of rows | Moderate to slow for large forests |
| Overfitting risk | Low — regularisation (L1/L2) is simple and effective | High for trees — needs depth control or ensembling |
| Feature scaling required | Yes — sensitive to scale differences | No — trees are scale-invariant |
| Best used when | Data is roughly linearly separable; explanation is required | Complex non-linear relationships; less need to explain |
| Regulatory environments | Preferred — auditable coefficient-level explanation | Difficult to audit without post-hoc explainability tools |
🎯 Key Takeaways
- Logistic Regression does not predict a class directly — it predicts a calibrated probability via the sigmoid function, and a threshold converts that probability to a label. The threshold is a business decision, not a model parameter.
- The coefficients are log-odds ratios: a coefficient of +0.8 on a feature means a one-unit increase in that feature multiplies the odds of the positive class by e^0.8 ≈ 2.23. This interpretability is the primary reason regulated industries still choose Logistic Regression over more powerful models.
- Always scale your features before training — Logistic Regression uses gradient descent, which is highly sensitive to features with vastly different magnitudes. Fitting the StandardScaler on training data only is non-negotiable; leaking test statistics inflates your metrics and is a common interview red flag.
- AUC-ROC measures the quality of the model's probability estimates across all thresholds — optimise this during model selection. Your chosen decision threshold is then a separate, downstream business decision based on the relative costs of false positives versus false negatives in your specific application.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Forgetting to scale features — Symptom: ConvergenceWarning appears even at max_iter=1000, and model accuracy is significantly lower than expected — Fix: Always apply StandardScaler (or MinMaxScaler) to your features before fitting. Remember to fit the scaler on training data only, then transform both train and test sets separately.
- ✕Mistake 2: Using accuracy as the only metric on imbalanced data — Symptom: Model reports 95% accuracy on a fraud dataset where 95% of transactions are legitimate — meaning it learned to predict 'not fraud' for everything — Fix: Always compute the confusion matrix plus precision, recall, and F1-score per class. For severe imbalance, use class_weight='balanced' in LogisticRegression() or over/undersample your minority class.
- ✕Mistake 3: Treating the 0.5 threshold as immovable — Symptom: Deployed model has acceptable accuracy but unacceptable real-world outcomes (e.g., too many missed cancer diagnoses or too many blocked legitimate emails) — Fix: Use predict_proba() to get raw probabilities, then sweep thresholds using precision_recall_curve() and select the cut-off that minimises your most costly error type for the specific business context.
Interview Questions on This Topic
- QWhy does Logistic Regression use log-loss (binary cross-entropy) instead of mean squared error as its loss function? — Interviewers love this because MSE with a sigmoid output creates a non-convex loss surface full of local minima. Log-loss is convex with respect to the weights, guaranteeing gradient descent finds the global minimum. A good answer also mentions that log-loss heavily penalises confident wrong predictions, which is exactly the behaviour you want.
- QWhat is the difference between L1 and L2 regularisation in Logistic Regression, and when would you choose each? — L2 (Ridge, the default in scikit-learn) shrinks all coefficients toward zero but rarely to exactly zero — good for multicollinearity. L1 (Lasso) can drive some coefficients to exactly zero, performing automatic feature selection — ideal when you suspect many features are irrelevant. In scikit-learn, control this with the penalty parameter ('l1' or 'l2') and the C parameter (inverse of regularisation strength — lower C = more regularisation).
- QIf Logistic Regression outputs a probability of 0.73 for a sample, what does that actually mean mathematically — and what are the underlying log-odds? — This trips people up. The probability 0.73 means the model believes there is a 73% chance of the positive class. The log-odds (logit) is log(0.73 / 0.27) = log(2.70) ≈ 0.994. The log-odds is what the linear part of the model (w₀ + w₁x₁ + ...) is directly computing — the sigmoid then maps it back to a probability. Understanding this chain — linear combination → log-odds → sigmoid → probability — shows you truly understand the model, not just its API.
Frequently Asked Questions
Can logistic regression handle multi-class classification problems?
Yes — scikit-learn's LogisticRegression supports multi-class out of the box via the multi_class parameter. It uses either One-vs-Rest (OvR), which trains one binary classifier per class, or the Multinomial (softmax) strategy, which optimises a single joint loss across all classes. Set multi_class='multinomial' and solver='lbfgs' for most multi-class problems.
Why does scikit-learn show a ConvergenceWarning for logistic regression?
It means gradient descent didn't reach the minimum within the allowed number of iterations. The two most common fixes are: (1) scale your features with StandardScaler — unscaled data creates an elongated loss surface that takes far more steps to traverse, and (2) increase max_iter to 1000 or higher. If it still doesn't converge, try a different solver like 'lbfgs' or 'saga'.
Is logistic regression still useful in the age of deep learning and gradient boosting?
Absolutely — and not just as a baseline. Anywhere a decision needs to be explained to a non-technical stakeholder, audited by a regulator, or deployed in a low-latency environment, Logistic Regression is the right tool. Credit scoring, clinical risk scoring, and legal-domain AI are all areas where its transparency is a hard requirement, not a nice-to-have.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.