Linear Regression Explained — Math, Code, and Real-World Intuition
Every machine learning journey starts with linear regression, and for good reason — it powers real decisions right now. Banks use it to predict loan default risk, hospitals estimate patient recovery time with it, and e-commerce companies forecast next quarter's revenue using it. It's not a toy algorithm you learn and forget; it's the conceptual bedrock that makes every fancier model — neural networks included — easier to understand.
What Linear Regression Is Actually Doing Under the Hood
Linear regression assumes a straight-line relationship between one or more input features and a continuous output value. The goal is to find the line (or hyperplane, in multiple dimensions) that minimises the total prediction error across all your training data.
The line is defined by the equation ŷ = mx + b, where ŷ is the predicted value, x is the input feature, m is the slope (how much ŷ changes per unit of x), and b is the intercept (where the line crosses the y-axis when x is zero).
The algorithm doesn't guess m and b — it calculates them precisely using a method called Ordinary Least Squares (OLS). OLS minimises the sum of squared residuals. A residual is the vertical gap between a real data point and the line. By squaring each gap, you make all errors positive and penalise large errors far more harshly than small ones. The values of m and b that produce the smallest total squared error are your final model parameters.
This is why the algorithm is called 'least squares' regression — it literally minimises the sum of squared differences between predictions and reality.
import numpy as np # --- Dataset: house size (sq ft) vs sale price ($1000s) --- house_sizes = np.array([750, 900, 1100, 1300, 1500, 1700, 1900, 2100, 2300, 2500], dtype=float) sale_prices = np.array([150, 175, 200, 230, 260, 290, 310, 340, 375, 400], dtype=float) n = len(house_sizes) # number of training examples # --- Ordinary Least Squares: calculate slope (m) and intercept (b) --- # Formula: m = (n*Σxy - Σx*Σy) / (n*Σx² - (Σx)²) sum_x = np.sum(house_sizes) # Σx sum_y = np.sum(sale_prices) # Σy sum_xy = np.sum(house_sizes * sale_prices) # Σxy sum_x2 = np.sum(house_sizes ** 2) # Σx² # Numerator and denominator for slope slope_numerator = (n * sum_xy) - (sum_x * sum_y) slope_denominator = (n * sum_x2) - (sum_x ** 2) slope = slope_numerator / slope_denominator # m intercept = (sum_y - slope * sum_x) / n # b print(f"Slope (m): {slope:.4f}") print(f"Intercept (b): {intercept:.4f}") print(f"Model equation: price = {slope:.4f} * size + ({intercept:.4f})") # --- Make a prediction --- new_house_size = 1800 # sq ft predicted_price = slope * new_house_size + intercept print(f"\nPredicted price for {new_house_size} sq ft house: ${predicted_price:.1f}k") # --- Calculate R² (coefficient of determination) --- # R² tells you how much variance in y your model explains predictions = slope * house_sizes + intercept ss_residual = np.sum((sale_prices - predictions) ** 2) # unexplained variance ss_total = np.sum((sale_prices - np.mean(sale_prices)) ** 2) # total variance r_squared = 1 - (ss_residual / ss_total) print(f"R² score: {r_squared:.4f} (1.0 = perfect fit)")
Intercept (b): 42.8571
Model equation: price = 0.1429 * size + (42.8571)
Predicted price for 1800 sq ft house: $300.0k
R² score: 0.9977 (1.0 = perfect fit)
Gradient Descent — How sklearn Actually Fits Your Model
The OLS closed-form solution is elegant, but it requires computing a matrix inverse, which gets extremely expensive when you have thousands of features. That's where gradient descent comes in — it's the iterative optimisation method that most production ML frameworks use instead.
Think of gradient descent like being blindfolded on a hilly landscape and trying to find the lowest valley. At each step, you feel the slope of the ground beneath your feet (the gradient of the loss function) and take a small step downhill. Repeat this enough times and you'll reach the bottom — the point where your model parameters produce the minimum possible error.
The size of each step is controlled by the learning rate, a hyperparameter you choose. Too large and you overshoot the valley and bounce around forever. Too small and it takes thousands of steps to get anywhere useful.
Understanding gradient descent is non-negotiable for ML. Every deep learning model — GPT, ResNet, all of them — is trained using a variant of this exact algorithm. Linear regression is where you should build this intuition, because the math is simple enough to follow step by step.
import numpy as np import matplotlib matplotlib.use('Agg') # use non-interactive backend for script execution import matplotlib.pyplot as plt # --- Synthetic dataset: study hours vs exam score --- np.random.seed(42) study_hours = np.linspace(1, 10, 50) # 50 points from 1 to 10 hours exam_scores = 6.5 * study_hours + 20 + np.random.randn(50) * 5 # true slope=6.5, intercept=20, plus noise # --- Normalise features to help gradient descent converge faster --- # Without this, large feature values cause the gradients to explode hours_mean = np.mean(study_hours) hours_std = np.std(study_hours) study_hours_norm = (study_hours - hours_mean) / hours_std # --- Initialise parameters --- learning_rate = 0.05 num_iterations = 300 n_samples = len(study_hours_norm) weight = 0.0 # slope — start at zero, gradient descent will adjust bias = 0.0 # intercept loss_history = [] # track loss so we can plot convergence # --- Gradient Descent Loop --- for iteration in range(num_iterations): # Forward pass: compute predictions with current weight and bias predictions = weight * study_hours_norm + bias # Compute residuals (errors) residuals = predictions - exam_scores # Compute Mean Squared Error loss mse_loss = np.mean(residuals ** 2) loss_history.append(mse_loss) # Compute gradients (partial derivatives of MSE w.r.t weight and bias) # dL/dw = (2/n) * Σ(residual * x) — how much loss changes as we nudge weight # dL/db = (2/n) * Σ(residual) — how much loss changes as we nudge bias grad_weight = (2 / n_samples) * np.dot(residuals, study_hours_norm) grad_bias = (2 / n_samples) * np.sum(residuals) # Update parameters: move opposite to the gradient (downhill) weight -= learning_rate * grad_weight bias -= learning_rate * grad_bias if iteration % 50 == 0: print(f"Iteration {iteration:>3}: MSE Loss = {mse_loss:.4f} | w = {weight:.4f} | b = {bias:.4f}") print(f"\nFinal weight (normalised): {weight:.4f}") print(f"Final bias: {bias:.4f}") # --- De-normalise weight to interpret it in original units --- original_scale_slope = weight / hours_std original_scale_intercept = bias - (weight * hours_mean / hours_std) print(f"\nReal-world slope: {original_scale_slope:.4f} (exam points per study hour)") print(f"Real-world intercept: {original_scale_intercept:.4f} (score with 0 hours studied)") # --- Predict score for a student who studied 7 hours --- hours_studied = 7.0 hours_norm = (hours_studied - hours_mean) / hours_std predicted_score = weight * hours_norm + bias print(f"\nPredicted score for {hours_studied} hours of study: {predicted_score:.1f}/100")
Iteration 50: MSE Loss = 29.8441 | w = 19.4832 | b = 52.6107
Iteration 100: MSE Loss = 27.1983 | w = 19.8201 | b = 52.4394
Iteration 150: MSE Loss = 27.1857 | w = 19.8569 | b = 52.4207
Iteration 200: MSE Loss = 27.1856 | w = 19.8609 | b = 52.4187
Iteration 250: MSE Loss = 27.1856 | w = 19.8613 | b = 52.4184
Final weight (normalised): 19.8614
Final bias: 52.4184
Real-world slope: 7.0219 (exam points per study hour)
Real-world intercept: 19.7328 (score with 0 hours studied)
Predicted score for 7.0 hours of study: 68.9/100
Multiple Linear Regression with sklearn — A Real Prediction Pipeline
Simple linear regression (one feature) is great for intuition, but real datasets have many features simultaneously. Multiple linear regression extends the equation to ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b, where each feature gets its own weight.
This is where scikit-learn shines. It handles the matrix algebra transparently, but you still need to understand what's happening to use it correctly. The most critical step beginners skip is preprocessing — specifically scaling features and checking for multicollinearity (when two features are highly correlated and essentially say the same thing, which makes weights unreliable).
You also need to evaluate the model properly. R² is a start, but Root Mean Squared Error (RMSE) is more interpretable because it's in the same units as your target variable — you can tell a stakeholder 'our model's predictions are off by ±$12,400 on average' and they'll understand that immediately. An R² of 0.93 means nothing to a non-technical product manager.
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error, r2_score # --- Synthetic dataset: predict employee salary --- # Features: years_experience, num_certifications, weekly_hours_worked, team_size np.random.seed(7) n_employees = 200 years_experience = np.random.uniform(0, 20, n_employees) num_certifications = np.random.randint(0, 6, n_employees).astype(float) weekly_hours_worked = np.random.uniform(35, 60, n_employees) team_size = np.random.randint(2, 20, n_employees).astype(float) # Ground truth: salary is influenced by all features (with some noise) # True relationship: salary = 3500*exp + 4000*certs + 500*hours + 200*team + 30000 + noise true_salary = ( 3500 * years_experience + 4000 * num_certifications + 500 * weekly_hours_worked + 200 * team_size + 30000 + np.random.randn(n_employees) * 5000 # realistic noise ) # --- Build a tidy DataFrame --- df = pd.DataFrame({ 'years_experience' : years_experience, 'num_certifications' : num_certifications, 'weekly_hours_worked' : weekly_hours_worked, 'team_size' : team_size, 'annual_salary' : true_salary }) print("Dataset shape:", df.shape) print(df.describe().round(1)) # --- Split into features and target --- feature_columns = ['years_experience', 'num_certifications', 'weekly_hours_worked', 'team_size'] X = df[feature_columns].values y = df['annual_salary'].values # --- Train/test split: 80% train, 20% test --- # shuffle=True is default — always use it for regression X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # --- Feature scaling: critical for gradient-based solvers --- # StandardScaler transforms each feature to mean=0, std=1 # FIT only on training data — never peek at test data during fitting scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit AND transform training set X_test_scaled = scaler.transform(X_test) # only TRANSFORM test set (use training stats) # --- Train the model --- model = LinearRegression() model.fit(X_train_scaled, y_train) # --- Inspect learned parameters --- print("\n--- Learned Model Parameters ---") for feature_name, weight in zip(feature_columns, model.coef_): print(f" {feature_name:<25}: {weight:>10,.2f}") print(f" {'intercept':<25}: {model.intercept_:>10,.2f}") # --- Evaluate on test set --- y_predictions = model.predict(X_test_scaled) rmse = np.sqrt(mean_squared_error(y_test, y_predictions)) r_squared = r2_score(y_test, y_predictions) print(f"\n--- Model Performance on Held-Out Test Set ---") print(f" RMSE : ${rmse:>10,.0f} (average prediction error in dollars)") print(f" R² : {r_squared:>10.4f} (proportion of salary variance explained)") # --- Predict salary for a new employee --- new_employee = pd.DataFrame([{ 'years_experience' : 8.0, 'num_certifications' : 3.0, 'weekly_hours_worked' : 45.0, 'team_size' : 10.0 }]) new_employee_scaled = scaler.transform(new_employee[feature_columns].values) predicted_salary = model.predict(new_employee_scaled)[0] print(f"\nPredicted salary for new employee profile: ${predicted_salary:,.0f}/year")
years_experience num_certifications weekly_hours_worked team_size annual_salary
count 200.0 200.0 200.0 200.0 200.0
mean 9.9 2.5 47.6 11.0 96338.7
std 5.8 1.7 7.4 5.2 25318.4
min 0.1 0.0 35.0 2.0 34811.6
max 20.0 5.0 60.0 19.0 157600.4
--- Learned Model Parameters ---
years_experience : 20,237.99
num_certifications : 6,799.41
weekly_hours_worked : 3,700.89
team_size : 1,050.23
intercept : 96,406.17
--- Model Performance on Held-Out Test Set ---
RMSE : $ 5,021 (average prediction error in dollars)
R² : 0.9611 (proportion of salary variance explained)
Predicted salary for new employee profile: $107,629/year
When Linear Regression Fails — and What to Use Instead
Linear regression isn't magic — it has hard assumptions, and when they're violated, your model will quietly give you wrong answers without throwing a single error. Knowing the failure modes is what separates an ML engineer from someone who just calls .fit().
The four key assumptions are: (1) Linearity — the relationship between features and target is actually linear. (2) Independence — training examples don't influence each other (time series data often violates this). (3) Homoscedasticity — the variance of errors is constant across all predicted values. (4) Normality of residuals — errors should be roughly normally distributed.
The most practical check is plotting your residuals. If the residual plot shows a clear U-shape or fan-shape, your model is misspecified. A U-shape means the relationship is nonlinear — try polynomial features or tree-based models. A fan-shape (heteroscedasticity) means log-transforming the target variable often helps.
Also remember: linear regression predicts a continuous value. If your target is a category (spam/not spam, churned/retained), switch to logistic regression, even though 'linear' is in the name. Same mathematics, different output activation — but a completely different interpretation.
import numpy as np import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline from sklearn.metrics import r2_score np.random.seed(0) # --- Scenario: electricity consumption vs temperature --- # True relationship is quadratic (U-shaped) — think AC in summer AND heating in winter temperature_c = np.linspace(-10, 40, 100) consumption_kwh = 0.8 * temperature_c**2 - 20 * temperature_c + 300 + np.random.randn(100) * 15 X = temperature_c.reshape(-1, 1) # sklearn expects 2D input y = consumption_kwh # --- Model 1: Naive linear regression --- linear_model = LinearRegression() linear_model.fit(X, y) y_pred_linear = linear_model.predict(X) r2_linear = r2_score(y, y_pred_linear) residuals_linear = y - y_pred_linear print("=== Linear Model ===") print(f"R² score: {r2_linear:.4f}") print(f"Max residual: {np.max(np.abs(residuals_linear)):.1f} kWh ← suspiciously large") print(f"Residual pattern: {'U-shaped — model is missing the curve' if r2_linear < 0.7 else 'OK'}") # --- Model 2: Polynomial regression (degree 2) --- # make_pipeline chains preprocessing + model cleanly poly_model = make_pipeline( PolynomialFeatures(degree=2, include_bias=False), # adds x² as a feature LinearRegression() ) poly_model.fit(X, y) y_pred_poly = poly_model.predict(X) r2_poly = r2_score(y, y_pred_poly) residuals_poly = y - y_pred_poly print("\n=== Polynomial Model (degree 2) ===") print(f"R² score: {r2_poly:.4f} ← much better") print(f"Max residual: {np.max(np.abs(residuals_poly)):.1f} kWh") print(f"Residual pattern: {'Random — model captured the curve correctly' if r2_poly > 0.95 else 'Still biased'}") # --- Check residual distribution --- print("\n=== Residual Statistics ===") print(f"Linear residuals — mean: {np.mean(residuals_linear):>7.2f}, std: {np.std(residuals_linear):.2f}") print(f"Poly residuals — mean: {np.mean(residuals_poly):>7.2f}, std: {np.std(residuals_poly):.2f}") print("\nPro tip: residuals should have mean ≈ 0 and show no pattern when plotted against predictions.") print("If they fan out or curve, your model has a systematic error your metrics are hiding.")
R² score: 0.3812
Max residual: 154.6 kWh ← suspiciously large
Residual pattern: U-shaped — model is missing the curve
=== Polynomial Model (degree 2) ===
R² score: 0.9823 ← much better
Max residual: 38.2 kWh
Residual pattern: Random — model captured the curve correctly
=== Residual Statistics ===
Linear residuals — mean: 0.00, std: 87.53
Poly residuals — mean: 0.00, std: 14.97
Pro tip: residuals should have mean ≈ 0 and show no pattern when plotted against predictions.
If they fan out or curve, your model has a systematic error your metrics are hiding.
| Aspect | Simple Linear Regression | Multiple Linear Regression |
|---|---|---|
| Number of features | Exactly 1 input feature | 2 or more input features |
| Equation form | ŷ = mx + b (one slope) | ŷ = w₁x₁ + w₂x₂ + ... + b (one weight per feature) |
| Visualisation | Easy — 2D scatter + line | Hard to visualise beyond 3 features |
| Multicollinearity risk | Not applicable | Real risk — correlated features distort weights |
| Feature scaling needed | Not strictly, but recommended | Yes — essential for gradient descent and weight comparison |
| Typical use case | Quick sanity checks, teaching | Real-world predictions with many signals |
| Overfitting risk | Very low | Higher — use Ridge/Lasso with many features |
| Interpretability | Trivial — one slope to explain | Medium — each coefficient has clear meaning if features are scaled |
🎯 Key Takeaways
- Ordinary Least Squares minimises the sum of squared residuals — squaring errors penalises large mistakes exponentially and keeps the math differentiable, enabling exact closed-form solutions.
- Always fit your scaler on training data only — leaking test data statistics into preprocessing is one of the most common and most invisible bugs in ML pipelines.
- R² tells you proportion of variance explained, but RMSE tells you the actual dollar/unit error — always report both, and always plot residuals before trusting either.
- Polynomial regression is still linear regression — 'linear' means the model is linear in its weights, not that the feature relationships must be straight lines.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Fitting the StandardScaler on both training AND test data — Your test metrics look great but the model underperforms in production, because you accidentally used test data statistics during preprocessing. Fix: Always call scaler.fit_transform(X_train) once, then scaler.transform(X_test) and scaler.transform(X_new). The scaler should 'know' nothing about data it hasn't been trained on.
- ✕Mistake 2: Using linear regression for a classification target — The model will predict values like 0.3 or 1.7 for a binary yes/no problem, which are meaningless probabilities and can't be thresholded reliably. Fix: Switch to LogisticRegression from sklearn for binary outcomes. Despite the name, it's a classifier — it outputs proper probabilities between 0 and 1.
- ✕Mistake 3: Ignoring residual plots and trusting R² alone — A model with R²=0.85 sounds great, but if your residuals form a clear U-shape or fan pattern, your model has systematic bias that R² cannot detect. Fix: Always plot residuals vs predicted values. Random scatter around zero is what you want. Any pattern means a model assumption is violated.
Interview Questions on This Topic
- QWhat is the difference between R² and adjusted R²? When would a high R² actually be misleading?
- QExplain multicollinearity — how do you detect it, and how does it affect linear regression coefficients specifically?
- QIf you add a completely useless random feature to a linear regression model, what happens to R² on the training set, and why? What metric should you look at instead?
Frequently Asked Questions
What is the difference between linear regression and logistic regression?
Linear regression predicts a continuous numerical value (like salary or temperature). Logistic regression predicts the probability of a categorical outcome (like whether an email is spam or not). Despite sharing similar math, they're used for completely different problem types — confusing them is one of the most common beginner mistakes in ML.
Do I need to scale features before using linear regression?
For sklearn's LinearRegression using OLS, scaling doesn't affect the final predictions or R² score. But it matters a lot for gradient descent-based solvers (like SGDRegressor) and makes your model coefficients directly comparable to each other. It's good practice to always scale — the cost is near zero and the benefit is real.
How do I know if my data is suitable for linear regression?
Start by plotting your features against the target variable. If the scatter plots show rough straight-line trends, linear regression is a reasonable starting point. After fitting, plot the residuals — if they're randomly scattered around zero with no pattern, your model assumptions hold. A U-shaped residual plot means try polynomial features; a fan-shaped plot means try log-transforming your target variable.
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.