Regularisation — 99% Accuracy Masked 3x Default Rate
- Regularisation adds a penalty term to the loss function that punishes large weights — this forces the model to learn general patterns rather than memorising training noise. It's not a trick; it's a direct mathematical constraint on model complexity.
- L1 (Lasso) uses absolute weight penalties which create exact zeros — it does feature selection automatically. L2 (Ridge) uses squared penalties which shrink weights smoothly but keep all features active. The geometry of these two penalties is fundamentally different, not just numerically.
- Lambda (α in sklearn) controls the regularisation strength and must be tuned via cross-validation. A log-scale search space (0.001 → 10000) gives much better coverage than a linear grid. RidgeCV and LassoCV make this a single method call.
- Regularisation adds a penalty term to the loss function that prevents overfitting by penalising large weights.
- L1 (Lasso) drives irrelevant feature weights to exactly zero — automatic feature selection.
- L2 (Ridge) shrinks all weights smoothly toward zero but keeps every feature in the game.
- Tuning lambda via cross-validation typically reduces test error by 15–30% compared to no regularisation.
- In production, skipping feature scaling before regularisation silently destroys model performance.
- The biggest mistake: treating regularisation as a magic fix instead of diagnosing the overfit first.
Overfitting Quick Debug Cheat Sheet
Validation error >> training error
from sklearn.model_selection import learning_curve; import matplotlib.pyplot as pltplot_learning_curve(model, X_train, y_train, cv=5)Extremely large model weights (>1000)
import numpy as np; max_weight = np.max(np.abs(model.coef_))print(f'Max weight: {max_weight}')Lasso returns all non-zero coefficients (no sparsity)
from sklearn.linear_model import LassoCVLassoCV(alphas=np.logspace(-3, 4, 50), cv=5).fit(X_train, y_train).coef_Ridge coefficients are tiny (< x features)
from sklearn.preprocessing import StandardScaler; scaler = StandardScaler(); X_scaled = scaler.fit_transform(X)RidgeCV(alphas=np.logspace(-3, 4, 100)).fit(X_scaled, y).alpha_Production Incident
Production Debug GuideSymptom → Action for diagnosing and fixing overfitting in production models
Every machine learning model has the same enemy: a model that looks brilliant on training data but falls apart the moment it sees real-world data. This isn't a rare edge case — it's the default failure mode. Left unchecked, models will cheerfully learn noise, flukes, and irrelevant patterns in your training set. In production, that translates to bad predictions and real business costs.
The root cause is that training a model is fundamentally an optimisation problem. The algorithm tries to minimise error on the data it can see. Without any guardrails, it'll find increasingly complex solutions that fit every quirk of the training set perfectly — but those quirks don't exist in the wild. Regularisation solves this by adding a penalty term to the loss function that punishes complexity itself. The model now has to balance two things at once: fit the data well AND stay simple.
By the end of this article you'll understand exactly why overfitting happens, what L1 and L2 regularisation actually do to your model's weights (not just the formula — the intuition), how to tune the regularisation strength with lambda, and how to pick the right type for your specific problem. You'll leave with working Python code you can drop straight into your own projects.
Why Models Overfit — and What Regularisation Actually Does
To understand regularisation, you first need a crisp mental model of overfitting. When you train a model, you're adjusting weights to minimise a loss function like Mean Squared Error. An unconstrained model will keep pushing weights to extreme values if doing so reduces training loss — even by a tiny amount. Those extreme weights capture noise that only exists in your training batch.
Here's the key insight: large weights are the symptom of overfitting. A weight of 847.3 on a feature means your model is hyper-sensitive to tiny changes in that feature. That's almost never justified by real-world signal.
Regularisation works by adding an extra term to the loss function:
Regularised Loss = Original Loss + λ × Penalty
The penalty is a function of the weights themselves. Now, the optimiser can't just chase lower training loss recklessly — every time it pushes a weight higher to fit the training data better, the penalty term pushes back. Lambda (λ) controls how aggressive that pushback is. A higher lambda means stronger regularisation, simpler model. A lambda of zero means no regularisation at all — back to overfitting territory.
This is why regularisation is sometimes called 'weight decay' — it actively decays weights toward zero during training.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline from sklearn.metrics import mean_squared_error np.random.seed(42) # --- Generate a simple dataset: true pattern is quadratic, but we add noise --- # Think of this as house prices vs size — there's a real trend, plus random noise num_samples = 30 house_sizes = np.linspace(50, 300, num_samples) true_prices = 0.5 * house_sizes**2 - 50 * house_sizes + 8000 # the real pattern noise = np.random.normal(0, 3000, num_samples) # market noise observed_prices = true_prices + noise # Reshape for sklearn (needs 2D input) house_sizes_2d = house_sizes.reshape(-1, 1) # --- Fit three models: underfitting, overfitting, and regularised --- # Degree-1: too simple, misses the curve (underfitting) linear_model = make_pipeline(PolynomialFeatures(degree=1), LinearRegression()) linear_model.fit(house_sizes_2d, observed_prices) # Degree-10: so flexible it chases every noise spike (overfitting) overfitted_model = make_pipeline(PolynomialFeatures(degree=10), LinearRegression()) overfitted_model.fit(house_sizes_2d, observed_prices) # Degree-10 with Ridge regularisation: flexible but penalised for large weights ridge_model = make_pipeline(PolynomialFeatures(degree=10), Ridge(alpha=1000)) ridge_model.fit(house_sizes_2d, observed_prices) # --- Evaluate on training data --- plot_range = np.linspace(50, 300, 300).reshape(-1, 1) linear_train_rmse = mean_squared_error(observed_prices, linear_model.predict(house_sizes_2d), squared=False) overfit_train_rmse = mean_squared_error(observed_prices, overfitted_model.predict(house_sizes_2d), squared=False) ridge_train_rmse = mean_squared_error(observed_prices, ridge_model.predict(house_sizes_2d), squared=False) print("=== Training RMSE Comparison ===") print(f"Linear (degree 1) : £{linear_train_rmse:,.0f}") print(f"Overfitted (degree 10) : £{overfit_train_rmse:,.0f} <- near-zero, but it cheated") print(f"Ridge regularised (d=10) : £{ridge_train_rmse:,.0f} <- honest fit") # Inspect the overfitted model's weights — they'll be enormous overfitted_coefficients = overfitted_model.named_steps['linearregression'].coef_ ridge_coefficients = ridge_model.named_steps['ridge'].coef_ print("\n=== Weight Magnitude Check ===") print(f"Max absolute weight (overfitted) : {np.max(np.abs(overfitted_coefficients)):,.2f}") print(f"Max absolute weight (Ridge) : {np.max(np.abs(ridge_coefficients)):,.2f}") print("\nRegularisation shrank those runaway weights dramatically!")
Linear (degree 1) : £4,821
Overfitted (degree 10) : £1,203 <- near-zero, but it cheated
Ridge regularised (d=10) : £3,109 <- honest fit
=== Weight Magnitude Check ===
Max absolute weight (overfitted) : 1,842,763.18
Max absolute weight (Ridge) : 312.47
Regularisation shrank those runaway weights dramatically!
L1 vs L2 Regularisation — The Real Difference That Matters in Practice
Both L1 (Lasso) and L2 (Ridge) add a penalty term to the loss function, but the penalty is calculated differently — and that difference has profound practical consequences.
L2 (Ridge) penalises the sum of squared weights: λ × Σ(wᵢ²). Because squaring a large weight makes it hugely expensive, Ridge aggressively shrinks big weights toward zero but rarely all the way to zero. Every feature keeps some influence — Ridge just democratises the weights, keeping things balanced.
L1 (Lasso) penalises the sum of absolute weights: λ × Σ|wᵢ|. The key difference: L1's penalty slope is constant regardless of weight size. This creates a fundamentally different optimisation landscape where the algorithm finds it genuinely cheaper to drive some weights exactly to zero rather than keep them small. The result is automatic feature selection.
Think of it this way: Ridge is like turning down the volume on all instruments equally. Lasso is like removing some instruments from the band entirely.
When to use which? Use Ridge when you believe most features carry some real signal — like predicting house prices where size, location, and age all matter. Use Lasso when you suspect many features are noise and you want the model to identify the useful ones — like gene expression data with thousands of genes but only dozens that matter. Elastic Net blends both penalties and is the safest default when you're unsure.
import numpy as np from sklearn.linear_model import Ridge, Lasso, ElasticNet from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_regression np.random.seed(0) # --- Create a dataset where only 5 of 20 features are genuinely useful --- # This simulates a real scenario: many candidate features, few real signals feature_matrix, target_values, true_coefficients = make_regression( n_samples=200, n_features=20, # 20 features total n_informative=5, # only 5 actually drive the outcome noise=25, coef=True, random_state=0 ) # IMPORTANT: Always scale features before regularisation! # Regularisation penalises weight magnitude — if Feature A is in metres and # Feature B is in millimetres, Feature B will be unfairly penalised. scaler = StandardScaler() feature_matrix_scaled = scaler.fit_transform(feature_matrix) # --- Train all three regularisation types with the same lambda strength --- regularisation_strength = 1.0 ridge_model = Ridge(alpha=regularisation_strength) lasso_model = Lasso(alpha=regularisation_strength, max_iter=10000) elastic_model = ElasticNet(alpha=regularisation_strength, l1_ratio=0.5, max_iter=10000) ridge_model.fit(feature_matrix_scaled, target_values) lasso_model.fit(feature_matrix_scaled, target_values) elastic_model.fit(feature_matrix_scaled, target_values) # --- Compare how many features each model zeroed out --- ridge_zeros = np.sum(np.abs(ridge_model.coef_) < 0.01) lasso_zeros = np.sum(np.abs(lasso_model.coef_) < 0.01) # true zeroes elastic_zeros = np.sum(np.abs(elastic_model.coef_) < 0.01) print("=== Feature Sparsity Comparison (20 features total) ===") print(f"Ridge — features effectively zeroed: {ridge_zeros:>2} (keeps most features active)") print(f"Lasso — features exactly zeroed : {lasso_zeros:>2} (built-in feature selection!)") print(f"ElasticNet — features zeroed : {elastic_zeros:>2} (balanced approach)") # --- Show which features Lasso kept (non-zero weights) --- lasso_selected_features = np.where(np.abs(lasso_model.coef_) >= 0.01)[0] print(f"\nLasso selected feature indices: {lasso_selected_features}") print(f"True informative feature indices: {np.where(np.abs(true_coefficients) > 0)[0]}") # --- Print weight table for first 10 features --- print("\n--- Weight comparison for features 0–9 ---") print(f"{'Feature':<10} {'True Coef':>12} {'Ridge':>10} {'Lasso':>10} {'ElasticNet':>12}") print("-" * 56) for i in range(10): print(f"Feature {i:<3} {true_coefficients[i]:>12.2f} " f"{ridge_model.coef_[i]:>10.2f} " f"{lasso_model.coef_[i]:>10.2f} " f"{elastic_model.coef_[i]:>12.2f}")
Ridge — features effectively zeroed: 0 (keeps most features active)
Lasso — features exactly zeroed : 15 (built-in feature selection!)
ElasticNet — features zeroed : 9 (balanced approach)
Lasso selected feature indices: [0 1 4 7 15]
True informative feature indices: [0 1 4 7 15]
--- Weight comparison for features 0–9 ---
Feature True Coef Ridge Lasso ElasticNet
--------------------------------------------------------
Feature 0 45.23 38.71 41.05 36.82
Feature 1 28.17 24.93 25.61 22.14
Feature 2 0.00 1.83 0.00 0.00
Feature 3 0.00 2.41 0.00 0.00
Feature 4 67.88 59.12 63.74 57.93
Feature 5 0.00 3.17 0.00 0.00
Feature 6 0.00 -1.94 0.00 0.00
Feature 7 33.55 29.48 30.92 27.61
Feature 8 0.00 2.08 0.00 -0.00
Feature 9 0.00 -1.62 0.00 0.00
Tuning Lambda — How to Find the Right Regularisation Strength
Lambda (α in sklearn) is the most important hyperparameter in regularisation. Set it too low and you barely constrain the model — overfitting creeps back in. Set it too high and you've penalised the model into uselessness, underfitting everything.
The gold standard approach is cross-validated search: train the model with many different lambda values, evaluate each on held-out validation folds, and pick the lambda that minimises validation error. Sklearn's RidgeCV and LassoCV do this efficiently, testing a grid of lambdas in a single call.
The validation curve is your most important diagnostic tool here. Plot training error and validation error against lambda values. You're looking for the lambda where the gap between training and validation error is smallest — that's your sweet spot. Too far left (small lambda): gap is wide — overfitting. Too far right (large lambda): both errors are high — underfitting.
One practical rule of thumb: start with a logarithmic search space (0.001, 0.01, 0.1, 1, 10, 100) rather than a linear one. Regularisation effects are roughly log-linear, so equal spacing on a log scale gives you much more informative coverage of the lambda landscape.
import numpy as np from sklearn.linear_model import RidgeCV, LassoCV from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error np.random.seed(7) # --- Dataset: predicting patient recovery scores from clinical measurements --- clinical_features, recovery_scores = make_regression( n_samples=500, n_features=30, n_informative=10, noise=40, random_state=7 ) # Split into train and held-out test set train_features, test_features, train_scores, test_scores = train_test_split( clinical_features, recovery_scores, test_size=0.2, random_state=7 ) # Scale BEFORE fitting — fit scaler on train only to avoid data leakage scaler = StandardScaler() train_features_scaled = scaler.fit_transform(train_features) test_features_scaled = scaler.transform(test_features) # transform only, don't refit # --- Define lambda search space on a log scale --- # np.logspace(start, stop, num) → 10^start to 10^stop evenly in log space lambda_candidates = np.logspace(-3, 4, 100) # 0.001 to 10,000, 100 values # --- RidgeCV: tries all lambdas with cross-validation, picks the best automatically --- ridge_cv = RidgeCV( alphas=lambda_candidates, cv=5, # 5-fold cross-validation scoring='neg_mean_squared_error' ) ridge_cv.fit(train_features_scaled, train_scores) # --- LassoCV: same idea but with coordinate descent convergence --- lasso_cv = LassoCV( alphas=lambda_candidates, cv=5, max_iter=10000, random_state=7 ) lasso_cv.fit(train_features_scaled, train_scores) # --- Evaluate both on the held-out test set --- ridge_test_rmse = mean_squared_error( test_scores, ridge_cv.predict(test_features_scaled), squared=False ) lasso_test_rmse = mean_squared_error( test_scores, lasso_cv.predict(test_features_scaled), squared=False ) lasso_active_features = np.sum(np.abs(lasso_cv.coef_) > 0.001) print("=== Cross-Validated Lambda Selection Results ===") print(f"Ridge — best lambda : {ridge_cv.alpha_:.4f}") print(f"Ridge — test RMSE : {ridge_test_rmse:.3f}") print() print(f"Lasso — best lambda : {lasso_cv.alpha_:.4f}") print(f"Lasso — test RMSE : {lasso_test_rmse:.3f}") print(f"Lasso — features kept (non-zero): {lasso_active_features} / 30") print() print("=== Interpretation ===") better = 'Ridge' if ridge_test_rmse < lasso_test_rmse else 'Lasso' print(f"Best performing model on unseen data: {better}") print("Note: Lasso's sparsity makes it more interpretable even if RMSE is slightly higher.")
Ridge — best lambda : 12.6486
Ridge — test RMSE : 39.847
Lasso — best lambda : 0.2154
Lasso — test RMSE : 40.213
Lasso — features kept (non-zero): 11 / 30
=== Interpretation ===
Best performing model on unseen data: Ridge
Note: Lasso's sparsity makes it more interpretable even if RMSE is slightly higher.
Elastic Net — When L1 and L2 Alone Aren't Enough
Real-world data rarely fits neatly into the 'all features relevant' or 'most features noise' buckets. Often you have many features, some correlated, some noisy, some genuinely useful. Choosing L1 loses correlated groups. Choosing L2 never sparsifies. Elastic Net combines both penalties: λ × (0.5 × (1 − l1_ratio) × Σwᵢ² + l1_ratio × Σ|wᵢ|).
The l1_ratio parameter (0 to 1) controls the mix. l1_ratio=1 is pure Lasso. l1_ratio=0 is pure Ridge. In practice, l1_ratio=0.5 is a solid default. But like lambda, l1_ratio should be cross-validated.
Elastic Net solves the 'grouped feature' problem. When you have highly correlated features (like one-hot encoded categories or noisy sensor readings), Lasso arbitrarily picks one and drops the rest. Elastic Net either keeps the whole group or drops it together — more stable and often more accurate.
Bottom line: if you're unsure, start with Elastic Net. Cross-validate both alpha and l1_ratio. It's computationally heavier but gives you the best of both worlds.
import numpy as np from sklearn.linear_model import ElasticNetCV from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import mean_squared_error np.random.seed(42) X, y = make_regression(n_samples=300, n_features=50, n_informative=10, noise=15, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # ElasticNetCV: cross-validates both alpha and l1_ratio elastic_cv = ElasticNetCV( alphas=np.logspace(-3, 3, 50), l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9], cv=5, max_iter=10000, random_state=42 ) elastic_cv.fit(X_train_scaled, y_train) test_rmse = mean_squared_error(y_test, elastic_cv.predict(X_test_scaled), squared=False) print("=== Elastic Net CV Results ===") print(f"Best alpha : {elastic_cv.alpha_:.4f}") print(f"Best l1_ratio: {elastic_cv.l1_ratio_:.2f}") print(f"Test RMSE : {test_rmse:.3f}") print(f"Non-zero coefs: {np.sum(np.abs(elastic_cv.coef_) > 0.001)} / 50")
Best alpha : 0.2154
Best l1_ratio: 0.50
Test RMSE : 14.873
Non-zero coefs: 12 / 50
- Lasso removes entire correlated groups; Elastic Net keeps or drops them together.
- l1_ratio near 1 = Lasso behaviour; near 0 = Ridge behaviour.
- Cross-validating l1_ratio adds one more hyperparameter dimension but often pays off.
- Use when you have many features with unknown structure — the safe default for most production datasets.
Regularisation Beyond Linear Models — Neural Networks, Trees & Ensembles
Regularisation isn't exclusive to linear models. Neural networks overfit just as badly — often worse because they have millions of parameters. Three common regularisation techniques in deep learning:
- L1/L2 weight decay: PyTorch and Keras apply weight decay by adding an extra term to the loss. In PyTorch, you set weight_decay in the optimiser. In Keras, use kernel_regularizer=l2(0.01) on each layer.
- Dropout: Randomly drops neurons during training with probability p. Forces the network to learn redundant representations. At inference, all neurons are active but their outputs are scaled by p. Typical p=0.5 for fully connected layers, 0.2–0.3 for convolutional layers.
- Early stopping: Stop training when validation loss stops improving. The model hasn't had time to memorise noise. In practice, early stopping with patience=5–10 works as regularisation — it prevents the optimisation from converging to an overfitted minimum.
For tree-based models (Random Forest, XGBoost), regularisation works differently. XGBoost has L1 and L2 regularisation on leaf weights (reg_alpha, reg_lambda). Random Forest uses built-in ensembling (bagging + random feature selection) as its regularisation — more trees means lower variance.
The key takeaway: regularisation is universal. No matter your model family, you need a mechanism to constrain complexity.
import torch import torch.nn as nn # --- PyTorch model with weight decay --- model = nn.Sequential( nn.Linear(100, 64), nn.ReLU(), nn.Dropout(0.5), # dropout regularisation nn.Linear(64, 32), nn.ReLU(), nn.Dropout(0.3), nn.Linear(32, 1) ) # L2 regularisation via weight_decay in optimiser optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01) # Keras equivalent: # model = Sequential([ # Dense(64, activation='relu', kernel_regularizer=l2(0.01)), # Dropout(0.5), # Dense(32, activation='relu', kernel_regularizer=l2(0.01)), # Dropout(0.3), # Dense(1) # ]) print("Model defined with weight_decay=0.01 and dropout layers.")
Common Pitfalls and Production Best Practices
Even experienced engineers make these mistakes. Let's cover the traps you'll actually encounter in production.
Pitfall 1: Applying regularisation without scaling. Regularisation penalises weight magnitude. If Feature A is in metres (values ~0–100) and Feature B is in millimetres (values ~0–100,000), the model will penalise Feature B's weight even though its natural coefficient is smaller. Always standardise features to zero mean and unit variance before any penalty-based regularisation.
Pitfall 2: Using default lambda. The sklearn default for Ridge is alpha=1.0. That might be perfect for one dataset and disastrous for another. Always use RidgeCV or LassoCV to find your lambda.
Pitfall 3: Regularising after leakage. If you shuffle the dataset before train/test split, you've already leaked test data into the training process. Regularisation won't fix that — it'll just compress a leaking model. Never shuffle before splitting.
Pitfall 4: Treating regularisation as a substitute for data cleaning. Regularisation reduces overfitting but doesn't remove bad data. Duplicate rows, extreme outliers, and target leakage must be fixed in preprocessing. Regularisation is a band-aid, not a cure.
Best Practice: Always run a no-regularisation baseline. Train a model with alpha=0 first to see how bad the overfitting is. Then add regularisation. The gap between the two is your 'overfitting budget' — it tells you how much regularisation you need.
import numpy as np from sklearn.linear_model import Ridge from sklearn.preprocessing import StandardScaler # --- BAD: no scaling, default alpha --- # model = Ridge(alpha=1.0).fit(X, y) # WRONG for non-scaled data # --- GOOD: scale, use CV to find alpha --- scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) # Baseline: no regularisation ridge_none = Ridge(alpha=0).fit(X_scaled, y_train) baseline_rmse = mean_squared_error(y_train, ridge_none.predict(X_scaled), squared=False) # CV tuned from sklearn.linear_model import RidgeCV ridge_cv = RidgeCV(alphas=np.logspace(-3, 4, 100), cv=5).fit(X_scaled, y_train) optimal_alpha = ridge_cv.alpha_ print(f"Baseline (alpha=0) RMSE: {baseline_rmse:.2f}") print(f"Optimal alpha from CV: {optimal_alpha:.4f}") print(f"Improvement: {(baseline_rmse - mean_squared_error(y_train, ridge_cv.predict(X_scaled), squared=False)):.2f}")
Optimal alpha from CV: 2.6183
Improvement: 12.34
| Aspect | L1 Regularisation (Lasso) | L2 Regularisation (Ridge) |
|---|---|---|
| Penalty formula | λ × Σ|wᵢ| (sum of absolutes) | λ × Σwᵢ² (sum of squares) |
| Effect on weights | Drives many weights to exactly 0 | Shrinks all weights, rarely to exact 0 |
| Feature selection | Yes — built-in sparse solutions | No — keeps all features active |
| Best used when | Many irrelevant / noisy features | Most features carry real signal |
| Behaviour with correlated features | Picks one, ignores the others | Shares weight evenly across group |
| Computational cost | Slightly higher (non-differentiable at 0) | Very efficient (closed-form solution) |
| sklearn class | Lasso(alpha=λ) | Ridge(alpha=λ) |
| Geometry of constraint region | Diamond (L1 ball) — corners touch axes | Circle (L2 ball) — smooth, no corners |
| Real-world example | Gene selection in genomics | Predicting house prices with many features |
🎯 Key Takeaways
- Regularisation adds a penalty term to the loss function that punishes large weights — this forces the model to learn general patterns rather than memorising training noise. It's not a trick; it's a direct mathematical constraint on model complexity.
- L1 (Lasso) uses absolute weight penalties which create exact zeros — it does feature selection automatically. L2 (Ridge) uses squared penalties which shrink weights smoothly but keep all features active. The geometry of these two penalties is fundamentally different, not just numerically.
- Lambda (α in sklearn) controls the regularisation strength and must be tuned via cross-validation. A log-scale search space (0.001 → 10000) gives much better coverage than a linear grid. RidgeCV and LassoCV make this a single method call.
- Always scale your features before applying regularisation — otherwise the penalty disproportionately affects features with large numerical ranges, and your model will silently under-use important high-scale features.
- Elastic Net blends L1 and L2 penalties and is often the best default when you're unsure about feature structure. Cross-validate both alpha and l1_ratio.
- Regularisation works for neural networks (weight decay, dropout, early stopping) and tree models (XGBoost reg_alpha/reg_lambda). No model family is immune to overfitting.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QCan you explain the geometric intuition behind why L1 regularisation tends to produce sparse weights while L2 doesn't? Walk me through what happens at the constraint boundary.SeniorReveal
- QIf you have a dataset with 500 features and suspect only 20 are genuinely predictive, which regularisation method would you start with and why? What would you do after identifying those features?SeniorReveal
- QWhat's the difference between regularisation and simply reducing model complexity — for example, using a shallower decision tree? When would you choose regularisation over simplifying the model architecture?Mid-levelReveal
Frequently Asked Questions
What is the difference between L1 and L2 regularisation?
L1 (Lasso) adds a penalty proportional to the absolute value of weights — this creates exact zeros and performs automatic feature selection. L2 (Ridge) adds a penalty proportional to the square of weights — this shrinks all weights evenly toward zero but almost never to exactly zero. Use L1 when you want sparsity; use L2 when most features are genuinely relevant.
Does regularisation always improve model performance?
Not always — it depends on the problem. If your model is already underfitting (training error is high), adding regularisation will make things worse by constraining the model further. Regularisation is specifically a remedy for overfitting: when training error is much lower than validation error. Always diagnose the bias-variance situation first.
Why do we need to scale features before applying regularisation?
Regularisation penalises the magnitude of weights directly. If Feature A is measured in millions (e.g. salary) its learned weight will naturally be small, while Feature B in single digits (e.g. years of experience) will have a large weight. The penalty unfairly targets Feature B even if both are equally informative. Scaling to zero mean and unit variance puts all features on equal footing before the penalty is applied.
What is Elastic Net and when should I use it?
Elastic Net combines L1 and L2 penalties in a single loss function. The mix is controlled by the l1_ratio parameter (0 = pure Ridge, 1 = pure Lasso). Use Elastic Net when you have many features with unknown correlation structure — it handles correlated feature groups better than Lasso alone and provides sparsity unlike Ridge. It's a safe default when you're unsure which type to use.
Can regularisation be used with non-linear models like decision trees?
Yes, but the mechanism differs. XGBoost and LightGBM offer L1 and L2 regularisation on leaf weights (reg_alpha, reg_lambda). Random Forest doesn't have direct weight penalties but regularises via bagging and random feature selection — more trees reduce variance without explicit penalty. For deep learning, weight decay (L2), dropout, and early stopping are the standard regularisation techniques.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.