Regularisation penalises large weights or randomly drops neurons to prevent overfitting
L2 (weight decay) shrinks all weights; L1 zeros irrelevant weights
Inverted dropout scales surviving neurons during training so inference needs no changes
AdamW is the correct way to apply L2 with Adam — plain Adam + weight_decay is not true L2
Dropout and BatchNorm conflict due to variance shift at inference time
Rule: always call model.eval() at inference — forgetting it is a silent production bug
Plain-English First
Imagine a school where students always work in the same fixed groups. They get so used to each other that if one student is absent, the whole group falls apart — they've stopped thinking independently. A good teacher mixes up the groups randomly every class, so each student learns to contribute on their own. Dropout does exactly this to a neural network: it randomly 'turns off' neurons during training so the network stops relying on any single neuron and learns more robust, general patterns instead.
Every neural network you train is secretly fighting two wars at once: the war against underfitting (not learning enough) and the war against overfitting (memorising the training data so well it fails on anything new). In production, overfitting is the silent killer — your model hits 98% accuracy on the training set and 67% on real-world data, and your team spends a week debugging what looks like a data pipeline bug before realising the model itself is the culprit. Regularisation is the entire family of techniques that keeps a model honest.
The core problem regularisation solves is that neural networks are universal function approximators — given enough parameters, they will happily memorise noise. A 10-million-parameter model trained on 5,000 examples doesn't generalise; it cheats. Regularisation introduces controlled friction into the learning process — either by constraining the weight magnitudes directly (L1/L2), by randomly disabling neurons during training (Dropout), or by corrupting the learning signal in structured ways (DropConnect, Batch Normalisation as an implicit regulariser). Each technique attacks the memorisation problem from a different angle.
By the end of this article you'll understand exactly why L2 regularisation shrinks weights but rarely zeros them while L1 creates sparsity, how inverted dropout works at the implementation level (and why naive dropout breaks inference), when Dropout actively hurts you (CNNs on small datasets, transformers), how to diagnose overfitting programmatically, and how to configure all of this correctly in PyTorch for a production training loop. You'll also have clear answers to the three interview questions that trip up even experienced ML engineers.
L1 and L2 Regularisation: Weight Penalties From First Principles
L1 and L2 regularisation both work by adding a penalty term to the loss function that punishes large weights. The difference in their math creates dramatically different behaviour in practice — and understanding why matters when you're choosing between them.
L2 regularisation adds λ * Σ(wᵢ²) to the loss. Because the penalty scales with the square of each weight, the gradient contribution from regularisation is 2λwᵢ — always proportional to the weight itself. This means large weights get pushed down hard, small weights get pushed down gently, and weights almost never reach exactly zero. You end up with many small, distributed weights. This is why L2 is also called weight decay in optimiser implementations: it multiplies every weight by (1 - 2λ·lr) each step.
L1 regularisation adds λ Σ|wᵢ|. The subgradient is λ sign(wᵢ) — a constant nudge toward zero regardless of the weight's current magnitude. A weight of 0.0001 gets pushed just as hard as a weight of 10.0. This is exactly why L1 promotes sparsity: small weights that aren't contributing much get pushed all the way to zero, giving you a natural feature selection effect. Use L1 when you suspect only a subset of your input features are genuinely useful. Use L2 (almost always the default) when all features likely matter and you just want to prevent any single weight from dominating.
A common production mistake is using L1 on non-sparse data — you can lose useful signal by zeroing out relevant features that happen to have small weights. Cross-validation over λ is essential for both techniques.
l1_l2_regularisation_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
# ── Tiny dataset: noisy sine wave with 80 training points ──────────────────
torch.manual_seed(42)
noise_std = 0.3
num_train_samples = 80# Input: values between 0 and 2π
train_inputs = torch.linspace(0, 2 * torch.pi, num_train_samples).unsqueeze(1)
train_targets = torch.sin(train_inputs) + torch.randn_like(train_inputs) * noise_std
# ── A deliberately over-parameterised model (easy to overfit) ──────────────classOverparameterisedNet(nn.Module):
def__init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(1, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 1)
)
defforward(self, x):
returnself.layers(x)
defcompute_l2_penalty(model: nn.Module, lambda_l2: float) -> torch.Tensor:
"""Manually compute L2 weight penalty (sum of squared weights × lambda).
Note: PyTorch's weight_decay inAdam/SGD does the same thing — this makes
the mechanism explicit."""
l2_penalty = torch.tensor(0.0)
for param_name, param in model.named_parameters():
if 'weight' in param_name: # skip bias terms — regularising bias rarely helps
l2_penalty = l2_penalty + torch.sum(param ** 2)
return lambda_l2 * l2_penalty
defcompute_l1_penalty(model: nn.Module, lambda_l1: float) -> torch.Tensor:
"""L1 penalty — sum of absolute weights × lambda. Promotes sparsity."""
l1_penalty = torch.tensor(0.0)
for param_name, param in model.named_parameters():
if'weight'in param_name:
l1_penalty = l1_penalty + torch.sum(torch.abs(param))
return lambda_l1 * l1_penalty
deftrain_with_regularisation(
reg_type: str,
lambda_strength: float,
num_epochs: int = 500
) -> tuple[list[float], nn.Module]:
"""Train the overparameterised net with a chosen regularisation strategy.
Returns training losses and the final trained model."""
model = OverparameterisedNet()
# NOTE: We set weight_decay=0 here intentionally — we're computing the# penalty manually so you can see exactly what's happening inside.
optimiser = optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.0)
mse_loss_fn = nn.MSELoss()
epoch_losses = []
for epoch inrange(num_epochs):
model.train()
optimiser.zero_grad()
predictions = model(train_inputs)
base_mse_loss = mse_loss_fn(predictions, train_targets)
# Add regularisation penalty to the base lossif reg_type == 'l2':
reg_penalty = compute_l2_penalty(model, lambda_strength)
elif reg_type == 'l1':
reg_penalty = compute_l1_penalty(model, lambda_strength)
else:
reg_penalty = torch.tensor(0.0) # no regularisation baseline
total_loss = base_mse_loss + reg_penalty
total_loss.backward() # gradients flow through both MSE and penalty
optimiser.step()
epoch_losses.append(base_mse_loss.item()) # track pure MSE, not penalised lossreturn epoch_losses, model
# ── Run all three variants and compare ────────────────────────────────────
no_reg_losses, no_reg_model = train_with_regularisation('none', 0.0)
l2_losses, l2_model = train_with_regularisation('l2', 1e-3)
l1_losses, l1_model = train_with_regularisation('l1', 1e-4)
# ── Check weight sparsity: how many weights are near zero? ─────────────────defcount_near_zero_weights(model: nn.Module, threshold: float = 1e-3) -> dict:
total_weights, near_zero = 0, 0for param_name, param in model.named_parameters():
if'weight'in param_name:
total_weights += param.numel()
near_zero += (torch.abs(param) < threshold).sum().item()
return {'total': total_weights, 'near_zero': near_zero,
'sparsity_pct': round(100 * near_zero / total_weights, 1)}
print("=== Weight Sparsity Report ===")
print(f"No regularisation : {count_near_zero_weights(no_reg_model)}")
print(f"L2 regularisation : {count_near_zero_weights(l2_model)}")
print(f"L1 regularisation : {count_near_zero_weights(l1_model)}")
print()
print(f"Final MSE — No reg : {no_reg_losses[-1]:.4f}")
print(f"Final MSE — L2 : {l2_losses[-1]:.4f}")
print(f"Final MSE — L1 : {l1_losses[-1]:.4f}")
Output
=== Weight Sparsity Report ===
No regularisation : {'total': 16641, 'near_zero': 312, 'sparsity_pct': 1.9}
Watch Out: weight_decay in Adam ≠ True L2 Regularisation
In PyTorch, Adam with weight_decay implements L2 penalty on the gradient-scaled update, not on the raw loss. This is called AdamW when done correctly (decoupled weight decay). If you care about true L2 regularisation with Adam, either compute the penalty manually as shown above, or use AdamW (torch.optim.AdamW) which was specifically designed to fix this. SGD with weight_decay does implement true L2 because SGD has no adaptive learning rates to corrupt the penalty.
Production Insight
L2 with Adam looks weaker than expected — high-gradient params get less regularisation.
Switch to AdamW for decoupled weight decay.
On large models like BERT, AdamW is the default for a reason.
Key Takeaway
L1 creates sparsity via constant-magnitude gradient push.
L2 shrinks all weights proportionally but rarely zeros them.
Choose L2 for dense signals, L1 for sparse feature selection.
Dropout: Internals, Inverted Scaling, and the Train/Eval Trap
Dropout's core idea sounds almost reckless: during each forward pass of training, randomly zero out each neuron's output with probability p. What this actually does is force the network to learn redundant representations — no single neuron can become a crutch because it might not be there on the next step. The ensemble interpretation is elegant: with n neurons each having dropout rate p, you're implicitly training 2ⁿ different sub-networks and averaging them at inference time.
Here's the subtle part that trips people up: inverted dropout. If you zero out 50% of neurons during training but use all neurons at inference time, the expected output magnitude doubles. Naive dropout would require you to multiply all weights by (1-p) at inference to compensate. Inverted dropout flips this — it scales up the surviving neurons by 1/(1-p) during training, so inference requires zero changes. Every modern framework (PyTorch, TensorFlow, JAX) uses inverted dropout. The implication: model.eval() is not optional — it disables this training-time scaling.
The critical question is where to place dropout layers. After activation functions, never before (you'd zero values before the non-linearity, wasting the computation). In Transformer architectures, dropout is applied to attention weights and feed-forward sublayer outputs. In convolutional networks, spatial dropout (dropping entire feature maps, not individual pixels) works significantly better because adjacent pixels are highly correlated — standard dropout doesn't break that correlation properly.
One underappreciated production detail: when using mixed-precision training, dropout's random mask generation must be seeded deterministically across distributed workers. Otherwise, different GPUs will drop different neurons, and the loss averaged across workers corrupts the training dynamics. PyTorch's torch.nn.Dropout handles this correctly by default, but if you write custom dropout logic, use the same seed per batch.
dropout_internals_pytorch.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import torch
import torch.nn as nn
import numpy as np
torch.manual_seed(7)
# ── 1. Prove inverted dropout scaling manually ─────────────────────────────print("=== Inverted Dropout Scaling Proof ===")
dropout_rate = 0.5
dropout_layer = nn.Dropout(p=dropout_rate)
# A tensor of all ones — makes the mean trivial to reason about
test_activations = torch.ones(10_000) # large sample for stable mean
dropout_layer.train() # training mode: dropout is ACTIVE
train_output = dropout_layer(test_activations)
print(f"Training mode — mean (should be ~1.0): {train_output.mean().item():.4f}")
print(f"Training mode — non-zero fraction : {(train_output != 0).float().mean().item():.4f}")
# Even though 50% are zeroed, the survivors are scaled by 1/(1-0.5)=2.0# so the mean stays at 1.0 — inverted dropout in action
dropout_layer.eval() # inference mode: NO dropout, NO scaling
eval_output = dropout_layer(test_activations)
print(f"Eval mode — mean (should be 1.0) : {eval_output.mean().item():.4f}")
print(f"Eval mode — non-zero fraction : {(eval_output != 0).float().mean().item():.4f}")
print()
# ── 2. A real training loop with correct dropout placement ─────────────────classRegularisedClassifier(nn.Module):
"""
A fully connected classifier with dropout placed AFTER activations.
dropout_rate: probability of zeroing a neuron (0 = no dropout, 0.5 = common default)
"""
def__init__(self, input_dim: int, hidden_dim: int, num_classes: int,
dropout_rate: float = 0.5):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_rate), # ← after activation, not before
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Dropout(p=dropout_rate), # ← dropout in every hidden block
nn.Linear(hidden_dim // 2, num_classes)
# NO dropout before the final output layer — you'd corrupt predictions
)
defforward(self, x: torch.Tensor) -> torch.Tensor:
returnself.network(x)
# ── 3. Demonstrate the train/eval mode difference on the same input ─────────
classifier = RegularisedClassifier(
input_dim=20, hidden_dim=64, num_classes=5, dropout_rate=0.5
)
sample_input = torch.randn(4, 20) # batch of 4 samples, 20 features each
classifier.train()
train_logits_run1 = classifier(sample_input)
train_logits_run2 = classifier(sample_input)
print("=== Same input, two forward passes in TRAIN mode ===")
print("Run 1 logits:", train_logits_run1[0].detach().numpy().round(3))
print("Run 2 logits:", train_logits_run2[0].detach().numpy().round(3))
print("Are they equal?", torch.allclose(train_logits_run1, train_logits_run2))
print() # They WILL differ — different random neurons dropped each time
classifier.eval()
with torch.no_grad(): # ALWAYS pair .eval() with no_grad() at inference
eval_logits_run1 = classifier(sample_input)
eval_logits_run2 = classifier(sample_input)
print("=== Same input, two forward passes in EVAL mode ===")
print("Run 1 logits:", eval_logits_run1[0].numpy().round(3))
print("Run 2 logits:", eval_logits_run2[0].numpy().round(3))
print("Are they equal?", torch.allclose(eval_logits_run1, eval_logits_run2))
# ── 4. Spatial Dropout for CNNs ────────────────────────────────────────────print("\n=== Spatial Dropout (for CNNs) ===")
# nn.Dropout2d drops entire channels (feature maps), not individual pixels
spatial_dropout = nn.Dropout2d(p=0.3)
feature_map_batch = torch.ones(2, 8, 4, 4) # (batch=2, channels=8, H=4, W=4)
spatial_dropout.train()
dropped_maps = spatial_dropout(feature_map_batch)
surviving_channels = (dropped_maps[0].sum(dim=(1, 2)) != 0).sum().item()
print(f"Channels surviving spatial dropout (of 8): {surviving_channels}")
print("(entire channels are zeroed, not individual pixels)")
Output
=== Inverted Dropout Scaling Proof ===
Training mode — mean (should be ~1.0): 1.0021
Training mode — non-zero fraction : 0.4998
Eval mode — mean (should be 1.0) : 1.0000
Eval mode — non-zero fraction : 1.0000
=== Same input, two forward passes in TRAIN mode ===
Run 1 logits: [ 0.183 -0.412 0.671 -0.089 0.224]
Run 2 logits: [-0.301 0.118 0.429 0.552 -0.177]
Are they equal? False
=== Same input, two forward passes in EVAL mode ===
Run 1 logits: [ 0.094 -0.152 0.318 0.201 -0.043]
Run 2 logits: [ 0.094 -0.152 0.318 0.201 -0.043]
Are they equal? True
=== Spatial Dropout (for CNNs) ===
Channels surviving spatial dropout (of 8): 6
(entire channels are zeroed, not individual pixels)
Pro Tip: Use MC Dropout for Free Uncertainty Estimates
Keep dropout active at inference time (call model.train() or pass training=True) and run N forward passes on the same input. The variance across predictions is a calibrated measure of the model's uncertainty — this is called Monte Carlo Dropout (Gal & Ghahramani, 2016). It's production-ready and costs almost nothing: just N forward passes. Useful for medical imaging, autonomous driving, or any domain where 'I don't know' is a valid and important answer.
Production Insight
Forget model.eval() and your inference becomes non-deterministic.
On distributed training, custom dropout masks must share seeds across workers.
Use MC Dropout for uncertainty — but N needs to be at least 50 for stable variance.
Key Takeaway
Inverted dropout scales survivors during training so inference is clean.
model.eval() is mandatory at inference — not optional.
Place dropout after activation, never before.
When Dropout Hurts, and What to Use Instead
Dropout is not a universal fix. Knowing when to skip it is just as important as knowing how to apply it.
Small datasets + CNNs: On tiny datasets (fewer than ~10k images), dropout in convolutional layers can destabilise training. CNNs already have strong inductive biases and weight sharing as implicit regularisers. Adding high dropout often just slows convergence without improving generalisation. Use data augmentation and L2 weight decay instead. SpatialDropout2d with low rates (0.1–0.2) is safer than standard Dropout.
Transformers and attention mechanisms: BERT, GPT, and ViT all use dropout, but the rates are much lower (0.1 typically) and the placement is surgical. Because transformers use residual connections and LayerNorm extensively, they have their own built-in stabilisation. Heavy dropout fights against these mechanisms. The dominant regulariser in modern transformers is a combination of weight decay, data augmentation, and stochastic depth (randomly dropping entire residual blocks).
Batch Normalisation as an implicit regulariser: When you're using BatchNorm, it introduces noise during training (because batch statistics are noisy approximations of the true distribution statistics), which acts like a weak regulariser. Combining heavy dropout with BatchNorm is problematic — Luo et al. (2018) showed that dropout changes the variance of activations that BatchNorm then tries to normalise, creating unstable training dynamics. The common production rule: if you're using BatchNorm in a block, use little to no dropout in that same block.
Recurrent networks (LSTMs, GRUs): Standard dropout applied to recurrent connections across time steps destroys the temporal gradient signal. Use variational dropout (same mask across all time steps, applied only to non-recurrent connections) as implemented in nn.LSTM(dropout=rate) in PyTorch — which applies dropout between LSTM layers, not within the recurrent computation.
regularisation_strategy_comparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data importDataLoader, TensorDatasetimport time
torch.manual_seed(0)
# ── Synthetic classification dataset (mimics a small real-world dataset) ────
num_samples, input_features, num_classes = 2000, 50, 4
all_inputs = torch.randn(num_samples, input_features)
# Ground truth: only first 10 features actually matter
true_weights = torch.zeros(input_features, num_classes)
true_weights[:10] = torch.randn(10, num_classes) # sparse ground truth
all_labels = (all_inputs @ true_weights).argmax(dim=1)
# 70/30 train/val split
split_idx = int(0.7 * num_samples)
train_dataset = TensorDataset(all_inputs[:split_idx], all_labels[:split_idx])
val_dataset = TensorDataset(all_inputs[split_idx:], all_labels[split_idx:])
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=128)
defbuild_model(use_dropout: bool, use_batchnorm: bool,
dropout_rate: float = 0.4) -> nn.Module:
"""Build a configurable MLP to test different regularisation combos."""
layers = [nn.Linear(input_features, 128)]
if use_batchnorm:
layers.append(nn.BatchNorm1d(128))
layers.append(nn.ReLU())
if use_dropout and not use_batchnorm: # avoid dropout+BN conflict
layers.append(nn.Dropout(p=dropout_rate))
layers.append(nn.Linear(128, 64))
if use_batchnorm:
layers.append(nn.BatchNorm1d(64))
layers.append(nn.ReLU())
if use_dropout andnot use_batchnorm:
layers.append(nn.Dropout(p=dropout_rate))
layers.append(nn.Linear(64, num_classes))
return nn.Sequential(*layers)
defevaluate_accuracy(model: nn.Module, loader: DataLoader) -> float:
"""Evaluate accuracy. MUST call model.eval() — dropout changes outputs."""
model.eval()
correct, total = 0, 0with torch.no_grad():
for batch_inputs, batch_labels in loader:
predictions = model(batch_inputs).argmax(dim=1)
correct += (predictions == batch_labels).sum().item()
total += len(batch_labels)
return correct / total
defrun_experiment(experiment_name: str, model: nn.Module,
weight_decay: float = 0.0, num_epochs: int = 60) -> dict:
"""Train a model configuration and return final train/val accuracy."""# AdamW with weight_decay implements proper decoupled L2 regularisation
optimiser = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=weight_decay)
loss_fn = nn.CrossEntropyLoss()
for epoch inrange(num_epochs):
model.train() # activates dropout and batchnorm training behaviourfor batch_inputs, batch_labels in train_loader:
optimiser.zero_grad()
logits = model(batch_inputs)
loss = loss_fn(logits, batch_labels)
loss.backward()
optimiser.step()
train_acc = evaluate_accuracy(model, train_loader)
val_acc = evaluate_accuracy(model, val_loader)
gap = train_acc - val_acc # high gap = overfittingprint(f"{experiment_name:35s} | Train: {train_acc:.3f} | Val: {val_acc:.3f} | Gap: {gap:.3f}")
return {'train_acc': train_acc, 'val_acc': val_acc, 'gap': gap}
# ── Run all experiments ────────────────────────────────────────────────────print(f"{'Experiment':35s} | {'Train':7} | {'Val':5} | Gap")
print("-" * 65)
run_experiment("Baseline (no regularisation)",
build_model(use_dropout=False, use_batchnorm=False))
run_experiment("L2 only (weight_decay=1e-3)",
build_model(use_dropout=False, use_batchnorm=False),
weight_decay=1e-3)
run_experiment("Dropout only (rate=0.4)",
build_model(use_dropout=True, use_batchnorm=False))
run_experiment("Dropout + L2 combined",
build_model(use_dropout=True, use_batchnorm=False),
weight_decay=1e-3)
run_experiment("BatchNorm only",
build_model(use_dropout=False, use_batchnorm=True))
run_experiment("BatchNorm + light L2 (no dropout)",
build_model(use_dropout=False, use_batchnorm=True),
weight_decay=5e-4)
Senior ML interview question: 'Why is mixing Dropout and BatchNorm problematic?' The answer is variance shift. During training, Dropout randomly zeros neurons, changing the variance of the distribution that BatchNorm then tries to normalise. At inference, no neurons are dropped but BatchNorm uses running statistics computed under the dropout-corrupted variance — these don't match, causing a systematic shift in BatchNorm's output. Fix: put Dropout after BatchNorm + activation (not before), keep dropout rate low (≤0.1), or replace Dropout with weight decay when using BatchNorm-heavy architectures.
Production Insight
Heavy dropout in CNNs slows convergence without improving generalisation.
Transformers use stochastic depth — dropping entire blocks — not neuron dropout.
Combine L2 with data augmentation; use dropout sparingly when BatchNorm is present.
Key Takeaway
Dropout hurts CNNs on small data; use augmentation + L2 instead.
In transformers, use low dropout (0.1) and stochastic depth.
Dropout + BatchNorm conflict: avoid mixing in the same block.
Tuning Regularisation in Practice: λ, Dropout Rate & Early Stopping
Choosing the right regularisation strength is more art than science, but there are systematic heuristics that save you from random search.
L2 weight decay (λ): Start with 1e-4 for small models (under 1M params) and 1e-2 for large models (BERT, ResNet-50). Monitor the gap between train and validation loss. If gap > 10% after 20 epochs, double λ. If both losses are high, halve λ. AdamW is your friend: it decouples the decay from adaptive gradients.
Dropout rate: For large fully-connected layers (width > 512), p=0.5 works. For medium layers (128–512), p=0.3. For small layers (<128), skip dropout — the layer lacks capacity to waste. In CNNs, spatial dropout at p=0.1–0.2 is the ceiling. In transformers, p=0.1 is standard; going above 0.2 hurts attention fidelity.
Early stopping: Always use it. Set patience = number of epochs where validation loss must not improve. For small datasets, patience=5; for large datasets, patience=10–20. Combine with a learning rate scheduler that reduces lr on plateau. Early stopping is your last line of defence against overfitting.
Systematic tuning workflow: 1. Train a baseline without regularisation. Note the train-val gap. 2. Add L2 (weight_decay=1e-3). If gap shrinks, keep it. 3. Add dropout (p=0.5) to largest FC layers. If gap shrinks further, good. 4. If train loss drops too much, reduce regularisation. 5. Use early stopping to prevent wasted epochs.
regularisation_tuning_workflow.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data importDataLoader, TensorDataset# Assume train_loader, val_loader, model defineddefsystematic_tuning(model_class, train_loader, val_loader, configs):
"""Run multiple regularisation configs and pick best val loss."""
results = []
for cfg in configs:
model = model_class(**cfg['model_args'])
optimiser = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=cfg['wd'])
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimiser, patience=3, factor=0.5)
best_val_loss = float('inf')
patience_counter = 0for epoch inrange(100):
model.train()
for x, y in train_loader:
optimiser.zero_grad()
loss = nn.CrossEntropyLoss()(model(x), y)
loss.backward()
optimiser.step()
model.eval()
val_loss = 0.0with torch.no_grad():
for x, y in val_loader:
val_loss += nn.CrossEntropyLoss(reduction='sum')(model(x), y).item()
val_loss /= len(val_loader.dataset)
scheduler.step(val_loss)
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0else:
patience_counter += 1if patience_counter >= 5:
break
results.append({'config': cfg, 'best_val_loss': best_val_loss})
returnmin(results, key=lambda r: r['best_val_loss'])
# Example configs:
configs = [
{'model_args': {'dropout_rate': 0.0}, 'wd': 0.0},
{'model_args': {'dropout_rate': 0.0}, 'wd': 1e-3},
{'model_args': {'dropout_rate': 0.3}, 'wd': 1e-3},
{'model_args': {'dropout_rate': 0.5}, 'wd': 1e-2},
]
# best_config = systematic_tuning(...)
Mental Model: Regularisation as a Thermostat
Too little regularisation: model memorises noise — overfits (low train loss, high val loss).
Too much regularisation: model can't learn patterns — underfits (both losses high).
Just right: model learns general patterns — both losses low and close.
Adjust λ or p in small multiplicative steps (2x or 0.5x) — don't jump orders of magnitude.
Early stopping is the safety valve: turn up reg strength and let early stopping compensate.
Production Insight
Always use AdamW over Adam for true L2 weight decay.
Set patience for early stopping based on dataset size — small datasets need shorter patience.
Reduce learning rate on plateau prevents the model from oscillating around a bad local minimum.
Key Takeaway
Start L2 at 1e-4 for small models, 1e-2 for large models.
Dropout rate scales with layer width: wide → 0.5, medium → 0.3, narrow → 0.0.
Early stopping + LR scheduler = practical overfitting defence.
Beyond Dropout and L2: Advanced Regularisation Techniques
While dropout and L2 are the workhorses, production systems often layer on additional regularisation techniques that are less known but highly effective.
DropConnect: Instead of zeroing neuron outputs, zero the weights themselves with probability p for each forward pass. This is a stronger form of regularisation because it prevents co-adaptation at the connection level, not just the neuron level. Rarely used in practice because it's expensive (masking all weights), but it can be effective for very wide layers.
Stochastic Depth: Used primarily in ResNets and Transformers. During training, randomly drop entire residual blocks (set their output to zero). This forces each block to learn features independently, not rely on the skip connection. At inference, all blocks are active but scaled by the survival probability. The result: faster training and better generalisation. Hugging Face's BERT variants use stochastic depth with survival probability 0.9.
Label Smoothing: Replace hard labels (1 for correct class, 0 for others) with soft targets: e.g., correct class = 0.9, others = 0.1/(num_classes-1). This penalises overconfidence and prevents the model from chasing infinitely high logits. Used in almost all modern classification models (ResNet, EfficientNet, Vision Transformers). Cross-entropy loss with label smoothing is a few lines of PyTorch.
Cutout / Mixup / Augmentation: Data augmentation is a form of regularisation. Cutout randomly masks square regions of input images. Mixup creates linear combinations of two input images and their labels. These techniques force the model to rely on the full input, not just a few discriminative patches. They're especially effective for CNNs.
When to use each
L2: always, as baseline.
Dropout: large FC layers, low-rate in attention.
Stochastic depth: deep networks (ResNet-50+, Transformers).
Label smoothing: classification tasks, especially when dataset is noisy.
Augmentation: computer vision and speech.
advanced_regularisation_snippets.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
import torch.nn.functional as F
# Label Smoothing LossclassLabelSmoothingCrossEntropy(nn.Module):
def__init__(self, smoothing: float = 0.1):
super().__init__()
self.smoothing = smoothing
defforward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
n_classes = logits.size(-1)
# Create smoothed targetswith torch.no_grad():
smooth_targets = torch.full_like(logits, self.smoothing / (n_classes - 1))
smooth_targets.scatter_(1, targets.unsqueeze(1), 1.0 - self.smoothing)
log_probs = F.log_softmax(logits, dim=-1)
loss = -(smooth_targets * log_probs).sum(dim=-1).mean()
return loss
# Stochastic Depth (Drop Path)classDropPath(nn.Module):
"""Drops entire residual blocks during training. survival_prob ~0.9."""def__init__(self, survival_prob: float = 0.9):
super().__init__()
self.survival_prob = survival_prob
defforward(self, x: torch.Tensor) -> torch.Tensor:
ifself.training:
# Bernoulli mask for batch dimension
mask = torch.empty((x.size(0), 1, 1, 1), device=x.device).bernoulli_(self.survival_prob)
x = x / self.survival_prob * mask
return x
# Usage in a ResNet block:classResNetBlock(nn.Module):
def__init__(self, in_channels, out_channels):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
self.bn2 = nn.BatchNorm2d(out_channels)
self.drop_path = DropPath(survival_prob=0.9)
defforward(self, x):
identity = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out = self.drop_path(out) # entire block dropped with prob 0.1 during training
out += identity
return F.relu(out)
Production Tip: Label Smoothing for Noisy Labels
If your training data has label noise (common in web-scraped datasets), label smoothing prevents the model from memorising incorrect labels. Smoothing=0.1 is a robust default. Too much smoothing (0.3+) slows convergence. Always evaluate on a clean validation set to detect smoothing-induced underfitting.
Production Insight
Stochastic depth speeds up training by skipping entire blocks — less computation per forward pass.
Label smoothing prevents overconfidence but can hide calibration issues — check ECE curve.
Data augmentation (cutout, mixup) is often more effective than dropout for vision models.
Key Takeaway
L2 + dropout covers most cases; add stochastic depth for very deep networks.
Label smoothing is a cheap regulariser for classification with noisy labels.
Always pair advanced regularisation with early stopping — it's your safety net.
● Production incidentPOST-MORTEMseverity: high
Forgetting model.eval() Costs a Production Classification API
Symptom
Same input text produces different predictions across successive API calls. Validation accuracy fluctuates wildly (between 76% and 88%) even when using torch.no_grad(). The model was trained with dropout and deployed without anyone toggling eval() mode.
Assumption
The team assumed that wrapping inference in torch.no_grad() would disable dropout. They thought dropout was just a training-time behaviour that automatically switched off at inference.
Root cause
Dropout layers in PyTorch and TensorFlow only deactivate when the model is set to eval() mode (model.eval()). torch.no_grad() disables gradient computation but does NOT affect dropout behaviour. Without eval(), dropout remains active, randomly zeroing neurons on every forward pass, producing stochastic outputs.
Fix
Two changes: (1) Call model.eval() before the inference loop, and (2) wrap inference in torch.no_grad() for memory efficiency. This combination ensures deterministic outputs and no gradient storage.
Key lesson
model.eval() and torch.no_grad() are separate concerns — eval() controls dropout/batchnorm behaviour, no_grad() disables gradient computation. You need both at inference.
Always include a unit test that runs two forward passes on the same input and asserts they produce identical outputs (within floating-point tolerance) when the model is in eval mode.
If your API logs show non-deterministic predictions, check the deployment code first: it's almost always a missing eval() call, not a data race.
Production debug guideHow to isolate whether your model is memorising noise, and which regularisation toggle to flip first.5 entries
Symptom · 01
Training accuracy is high (>95%) but validation accuracy is much lower (>15% gap).
→
Fix
First, increase L2 weight decay (try 1e-3 → 5e-3). If that doesn't close the gap, add dropout (0.5 for FC layers) or increase existing dropout rate. Monitor the gap — it should shrink.
Symptom · 02
Training accuracy is also low (under 80%) — underfitting, not overfitting.
→
Fix
Reduce regularisation strength: lower weight_decay (try 1e-4), decrease dropout rate (0.2), or remove dropout entirely from early layers. The model needs capacity to learn.
Symptom · 03
Validation loss plateaus or climbs after a certain epoch, even though training loss continues to drop.
→
Fix
Early stopping is the first fix — stop training at the epoch with lowest validation loss. Also consider reducing the learning rate (use a scheduler) and increasing regularisation strength modestly.
Symptom · 04
Model predictions are non-deterministic at inference time (same input → different output).
→
Fix
Check for missing model.eval() call. Also verify you're not accidentally passing a non-zero dropout arg in eval. Run model.eval() and repeat inference — outputs must be identical.
Symptom · 05
After adding BatchNorm, training becomes unstable or validation loss spikes.
→
Fix
Remove any dropout inside BatchNorm blocks. If you must keep dropout, place it after BatchNorm + activation, not before, and keep rate ≤0.1. Consider using weight decay instead of dropout when BatchNorm is present.
★ Quick Debug: Overfitting & RegularisationRun these commands to diagnose if your model is overfitting and to verify your regularisation setup.
Suspected overfitting (large train-val gap)−
Immediate action
Check train vs validation loss curves. If train loss continues to drop while val loss rises, you're overfitting.
Commands
python -c "import pandas as pd; d=pd.read_csv('logs.csv'); print(d[d['val_loss'].diff()>0].head())"
python -c "from torch.utils.data import DataLoader; loader=DataLoader(val_set, batch_size=64); correct=0; total=0; model.eval(); with torch.no_grad(): for x,y in loader: out=model(x); correct+= (out.argmax(1)==y).sum(); total+=len(y); print(f'Val acc: {100*correct/total:.1f}%')"
Fix now
Increase weight_decay to 5e-3 or add Dropout(0.5) for FC layers. Re-train with early stopping.
MC Dropout uncertainty not working+
Immediate action
Verify the model has dropout layers that are active. If the model uses eval() at inference, dropout is disabled.
Commands
python -c "model = torch.load('model.pth'); model.train(); preds = [model(x).detach() for _ in range(10)]; print(torch.stack(preds).var(0).mean())"
python -c "# Check if dropout exists: for name, mod in model.named_modules(): if isinstance(mod, torch.nn.Dropout): print(name, 'p=', mod.p)"
Fix now
Keep model in train mode during MC Dropout. Run N=50 forward passes, compute variance per output class.
AdamW weight_decay not working as expected+
Immediate action
Check if you're using Adam (not AdamW). Adam's weight_decay is coupled with adaptive gradients, not true L2.
python -c "# Compare weight norms: print([p.norm().item() for p in model.parameters() if p.requires_grad])"
Fix now
Replace torch.optim.Adam with torch.optim.AdamW. Set weight_decay to 1e-2 for large models. Monitor weight norms — they should decrease significantly.
Regularisation Techniques Compared
Aspect
L1 Regularisation
L2 Regularisation
Dropout
DropConnect
Stochastic Depth
Label Smoothing
Loss penalty term
λ · Σ|wᵢ|
λ · Σwᵢ²
None (structural noise)
None (weight masking)
None (block dropping)
Soft target cross-entropy
Effect on weights
Drives many weights to exactly 0
Shrinks all weights uniformly
Forces redundant representations
Weakens connections uniformly
N/A (blocks dropped)
Limits logit magnitudes
Resulting model
Sparse — natural feature selector
Dense with small weights
Ensemble of sub-networks
Ensemble of sparse sub-networks
Ensemble of depth sub-networks
Calibrated, less overconfident
Best use case
High-dim data, sparse true signal
Most default scenarios
Large FC layers, NLP
Very wide layers (>4096)
ResNet, Transformers
Classification with noisy labels
Works with BatchNorm?
Yes, no conflict
Yes, preferred (AdamW)
Problematic — variance shift
Similar to dropout conflict
Yes (applied to block output)
Yes
Inference cost
Zero extra cost
Zero extra cost
Zero extra cost (eval mode)
Zero extra cost (all weights used)
All blocks active (scaled)
Zero extra cost
Hyperparameter sensitivity
High — λ must be tuned carefully
Medium — robust over wide range
Medium — p=0.5 FC, p=0.1 attention
Medium — p=0.5 typical
Low — survival_prob 0.8–0.9
Low — smoothing 0.1–0.2
Gradient behaviour
Constant-magnitude subgradient
Proportional to weight value
Stochastic zeroing
Stochastic weight zeroing
Block gradient gating
Gradient from soft targets
CNNs
Rarely used
Standard via weight_decay
Use Dropout2d (spatial) only
Not common
Standard in ResNets
Standard in modern CNNs
Transformers
Not commonly used
Standard via AdamW
p=0.1, applied surgically
Not common
Standard in BERT, ViT
Standard
Key takeaways
1
Inverted dropout scales surviving neurons by 1/(1-p) during training so inference requires zero modification
but only if you call model.eval(). Forgetting this is one of the most common silent bugs in production ML.
2
L1 creates sparsity because its gradient is a constant-magnitude push toward zero (independent of weight size). L2 shrinks weights proportionally but almost never zeros them
choose based on whether you believe your true signal is sparse.
3
Dropout and BatchNorm conflict because Dropout alters activation variance during training, but BatchNorm's running statistics (used at inference) were computed under that corrupted variance
causing a distribution shift the moment you hit eval mode.
4
AdamW (decoupled weight decay) is almost always what you want with Adam, not Adam + weight_decay. The distinction matters most in large models
with Adam, weight_decay effectively does less for high-gradient parameters, meaning your over-parameterised layers get under-regularised exactly where you need it most.
5
Advanced regularisation (stochastic depth, label smoothing, augmentation) often provides more bang-for-buck than increasing dropout or L2 beyond moderate levels.
6
Always pair regularisation with early stopping and a learning rate scheduler
they form the complete production safety net.
Common mistakes to avoid
5 patterns
×
Forgetting model.eval() at inference
Symptom
Non-deterministic predictions on the same input; validation accuracy varies run-to-run even with torch.no_grad().
Fix
Always call model.eval() before any evaluation loop or inference call. Pair it with 'with torch.no_grad():' to also disable gradient computation. These are separate concerns — eval() controls dropout/batchnorm behaviour, no_grad() controls memory allocation. You need both.
×
Applying the same dropout rate everywhere
Symptom
The model either underfits badly (too much dropout) or the regularisation has no effect (too little, everywhere).
Fix
Use progressive dropout — higher rates in earlier, wider layers (where memorisation is cheapest) and lower or no dropout near the output layer. A common pattern: 0.5 for large hidden layers, 0.3 for smaller layers, 0.0 for the final classification head. Dropout before a softmax output directly corrupts the class probability distribution.
×
Using Adam with weight_decay expecting true L2 regularisation
Symptom
Regularisation seems weaker than expected; the model still overfits even with high weight_decay values.
Fix
Use torch.optim.AdamW instead of Adam. AdamW applies weight decay directly to the weights, decoupled from the gradient update — this is how L2 was always mathematically intended to work with adaptive optimisers.
×
Mixing Dropout and BatchNorm in the same block
Symptom
Training becomes unstable; validation loss spikes. The model performs worse than using either regulariser alone.
Fix
If you must use both, put Dropout after BatchNorm + activation (not before the BatchNorm), keep dropout rate low (≤0.1), or use weight decay instead of dropout when BatchNorm is present.
×
Not using early stopping when regularisation is present
Symptom
Model continues training past the point of optimal validation performance; validation loss climbs while train loss keeps dropping.
Fix
Always implement early stopping with patience (5–20 epochs depending on dataset size) and a learning rate scheduler that reduces lr on plateau. Early stopping is the final safety net — it prevents overfitting regardless of regularisation settings.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the ensemble interpretation of Dropout. How does it connect to b...
Q02SENIOR
What is the difference between weight_decay in PyTorch's Adam optimiser ...
Q03SENIOR
You're training a ResNet with BatchNorm throughout. Your validation loss...
Q04SENIOR
How do you diagnose overfitting programmatically in a training pipeline,...
Q01 of 04SENIOR
Explain the ensemble interpretation of Dropout. How does it connect to bagging, and why does this interpretation break down when you stack Dropout with very high rates across multiple layers?
ANSWER
Dropout can be interpreted as training an ensemble of 2ⁿ sub-networks (where n is the number of neurons), each receiving a gradient update only when its neurons are active. This is analogous to bagging, where each sub-network sees a different random subset of data (due to random neuron masking) — essentially training on random subnetworks without explicit ensembling at inference, which would be prohibitive. However, the interpretation breaks down because sub-networks share weights — bagging trains independent models. With very high dropout rates (e.g., 0.9), sub-networks become extremely sparse and their averaged inference does not approximate the true ensemble well — the model underfits because each sub-network sees too few parameters to learn effectively.
Q02 of 04SENIOR
What is the difference between weight_decay in PyTorch's Adam optimiser and true L2 regularisation? When would choosing the wrong one meaningfully hurt your model?
ANSWER
True L2 regularisation adds the squared weight penalty to the loss before computing gradients. In Adam, the weight_decay parameter subtracts the weight term directly from the parameter update, but this happens after the gradient is scaled by the adaptive learning rate. This means parameters with large gradients (high variance) get less effective regularisation. AdamW (Loshchilov & Hutter, 2019) fixes this by decoupling weight decay: the decay is applied to the raw weights, independent of the adaptive gradient scaling. This matters most in large models like Transformers, where some parameters (e.g., attention weights) naturally have different gradient magnitudes. Using Adam with weight_decay under-regularises exactly those layers that are most over-parameterised and prone to memorisation.
Q03 of 04SENIOR
You're training a ResNet with BatchNorm throughout. Your validation loss is climbing after epoch 20 (classic overfitting). A junior engineer adds Dropout(p=0.5) after every BatchNorm layer. Training gets worse, not better. Walk me through why, and what would you do instead?
ANSWER
Adding high dropout to a BatchNorm-heavy architecture causes two problems. First, variance shift: Dropout changes the variance of activations during training, but BatchNorm's running statistics were computed without dropout (or under different dropout rates) — at inference, the mismatch degrades predictions. Second, BatchNorm already provides a regularisation effect (noise from batch statistics), and adding high dropout creates excessive regularisation that underfits. Instead of dropout, we should: (1) increase L2 weight decay (use AdamW), (2) use stochastic depth (randomly drop entire residual blocks, which is compatible with BatchNorm), (3) add label smoothing, (4) use data augmentation. If we must add dropout, keep it after BatchNorm + activation and use p≤0.1, not 0.5.
Q04 of 04SENIOR
How do you diagnose overfitting programmatically in a training pipeline, beyond looking at the train/val loss curves?
ANSWER
Monitor three metrics: (1) The gap between train and val accuracy — if gap > 10% after 20% of total epochs, overfitting is likely. (2) The ratio of train loss to val loss — if it stays below 0.5, the model is memorising. (3) Per-class validation accuracy — if a class with few samples has disproportionately low acc, the model is memorising that class's noise. You can also compute the training loss at the early stopping best epoch — if it's near zero, you're probably overfitting. Automate these checks as hooks in the training loop and alert when thresholds are crossed.
01
Explain the ensemble interpretation of Dropout. How does it connect to bagging, and why does this interpretation break down when you stack Dropout with very high rates across multiple layers?
SENIOR
02
What is the difference between weight_decay in PyTorch's Adam optimiser and true L2 regularisation? When would choosing the wrong one meaningfully hurt your model?
SENIOR
03
You're training a ResNet with BatchNorm throughout. Your validation loss is climbing after epoch 20 (classic overfitting). A junior engineer adds Dropout(p=0.5) after every BatchNorm layer. Training gets worse, not better. Walk me through why, and what would you do instead?
SENIOR
04
How do you diagnose overfitting programmatically in a training pipeline, beyond looking at the train/val loss curves?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
What dropout rate should I use for my neural network?
For large fully-connected layers, 0.5 is the Srivastava et al. original recommendation and still a good starting point. For convolutional layers, use Dropout2d at 0.1–0.2 maximum. For Transformer attention layers, 0.1 is the norm. Always tune via validation performance — if training accuracy is also low, your dropout rate is too high.
Was this helpful?
02
Should I use dropout or L2 regularisation — or both?
They're complementary and often used together. L2 (via weight_decay in AdamW) is a near-zero-cost default for almost every network. Dropout is an additional tool for large FC layers. Don't stack them heavily with BatchNorm — pick weight decay + BatchNorm, or Dropout (lightly) + no BatchNorm, for the cleanest training dynamics.
Was this helpful?
03
Does dropout slow down training?
Dropout typically requires more epochs to converge because each step updates a randomly-masked sub-network, not the full model. The per-step cost is roughly the same (zeroing neurons is cheap), but you may need 1.5–2x more epochs to reach the same training accuracy. The trade-off is almost always worth it: you exchange faster convergence for meaningfully better generalisation on held-out data.
Was this helpful?
04
What is the difference between vanilla dropout and spatial dropout for CNNs?
Vanilla dropout (nn.Dropout) zeroes individual pixel activations independently. In a CNN, adjacent pixels are highly correlated, so dropping random pixels doesn't break the correlation pattern effectively. Spatial dropout (nn.Dropout2d) zeros entire feature maps (channels) instead. This forces the network to learn redundant features across channels, which is a more effective regulariser for convolutional layers.
Was this helpful?
05
How does label smoothing work as a regulariser?
Label smoothing replaces hard one-hot targets (e.g., [1,0,0]) with soft targets (e.g., [0.9, 0.05, 0.05]). This prevents the model from chasing infinitely large logits for the correct class, which would normally be encouraged by cross-entropy loss. The result is a less overconfident model that generalises better, especially on noisy datasets. Typical smoothing value is 0.1.