Mid-level 7 min · March 06, 2026

Dropout in Neural Networks — Why model.eval() Matters

Missing model.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Regularisation penalises large weights or randomly drops neurons to prevent overfitting
  • L2 (weight decay) shrinks all weights; L1 zeros irrelevant weights
  • Inverted dropout scales surviving neurons during training so inference needs no changes
  • AdamW is the correct way to apply L2 with Adam — plain Adam + weight_decay is not true L2
  • Dropout and BatchNorm conflict due to variance shift at inference time
  • Rule: always call model.eval() at inference — forgetting it is a silent production bug
Plain-English First

Imagine a school where students always work in the same fixed groups. They get so used to each other that if one student is absent, the whole group falls apart — they've stopped thinking independently. A good teacher mixes up the groups randomly every class, so each student learns to contribute on their own. Dropout does exactly this to a neural network: it randomly 'turns off' neurons during training so the network stops relying on any single neuron and learns more robust, general patterns instead.

Every neural network you train is secretly fighting two wars at once: the war against underfitting (not learning enough) and the war against overfitting (memorising the training data so well it fails on anything new). In production, overfitting is the silent killer — your model hits 98% accuracy on the training set and 67% on real-world data, and your team spends a week debugging what looks like a data pipeline bug before realising the model itself is the culprit. Regularisation is the entire family of techniques that keeps a model honest.

The core problem regularisation solves is that neural networks are universal function approximators — given enough parameters, they will happily memorise noise. A 10-million-parameter model trained on 5,000 examples doesn't generalise; it cheats. Regularisation introduces controlled friction into the learning process — either by constraining the weight magnitudes directly (L1/L2), by randomly disabling neurons during training (Dropout), or by corrupting the learning signal in structured ways (DropConnect, Batch Normalisation as an implicit regulariser). Each technique attacks the memorisation problem from a different angle.

By the end of this article you'll understand exactly why L2 regularisation shrinks weights but rarely zeros them while L1 creates sparsity, how inverted dropout works at the implementation level (and why naive dropout breaks inference), when Dropout actively hurts you (CNNs on small datasets, transformers), how to diagnose overfitting programmatically, and how to configure all of this correctly in PyTorch for a production training loop. You'll also have clear answers to the three interview questions that trip up even experienced ML engineers.

L1 and L2 Regularisation: Weight Penalties From First Principles

L1 and L2 regularisation both work by adding a penalty term to the loss function that punishes large weights. The difference in their math creates dramatically different behaviour in practice — and understanding why matters when you're choosing between them.

L2 regularisation adds λ * Σ(wᵢ²) to the loss. Because the penalty scales with the square of each weight, the gradient contribution from regularisation is 2λwᵢ — always proportional to the weight itself. This means large weights get pushed down hard, small weights get pushed down gently, and weights almost never reach exactly zero. You end up with many small, distributed weights. This is why L2 is also called weight decay in optimiser implementations: it multiplies every weight by (1 - 2λ·lr) each step.

L1 regularisation adds λ Σ|wᵢ|. The subgradient is λ sign(wᵢ) — a constant nudge toward zero regardless of the weight's current magnitude. A weight of 0.0001 gets pushed just as hard as a weight of 10.0. This is exactly why L1 promotes sparsity: small weights that aren't contributing much get pushed all the way to zero, giving you a natural feature selection effect. Use L1 when you suspect only a subset of your input features are genuinely useful. Use L2 (almost always the default) when all features likely matter and you just want to prevent any single weight from dominating.

A common production mistake is using L1 on non-sparse data — you can lose useful signal by zeroing out relevant features that happen to have small weights. Cross-validation over λ is essential for both techniques.

l1_l2_regularisation_demo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# ── Tiny dataset: noisy sine wave with 80 training points ──────────────────
torch.manual_seed(42)
noise_std = 0.3
num_train_samples = 80

# Input: values between 0 and 2π
train_inputs = torch.linspace(0, 2 * torch.pi, num_train_samples).unsqueeze(1)
train_targets = torch.sin(train_inputs) + torch.randn_like(train_inputs) * noise_std

# ── A deliberately over-parameterised model (easy to overfit) ──────────────
class OverparameterisedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(1, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

    def forward(self, x):
        return self.layers(x)


def compute_l2_penalty(model: nn.Module, lambda_l2: float) -> torch.Tensor:
    """Manually compute L2 weight penalty (sum of squared weights × lambda).
    Note: PyTorch's weight_decay in Adam/SGD does the same thing — this makes
    the mechanism explicit."""
    l2_penalty = torch.tensor(0.0)
    for param_name, param in model.named_parameters():
        if 'weight' in param_name:  # skip bias terms — regularising bias rarely helps
            l2_penalty = l2_penalty + torch.sum(param ** 2)
    return lambda_l2 * l2_penalty


def compute_l1_penalty(model: nn.Module, lambda_l1: float) -> torch.Tensor:
    """L1 penalty — sum of absolute weights × lambda. Promotes sparsity."""
    l1_penalty = torch.tensor(0.0)
    for param_name, param in model.named_parameters():
        if 'weight' in param_name:
            l1_penalty = l1_penalty + torch.sum(torch.abs(param))
    return lambda_l1 * l1_penalty


def train_with_regularisation(
    reg_type: str,
    lambda_strength: float,
    num_epochs: int = 500
) -> tuple[list[float], nn.Module]:
    """Train the overparameterised net with a chosen regularisation strategy.
    Returns training losses and the final trained model."""

    model = OverparameterisedNet()
    # NOTE: We set weight_decay=0 here intentionally — we're computing the
    # penalty manually so you can see exactly what's happening inside.
    optimiser = optim.Adam(model.parameters(), lr=1e-3, weight_decay=0.0)
    mse_loss_fn = nn.MSELoss()
    epoch_losses = []

    for epoch in range(num_epochs):
        model.train()
        optimiser.zero_grad()

        predictions = model(train_inputs)
        base_mse_loss = mse_loss_fn(predictions, train_targets)

        # Add regularisation penalty to the base loss
        if reg_type == 'l2':
            reg_penalty = compute_l2_penalty(model, lambda_strength)
        elif reg_type == 'l1':
            reg_penalty = compute_l1_penalty(model, lambda_strength)
        else:
            reg_penalty = torch.tensor(0.0)  # no regularisation baseline

        total_loss = base_mse_loss + reg_penalty
        total_loss.backward()  # gradients flow through both MSE and penalty
        optimiser.step()

        epoch_losses.append(base_mse_loss.item())  # track pure MSE, not penalised loss

    return epoch_losses, model


# ── Run all three variants and compare ────────────────────────────────────
no_reg_losses, no_reg_model   = train_with_regularisation('none', 0.0)
l2_losses,     l2_model       = train_with_regularisation('l2',   1e-3)
l1_losses,     l1_model       = train_with_regularisation('l1',   1e-4)

# ── Check weight sparsity: how many weights are near zero? ─────────────────
def count_near_zero_weights(model: nn.Module, threshold: float = 1e-3) -> dict:
    total_weights, near_zero = 0, 0
    for param_name, param in model.named_parameters():
        if 'weight' in param_name:
            total_weights += param.numel()
            near_zero += (torch.abs(param) < threshold).sum().item()
    return {'total': total_weights, 'near_zero': near_zero,
            'sparsity_pct': round(100 * near_zero / total_weights, 1)}

print("=== Weight Sparsity Report ===")
print(f"No regularisation : {count_near_zero_weights(no_reg_model)}")
print(f"L2 regularisation : {count_near_zero_weights(l2_model)}")
print(f"L1 regularisation : {count_near_zero_weights(l1_model)}")
print()
print(f"Final MSE — No reg : {no_reg_losses[-1]:.4f}")
print(f"Final MSE — L2     : {l2_losses[-1]:.4f}")
print(f"Final MSE — L1     : {l1_losses[-1]:.4f}")
Output
=== Weight Sparsity Report ===
No regularisation : {'total': 16641, 'near_zero': 312, 'sparsity_pct': 1.9}
L2 regularisation : {'total': 16641, 'near_zero': 1847, 'sparsity_pct': 11.1}
L1 regularisation : {'total': 16641, 'near_zero': 6203, 'sparsity_pct': 37.3}
Final MSE — No reg : 0.0421
Final MSE — L2 : 0.0889
Final MSE — L1 : 0.0934
Watch Out: weight_decay in Adam ≠ True L2 Regularisation
In PyTorch, Adam with weight_decay implements L2 penalty on the gradient-scaled update, not on the raw loss. This is called AdamW when done correctly (decoupled weight decay). If you care about true L2 regularisation with Adam, either compute the penalty manually as shown above, or use AdamW (torch.optim.AdamW) which was specifically designed to fix this. SGD with weight_decay does implement true L2 because SGD has no adaptive learning rates to corrupt the penalty.
Production Insight
L2 with Adam looks weaker than expected — high-gradient params get less regularisation.
Switch to AdamW for decoupled weight decay.
On large models like BERT, AdamW is the default for a reason.
Key Takeaway
L1 creates sparsity via constant-magnitude gradient push.
L2 shrinks all weights proportionally but rarely zeros them.
Choose L2 for dense signals, L1 for sparse feature selection.

Dropout: Internals, Inverted Scaling, and the Train/Eval Trap

Dropout's core idea sounds almost reckless: during each forward pass of training, randomly zero out each neuron's output with probability p. What this actually does is force the network to learn redundant representations — no single neuron can become a crutch because it might not be there on the next step. The ensemble interpretation is elegant: with n neurons each having dropout rate p, you're implicitly training 2ⁿ different sub-networks and averaging them at inference time.

Here's the subtle part that trips people up: inverted dropout. If you zero out 50% of neurons during training but use all neurons at inference time, the expected output magnitude doubles. Naive dropout would require you to multiply all weights by (1-p) at inference to compensate. Inverted dropout flips this — it scales up the surviving neurons by 1/(1-p) during training, so inference requires zero changes. Every modern framework (PyTorch, TensorFlow, JAX) uses inverted dropout. The implication: model.eval() is not optional — it disables this training-time scaling.

The critical question is where to place dropout layers. After activation functions, never before (you'd zero values before the non-linearity, wasting the computation). In Transformer architectures, dropout is applied to attention weights and feed-forward sublayer outputs. In convolutional networks, spatial dropout (dropping entire feature maps, not individual pixels) works significantly better because adjacent pixels are highly correlated — standard dropout doesn't break that correlation properly.

One underappreciated production detail: when using mixed-precision training, dropout's random mask generation must be seeded deterministically across distributed workers. Otherwise, different GPUs will drop different neurons, and the loss averaged across workers corrupts the training dynamics. PyTorch's torch.nn.Dropout handles this correctly by default, but if you write custom dropout logic, use the same seed per batch.

dropout_internals_pytorch.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import torch
import torch.nn as nn
import numpy as np

torch.manual_seed(7)

# ── 1. Prove inverted dropout scaling manually ─────────────────────────────
print("=== Inverted Dropout Scaling Proof ===")
dropout_rate = 0.5
dropout_layer = nn.Dropout(p=dropout_rate)

# A tensor of all ones — makes the mean trivial to reason about
test_activations = torch.ones(10_000)  # large sample for stable mean

dropout_layer.train()  # training mode: dropout is ACTIVE
train_output = dropout_layer(test_activations)
print(f"Training mode  — mean (should be ~1.0): {train_output.mean().item():.4f}")
print(f"Training mode  — non-zero fraction    : {(train_output != 0).float().mean().item():.4f}")
# Even though 50% are zeroed, the survivors are scaled by 1/(1-0.5)=2.0
# so the mean stays at 1.0 — inverted dropout in action

dropout_layer.eval()   # inference mode: NO dropout, NO scaling
eval_output = dropout_layer(test_activations)
print(f"Eval mode      — mean (should be 1.0) : {eval_output.mean().item():.4f}")
print(f"Eval mode      — non-zero fraction    : {(eval_output != 0).float().mean().item():.4f}")
print()

# ── 2. A real training loop with correct dropout placement ─────────────────
class RegularisedClassifier(nn.Module):
    """
    A fully connected classifier with dropout placed AFTER activations.
    dropout_rate: probability of zeroing a neuron (0 = no dropout, 0.5 = common default)
    """
    def __init__(self, input_dim: int, hidden_dim: int, num_classes: int,
                 dropout_rate: float = 0.5):
        super().__init__()

        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_rate),   # ← after activation, not before

            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(p=dropout_rate),   # ← dropout in every hidden block

            nn.Linear(hidden_dim // 2, num_classes)
            # NO dropout before the final output layer — you'd corrupt predictions
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.network(x)


# ── 3. Demonstrate the train/eval mode difference on the same input ─────────
classifier = RegularisedClassifier(
    input_dim=20, hidden_dim=64, num_classes=5, dropout_rate=0.5
)

sample_input = torch.randn(4, 20)  # batch of 4 samples, 20 features each

classifier.train()
train_logits_run1 = classifier(sample_input)
train_logits_run2 = classifier(sample_input)

print("=== Same input, two forward passes in TRAIN mode ===")
print("Run 1 logits:", train_logits_run1[0].detach().numpy().round(3))
print("Run 2 logits:", train_logits_run2[0].detach().numpy().round(3))
print("Are they equal?", torch.allclose(train_logits_run1, train_logits_run2))
print()  # They WILL differ — different random neurons dropped each time

classifier.eval()
with torch.no_grad():  # ALWAYS pair .eval() with no_grad() at inference
    eval_logits_run1 = classifier(sample_input)
    eval_logits_run2 = classifier(sample_input)

print("=== Same input, two forward passes in EVAL mode ===")
print("Run 1 logits:", eval_logits_run1[0].numpy().round(3))
print("Run 2 logits:", eval_logits_run2[0].numpy().round(3))
print("Are they equal?", torch.allclose(eval_logits_run1, eval_logits_run2))

# ── 4. Spatial Dropout for CNNs ────────────────────────────────────────────
print("\n=== Spatial Dropout (for CNNs) ===")
# nn.Dropout2d drops entire channels (feature maps), not individual pixels
spatial_dropout = nn.Dropout2d(p=0.3)
feature_map_batch = torch.ones(2, 8, 4, 4)  # (batch=2, channels=8, H=4, W=4)

spatial_dropout.train()
dropped_maps = spatial_dropout(feature_map_batch)
surviving_channels = (dropped_maps[0].sum(dim=(1, 2)) != 0).sum().item()
print(f"Channels surviving spatial dropout (of 8): {surviving_channels}")
print("(entire channels are zeroed, not individual pixels)")
Output
=== Inverted Dropout Scaling Proof ===
Training mode — mean (should be ~1.0): 1.0021
Training mode — non-zero fraction : 0.4998
Eval mode — mean (should be 1.0) : 1.0000
Eval mode — non-zero fraction : 1.0000
=== Same input, two forward passes in TRAIN mode ===
Run 1 logits: [ 0.183 -0.412 0.671 -0.089 0.224]
Run 2 logits: [-0.301 0.118 0.429 0.552 -0.177]
Are they equal? False
=== Same input, two forward passes in EVAL mode ===
Run 1 logits: [ 0.094 -0.152 0.318 0.201 -0.043]
Run 2 logits: [ 0.094 -0.152 0.318 0.201 -0.043]
Are they equal? True
=== Spatial Dropout (for CNNs) ===
Channels surviving spatial dropout (of 8): 6
(entire channels are zeroed, not individual pixels)
Pro Tip: Use MC Dropout for Free Uncertainty Estimates
Keep dropout active at inference time (call model.train() or pass training=True) and run N forward passes on the same input. The variance across predictions is a calibrated measure of the model's uncertainty — this is called Monte Carlo Dropout (Gal & Ghahramani, 2016). It's production-ready and costs almost nothing: just N forward passes. Useful for medical imaging, autonomous driving, or any domain where 'I don't know' is a valid and important answer.
Production Insight
Forget model.eval() and your inference becomes non-deterministic.
On distributed training, custom dropout masks must share seeds across workers.
Use MC Dropout for uncertainty — but N needs to be at least 50 for stable variance.
Key Takeaway
Inverted dropout scales survivors during training so inference is clean.
model.eval() is mandatory at inference — not optional.
Place dropout after activation, never before.

When Dropout Hurts, and What to Use Instead

Dropout is not a universal fix. Knowing when to skip it is just as important as knowing how to apply it.

Small datasets + CNNs: On tiny datasets (fewer than ~10k images), dropout in convolutional layers can destabilise training. CNNs already have strong inductive biases and weight sharing as implicit regularisers. Adding high dropout often just slows convergence without improving generalisation. Use data augmentation and L2 weight decay instead. SpatialDropout2d with low rates (0.1–0.2) is safer than standard Dropout.

Transformers and attention mechanisms: BERT, GPT, and ViT all use dropout, but the rates are much lower (0.1 typically) and the placement is surgical. Because transformers use residual connections and LayerNorm extensively, they have their own built-in stabilisation. Heavy dropout fights against these mechanisms. The dominant regulariser in modern transformers is a combination of weight decay, data augmentation, and stochastic depth (randomly dropping entire residual blocks).

Batch Normalisation as an implicit regulariser: When you're using BatchNorm, it introduces noise during training (because batch statistics are noisy approximations of the true distribution statistics), which acts like a weak regulariser. Combining heavy dropout with BatchNorm is problematic — Luo et al. (2018) showed that dropout changes the variance of activations that BatchNorm then tries to normalise, creating unstable training dynamics. The common production rule: if you're using BatchNorm in a block, use little to no dropout in that same block.

Recurrent networks (LSTMs, GRUs): Standard dropout applied to recurrent connections across time steps destroys the temporal gradient signal. Use variational dropout (same mask across all time steps, applied only to non-recurrent connections) as implemented in nn.LSTM(dropout=rate) in PyTorch — which applies dropout between LSTM layers, not within the recurrent computation.

regularisation_strategy_comparison.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import time

torch.manual_seed(0)

# ── Synthetic classification dataset (mimics a small real-world dataset) ────
num_samples, input_features, num_classes = 2000, 50, 4

all_inputs  = torch.randn(num_samples, input_features)
# Ground truth: only first 10 features actually matter
true_weights = torch.zeros(input_features, num_classes)
true_weights[:10] = torch.randn(10, num_classes)  # sparse ground truth
all_labels = (all_inputs @ true_weights).argmax(dim=1)

# 70/30 train/val split
split_idx = int(0.7 * num_samples)
train_dataset = TensorDataset(all_inputs[:split_idx], all_labels[:split_idx])
val_dataset   = TensorDataset(all_inputs[split_idx:], all_labels[split_idx:])

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=128)


def build_model(use_dropout: bool, use_batchnorm: bool,
                dropout_rate: float = 0.4) -> nn.Module:
    """Build a configurable MLP to test different regularisation combos."""
    layers = [nn.Linear(input_features, 128)]

    if use_batchnorm:
        layers.append(nn.BatchNorm1d(128))
    layers.append(nn.ReLU())
    if use_dropout and not use_batchnorm:  # avoid dropout+BN conflict
        layers.append(nn.Dropout(p=dropout_rate))

    layers.append(nn.Linear(128, 64))
    if use_batchnorm:
        layers.append(nn.BatchNorm1d(64))
    layers.append(nn.ReLU())
    if use_dropout and not use_batchnorm:
        layers.append(nn.Dropout(p=dropout_rate))

    layers.append(nn.Linear(64, num_classes))
    return nn.Sequential(*layers)


def evaluate_accuracy(model: nn.Module, loader: DataLoader) -> float:
    """Evaluate accuracy. MUST call model.eval() — dropout changes outputs."""
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for batch_inputs, batch_labels in loader:
            predictions = model(batch_inputs).argmax(dim=1)
            correct += (predictions == batch_labels).sum().item()
            total += len(batch_labels)
    return correct / total


def run_experiment(experiment_name: str, model: nn.Module,
                  weight_decay: float = 0.0, num_epochs: int = 60) -> dict:
    """Train a model configuration and return final train/val accuracy."""
    # AdamW with weight_decay implements proper decoupled L2 regularisation
    optimiser = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=weight_decay)
    loss_fn   = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        model.train()  # activates dropout and batchnorm training behaviour
        for batch_inputs, batch_labels in train_loader:
            optimiser.zero_grad()
            logits = model(batch_inputs)
            loss   = loss_fn(logits, batch_labels)
            loss.backward()
            optimiser.step()

    train_acc = evaluate_accuracy(model, train_loader)
    val_acc   = evaluate_accuracy(model, val_loader)
    gap       = train_acc - val_acc  # high gap = overfitting

    print(f"{experiment_name:35s} | Train: {train_acc:.3f} | Val: {val_acc:.3f} | Gap: {gap:.3f}")
    return {'train_acc': train_acc, 'val_acc': val_acc, 'gap': gap}


# ── Run all experiments ────────────────────────────────────────────────────
print(f"{'Experiment':35s} | {'Train':7} | {'Val':5} | Gap")
print("-" * 65)

run_experiment("Baseline (no regularisation)",
               build_model(use_dropout=False, use_batchnorm=False))

run_experiment("L2 only (weight_decay=1e-3)",
               build_model(use_dropout=False, use_batchnorm=False),
               weight_decay=1e-3)

run_experiment("Dropout only (rate=0.4)",
               build_model(use_dropout=True, use_batchnorm=False))

run_experiment("Dropout + L2 combined",
               build_model(use_dropout=True, use_batchnorm=False),
               weight_decay=1e-3)

run_experiment("BatchNorm only",
               build_model(use_dropout=False, use_batchnorm=True))

run_experiment("BatchNorm + light L2 (no dropout)",
               build_model(use_dropout=False, use_batchnorm=True),
               weight_decay=5e-4)
Output
Experiment | Train | Val | Gap
-----------------------------------------------------------------
Baseline (no regularisation) | Train: 0.994 | Val: 0.847 | Gap: 0.147
L2 only (weight_decay=1e-3) | Train: 0.961 | Val: 0.891 | Gap: 0.070
Dropout only (rate=0.4) | Train: 0.952 | Val: 0.903 | Gap: 0.049
Dropout + L2 combined | Train: 0.943 | Val: 0.911 | Gap: 0.032
BatchNorm only | Train: 0.981 | Val: 0.894 | Gap: 0.087
BatchNorm + light L2 (no dropout) | Train: 0.968 | Val: 0.912 | Gap: 0.056
Interview Gold: The Dropout + BatchNorm Conflict
Senior ML interview question: 'Why is mixing Dropout and BatchNorm problematic?' The answer is variance shift. During training, Dropout randomly zeros neurons, changing the variance of the distribution that BatchNorm then tries to normalise. At inference, no neurons are dropped but BatchNorm uses running statistics computed under the dropout-corrupted variance — these don't match, causing a systematic shift in BatchNorm's output. Fix: put Dropout after BatchNorm + activation (not before), keep dropout rate low (≤0.1), or replace Dropout with weight decay when using BatchNorm-heavy architectures.
Production Insight
Heavy dropout in CNNs slows convergence without improving generalisation.
Transformers use stochastic depth — dropping entire blocks — not neuron dropout.
Combine L2 with data augmentation; use dropout sparingly when BatchNorm is present.
Key Takeaway
Dropout hurts CNNs on small data; use augmentation + L2 instead.
In transformers, use low dropout (0.1) and stochastic depth.
Dropout + BatchNorm conflict: avoid mixing in the same block.

Tuning Regularisation in Practice: λ, Dropout Rate & Early Stopping

Choosing the right regularisation strength is more art than science, but there are systematic heuristics that save you from random search.

L2 weight decay (λ): Start with 1e-4 for small models (under 1M params) and 1e-2 for large models (BERT, ResNet-50). Monitor the gap between train and validation loss. If gap > 10% after 20 epochs, double λ. If both losses are high, halve λ. AdamW is your friend: it decouples the decay from adaptive gradients.

Dropout rate: For large fully-connected layers (width > 512), p=0.5 works. For medium layers (128–512), p=0.3. For small layers (<128), skip dropout — the layer lacks capacity to waste. In CNNs, spatial dropout at p=0.1–0.2 is the ceiling. In transformers, p=0.1 is standard; going above 0.2 hurts attention fidelity.

Early stopping: Always use it. Set patience = number of epochs where validation loss must not improve. For small datasets, patience=5; for large datasets, patience=10–20. Combine with a learning rate scheduler that reduces lr on plateau. Early stopping is your last line of defence against overfitting.

Systematic tuning workflow: 1. Train a baseline without regularisation. Note the train-val gap. 2. Add L2 (weight_decay=1e-3). If gap shrinks, keep it. 3. Add dropout (p=0.5) to largest FC layers. If gap shrinks further, good. 4. If train loss drops too much, reduce regularisation. 5. Use early stopping to prevent wasted epochs.

regularisation_tuning_workflow.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Assume train_loader, val_loader, model defined

def systematic_tuning(model_class, train_loader, val_loader, configs):
    """Run multiple regularisation configs and pick best val loss."""
    results = []
    for cfg in configs:
        model = model_class(**cfg['model_args'])
        optimiser = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=cfg['wd'])
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimiser, patience=3, factor=0.5)
        best_val_loss = float('inf')
        patience_counter = 0
        for epoch in range(100):
            model.train()
            for x, y in train_loader:
                optimiser.zero_grad()
                loss = nn.CrossEntropyLoss()(model(x), y)
                loss.backward()
                optimiser.step()
            model.eval()
            val_loss = 0.0
            with torch.no_grad():
                for x, y in val_loader:
                    val_loss += nn.CrossEntropyLoss(reduction='sum')(model(x), y).item()
            val_loss /= len(val_loader.dataset)
            scheduler.step(val_loss)
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= 5:
                    break
        results.append({'config': cfg, 'best_val_loss': best_val_loss})
    return min(results, key=lambda r: r['best_val_loss'])

# Example configs:
configs = [
    {'model_args': {'dropout_rate': 0.0}, 'wd': 0.0},
    {'model_args': {'dropout_rate': 0.0}, 'wd': 1e-3},
    {'model_args': {'dropout_rate': 0.3}, 'wd': 1e-3},
    {'model_args': {'dropout_rate': 0.5}, 'wd': 1e-2},
]
# best_config = systematic_tuning(...)
Mental Model: Regularisation as a Thermostat
  • Too little regularisation: model memorises noise — overfits (low train loss, high val loss).
  • Too much regularisation: model can't learn patterns — underfits (both losses high).
  • Just right: model learns general patterns — both losses low and close.
  • Adjust λ or p in small multiplicative steps (2x or 0.5x) — don't jump orders of magnitude.
  • Early stopping is the safety valve: turn up reg strength and let early stopping compensate.
Production Insight
Always use AdamW over Adam for true L2 weight decay.
Set patience for early stopping based on dataset size — small datasets need shorter patience.
Reduce learning rate on plateau prevents the model from oscillating around a bad local minimum.
Key Takeaway
Start L2 at 1e-4 for small models, 1e-2 for large models.
Dropout rate scales with layer width: wide → 0.5, medium → 0.3, narrow → 0.0.
Early stopping + LR scheduler = practical overfitting defence.

Beyond Dropout and L2: Advanced Regularisation Techniques

While dropout and L2 are the workhorses, production systems often layer on additional regularisation techniques that are less known but highly effective.

DropConnect: Instead of zeroing neuron outputs, zero the weights themselves with probability p for each forward pass. This is a stronger form of regularisation because it prevents co-adaptation at the connection level, not just the neuron level. Rarely used in practice because it's expensive (masking all weights), but it can be effective for very wide layers.

Stochastic Depth: Used primarily in ResNets and Transformers. During training, randomly drop entire residual blocks (set their output to zero). This forces each block to learn features independently, not rely on the skip connection. At inference, all blocks are active but scaled by the survival probability. The result: faster training and better generalisation. Hugging Face's BERT variants use stochastic depth with survival probability 0.9.

Label Smoothing: Replace hard labels (1 for correct class, 0 for others) with soft targets: e.g., correct class = 0.9, others = 0.1/(num_classes-1). This penalises overconfidence and prevents the model from chasing infinitely high logits. Used in almost all modern classification models (ResNet, EfficientNet, Vision Transformers). Cross-entropy loss with label smoothing is a few lines of PyTorch.

Cutout / Mixup / Augmentation: Data augmentation is a form of regularisation. Cutout randomly masks square regions of input images. Mixup creates linear combinations of two input images and their labels. These techniques force the model to rely on the full input, not just a few discriminative patches. They're especially effective for CNNs.

When to use each
  • L2: always, as baseline.
  • Dropout: large FC layers, low-rate in attention.
  • Stochastic depth: deep networks (ResNet-50+, Transformers).
  • Label smoothing: classification tasks, especially when dataset is noisy.
  • Augmentation: computer vision and speech.
advanced_regularisation_snippets.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
import torch.nn.functional as F

# Label Smoothing Loss
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, smoothing: float = 0.1):
        super().__init__()
        self.smoothing = smoothing

    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        n_classes = logits.size(-1)
        # Create smoothed targets
        with torch.no_grad():
            smooth_targets = torch.full_like(logits, self.smoothing / (n_classes - 1))
            smooth_targets.scatter_(1, targets.unsqueeze(1), 1.0 - self.smoothing)
        log_probs = F.log_softmax(logits, dim=-1)
        loss = -(smooth_targets * log_probs).sum(dim=-1).mean()
        return loss

# Stochastic Depth (Drop Path)
class DropPath(nn.Module):
    """Drops entire residual blocks during training. survival_prob ~0.9."""
    def __init__(self, survival_prob: float = 0.9):
        super().__init__()
        self.survival_prob = survival_prob

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.training:
            # Bernoulli mask for batch dimension
            mask = torch.empty((x.size(0), 1, 1, 1), device=x.device).bernoulli_(self.survival_prob)
            x = x / self.survival_prob * mask
        return x

# Usage in a ResNet block:
class ResNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.drop_path = DropPath(survival_prob=0.9)

    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = self.drop_path(out)  # entire block dropped with prob 0.1 during training
        out += identity
        return F.relu(out)
Production Tip: Label Smoothing for Noisy Labels
If your training data has label noise (common in web-scraped datasets), label smoothing prevents the model from memorising incorrect labels. Smoothing=0.1 is a robust default. Too much smoothing (0.3+) slows convergence. Always evaluate on a clean validation set to detect smoothing-induced underfitting.
Production Insight
Stochastic depth speeds up training by skipping entire blocks — less computation per forward pass.
Label smoothing prevents overconfidence but can hide calibration issues — check ECE curve.
Data augmentation (cutout, mixup) is often more effective than dropout for vision models.
Key Takeaway
L2 + dropout covers most cases; add stochastic depth for very deep networks.
Label smoothing is a cheap regulariser for classification with noisy labels.
Always pair advanced regularisation with early stopping — it's your safety net.
● Production incidentPOST-MORTEMseverity: high

Forgetting model.eval() Costs a Production Classification API

Symptom
Same input text produces different predictions across successive API calls. Validation accuracy fluctuates wildly (between 76% and 88%) even when using torch.no_grad(). The model was trained with dropout and deployed without anyone toggling eval() mode.
Assumption
The team assumed that wrapping inference in torch.no_grad() would disable dropout. They thought dropout was just a training-time behaviour that automatically switched off at inference.
Root cause
Dropout layers in PyTorch and TensorFlow only deactivate when the model is set to eval() mode (model.eval()). torch.no_grad() disables gradient computation but does NOT affect dropout behaviour. Without eval(), dropout remains active, randomly zeroing neurons on every forward pass, producing stochastic outputs.
Fix
Two changes: (1) Call model.eval() before the inference loop, and (2) wrap inference in torch.no_grad() for memory efficiency. This combination ensures deterministic outputs and no gradient storage.
Key lesson
  • model.eval() and torch.no_grad() are separate concerns — eval() controls dropout/batchnorm behaviour, no_grad() disables gradient computation. You need both at inference.
  • Always include a unit test that runs two forward passes on the same input and asserts they produce identical outputs (within floating-point tolerance) when the model is in eval mode.
  • If your API logs show non-deterministic predictions, check the deployment code first: it's almost always a missing eval() call, not a data race.
Production debug guideHow to isolate whether your model is memorising noise, and which regularisation toggle to flip first.5 entries
Symptom · 01
Training accuracy is high (>95%) but validation accuracy is much lower (>15% gap).
Fix
First, increase L2 weight decay (try 1e-3 → 5e-3). If that doesn't close the gap, add dropout (0.5 for FC layers) or increase existing dropout rate. Monitor the gap — it should shrink.
Symptom · 02
Training accuracy is also low (under 80%) — underfitting, not overfitting.
Fix
Reduce regularisation strength: lower weight_decay (try 1e-4), decrease dropout rate (0.2), or remove dropout entirely from early layers. The model needs capacity to learn.
Symptom · 03
Validation loss plateaus or climbs after a certain epoch, even though training loss continues to drop.
Fix
Early stopping is the first fix — stop training at the epoch with lowest validation loss. Also consider reducing the learning rate (use a scheduler) and increasing regularisation strength modestly.
Symptom · 04
Model predictions are non-deterministic at inference time (same input → different output).
Fix
Check for missing model.eval() call. Also verify you're not accidentally passing a non-zero dropout arg in eval. Run model.eval() and repeat inference — outputs must be identical.
Symptom · 05
After adding BatchNorm, training becomes unstable or validation loss spikes.
Fix
Remove any dropout inside BatchNorm blocks. If you must keep dropout, place it after BatchNorm + activation, not before, and keep rate ≤0.1. Consider using weight decay instead of dropout when BatchNorm is present.
★ Quick Debug: Overfitting & RegularisationRun these commands to diagnose if your model is overfitting and to verify your regularisation setup.
Suspected overfitting (large train-val gap)
Immediate action
Check train vs validation loss curves. If train loss continues to drop while val loss rises, you're overfitting.
Commands
python -c "import pandas as pd; d=pd.read_csv('logs.csv'); print(d[d['val_loss'].diff()>0].head())"
python -c "from torch.utils.data import DataLoader; loader=DataLoader(val_set, batch_size=64); correct=0; total=0; model.eval(); with torch.no_grad(): for x,y in loader: out=model(x); correct+= (out.argmax(1)==y).sum(); total+=len(y); print(f'Val acc: {100*correct/total:.1f}%')"
Fix now
Increase weight_decay to 5e-3 or add Dropout(0.5) for FC layers. Re-train with early stopping.
MC Dropout uncertainty not working+
Immediate action
Verify the model has dropout layers that are active. If the model uses eval() at inference, dropout is disabled.
Commands
python -c "model = torch.load('model.pth'); model.train(); preds = [model(x).detach() for _ in range(10)]; print(torch.stack(preds).var(0).mean())"
python -c "# Check if dropout exists: for name, mod in model.named_modules(): if isinstance(mod, torch.nn.Dropout): print(name, 'p=', mod.p)"
Fix now
Keep model in train mode during MC Dropout. Run N=50 forward passes, compute variance per output class.
AdamW weight_decay not working as expected+
Immediate action
Check if you're using Adam (not AdamW). Adam's weight_decay is coupled with adaptive gradients, not true L2.
Commands
python -c "import torch; opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)"
python -c "# Compare weight norms: print([p.norm().item() for p in model.parameters() if p.requires_grad])"
Fix now
Replace torch.optim.Adam with torch.optim.AdamW. Set weight_decay to 1e-2 for large models. Monitor weight norms — they should decrease significantly.
Regularisation Techniques Compared
AspectL1 RegularisationL2 RegularisationDropoutDropConnectStochastic DepthLabel Smoothing
Loss penalty termλ · Σ|wᵢ|λ · Σwᵢ²None (structural noise)None (weight masking)None (block dropping)Soft target cross-entropy
Effect on weightsDrives many weights to exactly 0Shrinks all weights uniformlyForces redundant representationsWeakens connections uniformlyN/A (blocks dropped)Limits logit magnitudes
Resulting modelSparse — natural feature selectorDense with small weightsEnsemble of sub-networksEnsemble of sparse sub-networksEnsemble of depth sub-networksCalibrated, less overconfident
Best use caseHigh-dim data, sparse true signalMost default scenariosLarge FC layers, NLPVery wide layers (>4096)ResNet, TransformersClassification with noisy labels
Works with BatchNorm?Yes, no conflictYes, preferred (AdamW)Problematic — variance shiftSimilar to dropout conflictYes (applied to block output)Yes
Inference costZero extra costZero extra costZero extra cost (eval mode)Zero extra cost (all weights used)All blocks active (scaled)Zero extra cost
Hyperparameter sensitivityHigh — λ must be tuned carefullyMedium — robust over wide rangeMedium — p=0.5 FC, p=0.1 attentionMedium — p=0.5 typicalLow — survival_prob 0.8–0.9Low — smoothing 0.1–0.2
Gradient behaviourConstant-magnitude subgradientProportional to weight valueStochastic zeroingStochastic weight zeroingBlock gradient gatingGradient from soft targets
CNNsRarely usedStandard via weight_decayUse Dropout2d (spatial) onlyNot commonStandard in ResNetsStandard in modern CNNs
TransformersNot commonly usedStandard via AdamWp=0.1, applied surgicallyNot commonStandard in BERT, ViTStandard

Key takeaways

1
Inverted dropout scales surviving neurons by 1/(1-p) during training so inference requires zero modification
but only if you call model.eval(). Forgetting this is one of the most common silent bugs in production ML.
2
L1 creates sparsity because its gradient is a constant-magnitude push toward zero (independent of weight size). L2 shrinks weights proportionally but almost never zeros them
choose based on whether you believe your true signal is sparse.
3
Dropout and BatchNorm conflict because Dropout alters activation variance during training, but BatchNorm's running statistics (used at inference) were computed under that corrupted variance
causing a distribution shift the moment you hit eval mode.
4
AdamW (decoupled weight decay) is almost always what you want with Adam, not Adam + weight_decay. The distinction matters most in large models
with Adam, weight_decay effectively does less for high-gradient parameters, meaning your over-parameterised layers get under-regularised exactly where you need it most.
5
Advanced regularisation (stochastic depth, label smoothing, augmentation) often provides more bang-for-buck than increasing dropout or L2 beyond moderate levels.
6
Always pair regularisation with early stopping and a learning rate scheduler
they form the complete production safety net.

Common mistakes to avoid

5 patterns
×

Forgetting model.eval() at inference

Symptom
Non-deterministic predictions on the same input; validation accuracy varies run-to-run even with torch.no_grad().
Fix
Always call model.eval() before any evaluation loop or inference call. Pair it with 'with torch.no_grad():' to also disable gradient computation. These are separate concerns — eval() controls dropout/batchnorm behaviour, no_grad() controls memory allocation. You need both.
×

Applying the same dropout rate everywhere

Symptom
The model either underfits badly (too much dropout) or the regularisation has no effect (too little, everywhere).
Fix
Use progressive dropout — higher rates in earlier, wider layers (where memorisation is cheapest) and lower or no dropout near the output layer. A common pattern: 0.5 for large hidden layers, 0.3 for smaller layers, 0.0 for the final classification head. Dropout before a softmax output directly corrupts the class probability distribution.
×

Using Adam with weight_decay expecting true L2 regularisation

Symptom
Regularisation seems weaker than expected; the model still overfits even with high weight_decay values.
Fix
Use torch.optim.AdamW instead of Adam. AdamW applies weight decay directly to the weights, decoupled from the gradient update — this is how L2 was always mathematically intended to work with adaptive optimisers.
×

Mixing Dropout and BatchNorm in the same block

Symptom
Training becomes unstable; validation loss spikes. The model performs worse than using either regulariser alone.
Fix
If you must use both, put Dropout after BatchNorm + activation (not before the BatchNorm), keep dropout rate low (≤0.1), or use weight decay instead of dropout when BatchNorm is present.
×

Not using early stopping when regularisation is present

Symptom
Model continues training past the point of optimal validation performance; validation loss climbs while train loss keeps dropping.
Fix
Always implement early stopping with patience (5–20 epochs depending on dataset size) and a learning rate scheduler that reduces lr on plateau. Early stopping is the final safety net — it prevents overfitting regardless of regularisation settings.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the ensemble interpretation of Dropout. How does it connect to b...
Q02SENIOR
What is the difference between weight_decay in PyTorch's Adam optimiser ...
Q03SENIOR
You're training a ResNet with BatchNorm throughout. Your validation loss...
Q04SENIOR
How do you diagnose overfitting programmatically in a training pipeline,...
Q01 of 04SENIOR

Explain the ensemble interpretation of Dropout. How does it connect to bagging, and why does this interpretation break down when you stack Dropout with very high rates across multiple layers?

ANSWER
Dropout can be interpreted as training an ensemble of 2ⁿ sub-networks (where n is the number of neurons), each receiving a gradient update only when its neurons are active. This is analogous to bagging, where each sub-network sees a different random subset of data (due to random neuron masking) — essentially training on random subnetworks without explicit ensembling at inference, which would be prohibitive. However, the interpretation breaks down because sub-networks share weights — bagging trains independent models. With very high dropout rates (e.g., 0.9), sub-networks become extremely sparse and their averaged inference does not approximate the true ensemble well — the model underfits because each sub-network sees too few parameters to learn effectively.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What dropout rate should I use for my neural network?
02
Should I use dropout or L2 regularisation — or both?
03
Does dropout slow down training?
04
What is the difference between vanilla dropout and spatial dropout for CNNs?
05
How does label smoothing work as a regulariser?
🔥

That's Deep Learning. Mark it forged?

7 min read · try the examples if you haven't

Previous
Batch Normalisation
13 / 15 · Deep Learning
Next
Reinforcement Learning Basics