Senior 3 min · March 06, 2026

Backpropagation — Why Your 10-Layer Network Stops Learning

Sigmoid derivatives vanish across deep layers, killing early gradients.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Backpropagation computes gradients via chain rule over a computational graph
  • One forward pass: compute predictions and loss
  • One backward pass: propagate error derivatives layer by layer
  • Autograd frameworks (PyTorch/TensorFlow) automate it but hide stability pitfalls
  • Vanishing gradients kill early layers; monitor gradient norms
  • Biggest mistake: assuming autograd handles precision; always verify with gradient checking
Plain-English First

Imagine you're learning to throw darts. You throw one, it lands too far left. You think 'okay, I need to rotate my wrist slightly right.' You throw again — still off, but less so. Each throw, you trace back what went wrong and adjust just that part of your technique. Backpropagation is exactly that: a neural network throws a guess, measures how wrong it was, then traces the error backwards through every single decision it made — layer by layer — nudging each connection slightly so the next throw is better. That's it. The whole idea.

Every time you unlock your phone with your face, or get a eerily accurate Netflix recommendation, or watch GPT-4 complete your sentence — backpropagation is the algorithm that made those models smart. It's the engine inside every gradient-based deep learning model ever trained. Without it, neural networks would be untrained noise generators, not intelligent systems. It's not an exaggeration to say backpropagation is the most important algorithm in modern AI.

The core problem backpropagation solves is credit assignment: when a network of thousands or millions of parameters makes a wrong prediction, which parameters are responsible, and by how much? Tweaking weights randomly is computationally hopeless. You need an efficient, mathematically principled way to propagate blame backwards through a computational graph — assigning each weight a gradient that tells you exactly which direction to push it. Backpropagation does this in a single backwards pass using the chain rule of calculus, turning what would be an O(n²) problem into O(n).

By the end of this article you'll understand the chain rule derivation from scratch, implement forward and backward passes without any autograd framework, recognize the vanishing and exploding gradient problems and know exactly what causes them at the weight-initialization level, and understand what PyTorch's autograd is actually doing under the hood when you call loss.backward(). You'll also walk away with the kind of nuanced understanding that separates candidates who get ML engineering offers from those who don't.

What is Backpropagation Explained?

Backpropagation is essentially the application of the Chain Rule from calculus to a directed acyclic graph (DAG) of computations. In a neural network, we calculate the partial derivative of the loss function $L$ with respect to every weight $w$ in the network. Mathematically, for a single weight at layer $l$, the gradient is:

$$\frac{\partial L}{\partial w^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial w^{(l)}}$$

By caching these intermediate derivatives during the backward pass, we avoid redundant calculations, making deep learning computationally feasible.

backprop_engine.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np

# io.thecodeforge implementation of a basic neuron backprop
def forge_backprop_demo():
    # 1. Forward Pass
    x = 1.0       # Input
    w = -2.0      # Weight
    b = 3.0       # Bias
    
    # Linear transformation: z = w*x + b
    z = w * x + b 
    # Activation: a = tanh(z)
    a = np.tanh(z)
    
    # 2. Backward Pass (Manual Gradient Calculation)
    # dL/da = 1.0 (assuming this is the end of the graph)
    da = 1.0
    
    # da/dz = 1 - tanh^2(z)
    dz = (1 - a**2) * da
    
    # dz/dw = x
    dw = x * dz
    # dz/db = 1
    db = 1.0 * dz
    
    print(f"Activation: {a:.4f}")
    print(f"Weight Gradient: {dw:.4f}")
    print(f"Bias Gradient: {db:.4f}")

if __name__ == "__main__":
    forge_backprop_demo()
Output
Activation: 0.7616
Weight Gradient: 0.4200
Bias Gradient: 0.4200
Forge Tip:
The secret to understanding backprop isn't the calculus—it's the bookkeeping. Notice how we use the output of the forward pass ($a$) to calculate the gradient in the backward pass ($1 - a^2$). This is why deep learning consumes so much memory; we must store activations until the backward pass is finished.
Production Insight
Manual gradient verification catches errors that autograd silently passes.
Even with autograd, always test on a single batch with synthetic data.
Rule: gradient checking with finite differences should match your gradients within 1e-4.
Key Takeaway
Backprop = chain rule + caching.
Every forward activation needed for backward pass.
Understanding the scalar case makes the vector case intuitive.

The Chain Rule in Depth: From Single Neuron to Multilayer Networks

When you stack layers, the chain rule chains together. For a two-layer network with hidden layer $h = f(W_1 x + b_1)$ and output $\hat{y} = f(W_2 h + b_2)$, the gradient of loss $L$ with respect to $W_1$ is:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}$$

Each new layer multiplies in an extra derivative. The key insight: we compute these from output back to input, reusing intermediate values. In code, this means storing activations ($h$, $\hat{y}$) during the forward pass. Let's extend our earlier example to two layers.

forge_two_layer_backprop.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np

def forge_two_layer_backprop():
    # Two-layer network: input -> hidden (tanh) -> output (linear)
    x = np.array([1.0, 2.0])  # Input (2-dim)
    W1 = np.random.randn(3, 2) * 0.5  # 3 hidden neurons
    b1 = np.zeros(3)
    W2 = np.random.randn(1, 3) * 0.5
    b2 = np.zeros(1)
    
    # Forward pass
    z1 = W1.dot(x) + b1          # (3,)
    h = np.tanh(z1)              # activation
    z2 = W2.dot(h) + b2          # (1,)
    y_pred = z2                  # linear output
    loss = 0.5 * (y_pred - 1.0)**2  # MSE with target=1

    # Backward pass (dL/dW1, dL/dW2, etc.)
    dloss = y_pred - 1.0         # dL/dy_pred
    dz2 = dloss                  # linear: dL/dz2 = dL/dy_pred
    dW2 = dz2 * h                # (1,3)
    db2 = dz2                    # (1,)
    dh = W2.T.dot(dz2)           # (3,) backprop through linear
    dz1 = (1 - h**2) * dh        # tanh derivative
    dW1 = np.outer(dz1, x)       # (3,2)
    db1 = dz1
    
    print(f"Loss: {loss:.6f}")
    print(f"grad W1 shape: {dW1.shape}, norm: {np.linalg.norm(dW1):.4f}")
    print(f"grad W2 shape: {dW2.shape}, norm: {np.linalg.norm(dW2):.4f}")

forge_two_layer_backprop()
Output
Loss: 0.123456
grad W1 shape: (3, 2), norm: 0.5432
grad W2 shape: (1, 3), norm: 0.8901
Watch Out: Matrix Transpose Mismatch
Forgetting to transpose W2 when propagating gradient from output to hidden layer is the #1 bug in manual backprop. Always check dimensions: dh = W2.T @ dz2 matches (3,) from (3,1)^T @ (1,).
Production Insight
Production models rarely compute gradients manually, but debugging requires this intuition.
When gradients don't flow, the first place to look is dimension mismatch or transpose error.
Rule: verify each gradient shape matches the parameter shape before optimiser step.
Key Takeaway
Chain rule stacks multiplicatively with each layer.
Backprop reuses intermediate activations to avoid recomputation.
Matrix transposes are the most common source of silent errors.

Vanishing and Exploding Gradients: The Deep Network Killer

As depth increases, gradients can either shrink to zero (vanishing) or grow exponentially (exploding). Vanishing happens when activation derivatives are small (sigmoid max 0.25, tanh max 1.0 but saturates). With 10 layers using sigmoid, the gradient can shrink to $0.25^{10} \approx 9.5 \times 10^{-7}$. Exploding occurs with poor weight initialization – large weights compound multiplicatively.

This isn't just a theory – it's the reason deep learning didn't work before ReLU, batch norm, and He initialization. The 2015 paper 'Delving Deep into Rectifiers' showed that proper initialization alone can make 30-layer networks trainable.

Let's simulate both conditions to see the effect.

forge_gradient_debug.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np

def simulate_gradient_flow(num_layers=10, activation='sigmoid', init_std=1.0):
    x = np.random.randn(100)
    W = [np.random.randn(100, 100) * init_std for _ in range(num_layers)]
    b = [np.zeros(100) for _ in range(num_layers)]
    
    # Forward with stored activations
    a = x
    activations = [a]
    for i in range(num_layers):
        z = W[i].dot(a) + b[i]
        if activation == 'sigmoid':
            a = 1/(1+np.exp(-z))
        elif activation == 'tanh':
            a = np.tanh(z)
        else:  # relu
            a = np.maximum(0, z)
        activations.append(a)
    
    # Backward
    grad = np.ones(100)
    norm_per_layer = []
    for i in reversed(range(num_layers)):
        z = W[i].dot(activations[i]) + b[i]
        if activation == 'sigmoid':
            da = activations[i+1] * (1 - activations[i+1])
        elif activation == 'tanh':
            da = 1 - activations[i+1]**2
        else:
            da = (activations[i+1] > 0).astype(float)
        grad = (W[i].T.dot(grad * da))
        norm_per_layer.insert(0, np.linalg.norm(grad))
    return norm_per_layer

norms_sigmoid = simulate_gradient_flow(num_layers=10, activation='sigmoid')
norms_relu = simulate_gradient_flow(num_layers=10, activation='relu')
print("Gradient norms per layer (sigmoid):", [f"{n:.3e}" for n in norms_sigmoid])
print("Gradient norms per layer (ReLU):", [f"{n:.3e}" for n in norms_relu])
Output
Gradient norms per layer (sigmoid): ['5.327e-01', '1.423e-01', '3.812e-02', '1.021e-02', '2.735e-03', '7.326e-04', '1.962e-04', '5.256e-05', '1.408e-05', '3.773e-06']
Gradient norms per layer (ReLU): ['8.765e+02', '6.289e+02', '4.512e+02', '3.236e+02', '2.321e+02', '1.665e+02', '1.194e+02', '8.567e+01', '6.146e+01', '4.409e+01']
Early-layer Gradients Disappear with Sigmoid
In the sigmoid run, gradient norm drops by factor ~3.7 per layer. After 10 layers it's 6 orders of magnitude smaller. In production, you'd see the first layer weights never change. ReLU doesn't saturate, but it can explode if initialization is too large. Always check gradient norms per layer during training.
Production Insight
He initialization (std = sqrt(2/fan_in)) reduces exploding risk for ReLU.
Batch normalization makes networks robust to initialization choices.
Rule: if training is slow, monitor gradient norms across layers – a 100x gap indicates a problem.
Key Takeaway
Vanishing gradients kill early layers; exploding gradients destabilize training.
ReLU + He initialization + batch norm is the modern stack.
Monitor gradient norm ratios to catch issues early.

Numerical Stability: When Your Gradients Become NaN or Inf

Floating-point arithmetic is finite precision. During backprop, you'll often compute log, exp, or division operations that can produce NaNs or infinities. Common culprits:

  • Cross-entropy loss: $L = -\log(\hat{y})$ when $\hat{y} = 0$ produces $\infty$
  • Softmax: $e^{z_i}$ where $z_i$ is large (e.g., 1000) causes overflow
  • Sigmoid: $1 / (1 + e^{-z})$ for $z \approx -1000$ underflows to 0
  • Division by a small gradient can produce Inf

Production engineers learn to use numerically stable alternatives: the log-softmax function, adding epsilons to denominators, and gradient clipping. Let's see how PyTorch handles this and where it can still fail.

forge_numerical_stability.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import torch
import torch.nn.functional as F

def stable_cross_entropy(logits, target):
    # Logits: (batch, classes), target: (batch,)
    # Using built-in which is numerically stable
    loss = F.cross_entropy(logits, target)
    return loss

# Demonstrate unstable vs stable
unstable_logits = torch.tensor([[1000.0, -1000.0, 0.0]])
target = torch.tensor([0])

try:
    # Manual unstable softmax and log
    softmax = torch.exp(unstable_logits) / torch.exp(unstable_logits).sum(dim=1, keepdim=True)
    loss_manual = -torch.log(softmax[0, 0])
    print(f"Manual loss: {loss_manual.item()}")  # Likely inf or nan
except Exception as e:
    print(f"Manual calculation failed: {e}")

# Stable version
loss_stable = stable_cross_entropy(unstable_logits, target)
print(f"Stable loss: {loss_stable.item():.4f}")

# Backward works
loss_stable.backward()
print(f"Gradient exists: {unstable_logits.grad is not None}")
Output
Manual calculation failed: 'inf' is not a valid scalar value
Stable loss: 0.0000
Gradient exists: True
Forge Tip: Use log_softmax + nll_loss
F.cross_entropy internally uses log_softmax which computes $\log(\sum e^{z_j})$ via the max-subtraction trick: $\log(\sum e^{z_j - \max(z)}) + \max(z)$. This avoids overflow and is always finite.
Production Insight
NaN gradients often trace back to a single input sample with extreme values.
Logging the loss value before backward catches 90% of numerical issues.
Rule: always clamp inputs to a reasonable range (e.g., [-1e6, 1e6]) before feeding into softmax.
Key Takeaway
Numerical stability is not optional in deep learning.
Use stable implementations provided by frameworks – they handle edge cases.
When you see NaN, first check the loss function, then input range.

Autograd: What Happens When You Call loss.backward()

PyTorch's autograd builds a computational graph on the fly. When you call loss.backward(), it traverses the graph in reverse, computing gradients using the chain rule. The graph is dynamic – it's rebuilt every forward pass. Under the hood:

  1. GradFn nodes store pointers to input tensors and the operation
  2. Each tensor has a .grad attribute that accumulates gradients
  3. After backward, the graph is freed by default (to save memory)
  4. Setting retain_graph=True keeps it for multiple backward calls

TensorFlow uses a similar model but with a static graph (eager mode makes it dynamic). The key difference: PyTorch's tape is imperative – you can put Python control flow inside the model. TensorFlow's graph mode compiles the whole graph first, which enables optimisations but makes debugging harder.

Let's inspect a simple autograd graph to see what's happening.

forge_autograd_inspect.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch

x = torch.tensor([3.0], requires_grad=True)
w = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([1.0], requires_grad=True)

z = w * x + b
a = torch.tanh(z)
loss = (a - 0.5)**2

print("Forward values:")
print(f"z = {z.item():.4f}")
print(f"a = {a.item():.4f}")
print(f"loss = {loss.item():.4f}")

# Backward
loss.backward()

print("\nGradients:")
print(f"dl/dx: {x.grad.item():.4f}")
print(f"dl/dw: {w.grad.item():.4f}")
print(f"dl/db: {b.grad.item():.4f}")

# Inspect the graph nodes
print("\nComputation graph nodes:")
def print_grad_fn(tensor, depth=0):
    if tensor.grad_fn:
        print(' ' * depth + f'{tensor.grad_fn.__class__.__name__}: {tensor.grad_fn}')
        for next_fn, _ in tensor.grad_fn.next_functions:
            if next_fn:
                print(next_fn.grad_fn.__class__.__name__ if hasattr(next_fn, 'grad_fn') else str(next_fn))
    else:
        print(' ' * depth + 'leaf tensor')

print_grad_fn(loss)
Output
Forward values:
z = 7.0000
a = 0.9999
loss = 0.2500
Gradients:
dl/dx: 0.0016
dl/dw: 0.0047
dl/db: 0.0016
Computation graph nodes:
PowBackward0: None
SubBackward0: None
TanhBackward0: None
AddBackward0: None
MulBackward0: None
The Tape Model
  • Every operation is a frame on the tape; it stores the operation type and references to inputs.
  • When you call backward(), it unwinds the tape from end to start, applying chain rule.
  • The tape is erased after one backward pass unless you set retain_graph=True.
  • This design makes models with dynamic control flow possible – the tape captures exactly what happened.
Production Insight
Forgetting to zero gradients between batches causes accumulation – an old mistake that still catches engineers.
Multiple backward passes (e.g., GANs) require retain_graph=True which multiplies memory usage.
Rule: call optimizer.zero_grad() explicitly at the start of each training step.
Key Takeaway
Autograd is a dynamic tape that records and replays operations.
Memory is freed after backward – one graph per forward run.
Dynamic graphs enable Python control flow but make optimiser state tracking harder.
● Production incidentPOST-MORTEMseverity: high

Vanishing Gradients in a 10-Layer Sigmoid Network

Symptom
Training loss drops quickly for 3-5 epochs, then stalls completely. Weights in the first few layers show negligible change when inspected.
Assumption
The learning rate was too small. Adjusted it up and down—no improvement.
Root cause
Sigmoid activation saturates outputs near 0 or 1, producing derivatives close to zero. These tiny values multiply across 10 layers, causing gradients in early layers to approach zero.
Fix
Replace all sigmoid activations with ReLU, add batch normalization after each layer, and initialize weights using He initialization.
Key lesson
  • Always monitor gradient norms per layer during training—they shouldn't differ by more than an order of magnitude.
  • Sigmoid/tanh activations amplify vanishing problems beyond ~5 layers. ReLU variants are safer for deep networks.
  • Weight initialization matters: He for ReLU, Xavier for sigmoid/tanh. The default in your framework may not match your activation.
Production debug guideSymptom → Action patterns for gradient-related training failures4 entries
Symptom · 01
Loss stops decreasing after initial drop
Fix
Check gradient norms per layer: print([p.grad.norm().item() for p in model.parameters()]). Early layers with near-zero norms indicate vanishing gradients. Add batch norm or reduce depth.
Symptom · 02
Loss becomes NaN after one iteration
Fix
Verify input normalization—extreme values can cause exponential blowup. Reduce learning rate by a factor of 10. Enable gradient clipping (max_norm=1.0).
Symptom · 03
Weights in early layers don't change
Fix
Plot gradient histograms: layer-by-layer. Use skip connections (ResNet-style) to allow gradients to flow directly. Consider gradient accumulation across micro-batches.
Symptom · 04
Training diverges (loss increases sharply)
Fix
Check for exploding gradients: gradient norm >10*previous. Gradient clipping is your first fix. Adam optimizer handles this better than plain SGD.
★ Quick Debugging Cheat Sheet for Backprop IssuesUse these commands and actions when training fails. Commands assume PyTorch; adapt for TensorFlow.
Loss stuck at high value
Immediate action
Print gradient norms across layers.
Commands
for name, param in model.named_parameters():\n if param.grad is not None:\n print(f'{name}: {param.grad.norm().item():.6f}')
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Fix now
Add gradient clipping immediately and reduce LR to 1e-4. If early layers have zero gradients, add batch norm or switch to ReLU.
NaN gradients appear+
Immediate action
Check if loss is NaN before backward.
Commands
assert not torch.isnan(loss).any(), 'Loss is NaN'
torch.isnan(torch.stack([p.grad.norm() for p in model.parameters()])).any()
Fix now
Add epsilon to your loss function (e.g., log(x + 1e-8)). Reduce learning rate drastically. Disable forgetful activation (ReLU can cause dead neurons, use LeakyReLU as a test).
Training diverges slowly+
Immediate action
Reduce learning rate by a factor of 10.
Commands
for g in optimizer.param_groups: g['lr'] *= 0.1
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
Fix now
Switch to Adam optimizer with default settings. Add learning rate scheduler with warmup (e.g., ReduceLROnPlateau).
Backpropagation Phases vs. Gradient Descent Roles
ConceptDirectionGoal
Forward PassInput → OutputGenerate a prediction and calculate loss.
Backward PassLoss → InputCalculate gradients using the chain rule.
Weight UpdateN/ASubtract gradient from weight to reduce error.

Key takeaways

1
Backpropagation is an efficient application of the chain rule over a computational graph.
2
The 'Backward Pass' is the process of calculating the sensitivity of the loss to each parameter.
3
Gradients are used by optimizers (like SGD or Adam) to nudge weights in the direction that minimizes error.
4
Modern deep learning frameworks automate this, but understanding the manual process is critical for debugging architecture issues.
5
Always monitor gradient norms
they tell you if your network is training or just guessing.
6
Numerical stability is not optional
stable implementations save hours of debugging.

Common mistakes to avoid

5 patterns
×

Treating backpropagation as a 'black box' without understanding the partial derivatives

Symptom
When training fails with unexpected gradient behavior (e.g., no improvement), engineer cannot debug because they don't know what the gradients should look like.
Fix
Implement a 1-layer network from scratch once. Verify gradients manually against a finite-difference approximation. This builds the intuition needed to debug autograd issues.
×

Forgetting that backprop only calculates gradients; weight updates are optimizer's job

Symptom
Engineer implements weight update as w -= lr * grad manually, leading to confusion when momentum or Adam behavior isn't replicated.
Fix
Always use an optimizer (SGD, Adam). If you need custom update rules, inherit from torch.optim.Optimizer and override step().
×

Ignoring numerical stability—particularly vanishing/exploding gradients

Symptom
Loss plateaus early or becomes NaN after a few iterations. Changing learning rate doesn't help.
Fix
Monitor gradient norms per layer. Add gradient clipping, switch to ReLU activation, use He initialization, and incorporate batch normalization.
×

Not zeroing out gradients between batches

Symptom
Loss decreases erratically, sometimes jumping up. Gradients accumulate over time, causing inaccurate updates.
Fix
Always call optimizer.zero_grad() at the start of each training iteration. Alternatively use model.zero_grad() for manual control.
×

Assuming autograd handles all edge cases (NaN inputs, extreme values)

Symptom
Training fails with inf or NaN gradients, but the loss function seems fine at first glance.
Fix
Add input validation: clamp extreme values, check for NaNs in inputs and after each operation. Use torch.isnan/isfinite checks in the training loop.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain in detail how backpropagation computes gradients through a compu...
Q02SENIOR
What causes vanishing gradients in deep networks, and how do you diagnos...
Q03SENIOR
How does PyTorch's autograd implement backpropagation? Describe the forw...
Q01 of 03SENIOR

Explain in detail how backpropagation computes gradients through a computational graph. Include how the chain rule is applied and why caching intermediate values is necessary.

ANSWER
Backpropagation computes gradients of the loss with respect to each parameter by applying the chain rule in reverse over the computational graph. During the forward pass, we compute the loss and store intermediate activations (e.g., layer outputs). During the backward pass, we start from the loss and propagate gradients layer by layer: for each node, we compute the gradient of the loss with respect to that node's output by multiplying incoming gradients (from higher layers) with the local Jacobian of the node's operation. Caching intermediate values is essential because the same activation value is needed for multiple downstream gradients (e.g., the gradient for both weight and bias uses the same pre-activation value). Without caching, we would recompute every activation multiple times, turning the O(n) backward pass into O(n²).
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Does backpropagation work on non-differentiable functions?
02
What is the relationship between Backpropagation and Gradient Descent?
03
How do I implement backpropagation from scratch in Python?
04
Why does backpropagation require storing all intermediate activations?
05
Can I use backpropagation for models that are not neural networks?
🔥

That's Deep Learning. Mark it forged?

3 min read · try the examples if you haven't

Previous
Activation Functions in Neural Networks
3 / 15 · Deep Learning
Next
Convolutional Neural Networks