Backpropagation — Why Your 10-Layer Network Stops Learning
Sigmoid derivatives vanish across deep layers, killing early gradients.
- Backpropagation computes gradients via chain rule over a computational graph
- One forward pass: compute predictions and loss
- One backward pass: propagate error derivatives layer by layer
- Autograd frameworks (PyTorch/TensorFlow) automate it but hide stability pitfalls
- Vanishing gradients kill early layers; monitor gradient norms
- Biggest mistake: assuming autograd handles precision; always verify with gradient checking
Imagine you're learning to throw darts. You throw one, it lands too far left. You think 'okay, I need to rotate my wrist slightly right.' You throw again — still off, but less so. Each throw, you trace back what went wrong and adjust just that part of your technique. Backpropagation is exactly that: a neural network throws a guess, measures how wrong it was, then traces the error backwards through every single decision it made — layer by layer — nudging each connection slightly so the next throw is better. That's it. The whole idea.
Every time you unlock your phone with your face, or get a eerily accurate Netflix recommendation, or watch GPT-4 complete your sentence — backpropagation is the algorithm that made those models smart. It's the engine inside every gradient-based deep learning model ever trained. Without it, neural networks would be untrained noise generators, not intelligent systems. It's not an exaggeration to say backpropagation is the most important algorithm in modern AI.
The core problem backpropagation solves is credit assignment: when a network of thousands or millions of parameters makes a wrong prediction, which parameters are responsible, and by how much? Tweaking weights randomly is computationally hopeless. You need an efficient, mathematically principled way to propagate blame backwards through a computational graph — assigning each weight a gradient that tells you exactly which direction to push it. Backpropagation does this in a single backwards pass using the chain rule of calculus, turning what would be an O(n²) problem into O(n).
By the end of this article you'll understand the chain rule derivation from scratch, implement forward and backward passes without any autograd framework, recognize the vanishing and exploding gradient problems and know exactly what causes them at the weight-initialization level, and understand what PyTorch's autograd is actually doing under the hood when you call loss.backward(). You'll also walk away with the kind of nuanced understanding that separates candidates who get ML engineering offers from those who don't.
What is Backpropagation Explained?
Backpropagation is essentially the application of the Chain Rule from calculus to a directed acyclic graph (DAG) of computations. In a neural network, we calculate the partial derivative of the loss function $L$ with respect to every weight $w$ in the network. Mathematically, for a single weight at layer $l$, the gradient is:
$$\frac{\partial L}{\partial w^{(l)}} = \frac{\partial L}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}} \cdot \frac{\partial z^{(l)}}{\partial w^{(l)}}$$
By caching these intermediate derivatives during the backward pass, we avoid redundant calculations, making deep learning computationally feasible.
The Chain Rule in Depth: From Single Neuron to Multilayer Networks
When you stack layers, the chain rule chains together. For a two-layer network with hidden layer $h = f(W_1 x + b_1)$ and output $\hat{y} = f(W_2 h + b_2)$, the gradient of loss $L$ with respect to $W_1$ is:
$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}$$
Each new layer multiplies in an extra derivative. The key insight: we compute these from output back to input, reusing intermediate values. In code, this means storing activations ($h$, $\hat{y}$) during the forward pass. Let's extend our earlier example to two layers.
Vanishing and Exploding Gradients: The Deep Network Killer
As depth increases, gradients can either shrink to zero (vanishing) or grow exponentially (exploding). Vanishing happens when activation derivatives are small (sigmoid max 0.25, tanh max 1.0 but saturates). With 10 layers using sigmoid, the gradient can shrink to $0.25^{10} \approx 9.5 \times 10^{-7}$. Exploding occurs with poor weight initialization – large weights compound multiplicatively.
This isn't just a theory – it's the reason deep learning didn't work before ReLU, batch norm, and He initialization. The 2015 paper 'Delving Deep into Rectifiers' showed that proper initialization alone can make 30-layer networks trainable.
Let's simulate both conditions to see the effect.
Numerical Stability: When Your Gradients Become NaN or Inf
Floating-point arithmetic is finite precision. During backprop, you'll often compute log, exp, or division operations that can produce NaNs or infinities. Common culprits:
- Cross-entropy loss: $L = -\log(\hat{y})$ when $\hat{y} = 0$ produces $\infty$
- Softmax: $e^{z_i}$ where $z_i$ is large (e.g., 1000) causes overflow
- Sigmoid: $1 / (1 + e^{-z})$ for $z \approx -1000$ underflows to 0
- Division by a small gradient can produce Inf
Production engineers learn to use numerically stable alternatives: the log-softmax function, adding epsilons to denominators, and gradient clipping. Let's see how PyTorch handles this and where it can still fail.
Autograd: What Happens When You Call loss.backward()
PyTorch's autograd builds a computational graph on the fly. When you call loss.backward(), it traverses the graph in reverse, computing gradients using the chain rule. The graph is dynamic – it's rebuilt every forward pass. Under the hood:
- GradFn nodes store pointers to input tensors and the operation
- Each tensor has a .grad attribute that accumulates gradients
- After backward, the graph is freed by default (to save memory)
- Setting retain_graph=True keeps it for multiple backward calls
TensorFlow uses a similar model but with a static graph (eager mode makes it dynamic). The key difference: PyTorch's tape is imperative – you can put Python control flow inside the model. TensorFlow's graph mode compiles the whole graph first, which enables optimisations but makes debugging harder.
Let's inspect a simple autograd graph to see what's happening.
- Every operation is a frame on the tape; it stores the operation type and references to inputs.
- When you call
backward(), it unwinds the tape from end to start, applying chain rule. - The tape is erased after one backward pass unless you set retain_graph=True.
- This design makes models with dynamic control flow possible – the tape captures exactly what happened.
optimizer.zero_grad() explicitly at the start of each training step.Vanishing Gradients in a 10-Layer Sigmoid Network
- Always monitor gradient norms per layer during training—they shouldn't differ by more than an order of magnitude.
- Sigmoid/tanh activations amplify vanishing problems beyond ~5 layers. ReLU variants are safer for deep networks.
- Weight initialization matters: He for ReLU, Xavier for sigmoid/tanh. The default in your framework may not match your activation.
model.parameters()]). Early layers with near-zero norms indicate vanishing gradients. Add batch norm or reduce depth.Key takeaways
Common mistakes to avoid
5 patternsTreating backpropagation as a 'black box' without understanding the partial derivatives
Forgetting that backprop only calculates gradients; weight updates are optimizer's job
step().Ignoring numerical stability—particularly vanishing/exploding gradients
Not zeroing out gradients between batches
optimizer.zero_grad() at the start of each training iteration. Alternatively use model.zero_grad() for manual control.Assuming autograd handles all edge cases (NaN inputs, extreme values)
Interview Questions on This Topic
Explain in detail how backpropagation computes gradients through a computational graph. Include how the chain rule is applied and why caching intermediate values is necessary.
Frequently Asked Questions
That's Deep Learning. Mark it forged?
3 min read · try the examples if you haven't