Gradient Accumulation Bug — Silent Divergence in PyTorch
Model loss decreased but 30% of inputs produced NaN due to missing optimizer.zero_grad().
20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.
- Autograd is PyTorch's automatic differentiation engine that records operations on tensors and computes gradients via backpropagation
- Set requires_grad=True on a tensor to track all operations performed on it in a dynamic computational graph
- Call .backward() to compute gradients — PyTorch applies the Chain Rule automatically through the recorded graph
- Always call optimizer.zero_grad() before each backward pass — PyTorch accumulates gradients by default, which is useful for gradient accumulation but catastrophic if forgotten in standard training loops
- Wrap inference code in torch.inference_mode() for maximum performance — it is 10-20% faster than torch.no_grad() because it skips version counter tracking
- The dynamic graph (Define-by-Run) means Python if/else and loops inside your model just work — unlike static graph frameworks where conditional logic requires special ops
- Gradient checkpointing trades ~30% more compute for ~60% less memory — the standard tool when a model does not fit on one GPU
Think of training a neural network like finding the lowest point in a mountain range while blindfolded. You cannot see the terrain, but you can feel the slope under your feet. Backpropagation is the process of calculating exactly which direction is downhill — the gradient — so you can take a step in the right direction. Autograd is the system that measures that slope automatically.
Without Autograd, you would have to derive the gradient equations by hand for every operation in your network. For a network with dozens of layers and millions of parameters, that is months of calculus work that goes wrong the moment you change the architecture. Autograd does the entire calculation for you by recording every operation your data passed through on the way to the final error number, then playing that recording backward to figure out how much each parameter contributed to the error.
The 'dynamic' part — which distinguishes PyTorch from older frameworks — means Autograd records whatever actually happened during a specific forward pass, including any if/else branches or loops your model took. Each forward pass gets its own exact recording. The gradient computation is always correct for what just ran, not for some pre-compiled approximation of what might run.
Autograd and Backpropagation in PyTorch is the engine that makes neural network training work. Understanding it at a mechanical level — not just as a black box — is what separates engineers who can debug training failures from engineers who restart the training run and hope for different results.
PyTorch uses a Define-by-Run approach where the computational graph is built dynamically during the forward pass. This means the graph structure changes with every iteration if your model contains conditional logic or variable-length sequences — something static graph frameworks handle awkwardly through specialized conditional operations that bear little resemblance to ordinary Python code.
The practical consequence: gradients are always correct for the exact computation that just ran, not a pre-compiled approximation. The trade-off is that the graph must be rebuilt each forward pass, which adds overhead that static frameworks avoid through upfront compilation. In practice, PyTorch's JIT compiler and torch.compile() (introduced in PyTorch 2.0) close most of that gap for production workloads.
This guide covers the full Autograd lifecycle: how the computational graph is built and destroyed, the most expensive mistakes in production training loops, custom gradient implementations for numerical stability, gradient checkpointing for large models, and the correct inference contexts for serving. By the end, you will be able to reason about why gradients are wrong before reaching for a debugger.
What Is Autograd and Backpropagation in PyTorch and Why Does It Exist?
Autograd exists because manual gradient derivation does not scale. A two-layer network with a sigmoid activation and cross-entropy loss already requires several lines of calculus to derive the gradients correctly. A 24-layer transformer with attention, layer norm, residual connections, and a mixture of activation functions would require weeks of derivation work — work that becomes incorrect the moment you change the architecture.
PyTorch's answer is automatic differentiation through a dynamic computational graph. Every operation you perform on a tensor with requires_grad=True gets recorded as a node in a directed acyclic graph (DAG). Each node stores references to its inputs and carries a gradient function — the local derivative of that specific operation. When you call .backward() on a scalar loss, PyTorch traverses this graph in reverse from the loss to every leaf tensor, multiplying local gradients together via the Chain Rule at each node. The result is the exact gradient of the loss with respect to every tracked parameter.
The Define-by-Run approach means the graph is built during execution, not beforehand. If your model has an if/else branch, PyTorch records whichever branch actually executed for that specific input. The gradient computation is then correct for that specific execution path. Static graph frameworks require you to express conditional logic using framework-specific operations that are compiled into a fixed graph — the gradient is correct for the compiled graph, which may differ from what you intended if the conditional logic is complex.
The practical trade-off: dynamic graph construction adds overhead on each forward pass that static frameworks avoid through upfront compilation. PyTorch 2.0's torch.compile() closes most of this gap for production workloads by compiling the dynamic graph into optimized static kernels while preserving the flexibility of Python-level control flow.
- Forward pass: execute operations and build the graph — each operation node stores its gradient function and holds references to its inputs
- Backward pass: start from the scalar loss and traverse the graph in reverse, calling each node's gradient function and multiplying by the upstream gradient via the Chain Rule
- The graph is rebuilt on every forward pass — it reflects the exact computation that just ran, including any Python branches that were taken
- requires_grad=True marks a tensor as a leaf node whose gradient we want accumulated in .grad after
backward() - A single .backward() call on the scalar loss computes gradients for every tracked parameter in the entire network — that is the efficiency that makes training practical
torch.compile() in PyTorch 2.0+ recovers most of it for static sub-graphs within a dynamic model.backward() that computes the gradient in a more stable form. Verify with torch.autograd.gradcheck() before deploying.torch.inference_mode() to disable graph construction entirely and save both memory and compute. Do not use no_grad() for production serving — inference_mode() is strictly faster.torch.autograd.gradcheck() to numerically verify computed gradients against finite differences. If gradcheck passes, Autograd is correct and the bug is elsewhere.The Computational Graph — Anatomy and Lifecycle
The computational graph is not an abstract concept — it is a concrete data structure that PyTorch builds in memory during your forward pass and destroys after your backward pass. Understanding its lifecycle is the prerequisite for debugging GPU memory leaks, gradient errors, and unexpected RuntimeError messages.
Construction phase: every operation on a tensor with requires_grad=True creates a Function object. This object stores references to its input tensors and implements the backward computation for that specific operation. The Function objects chain together through their next_functions pointers to form the graph. This construction happens eagerly, operation by operation, as your code executes.
Retention phase: the graph exists from the end of the forward pass until backward() is called. During this window, the graph holds references to every intermediate tensor needed for gradient computation. This is where GPU memory is allocated for all those intermediate activations. The longer the graph, the more memory it holds during retention.
Destruction phase: after backward() completes, PyTorch releases the graph by default. All Function objects are freed, and the intermediate tensors they held references to can be garbage collected. This is intentional — PyTorch assumes you only need to call backward() once per forward pass.
The three production mistakes that come from misunderstanding this lifecycle: calling backward() twice without retain_graph=True (the graph is gone after the first call), storing loss tensors in lists without .item() (holding a reference to the tensor holds the entire graph), and using retain_graph=True in a training loop without releasing it (the graph grows with each iteration).
backward() multiple times on the same graph — for example, computing gradients of multiple scalar outputs from a multi-task model separately. Even then, call it only on all but the final backward() call so the graph is released after the last pass. Never set it as a default in a training loop to 'prevent errors' without understanding what it does.backward() by default — this is correct and intentional. PyTorch engineers designed it this way because almost every training loop only calls backward() once per forward pass, and destroying the graph immediately is the safest memory management strategy.no_grad(), or retain_graph=True left in from debugging.backward() completes, and destruction to free memory.backward() by default — this is the correct behavior. retain_graph=True is a tool for specific multi-output scenarios, not a general-purpose option.Common Mistakes and How to Avoid Them
Most Autograd bugs in production follow one of five patterns. They appear across teams, seniority levels, and model types — which means they are not careless mistakes but rather counterintuitive behaviors that are easy to get wrong even when you know the framework well.
1. Forgetting optimizer.zero_grad(). PyTorch accumulates gradients by default — .backward() adds to .grad rather than replacing it. In a standard training loop where you want gradients from only the current batch, failing to zero gradients causes every batch's gradients to stack. After 50 batches, the optimizer is responding to the sum of all 50 batches' gradients, which has no relation to the current batch's signal. The weight updates overshoot exponentially. This is the bug from the production incident section — it produced stable-looking training loss while silently corrupting the model.
2. In-place operations on tracked tensors. PyTorch's version counter system detects when a tensor is modified in-place after it was recorded in the graph. If the modified value is needed for a gradient computation, the version counter mismatch raises RuntimeError. Worse, some in-place operations in specific positions in the graph corrupt gradients silently rather than raising an error — the gradients are computed for the unmodified tensor value while the forward pass used the modified value.
3. Failing to detach tensors when logging. This is the most common GPU memory leak in training loops. Storing loss.item() is correct. Storing loss without .item() holds a reference to the entire computational graph — because loss is a tensor with a grad_fn that chains back through the entire forward pass. Every batch's graph accumulates in memory until OOM.
4. Accidentally wrapping training in torch.no_grad(). This is the silent learning failure. The model forward pass runs, the loss is computed, and nothing happens. No gradient is computed. No weight is updated. Loss stays constant or varies only from batch-to-batch data differences. It looks like a learning rate that is too low, or a model that has converged immediately. The tell is that gradient norms are all zero — add gradient norm logging and this becomes visible in the first epoch.
5. Calling backward() on a non-scalar without a gradient argument. backward() is designed for scalar outputs. If your loss is a vector (common when you forget to average across the batch), PyTorch raises RuntimeError immediately. The fix is either averaging the loss — loss.mean().backward() — or passing an explicit gradient tensor matching the loss shape.
torch.no_grad() or torch.inference_mode() accidentally wraps the training loop, when in-place operations silently corrupt gradients without raising an error, or when requires_grad is False on parameters that should be updated.
The tell for all of these is gradient norms that are zero. Add gradient norm monitoring to every training loop — it is a one-line addition and it makes an entire class of silent failures immediately visible. A training run with zero gradient norms is not training. Full stop.zero_grad() causes gradient accumulation that grows exponentially with training steps. The model appears to train for dozens of epochs before diverging — which is exactly long enough to waste serious compute budget before anyone investigates.torch.no_grad() means weights never update. This is the most common mistake made during debugging — someone adds no_grad() to test inference behavior, forgets to remove it, and the training run produces a model that is identical to its initialization.optimizer.zero_grad() before the forward pass — PyTorch accumulates gradients by default and the accumulation is exponential in magnitude.model.parameters(), max_norm=1.0).no_grad() wrapping, in-place ops on leaf tensors, or .numpy() calls breaking the graph. If non-zero but collapsing: investigate the loss function for insufficient discriminative signal.loss.item()). Replace tensor_list.append(tensor) with tensor_list.append(tensor.detach().cpu()). Check for retain_graph=True in any loop.Custom Gradients with torch.autograd.Function
PyTorch's Autograd is correct for most operations, but correct and numerically stable are different properties. For operations involving logarithms near zero, exponentials of large values, or divisions by small numbers, the mathematically correct gradient can overflow float32 before it is computed. This is where torch.autograd.Function becomes essential.
The Function class exposes two static methods. forward() computes the output given the inputs and has access to ctx — a context object that can store tensors for use in backward() via ctx.save_for_backward(). backward() receives grad_output — the upstream gradient from the loss — and returns one gradient per input to forward(), or None for inputs that do not need gradients (like integer indices or dimension arguments).
The production use case that comes up most frequently: log-sum-exp and softmax operations. The standard computation exp(x) / sum(exp(x)) overflows in float32 for logits above approximately 88. The stable version subtracts the maximum before exponentiating: exp(x - max) / sum(exp(x - max)). The gradient of the stable version is the same mathematically but avoids the overflow in both forward and backward.
A second important use case: straight-through estimators for quantization. During the forward pass, you round continuous values to discrete bins. This operation has a gradient of zero everywhere (mathematically) which means no signal propagates back. The straight-through estimator uses a custom backward() that returns the upstream gradient unmodified — treating the quantization as an identity function for gradient purposes. This allows training of quantized models.
Before deploying any custom Function, always verify it with torch.autograd.gradcheck(). This function numerically estimates gradients using finite differences and compares them against your custom backward() output. A gradcheck pass does not guarantee your gradient is optimal, but it confirms it is correct.
- Numerical stability: when the Chain Rule gradient overflows or underflows for your specific operation — log-sum-exp, softmax with large logits, division by values that can approach zero
- Non-differentiable operations: when you need to define a surrogate gradient for operations that are mathematically non-differentiable — quantization, thresholding, straight-through estimators for binary neural networks
- Performance: when Autograd's generic gradient is significantly slower than a hand-optimized kernel — fused operations that combine multiple steps can be faster than the generic backward graph
- Memory optimization: when you want to recompute intermediates in
backward()rather than storing them inforward()— a form of manual gradient checkpointing for specific operations - Always verify with
torch.autograd.gradcheck()using float64 inputs — float32 is not precise enough for finite difference verification to be reliable
backward() that is mathematically wrong is significantly harder to debug than a failing test — the model will train, loss will decrease, and incorrect gradients will be invisible until you compare against a numerically estimated baseline.backward() — storing them as instance attributes will cause reference counting issues.torch.autograd.gradcheck() using float64 inputs before deploying. A wrong custom gradient trains silently and surfaces only in model quality metrics.Memory Optimization — Gradient Checkpointing and Efficient Autograd
The computational graph stores intermediate activations from every operation in the forward pass — because backward() needs them to compute gradients. For a 24-layer transformer with a 512 sequence length and batch size 32, this can easily consume 30 to 40GB of GPU memory just for the graph's intermediate activations, independent of the model weights themselves.
Gradient checkpointing is the standard solution. Rather than storing all intermediates during the forward pass, torch.utils.checkpoint discards them and recomputes them during the backward pass from the nearest checkpoint boundary. The memory saved is proportional to how much of the graph is checkpointed. The compute cost is approximately 30% additional forward pass operations because each checkpointed segment runs twice — once during forward and once during backward reconstruction.
The 30% compute overhead sounds significant, but consider the alternative: if your model does not fit on one GPU without checkpointing, you either add GPUs (expensive and complex) or reduce batch size (hurts convergence). In most cases, 30% more compute on the same hardware is the correct trade-off.
The practical checkpointing strategy: checkpoint expensive blocks and skip cheap ones. Transformer attention blocks and feed-forward networks are the expensive ones — they dominate both memory and compute. Embedding lookups, layer norms, and linear projections at the input/output are cheap to store and expensive to recompute relative to the memory they consume. Checkpoint the transformer blocks, leave the rest.
PyTorch 2.0 introduced use_reentrant=False for torch.utils.checkpoint, which is strictly better than the old default. The old reentrant implementation has edge cases with double backward (gradient of gradients) and with custom autograd functions. Always use use_reentrant=False on PyTorch 2.0+.
- Checkpoint transformer attention blocks and FFN layers — they dominate both memory and compute, making the recomputation trade-off favorable
- Do not checkpoint embedding layers or final projection layers — they are cheap to store and expensive to recompute relative to their memory footprint
- Always use use_reentrant=False on PyTorch 2.0+ — the old default has documented edge cases with double backward and custom autograd functions
- Expect approximately 30% more compute time for approximately 60% less peak memory — this trade-off is almost always worth it when the alternative is adding GPU hardware
- Do not checkpoint operations with side effects — print statements, logging calls, and assertion checks inside a checkpointed block will execute twice during backward
- Measure with
torch.cuda.max_memory_allocated()before and after to verify actual savings — the theoretical reduction and the measured reduction can differ based on where you place checkpoint boundaries
torch.no_grad(), .detach(), and Inference Mode
When you do not need gradients — during inference, validation, or metric computation — explicitly disabling Autograd is not optional. Leaving gradient tracking enabled during inference wastes memory (the entire computational graph is built for every forward pass) and compute (version counters are incremented for every operation). At production serving scale, this overhead compounds into meaningful latency and cost.
PyTorch provides three mechanisms with different performance characteristics and different safety guarantees.
torch.no_grad() disables gradient computation — the backward engine will not run for operations inside the context. However, it still increments version counters on tensor operations, which is the mechanism PyTorch uses to detect in-place modifications. This makes no_grad() safe to use during training-adjacent code like gradient clipping where you might mix tracked and untracked operations.
torch.inference_mode() (introduced in PyTorch 1.9) goes further. It disables both gradient computation and version counter tracking. Tensors created inside inference_mode() are permanently marked as non-version-tracked — you cannot call .backward() on them even after exiting the context. The performance gain over no_grad() is 10 to 20% for typical model inference, which compounds significantly at scale.
.detach() operates on a single tensor rather than a context. It creates a new tensor that shares the same storage (no copy) but is not connected to the computational graph. Use it when you need to pass a specific tensor to a non-PyTorch library, log a tensor value, or break a gradient flow path in the middle of a model. Unlike no_grad(), detach() works at tensor granularity rather than context granularity.
The production recommendation is clear: use inference_mode() for serving, no_grad() for validation loops during training, and .detach() for individual tensor operations that need to escape the graph.
- torch.no_grad(): disables gradient computation, keeps version tracking — use for validation during training where you mix tracked and untracked operations in the same scope
- torch.inference_mode(): disables gradients + version tracking — use for production serving and evaluation where no backward pass will ever be needed
- .detach(): breaks graph reference for one specific tensor — use for logging, visualization, or stop-gradient operations within a model
- .item(): extracts the scalar value as a Python float — use for logging scalar losses and metrics. No tensor, no graph, no ambiguity.
- Never use
inference_mode()on tensors you plan to backpropagate through — they are permanently marked as non-tracked andbackward()will fail
no_grad() depending on model architecture and hardware. For a model serving 1000 requests per second, that is potentially hundreds of milliseconds of latency improvement per second of wall clock time — meaningful at production scale.no_grad() does not eliminate is real and measurable. Every tensor operation inside no_grad() still increments an integer counter. For models with hundreds of operations per forward pass and thousands of requests per second, this overhead adds up.inference_mode(). Validation during training uses no_grad(). Individual tensor operations that need to escape the graph use .detach(). These are not interchangeable — pick the right tool for each context.torch.inference_mode() for production inference — not no_grad(). It disables both gradient computation and version tracking, providing 10 to 34% faster execution for serving workloads.torch.no_grad() for validation during training when you need to mix tracked and untracked operations in adjacent code.Gradient Flow Control: When to Freeze Subgraphs with requires_grad
Most engineers treat requires_grad as a binary toggle. That's naive. In production pipelines, you often need fine-grained control over which branches of the computational graph contribute gradients. Freezing subgraphs isn't just about preventing updates — it's about memory and correctness.
Consider a multi-task architecture where a shared encoder feeds two task heads. During adversarial training or fine-tuning, you may want to backprop through one head while blocking gradients from the other. Setting requires_grad=False on a tensor doesn't just stop gradient accumulation; it prunes that entire subgraph from the DAG during the backward pass. This saves compute and avoids unintended parameter updates.
The pitfall? Engineers often forget that requires_grad propagates lazily. If you set it False on a head's parameters after creating the graph, the DAG already recorded operations. You must detach() the intermediate tensors before they enter the frozen path. Otherwise, PyTorch silently computes gradients you'll never use, wasting VRAM.
torch.no_grad() inside a forward pass will still build a graph node. Always detach() the tensor feeding into the frozen subgraph. Check with tensor.grad_fn — if it's not None, you didn't clip the graph.detach(). One controls update, the other controls graph construction.Retain Graph vs. Backward Multiple Times: When and Why It Breaks
The default backward() releases the computational graph after one pass. This is memory-efficient but breaks multi-loss scenarios. Engineers often set retain_graph=True blindly, treating it like a magic switch. That leaks memory fast.
You should only retain the graph when you genuinely need multiple backward passes from the same intermediate node — for example, training GANs where discriminator and generator losses share a common encoder, or in multi-objective reinforcement learning. Each backward pass consumes gradient computation but does NOT reset intermediate buffers. If you call backward() 10 times with retain_graph=True, you hold the entire graph's activations in memory 10x longer.
The correct pattern: compute all losses on the same graph, then call backward() once. If you must split backwards, manually zero the retain flag after your last pass. Another approach: use autograd.grad() for non-accumulating gradient extraction, then backward() for the final parameter update.
The worst anti-pattern? Calling backward() with retain_graph=True inside a training loop iteration and forgetting to set it False after. You'll see memory creep — then an OOM crash mid-epoch.
torch.cuda.memory_summary() before and after each backward. If retain_graph=True is used, monitor the 'allocated_bytes.current' field. A persistent increase across iterations indicates you're leaking graph references.Training Silently Diverged After Gradient Accumulation Bug
optimizer.zero_grad(). PyTorch accumulates gradients by default — every backward() call adds to existing gradient values in the .grad attribute rather than replacing them. Over 50 epochs of accumulation, gradient magnitudes grew exponentially. The optimizer was applying increasingly large weight updates in directions determined by the sum of 50 epochs of gradients rather than the current batch. Weights grew until intermediate activations overflowed to floating-point infinity, which propagated to NaN in subsequent operations. Training loss appeared healthy because the optimizer's massive corrections happened to reduce the accumulated loss signal, masking the divergence until inference ran on clean production data without the distortion.optimizer.zero_grad() at the start of every training step — before the forward pass, not after. Added gradient clipping using torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) as a permanent safety net. Added NaN detection in the validation loop using torch.isnan(val_loss).any() to catch divergence within the same epoch it occurs. Added gradient norm logging to the monitoring dashboard — a spike in gradient norms is the earliest signal of accumulation bugs or learning rate issues, appearing several epochs before loss divergence becomes visible.- PyTorch accumulates gradients by default — always call
optimizer.zero_grad()before the forward pass, not after the backward pass, because the position relative to the forward matters for gradient accumulation workflows - Monitor gradient norms during training — the norm should be stable within an order of magnitude. Sudden spikes indicate accumulation bugs, learning rate issues, or outlier batches before the loss shows any sign of divergence
- Add NaN detection in validation loops — training loss can appear healthy while the model silently diverges because accumulated gradients produce weight corrections that happen to reduce the loss signal even as weights grow unbounded
- Gradient clipping is not optional for production training — treat max_norm=1.0 as a default starting point and tune from there. It prevents a single outlier batch from causing catastrophic weight updates that take multiple epochs to detect
p.grad.norm().item()) for name, p in model.named_parameters() if p.grad is not None]). If gradients are zero or near-zero, check for dead ReLU neurons caused by large negative biases, weights initialized to zero, or in-place operations that broke the graph. If gradients are non-zero but the model still collapses, suspect a loss function that does not create sufficient gradient signal to discriminate between inputs.loss.item()). Run torch.cuda.memory_summary() to see which allocation is growing. Also check for retain_graph=True in a loop — it keeps every graph alive.torch.no_grad() or torch.inference_mode() wrapping the training loop accidentally, (2) in-place operations on leaf tensors that break the graph silently, (3) calling .numpy() on a tracked tensor which detaches it from the graph, (4) a disconnected computation path where the loss does not actually depend on the parameters. Use model.named_parameters() to confirm requires_grad is True for all trainable parameters.torch.profiler.profile() to identify which backward operation is the bottleneck. Common causes: naive attention implementations that materialize the full N×N attention matrix in memory, operations that create large intermediate tensors in backward but not forward, or missing gradient checkpointing on deep models where backward must traverse a very long graph. For transformers, verify you are using memory-efficient attention kernels (Flash Attention or scaled_dot_product_attention in PyTorch 2.0+).torch.autograd.set_detect_anomaly(True)print([(name, p.grad.norm().item()) for name, p in model.named_parameters() if p.grad is not None])model.parameters(), max_norm=1.0). This stops the symptom — then find and fix the root cause with anomaly detection output.Key takeaways
backward(). Understanding the graph lifecycleoptimizer.zero_grad() before the forward pass. PyTorch accumulates gradients by default. Forgetting this produces exponential gradient growth that silently corrupts models over dozens of epochs before the divergence becomes visible in loss metrics.torch.inference_mode() for production servingno_grad() because it disables both gradient computation and version counter tracking. Reserve no_grad() for validation during training where tracked and untracked operations coexist.backward() when Autograd's default gradient is numerically unstabletorch.autograd.gradcheck() using float64 inputs before deploying.Common mistakes to avoid
5 patternsAccumulating gradients across batches by not calling optimizer.zero_grad()
optimizer.zero_grad() at the start of every training step, before the forward pass. Add gradient norm logging every N steps — a norm that grows monotonically without bound is the earliest signal of accumulation. Add gradient clipping as a permanent safety net: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).Modifying tracked tensors in-place during the forward pass
x.clone(); x[mask] = value). Use torch.autograd.set_detect_anomaly(True) to identify which specific operation is causing silent corruption.Forgetting to detach tensors when logging or storing loss history
loss.item() for scalar logging — it returns a Python float with no graph reference. Use tensor.detach().cpu() for tensor logging — .detach() breaks the graph reference, .cpu() releases GPU memory. Audit every list.append() call in your training loop for undetached tensors.Accidentally wrapping the training loop in torch.no_grad() or torch.inference_mode()
torch.no_grad() and torch.inference_mode() exclusively wrap validation and inference code. Add gradient norm assertions after the backward pass: assert any(p.grad is not None and p.grad.norm() > 0 for p in model.parameters()), 'No gradients computed — check for accidental no_grad context'.Calling backward() on a non-scalar tensor without providing a gradient argument
backward() call fails immediately with a clear error message — this one is at least easy to diagnose.backward(). For intentional non-scalar backward — computing Jacobians or vector-Jacobian products — pass an explicit gradient tensor: tensor.backward(gradient=torch.ones_like(tensor)).Interview Questions on This Topic
What is a computational graph and how does PyTorch's dynamic graph differ from static graph frameworks?
backward() completes. The graph structure can change between iterations because it reflects whatever Python code actually executed. This makes Python if/else, loops, and data-dependent control flow work naturally.
Static graph frameworks like TensorFlow 1.x compile the graph once upfront and execute it repeatedly. The graph structure is fixed after compilation. Conditional logic requires framework-specific operations (tf.cond, tf.while_loop) rather than Python control flow. The trade-off: static graphs enable aggressive whole-program optimization at compile time, but they add a representation layer between your code and the actual computation.
PyTorch 2.0's torch.compile() narrows this gap by tracing and compiling the dynamic graph into optimized static kernels while preserving Python-level flexibility for control flow that changes between iterations.Frequently Asked Questions
20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.
That's PyTorch. Mark it forged?
11 min read · try the examples if you haven't