Intermediate 8 min · March 06, 2026

PyTorch Gradient Accumulation — 200 Epoch Silent Failure

Missing optimizer.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
Quick Answer
  • PyTorch tensors are multi-dimensional arrays that live on CPU or GPU and optionally track gradients for backpropagation
  • requires_grad=True opts a tensor into the autograd engine — only set it on learnable parameters, never on input data
  • The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model
  • model.train() and model.eval() control layer behaviour (Dropout, BatchNorm) — they do NOT control gradient computation
  • Forgetting optimizer.zero_grad() causes gradient accumulation, which silently corrupts training
  • Always use torch.inference_mode() or torch.no_grad() during validation and serving — not optional in production

PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.

The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine — and as of 2026, that engine underpins everything from two-layer regression models to the transformer architectures powering production LLMs.

The most common production failure I see: developers understand the happy path but not the failure modes. Training loops that silently accumulate gradients, validation code that forgets model.eval(), and inference that wastes GPU memory by not disabling autograd. This guide covers both the concepts and the production gotchas — because shipping a model that actually works in production is a different skill from getting a notebook to converge.

Tensors: The DNA of Every PyTorch Model

A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.

What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.

You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising buffers, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, never automatic. That explicitness is a feature, not an oversight; it forces you to reason about where computation actually happens, which is the difference between a model that fits in GPU memory and one that crashes at batch two.

As of PyTorch 2.x, torch.compile() can fuse tensor operations into optimised kernels automatically — but only if your tensors are on the right device and dtype from the start. Sloppy tensor hygiene becomes measurably more expensive in 2026 than it was when compilation wasn't part of the picture.

The dtype mismatch is the most common silent failure: Python integer literals default to int64, Python floats default to float64, and PyTorch defaults to float32 for most operations. Mixing them throws a RuntimeError at operation time, not at creation time — so the error surfaces somewhere unexpected. Always pass floats with a trailing .0 or specify dtype explicitly at creation.

Autograd: How PyTorch Learns Without You Doing Calculus

Autograd is the reason PyTorch feels almost magical the first time it clicks. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (almost always a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.

In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter — telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge, applied repeatedly, is gradient descent.

Three rules to memorise before shipping anything: (1) .backward() can only be called on a scalar tensor. If your loss is a multi-element tensor, call .mean() or .sum() first or pass a gradient argument. (2) Gradients accumulate by default — every call to .backward() adds to existing .grad values rather than replacing them. Call optimizer.zero_grad() before each backward pass or gradients will pile up across batches and corrupt training in exactly the way the production incident above describes. (3) During inference, wrap code in torch.no_grad() or torch.inference_mode() to skip graph construction entirely — it is faster, uses less memory, and removes an entire class of production bugs.

The graph is destroyed after .backward() completes by default. This is intentional memory management: the graph for one forward pass can consume hundreds of megabytes on a deep network. Without destruction, GPU memory would grow linearly with training steps. This is also why you cannot call .backward() twice on the same graph without retain_graph=True — and retain_graph=True in a training loop is almost always a bug, not a feature.

One nuance worth knowing as of PyTorch 2.x: torch.compile() can aggressively optimise the forward and backward passes together, but it relies on the graph being consistent across calls. If your forward pass has Python-level control flow that changes based on input values (not just tensor shapes), you may need to mark those branches with torch.compiler.disable() to prevent recompilation overhead on every batch.

Building a Real Training Loop with nn.Module

Writing raw tensor operations gets unwieldy past a handful of layers. PyTorch's nn.Module is the standard abstraction for any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters (or sub-modules that contain them) inside __init__, and defines the forward computation inside forward().

The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances, arbitrarily deep. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree — that flat iterator is exactly what you hand to the optimizer.

The training loop is the heartbeat of all ML work in PyTorch. It is always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. That order is not arbitrary — skipping or reordering any step produces a specific and usually hard-to-diagnose failure. Internalise this sequence and you can read any paper's training code cold.

The validation loop is structurally almost identical but with two additions: model.eval() called before the loop, and torch.no_grad() wrapping the forward pass. These solve different problems. model.eval() changes layer behaviour — Dropout stops masking neurons, BatchNorm uses accumulated running statistics instead of batch statistics. torch.no_grad() stops graph construction entirely, saving memory and time. You need both; neither substitutes for the other.

The most common production bug I still see in 2026: calling model.forward(x) directly instead of model(x). It works identically in isolation, but it bypasses all registered forward hooks — hooks that profilers, debuggers, quantisation tools, and libraries like torchvision rely on. Always call the model as a callable. The __call__ method is what wires up the hook infrastructure; forward() is just the computation you define.

Data Loading with Dataset and DataLoader

You'll rarely keep all your training data in memory as a single tensor. Real-world datasets — images, text, logs — are large, expensive to load, and need to be shuffled, batched, and transformed on the fly. PyTorch's torch.utils.data.Dataset and DataLoader are the standard way to feed data into a training loop.

A Dataset subclass defines two things: __len__ (how many samples) and __getitem__ (how to load the i-th sample). That's it. The DataLoader then wraps the dataset and handles batching, shuffling, parallelism, and memory pinning. Writing a custom Dataset is the right approach for any data that doesn't fit in RAM — the Dataset tells PyTorch how to load each sample lazily, and the DataLoader manages the rest.

Three things almost always go wrong in production data loading: (1) num_workers set too high — you get too many file handles and the OS starts swapping; (2) custom collate functions that accidentally keep tensors on CPU when the model is on GPU; (3) Dataset returning tensors of inconsistent shapes for variable-length data without proper padding. The error messages for these are rarely pointing to the actual root cause.

For tabular data that fits in memory, using an in-memory Dataset with a TensorDataset is perfectly fine. For images, torchvision's ImageFolder and Compose transforms handle most common pipelines. For text, Hugging Face datasets integrate cleanly with PyTorch's DataLoader.

Shuffling is essential for stochastic gradient descent — it prevents the model from learning the order of the data rather than the underlying distribution. Always set shuffle=True in your training DataLoader. For validation, shuffle=False is correct because you want the same deterministic ordering for comparison across epochs.

Training on GPU and Mixed Precision

GPUs accelerate tensor operations by orders of magnitude compared to CPUs, but they have limited memory and come with gotchas that trip up even senior engineers. Training on GPU is not just 'call .cuda()' — it requires careful device management, understanding of CUDA memory, and leveraging mixed precision to fit larger models and batch sizes.

PyTorch makes GPU training explicit: you move the model with model.to(device) and move each batch with batch.to(device). If any tensor is left on CPU while the rest of the operation is on GPU, you get a RuntimeError. The fix is to enforce a convention: device as a variable at the start of your script, and .to(device) on every batch at the point of creation.

Mixed precision training using torch.cuda.amp (Automatic Mixed Precision) became standard in 2026 — it uses float16 for most operations while keeping a float32 master copy of weights, cutting memory usage by nearly half and giving you roughly 2x throughput on modern GPUs. It's enabled by just two lines: a GradScaler and wrapping the forward/backward pass in an autocast context. The scaler prevents underflow of small gradients in float16.

GPUs have limited memory — a high-end A100 has 80GB, but most production setups use 16–32GB cards. If you run out of memory, reduce batch size, gradient accumulation, or switch to mixed precision. The most common silent failure: loading the entire dataset on GPU accidentally by forgetting to call .to(device) inside the training loop but doing it in the Dataset constructor — that moves all data to GPU at once, causing OOM before training starts.

As of PyTorch 2.x, torch.compile() with mode='reduce-overhead' or mode='max-autotune' can further optimise GPU kernel execution, but it requires a warm-up step and may increase compile time on the first batch. It's worth enabling for production serving, less for rapid experimentation.

PyTorch vs TensorFlow 1.x — Architectural Differences
Feature / AspectPyTorch (Dynamic Graph)TensorFlow 1.x (Static Graph)
Graph constructionBuilt at runtime on every forward pass — debug with standard Python tools anywhere in the loopPre-compiled before any data flows through — the graph was fixed at definition time, making runtime inspection nearly impossible
DebuggingStandard Python debugger, print(), and pdb work anywhere in the forward pass with no special configurationRequired special tf.Print ops inserted into the graph; runtime errors produced stack traces that pointed to graph compilation, not the user code that caused them
Research flexibilityArchitecture changes take effect immediately — swap a layer, change a loss function, add a branch mid-experiment with no recompilation stepAny architectural change required rebuilding and recompiling the graph, which could take seconds to minutes for large models
Production deploymentTorchScript or ONNX export required for optimised serving without a Python runtime; torch.compile() in 2.x closes most of the performance gap for GPU servingSavedModel format was natively optimised for TF Serving; the static graph made deployment straightforward but locked you into the graph you compiled
Community adoptionDominant in research — over 75% of ML papers published in 2024-2025 used PyTorch as the primary frameworkRemains strong in enterprise production systems built before 2020; legacy TF1 codebases are still running in many large organisations
GPU memory controlExplicit .to(device) — you decide what moves and when; nothing migrates automaticallyAutomatic placement with manual overrides via tf.device() context managers; less control but fewer explicit device calls
Gradient controlrequires_grad per tensor; torch.no_grad() and torch.inference_mode() context managers; fine-grained control at the tensor levelGradientTape context manager in TF 2.x — similar concept but opt-in rather than opt-out; in TF 1.x gradients were computed by tf.gradients() on the pre-compiled graph

Key Takeaways

  • Tensors carry three critical properties beyond their values: dtype, device, and requires_grad — getting any one of these wrong silently breaks training in ways that trace to the wrong location in the stack.
  • Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss. The graph is destroyed after each backward pass by default — retain_graph=True in a loop is almost always a memory leak.
  • The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing. Memorise it and you can read any codebase or paper's training code cold.
  • model.train() and model.eval() control layer behaviour like Dropout and BatchNorm. torch.no_grad() controls gradient computation. These are three separate mechanisms. Confusing them is the single most common source of subtle training bugs in production PyTorch code.
  • Always call model as a callable (model(x)), never model.forward(x) — the __call__ method wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on. This one habit prevents an entire class of silent tooling failures.
  • Dataset + DataLoader form the standard data pipeline. Keep num_workers moderate, always move batches to the correct device, and test the pipeline in isolation.
  • Mixed precision with torch.cuda.amp halves memory and doubles throughput. GradScaler is not optional — it prevents float16 underflow. Use inference_mode for production serving, not no_grad.

Common Mistakes to Avoid

  • Forgetting optimizer.zero_grad() before loss.backward()
    Symptom: Loss decreases during training but predictions are garbage in production. Gradient norms grow exponentially across epochs. The training run looks like it converged but the model has effectively random weights that memorised noise.
    Fix: Call optimizer.zero_grad() as the first line of every training step — before the forward pass, before the loss computation, before anything. Add gradient norm logging to your training dashboard. Consider gradient clipping with max_norm=1.0 as a permanent safety net, not just a debugging tool.
  • Calling model.forward(x) directly instead of model(x)
    Symptom: Profilers produce no output. Libraries that register forward hooks (torchvision, quantisation tools, some logging frameworks) silently fail to fire. The model produces correct predictions but all hook-dependent tooling is invisible.
    Fix: Always call the model as a callable: predictions = model(input_tensor). The __call__ method is where forward hooks, backward hooks, and the training mode flag are applied. Your forward() method is called internally — it is not the entry point.
  • Storing raw loss tensors in a list for logging
    Symptom: GPU memory grows steadily across epochs with no obvious leak in the model code. Each epoch consumes more memory than the last until an out-of-memory crash occurs, often in the middle of a long training run.
    Fix: Use loss.item() to extract a plain Python float before storing or logging. .item() detaches the scalar from the computation graph. Never append loss itself to a list — it keeps the entire graph alive for that batch in memory indefinitely.
  • Not calling model.eval() during validation
    Symptom: Validation loss fluctuates wildly between epochs even when training loss is smooth and decreasing. The model appears unstable, but the instability is in the evaluation, not the model weights.
    Fix: Call model.eval() before every validation loop and model.train() before every training loop. Treat them as a matched pair. If your codebase has multiple evaluation paths (validation, test, inference), add a utility function that ensures eval mode is set and inference_mode is active — centralise it so it cannot be forgotten.
  • Using torch.no_grad() instead of torch.inference_mode() for production serving
    Symptom: Inference is slower than expected and uses more memory than necessary. Not a crash — a silent performance regression that is easy to miss without profiling.
    Fix: Use torch.inference_mode() for all production inference paths. It disables both gradient computation and version counter tracking, providing 10-20% faster execution on typical transformer and CNN architectures. Reserve torch.no_grad() for validation loops inside training runs where you may still need version tracking for other operations.
  • Setting num_workers too high in DataLoader
    Symptom: Training starts slow or crashes with memory errors. CPU usage spikes to 100% and GPU utilisation is low. System may start swapping.
    Fix: Start with num_workers=4 and increase until GPU utilisation plateaus. Monitor system memory: each worker may duplicate dataset memory on Linux. If you see memory pressure, reduce workers or set num_workers=0 to disable multiprocessing.
  • Forgetting to move data to GPU in the training loop (but moving the model)
    Symptom: RuntimeError: Expected all tensors to be on the same device. Usually occurs on the first forward pass, but sometimes later if some branches avoid the mismatch.
    Fix: Always call features, targets = features.to(device), targets.to(device) at the start of each batch. Use a consistent device variable. Add a one-line assertion: assert next(model.parameters()).device == features.device, before forward pass.

Interview Questions on This Topic

  • QWhat is the computation graph in PyTorch and how does autograd use it to compute gradients?SeniorReveal
    The computation graph is a directed acyclic graph (DAG) that records every operation performed on tensors that have requires_grad=True. Each operation becomes a node (with a grad_fn), and the edges represent the tensors flowing between operations. When you call .backward() on a scalar loss, autograd traverses this graph in reverse topological order, applying the chain rule at each node to compute the gradient of the loss with respect to every leaf tensor that required a gradient. The graph is dynamically built on each forward pass and is destroyed after backward() by default, which keeps memory usage proportional to a single forward pass rather than the entire training history.
  • QExplain the difference between model.train() and model.eval() and why you need both.Mid-levelReveal
    model.train() sets the model's internal training flag to True, which enables layer-specific behaviours: Dropout randomly masks neurons, and BatchNorm updates its running mean/variance using the current batch statistics. model.eval() sets the flag to False: Dropout passes all neurons (no masking) and BatchNorm uses the accumulated running statistics instead of the batch statistics. These affect the forward pass regardless of gradient computation. You need model.eval() during validation to get deterministic predictions that reflect the model's true performance. You need model.train() during training for proper regularisation (Dropout) and normalisation. torch.no_grad() is orthogonal — it disables gradient tracking but does not change layer behaviour.
  • QWhat happens if you forget to call optimizer.zero_grad() before each training step?SeniorReveal
    Gradients accumulate across batches because .backward() adds the computed gradients to the existing .grad buffer, rather than overwriting it. After N steps without zero_grad(), the accumulated gradient magnitude is roughly N times what it should be for the first batch, causing the optimizer to make enormous, compounding updates. The loss may still decrease initially because the model can memorise noise from early batches, but the final weights are effectively random when evaluated on unseen data. This bug is hard to detect from loss curves alone — validation loss may appear normal while the model has zero predictive power. The fix is calling optimizer.zero_grad() as the first line inside the training loop, before any forward pass.

Frequently Asked Questions

What is the difference between PyTorch and NumPy?

NumPy arrays live only on the CPU and have no concept of gradients or automatic differentiation. PyTorch tensors can live on a GPU — which is what makes large matrix operations fast enough for deep learning in practice — and tensors with requires_grad=True automatically track every operation performed on them so that gradients can be computed via .backward(). For pure numerical computing with no learning involved, NumPy is lighter and more widely supported in the scientific Python ecosystem. The moment you need a model to learn from data, PyTorch is the right tool. Many teams also mix both: NumPy for data preprocessing and analysis, PyTorch for the model itself.

🔥

That's Tools. Mark it forged?

8 min read · try the examples if you haven't

Previous
TensorFlow Basics
3 / 12 · Tools
Next
Keras for Deep Learning