PyTorch Gradient Accumulation — 200 Epoch Silent Failure
Missing optimizer.zero_grad() caused 200x gradient accumulation over 200 epochs, corrupting weights silently.
20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.
- PyTorch tensors are multi-dimensional arrays that live on CPU or GPU and optionally track gradients for backpropagation
- requires_grad=True opts a tensor into the autograd engine — only set it on learnable parameters, never on input data
- The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model
- model.train() and model.eval() control layer behaviour (Dropout, BatchNorm) — they do NOT control gradient computation
- Forgetting optimizer.zero_grad() causes gradient accumulation, which silently corrupts training
- Always use torch.inference_mode() or torch.no_grad() during validation and serving — not optional in production
Imagine you're teaching a child to recognise cats by showing them thousands of pictures and correcting them every time they're wrong. PyTorch is the notebook, pencil, and eraser that lets a computer do exactly that — store the pictures as grids of numbers (tensors), measure how wrong each guess was (loss), and automatically figure out which knobs to tweak to do better next time (autograd). It doesn't decide what to learn; it gives you the tools to build the machine that learns.
PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.
The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine — and as of 2026, that engine underpins everything from two-layer regression models to the transformer architectures powering production LLMs.
The most common production failure I see: developers understand the happy path but not the failure modes. Training loops that silently accumulate gradients, validation code that forgets model.eval(), and inference that wastes GPU memory by not disabling autograd. This guide covers both the concepts and the production gotchas — because shipping a model that actually works in production is a different skill from getting a notebook to converge.
Why Gradient Accumulation Is Not a Free Lunch
Gradient accumulation is a technique that simulates larger batch sizes by summing gradients over multiple forward-backward passes before performing a single optimizer step. Instead of updating weights after every batch, you accumulate gradients across N micro-batches, then step. This lets you train with effective batch sizes that exceed GPU memory limits — a 4GB card can simulate a 256-sample batch by accumulating 32 micro-batches of 8 samples each.
In practice, gradient accumulation changes the training dynamics in subtle ways. Each micro-batch computes gradients independently, but the optimizer sees only the accumulated sum. This means batch normalization statistics are computed per micro-batch, not per effective batch — a common source of silent degradation. Also, gradient clipping must be applied to the accumulated gradient, not per micro-batch, or you'll distort the gradient scale. The effective learning rate should remain tied to the effective batch size, not the micro-batch size, or convergence suffers.
Use gradient accumulation when your GPU memory cannot hold the desired batch size — typically for large models (transformers, CNNs with high-res inputs) or high-resolution images. It is not a substitute for proper batch normalization handling; you must either freeze BN stats or use sync BN across micro-batches. In production, teams often hit a 200-epoch silent failure: the model trains fine for 150 epochs, then plateaus or diverges because BN statistics drifted from the true distribution over the effective batch.
Tensors: The DNA of Every PyTorch Model
A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.
What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.
You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising buffers, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, never automatic. That explicitness is a feature, not an oversight; it forces you to reason about where computation actually happens, which is the difference between a model that fits in GPU memory and one that crashes at batch two.
As of PyTorch 2.x, torch.compile() can fuse tensor operations into optimised kernels automatically — but only if your tensors are on the right device and dtype from the start. Sloppy tensor hygiene becomes measurably more expensive in 2026 than it was when compilation wasn't part of the picture.
The dtype mismatch is the most common silent failure: Python integer literals default to int64, Python floats default to float64, and PyTorch defaults to float32 for most operations. Mixing them throws a RuntimeError at operation time, not at creation time — so the error surfaces somewhere unexpected. Always pass floats with a trailing .0 or specify dtype explicitly at creation.
torch.compile(), dtype and device inconsistencies also prevent kernel fusion, silently costing you throughput on top of correctness.torch.tensor() — it copies the data and infers dtype, but defaults to float32 for Python floats. For large arrays, torch.from_numpy() avoids the copy.torch.randn() * init_scale or nn.init.kaiming_normal_ — never initialise all weights to zero; every neuron would compute identical gradients and the network would never differentiate.Autograd: How PyTorch Learns Without You Doing Calculus
Autograd is the reason PyTorch feels almost magical the first time it clicks. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (almost always a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.
In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter — telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge, applied repeatedly, is gradient descent.
Three rules to memorise before shipping anything: (1) .backward() can only be called on a scalar tensor. If your loss is a multi-element tensor, call .mean() or .sum() first or pass a gradient argument. (2) Gradients accumulate by default — every call to .backward() adds to existing .grad values rather than replacing them. Call optimizer.zero_grad() before each backward pass or gradients will pile up across batches and corrupt training in exactly the way the production incident above describes. (3) During inference, wrap code in torch.no_grad() or torch.inference_mode() to skip graph construction entirely — it is faster, uses less memory, and removes an entire class of production bugs.
The graph is destroyed after .backward() completes by default. This is intentional memory management: the graph for one forward pass can consume hundreds of megabytes on a deep network. Without destruction, GPU memory would grow linearly with training steps. This is also why you cannot call .backward() twice on the same graph without retain_graph=True — and retain_graph=True in a training loop is almost always a bug, not a feature.
One nuance worth knowing as of PyTorch 2.x: torch.compile() can aggressively optimise the forward and backward passes together, but it relies on the graph being consistent across calls. If your forward pass has Python-level control flow that changes based on input values (not just tensor shapes), you may need to mark those branches with torch.compiler.disable() to prevent recompilation overhead on every batch.
- Forward pass: execute operations and record the graph — each operation node stores its own gradient function (grad_fn)
- Backward pass: traverse the graph in reverse from the loss node, applying the Chain Rule at each node to accumulate gradients
- The graph is rebuilt fresh on every forward pass — it captures the exact computation that just ran, including any Python-level branching
- requires_grad=True marks a tensor as a leaf node whose .grad we want filled in after
backward() - The gradient of a scalar loss with respect to all parameters is computed in a single .backward() call — you do not loop over parameters manually
torch.compile(), the dynamic graph gets partially compiled for performance while retaining correctness for control-flow branches.backward() by default — retain_graph=True in a training loop is almost always a memory leak waiting to happen. In production, use an optimizer rather than manual weight updates, and always wrap inference in torch.inference_mode() — it disables both gradient computation and version tracking, making it measurably faster than torch.no_grad() for serving workloads.torch.inference_mode() for production serving. Use torch.no_grad() during validation inside training loops where you may still need tensor version tracking.torch.autograd.gradcheck() to numerically verify computed gradients against finite differences — invaluable when implementing custom backward passes.Building a Real Training Loop with nn.Module
Writing raw tensor operations gets unwieldy past a handful of layers. PyTorch's nn.Module is the standard abstraction for any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters (or sub-modules that contain them) inside __init__, and defines the forward computation inside forward().
The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances, arbitrarily deep. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree — that flat iterator is exactly what you hand to the optimizer.
The training loop is the heartbeat of all ML work in PyTorch. It is always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. That order is not arbitrary — skipping or reordering any step produces a specific and usually hard-to-diagnose failure. Internalise this sequence and you can read any paper's training code cold.
The validation loop is structurally almost identical but with two additions: model.eval() called before the loop, and torch.no_grad() wrapping the forward pass. These solve different problems. model.eval() changes layer behaviour — Dropout stops masking neurons, BatchNorm uses accumulated running statistics instead of batch statistics. torch.no_grad() stops graph construction entirely, saving memory and time. You need both; neither substitutes for the other.
The most common production bug I still see in 2026: calling model.forward(x) directly instead of model(x). It works identically in isolation, but it bypasses all registered forward hooks — hooks that profilers, debuggers, quantisation tools, and libraries like torchvision rely on. Always call the model as a callable. The __call__ method is what wires up the hook infrastructure; forward() is just the computation you define.
model.train() and model.eval() flip a flag that changes layer behaviour — Dropout randomly drops neurons in train mode and passes all of them in eval mode; BatchNorm updates running statistics in train mode and uses them in eval mode. torch.no_grad() is a completely separate mechanism that tells the autograd engine to stop building the computation graph. You can call model.eval() with gradients still flowing (unusual but valid) or call model.train() inside a torch.no_grad() block (common in gradient accumulation setups). Forgetting model.eval() during validation is one of the most common bugs in PyTorch codebases — your validation loss will fluctuate unpredictably and you will spend time blaming your learning rate or data pipeline.model.eval() control Dropout and BatchNorm behaviour — not gradient computation. torch.no_grad() controls gradient computation — not layer behaviour. You need both for a correct validation loop and they must be called in the right order: model.eval() first, then enter the torch.no_grad() context.torch.compile() is compatible with both — but compile the model before calling .eval() or .train() to avoid recompilation on mode switches.model.train() and model.eval() control layer behaviour (Dropout, BatchNorm); torch.no_grad() controls graph construction. Always call the model as a callable (model(x)), never model.forward(x) — the __call__ method is what wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on.torch.no_grad() -> forward -> loss — no backward or step. Both calls are required; neither replaces the other.torch.inference_mode() -> forward — fastest path, disables both graph construction and version counter tracking.backward() every step, call optimizer.step() + zero_grad() only every N steps. Divide the loss by N before backward() to keep gradient magnitudes consistent with a single large batch.Data Loading with Dataset and DataLoader
You'll rarely keep all your training data in memory as a single tensor. Real-world datasets — images, text, logs — are large, expensive to load, and need to be shuffled, batched, and transformed on the fly. PyTorch's torch.utils.data.Dataset and DataLoader are the standard way to feed data into a training loop.
A Dataset subclass defines two things: __len__ (how many samples) and __getitem__ (how to load the i-th sample). That's it. The DataLoader then wraps the dataset and handles batching, shuffling, parallelism, and memory pinning. Writing a custom Dataset is the right approach for any data that doesn't fit in RAM — the Dataset tells PyTorch how to load each sample lazily, and the DataLoader manages the rest.
Three things almost always go wrong in production data loading: (1) num_workers set too high — you get too many file handles and the OS starts swapping; (2) custom collate functions that accidentally keep tensors on CPU when the model is on GPU; (3) Dataset returning tensors of inconsistent shapes for variable-length data without proper padding. The error messages for these are rarely pointing to the actual root cause.
For tabular data that fits in memory, using an in-memory Dataset with a TensorDataset is perfectly fine. For images, torchvision's ImageFolder and Compose transforms handle most common pipelines. For text, Hugging Face datasets integrate cleanly with PyTorch's DataLoader.
Shuffling is essential for stochastic gradient descent — it prevents the model from learning the order of the data rather than the underlying distribution. Always set shuffle=True in your training DataLoader. For validation, shuffle=False is correct because you want the same deterministic ordering for comparison across epochs.
Training on GPU and Mixed Precision
GPUs accelerate tensor operations by orders of magnitude compared to CPUs, but they have limited memory and come with gotchas that trip up even senior engineers. Training on GPU is not just 'call .cuda()' — it requires careful device management, understanding of CUDA memory, and leveraging mixed precision to fit larger models and batch sizes.
PyTorch makes GPU training explicit: you move the model with model.to(device) and move each batch with batch.to(device). If any tensor is left on CPU while the rest of the operation is on GPU, you get a RuntimeError. The fix is to enforce a convention: device as a variable at the start of your script, and .to(device) on every batch at the point of creation.
Mixed precision training using torch.cuda.amp (Automatic Mixed Precision) became standard in 2026 — it uses float16 for most operations while keeping a float32 master copy of weights, cutting memory usage by nearly half and giving you roughly 2x throughput on modern GPUs. It's enabled by just two lines: a GradScaler and wrapping the forward/backward pass in an autocast context. The scaler prevents underflow of small gradients in float16.
GPUs have limited memory — a high-end A100 has 80GB, but most production setups use 16–32GB cards. If you run out of memory, reduce batch size, gradient accumulation, or switch to mixed precision. The most common silent failure: loading the entire dataset on GPU accidentally by forgetting to call .to(device) inside the training loop but doing it in the Dataset constructor — that moves all data to GPU at once, causing OOM before training starts.
As of PyTorch 2.x, torch.compile() with mode='reduce-overhead' or mode='max-autotune' can further optimise GPU kernel execution, but it requires a warm-up step and may increase compile time on the first batch. It's worth enabling for production serving, less for rapid experimentation.
backward() on the scaled loss, then divides the resulting gradients back down before the optimizer step. This keeps gradients in the representable range. Always use scaler if you use autocast — the two are designed as a pair.Checkpointing: The Difference Between a Mild Inconvenience and a Career-Ending Mistake
Nobody cares about your training loop when a spotty AWS instance reboots 47 hours in. They care about whether you picked up from epoch 14 or started over. Checkpointing isn't a nicety. It's your job security.
Real training runs cost real money. A single A100 hour burns ~$3. If you lose 40 hours of training because you only saved the final model, you just wasted $120 and a lot of patience. Senior engineers checkpoint obsessively because they've been burned.
The trick isn't just saving weights. It's saving optimizer state, RNG seeds, and the current epoch. That lets you resume identically — same learning rate schedule, same batch order, same everything. Anything less is a half-baked restore.
Build your checkpoint logic into the training loop from day one. Not after the first crash. You will crash. The question is whether you're ready.
Distributed Data Parallel: When One GPU Isn't Enough and Neither Is Your Patience
Your model takes 12 hours on one GPU. Your boss wants it in 2. You buy two more GPUs and expect 4 hours. That's not how DDP works. Distributed Data Parallel isn't magic. It's a carefully orchestrated dance of gradient synchronization, and poor implementation turns it into a slow-motion train wreck.
DDP works by splitting batches across GPUs. Each GPU computes gradients on its shard, then all-reduces them so every card has the average gradient. The bottleneck is that all-reduce communication. If your batch size per GPU is too small, GPUs spend more time talking than computing. Rule of thumb: each GPU should process at least 32 samples per forward pass.
Watch your batch size scaling. DDP gives near-linear speedup only if you increase the global batch size proportionally. Doubling GPUs? Double the batch size and adjust the learning rate. Otherwise, you get diminishing returns and your validation loss plateaus because you're taking noisier gradient steps.
Wrap your model with nn.parallel.DistributedDataParallel, not the deprecated DataParallel. DataParallel serializes everything through GPU 0. It's a bottleneck masquerading as parallelism.
Installation: Get PyTorch Running Before Your Coffee Gets Cold
You need PyTorch installed. Skip the pip install torch blanket statement — that's for people who enjoy debugging CUDA errors at 2 AM. You need the right wheel for your hardware.
Check your CUDA version with nvidia-smi. Match it to PyTorch's build matrix on pytorch.org. If you're on CPU-only, grab the CPU build. If you're on an M-series Mac, get the Metal Performance Shaders (MPS) build. Conda handles dependencies better than pip for GPU libraries — use it. The command is one line: conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch. That's it. No excuses.
After install, run torch.cuda.is_available() in a Python shell. If it returns False on a CUDA machine, your install is wrong. Fix it before you write a single line of training code.
GPU Acceleration: Stop Burning CPU Cycles on Matrix Math
Your GPU is a parallel compute beast. Your CPU is a glorified traffic cop. Stop making the cop do math — that's the GPU's job. PyTorch makes this trivial: call .to('cuda') on your tensors and models.
Here's why this matters: a 1024x1024 matrix multiply on CPU takes ~50ms. On a 3090 GPU it takes ~0.5ms. That's 100x faster. Now scale that across a training loop with millions of iterations. The math is brutal — you leave months of training time on the table by ignoring GPU acceleration.
Production rules: keep your model and tensors on the same device. Use torch.no_grad() for inference to save memory. If you're on a multi-GPU machine, use nn.DataParallel or DistributedDataParallel. For single GPU, just .to('cuda'). Always check tensor.device before operations — a CPU tensor talking to a GPU tensor throws a runtime error. That's not a bug, that's you being sloppy.
torch.cuda.synchronize() when timing. Without it, PyTorch queues ops asynchronously and your timestamps lie to you.Enhancing Data Diversity through Augmentation
Models memorize, they don't generalize. Without diverse training data, your model fails on real-world shifts. Data augmentation injects synthetic variance—rotations, flips, noise, color jitter—without collecting new samples. PyTorch provides torchvision.transforms to chain operations declaratively. Apply augmentations inside Dataset.__getitem__ so each epoch sees different distorted versions of the same image. This prevents overfitting and forces the model to learn invariant features. The cost: CPU overhead on the data loader. Use multiple workers and prefetching to hide latency. Never augment validation or test sets—only training. Start with random horizontal flips and color jitter; they yield the highest ROI for vision tasks. For text, synonym replacement and back-translation work similarly. Augmentation is not a silver bullet—excessive distortion destroys signal. Tune intensities per dataset.
Recurrent Neural Networks (RNNs)
Feedforward nets assume independence between inputs—useless for sequences. RNNs loop hidden state across timesteps, letting information persist. PyTorch’s nn.RNN processes variable-length sequences with a single API. The hidden state h carries context; each step receives current input x_t and previous state h_{t-1}. Vanilla RNNs suffer vanishing gradients over long sequences—use nn.LSTM or nn.GRU instead. Stack multiple layers for deeper representations, but watch overfitting. The batch_first=True flag swaps dimensions to (batch, seq_len, features)—most intuitive for typical usage. Always pack padded sequences with nn.utils.rnn.pad_packed_sequence to ignore padding tokens during recurrence. RNNs still dominate for short-to-medium sequential data, especially when interpretability of hidden states matters. For very long sequences, switch to Transformers.
model.rnn with batch_first=False (default) transposes your tensor silently—use batch_first=True to avoid shape bugs.Finding PyTorch Jobs
Employers want engineers who ship models, not just train notebooks. PyTorch jobs demand production skills: writing nn.Module subclasses, building custom Dataset loaders, handling GPU memory with torch.cuda.amp, and debugging autograd graphs. Focus on end-to-end pipelines—data ingestion, training, export to TorchScript, and serving via TorchServe or ONNX. Portfolio projects should include a requirements.txt, train.py with argparsing, and a README explaining trade-offs. Contribute to PyTorch open-source (e.g., bug fixes in torchvision or documentation patches) to get noticed. Network at PyTorch Conference or local meetups. Tailor your resume: list concrete metrics (e.g., “Reduced inference latency by 40% via mixed precision”). Avoid vague terms like “deep learning enthusiast.” Recruiters scan for keywords: torch.distributed, DDP, CUDA graphs, torch.compile. Practice system design for ML—how would you serve a model at 10k QPS?
autograd and custom nn.Module hooks.Audience
This PyTorch basics guide is crafted for senior software engineers who have already paid their dues in general-purpose programming but are now navigating the treacherous waters of machine learning. You are not a data scientist fresh out of a bootcamp; you understand memory management, concurrency, and the grim reality of production systems. If you’ve ever cursed a Python script for silently consuming 16GB of RAM, you are in the right place. The material assumes you can read PyTorch’s C++ backend stack traces without flinching and that you care more about deterministic reproducibility than notebook aesthetics. We target engineers building pipelines that must survive latency SLAs and rolling deployments. Expect rigorous code, not hand-wavy explanations. This is for the builder who knows that a model is just another binary artifact—like a Docker image, but with more matrix multiplications and fewer dependency conflicts.
Prerequisites
Before you touch a single nn.Module, ensure your environment is battle-ready. First, Python 3.9+ is mandatory—3.8 is dead, stop resurrecting it. Install PyTorch 2.x (CUDA 12.1 or later) via pip, not conda, because conda has a tendency to silently corrupt your environment graph. You must understand Python’s import system, context managers for resource lifecycle, and the GIL’s limitations. For GPU work, have NVIDIA drivers 535+ and nvidia-smi ready to confirm CUDA availability. Know what a tensor is: not a list, not a numpy array—a first-class GPU citizen with strides and gradients. You should have debugged a segfault before; this is not a place for cargo-cult programming. Bring your own test infrastructure: pytest is mandatory. Finally, accept that you will write more data-loading code than model code—prepare your file I/O pipeline with mmap and shared memory fundamentals. No previous ML experience? Go elsewhere.
torch.no_grad() and model.eval() early.Production model silently trained on accumulated gradients for 200 epochs
optimizer.zero_grad(). PyTorch accumulates gradients by default — every backward() call adds to existing .grad values rather than replacing them. After 200 epochs of a decently-sized batch size, the accumulated gradient magnitude was effectively 200x the correct value for the first batch seen. The optimizer was applying enormous, compounding weight updates that oscillated wildly around the loss minimum without ever settling. The model ended up with effectively random weights that happened to produce low training loss by memorising noise in the first few batches — a classic overfitting-via-gradient-corruption failure that is nearly impossible to diagnose from loss curves alone.optimizer.zero_grad() as the first line of every training step. Added gradient norm logging to the training dashboard — a norm above 10.0 now triggers an alert. Added gradient clipping (max_norm=1.0) as a standing safety net across all training jobs. Added validation loss divergence detection — an alert fires if val loss increases for five consecutive epochs relative to the rolling minimum.- PyTorch accumulates gradients by default —
zero_grad()is not optional, it is the first line of every training step - Monitor gradient norms during training — a sudden spike almost always indicates accumulation or an unchecked learning rate schedule
- Validation loss trending down is not sufficient signal — always check for divergence between train loss and val loss over time
- Gradient clipping prevents catastrophic divergence from outlier batches or accumulation bugs — set it once and leave it on
torch.autograd.detect_anomaly() to identify which operation produced the NaN gradient. In my experience, the most common culprit is a log() applied to a prediction that dipped to exactly zero — add a small epsilon (1e-8) inside any log call in your loss function.torch.no_grad() wrapping the training loop — this is surprisingly easy to do when refactoring inference code into a shared utility. Also check for dead ReLU initialisation: if all pre-activations are negative at init, the entire gradient signal is zero from step one.loss.item()) to a history list. Use .item() for scalar logging and .detach() for tensor logging. Also check for retain_graph=True being called repeatedly — it is almost never necessary in standard training and will silently accumulate the entire graph in memory.model.eval() is called before validation. Without it, Dropout randomly drops different neurons on every forward pass, and BatchNorm uses the current batch's statistics instead of the accumulated running statistics. The result is non-deterministic validation outputs even on identical input data — which looks exactly like training instability but is actually an evaluation bug.torch.autograd.set_detect_anomaly(True)print([(n, p.grad.norm()) for n, p in model.named_parameters() if p.grad is not None])model.parameters(), max_norm=1.0)Key takeaways
model.eval() control layer behaviour like Dropout and BatchNorm. torch.no_grad() controls gradient computation. These are three separate mechanisms. Confusing them is the single most common source of subtle training bugs in production PyTorch code.Common mistakes to avoid
7 patternsForgetting optimizer.zero_grad() before loss.backward()
optimizer.zero_grad() as the first line of every training step — before the forward pass, before the loss computation, before anything. Add gradient norm logging to your training dashboard. Consider gradient clipping with max_norm=1.0 as a permanent safety net, not just a debugging tool.Calling model.forward(x) directly instead of model(x)
forward() method is called internally — it is not the entry point.Storing raw loss tensors in a list for logging
loss.item() to extract a plain Python float before storing or logging. .item() detaches the scalar from the computation graph. Never append loss itself to a list — it keeps the entire graph alive for that batch in memory indefinitely.Not calling model.eval() during validation
model.eval() before every validation loop and model.train() before every training loop. Treat them as a matched pair. If your codebase has multiple evaluation paths (validation, test, inference), add a utility function that ensures eval mode is set and inference_mode is active — centralise it so it cannot be forgotten.Using torch.no_grad() instead of torch.inference_mode() for production serving
torch.inference_mode() for all production inference paths. It disables both gradient computation and version counter tracking, providing 10-20% faster execution on typical transformer and CNN architectures. Reserve torch.no_grad() for validation loops inside training runs where you may still need version tracking for other operations.Setting num_workers too high in DataLoader
Forgetting to move data to GPU in the training loop (but moving the model)
model.parameters()).device == features.device, before forward pass.Interview Questions on This Topic
What is the computation graph in PyTorch and how does autograd use it to compute gradients?
backward() by default, which keeps memory usage proportional to a single forward pass rather than the entire training history.Frequently Asked Questions
20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.
That's Tools. Mark it forged?
15 min read · try the examples if you haven't