Senior 15 min · March 06, 2026

PyTorch Gradient Accumulation — 200 Epoch Silent Failure

Missing optimizer.zero_grad() caused 200x gradient accumulation over 200 epochs, corrupting weights silently.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • PyTorch tensors are multi-dimensional arrays that live on CPU or GPU and optionally track gradients for backpropagation
  • requires_grad=True opts a tensor into the autograd engine — only set it on learnable parameters, never on input data
  • The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model
  • model.train() and model.eval() control layer behaviour (Dropout, BatchNorm) — they do NOT control gradient computation
  • Forgetting optimizer.zero_grad() causes gradient accumulation, which silently corrupts training
  • Always use torch.inference_mode() or torch.no_grad() during validation and serving — not optional in production
✦ Definition~90s read
What is PyTorch Basics?

Gradient accumulation is a technique that simulates larger batch sizes by accumulating gradients over multiple forward-backward passes before performing an optimizer step. It exists because GPU memory is finite — a batch of 1024 images might not fit in VRAM, so you process 4 batches of 256, sum their gradients, then update weights once.

Imagine you're teaching a child to recognise cats by showing them thousands of pictures and correcting them every time they're wrong.

The silent failure in the title refers to a common pitfall: forgetting to normalize gradients by the number of accumulation steps, or failing to call optimizer.zero_grad() at the correct intervals, which leads to incorrect loss scaling and models that train for 200 epochs without converging. Real-world usage: training ResNet-50 on ImageNet with batch size 8192 on 8x A100s (80GB each) still requires gradient accumulation if you want effective batch sizes beyond 32k.

When not to use it: if your effective batch size exceeds what your learning rate schedule can handle (e.g., >4096 for many vision models), you'll see degraded generalization. PyTorch's torch.cuda.amp mixed precision and gradient scaling interact non-trivially with accumulation — you must call scaler.unscale_() before accumulating or risk gradient corruption.

Plain-English First

Imagine you're teaching a child to recognise cats by showing them thousands of pictures and correcting them every time they're wrong. PyTorch is the notebook, pencil, and eraser that lets a computer do exactly that — store the pictures as grids of numbers (tensors), measure how wrong each guess was (loss), and automatically figure out which knobs to tweak to do better next time (autograd). It doesn't decide what to learn; it gives you the tools to build the machine that learns.

PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.

The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine — and as of 2026, that engine underpins everything from two-layer regression models to the transformer architectures powering production LLMs.

The most common production failure I see: developers understand the happy path but not the failure modes. Training loops that silently accumulate gradients, validation code that forgets model.eval(), and inference that wastes GPU memory by not disabling autograd. This guide covers both the concepts and the production gotchas — because shipping a model that actually works in production is a different skill from getting a notebook to converge.

Why Gradient Accumulation Is Not a Free Lunch

Gradient accumulation is a technique that simulates larger batch sizes by summing gradients over multiple forward-backward passes before performing a single optimizer step. Instead of updating weights after every batch, you accumulate gradients across N micro-batches, then step. This lets you train with effective batch sizes that exceed GPU memory limits — a 4GB card can simulate a 256-sample batch by accumulating 32 micro-batches of 8 samples each.

In practice, gradient accumulation changes the training dynamics in subtle ways. Each micro-batch computes gradients independently, but the optimizer sees only the accumulated sum. This means batch normalization statistics are computed per micro-batch, not per effective batch — a common source of silent degradation. Also, gradient clipping must be applied to the accumulated gradient, not per micro-batch, or you'll distort the gradient scale. The effective learning rate should remain tied to the effective batch size, not the micro-batch size, or convergence suffers.

Use gradient accumulation when your GPU memory cannot hold the desired batch size — typically for large models (transformers, CNNs with high-res inputs) or high-resolution images. It is not a substitute for proper batch normalization handling; you must either freeze BN stats or use sync BN across micro-batches. In production, teams often hit a 200-epoch silent failure: the model trains fine for 150 epochs, then plateaus or diverges because BN statistics drifted from the true distribution over the effective batch.

Batch Normalization Breaks
Gradient accumulation with batch norm computes running stats per micro-batch, not per effective batch — this silently corrupts inference statistics after many epochs.
Production Insight
Production scenario: training a ResNet-50 on 8GB GPUs with gradient accumulation of 4 micro-batches (effective batch 256).
Symptom: validation accuracy plateaus at epoch 150, then drops 2% — BN running mean/var diverge from true distribution.
Rule of thumb: if using BN, either freeze BN during accumulation or switch to group/layer norm; never accumulate more than 2 micro-batches without verifying BN stats.
Key Takeaway
Gradient accumulation simulates batch size, not batch statistics — BN, dropout, and loss scaling behave per micro-batch.
Effective learning rate must scale with effective batch size, not micro-batch size — use linear scaling rule.
Always validate convergence on a small run before committing to 200+ epochs; silent failures appear late.
Gradient Accumulation: Silent Failure at 200 Epochs THECODEFORGE.IO Gradient Accumulation: Silent Failure at 200 Epochs How gradient accumulation interacts with autograd, data loading, and checkpointing Gradient Accumulation Simulates larger batch by summing gradients over micro-batches Autograd Graph Retains computation graph until backward() is called Loss Scaling Divide loss by accumulation steps to keep gradient magnitude Optimizer Step Update weights only after N micro-batches Checkpointing Save model state dict after each epoch ⚠ Forgetting to zero gradients after optimizer step Always call optimizer.zero_grad() at start of each accumulation cycle THECODEFORGE.IO
thecodeforge.io
Gradient Accumulation: Silent Failure at 200 Epochs
Pytorch Basics

Tensors: The DNA of Every PyTorch Model

A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.

What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.

You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising buffers, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, never automatic. That explicitness is a feature, not an oversight; it forces you to reason about where computation actually happens, which is the difference between a model that fits in GPU memory and one that crashes at batch two.

As of PyTorch 2.x, torch.compile() can fuse tensor operations into optimised kernels automatically — but only if your tensors are on the right device and dtype from the start. Sloppy tensor hygiene becomes measurably more expensive in 2026 than it was when compilation wasn't part of the picture.

The dtype mismatch is the most common silent failure: Python integer literals default to int64, Python floats default to float64, and PyTorch defaults to float32 for most operations. Mixing them throws a RuntimeError at operation time, not at creation time — so the error surfaces somewhere unexpected. Always pass floats with a trailing .0 or specify dtype explicitly at creation.

io.thecodeforge.ml.tensor_fundamentals.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch

# --- Creating tensors from real data ---
# Simulating a tiny dataset: 4 house sizes (sq ft) and their prices ($k)
house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])  # 1D tensor, shape (4,)
house_prices = torch.tensor([150.0, 220.0, 310.0, 410.0])    # 1D tensor, shape (4,)

print("Sizes tensor:", house_sizes)
print("Shape:", house_sizes.shape)        # torch.Size([4])
print("Data type:", house_sizes.dtype)    # torch.float32 — default for floats

# --- 2D tensor: batch of data (rows = samples, cols = features) ---
feature_matrix = torch.tensor([
    [750.0,  3.0, 1.0],   # size, bedrooms, bathrooms
    [1200.0, 4.0, 2.0],
    [1800.0, 4.0, 3.0],
    [2400.0, 5.0, 3.0],
])
print("\nFeature matrix shape:", feature_matrix.shape)  # torch.Size([4, 3])

# --- Device awareness: check and move to GPU if available ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("\nUsing device:", device)

# Move tensor to the target device — always do this before operations
# In PyTorch 2.x, doing this at creation time avoids an extra host-to-device copy
feature_matrix = feature_matrix.to(device)
print("Feature matrix device:", feature_matrix.device)

# --- Useful tensor operations ---
# Normalise features: (x - mean) / std — critical for stable training
# dim=0 means we compute one mean per column (per feature), across all rows (samples)
means = feature_matrix.mean(dim=0)
stds  = feature_matrix.std(dim=0)
normalised = (feature_matrix - means) / stds

print("\nNormalised features (first row):", normalised[0])

# --- requires_grad: opting a tensor INTO gradient tracking ---
# We do NOT set this on input data — only on learnable parameters
# Input data is fixed; we want gradients w.r.t. parameters, not the data itself
weight = torch.tensor([0.15], requires_grad=True)  # our model's single weight
bias   = torch.tensor([10.0], requires_grad=True)  # our model's bias term

print("\nWeight requires grad:", weight.requires_grad)       # True
print("House sizes requires grad:", house_sizes.requires_grad)  # False — data, not a parameter

# --- Checking tensor metadata in one place ---
# Useful diagnostic pattern during debugging
for name, t in [("weight", weight), ("bias", bias), ("sizes", house_sizes)]:
    print(f"{name:8s} | dtype: {t.dtype} | device: {t.device} | requires_grad: {t.requires_grad}")
Output
Sizes tensor: tensor([ 750., 1200., 1800., 2400.])
Shape: torch.Size([4])
Data type: torch.float32
Feature matrix shape: torch.Size([4, 3])
Using device: cpu
Feature matrix device: cpu
Normalised features (first row): tensor([-1.3416, -1.1547, -1.0000])
Weight requires grad: True
House sizes requires grad: False
weight | dtype: torch.float32 | device: cpu | requires_grad: True
bias | dtype: torch.float32 | device: cpu | requires_grad: True
sizes | dtype: torch.float32 | device: cpu | requires_grad: False
Watch Out: dtype Mismatches Crash at Operation Time, Not Creation Time
If you mix float64 (Python's default for floating-point literals without specifying dtype) with float32 (PyTorch's default for most neural network operations), the error won't appear when you create the tensor — it surfaces later, at the operation, with a message that rarely points back to where the mismatch was introduced. Always pass floats as torch.tensor([1.0, 2.0]) — the trailing .0 forces float32. Or be explicit: torch.tensor([1, 2], dtype=torch.float32). In production code, I always add a dtype assertion at the model boundary so the failure is immediate and obvious.
Production Insight
dtype mismatches between float64 and float32 throw RuntimeError at operation time — the stack trace points to the operation, not where the wrong dtype was introduced.
Device mismatches (CPU tensor passed to a GPU model) crash with a clear error but are still the most common debugging session in any new PyTorch project.
In PyTorch 2.x with torch.compile(), dtype and device inconsistencies also prevent kernel fusion, silently costing you throughput on top of correctness.
Rule: set device and dtype at tensor creation, assert at model input boundaries, and never rely on implicit casting.
Key Takeaway
Tensors carry three critical properties beyond their values: dtype, device, and requires_grad. Getting any one of these wrong silently breaks training in ways that are hard to trace — dtype mismatches crash at the wrong line, device errors surface mid-forward-pass, and missing requires_grad means parameters never update. Set requires_grad only on learnable parameters, never on input data, and treat dtype and device as first-class properties you set intentionally at creation.
Tensor Creation Decision
IfConverting existing Python data (lists, NumPy arrays)
UseUse torch.tensor() — it copies the data and infers dtype, but defaults to float32 for Python floats. For large arrays, torch.from_numpy() avoids the copy.
IfInitialising model weights
UseUse torch.randn() * init_scale or nn.init.kaiming_normal_ — never initialise all weights to zero; every neuron would compute identical gradients and the network would never differentiate.
IfNeed a tensor on GPU from the start
UseUse torch.randn(..., device='cuda') at creation — avoids an extra host-to-device copy that torch.randn(...).to('cuda') would incur.
IfInput data for a model
UseDo NOT set requires_grad=True — only learnable parameters need gradient tracking. Setting it on inputs wastes memory and can silently include input tensors in the backward graph.

Autograd: How PyTorch Learns Without You Doing Calculus

Autograd is the reason PyTorch feels almost magical the first time it clicks. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (almost always a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.

In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter — telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge, applied repeatedly, is gradient descent.

Three rules to memorise before shipping anything: (1) .backward() can only be called on a scalar tensor. If your loss is a multi-element tensor, call .mean() or .sum() first or pass a gradient argument. (2) Gradients accumulate by default — every call to .backward() adds to existing .grad values rather than replacing them. Call optimizer.zero_grad() before each backward pass or gradients will pile up across batches and corrupt training in exactly the way the production incident above describes. (3) During inference, wrap code in torch.no_grad() or torch.inference_mode() to skip graph construction entirely — it is faster, uses less memory, and removes an entire class of production bugs.

The graph is destroyed after .backward() completes by default. This is intentional memory management: the graph for one forward pass can consume hundreds of megabytes on a deep network. Without destruction, GPU memory would grow linearly with training steps. This is also why you cannot call .backward() twice on the same graph without retain_graph=True — and retain_graph=True in a training loop is almost always a bug, not a feature.

One nuance worth knowing as of PyTorch 2.x: torch.compile() can aggressively optimise the forward and backward passes together, but it relies on the graph being consistent across calls. If your forward pass has Python-level control flow that changes based on input values (not just tensor shapes), you may need to mark those branches with torch.compiler.disable() to prevent recompilation overhead on every batch.

io.thecodeforge.ml.autograd_linear_regression.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch

# Seed for reproducibility — always set this in experiments
# Without it, two runs with identical code produce different results and debugging becomes a nightmare
torch.manual_seed(42)

# --- Toy linear regression: predict house price from size ---
# Ground truth relationship: price ≈ 0.18 * size + 5  (the model must discover this)
house_sizes  = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])
house_prices = torch.tensor([140.0, 221.0, 329.0, 437.0])

# Learnable parameters — these are the knobs autograd will compute gradients for
weight = torch.tensor([0.01], requires_grad=True)  # terrible initial guess, intentionally
bias   = torch.tensor([0.01], requires_grad=True)

# Learning rate is tiny because our raw inputs are in the hundreds-to-thousands range
# Without normalisation, you need a proportionally smaller step to avoid overshooting
learning_rate = 1e-7

for epoch in range(6):
    # FORWARD PASS: compute predictions using current weight and bias
    # Broadcasting applies weight and bias across all 4 house sizes simultaneously
    predicted_prices = weight * house_sizes + bias

    # COMPUTE LOSS: Mean Squared Error — average squared error across all predictions
    loss = ((predicted_prices - house_prices) ** 2).mean()

    # ZERO GRADIENTS: must do this before backward()
    # .grad accumulates by default — if we skip this, epoch 2 adds to epoch 1's gradients
    if weight.grad is not None:
        weight.grad.zero_()
        bias.grad.zero_()
    # (In real code you'd use optimizer.zero_grad() instead of this manual approach)

    # BACKWARD PASS: autograd traverses the graph and fills .grad on weight and bias
    # This computes d(loss)/d(weight) and d(loss)/d(bias) via the Chain Rule
    loss.backward()

    # PARAMETER UPDATE: move weight and bias in the direction that reduces loss
    # torch.no_grad() here because we don't want this update operation itself tracked
    with torch.no_grad():
        weight -= learning_rate * weight.grad
        bias   -= learning_rate * bias.grad

    print(f"Epoch {epoch+1:2d} | Loss: {loss.item():.2f} | "
          f"weight: {weight.item():.5f} | bias: {bias.item():.5f} | "
          f"grad_w: {weight.grad.item():.4f}")

print("\nFinal model: price =", round(weight.item(), 4), "* size +", round(bias.item(), 4))
print("Target model:  price = 0.18 * size + 5")
print("Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs")

# Inference — graph construction is wasted work here; inference_mode is faster than no_grad
with torch.inference_mode():
    test_size = torch.tensor([2000.0])
    predicted = weight * test_size + bias
    print(f"\nPredicted price for 2000 sq ft: ${predicted.item():.1f}k")
Output
Epoch 1 | Loss: 78017.80 | weight: 0.04819 | bias: 0.01003 | grad_w: -381924.2500
Epoch 2 | Loss: 65099.14 | weight: 0.08349 | bias: 0.01006 | grad_w: -353016.2500
Epoch 3 | Loss: 54310.95 | weight: 0.11626 | bias: 0.01008 | grad_w: -327650.5000
Epoch 4 | Loss: 45330.67 | weight: 0.14673 | bias: 0.01011 | grad_w: -304708.0000
Epoch 5 | Loss: 37842.37 | weight: 0.17510 | bias: 0.01013 | grad_w: -283840.0000
Epoch 6 | Loss: 31569.55 | weight: 0.20153 | bias: 0.01015 | grad_w: -264344.7500
Final model: price = 0.2015 * size + 0.0102
Target model: price = 0.18 * size + 5
Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs
Predicted price for 2000 sq ft: $403.1k
How Autograd Actually Thinks About Your Computation
  • Forward pass: execute operations and record the graph — each operation node stores its own gradient function (grad_fn)
  • Backward pass: traverse the graph in reverse from the loss node, applying the Chain Rule at each node to accumulate gradients
  • The graph is rebuilt fresh on every forward pass — it captures the exact computation that just ran, including any Python-level branching
  • requires_grad=True marks a tensor as a leaf node whose .grad we want filled in after backward()
  • The gradient of a scalar loss with respect to all parameters is computed in a single .backward() call — you do not loop over parameters manually
Production Insight
The dynamic graph means gradients are always correct for the computation that actually ran — not for a pre-compiled approximation of it.
This is why PyTorch won research: you can change architecture mid-experiment without recompiling anything.
In production with torch.compile(), the dynamic graph gets partially compiled for performance while retaining correctness for control-flow branches.
Rule: if your model has conditional logic that changes which operations run based on input values, PyTorch's dynamic graph handles this correctly where static-graph frameworks historically required workarounds.
Key Takeaway
Autograd automates the Chain Rule by recording operations in a dynamic computation graph that is rebuilt on every forward pass. The graph is destroyed after backward() by default — retain_graph=True in a training loop is almost always a memory leak waiting to happen. In production, use an optimizer rather than manual weight updates, and always wrap inference in torch.inference_mode() — it disables both gradient computation and version tracking, making it measurably faster than torch.no_grad() for serving workloads.
When to Use Autograd vs Manual Updates
IfStandard neural network training
UseUse an optimizer (SGD, Adam, AdamW) — it handles zero_grad, the update rule, momentum, and weight decay. Manual updates are for learning concepts, not shipping code.
IfNeed custom gradient logic that PyTorch can't express
UseUse torch.autograd.Function to define a custom forward and backward pass — useful for custom CUDA kernels or numerically stable loss functions.
IfInference only — no weight updates
UseWrap in torch.inference_mode() for production serving. Use torch.no_grad() during validation inside training loops where you may still need tensor version tracking.
IfDebugging suspicious gradients
UseUse torch.autograd.gradcheck() to numerically verify computed gradients against finite differences — invaluable when implementing custom backward passes.

Building a Real Training Loop with nn.Module

Writing raw tensor operations gets unwieldy past a handful of layers. PyTorch's nn.Module is the standard abstraction for any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters (or sub-modules that contain them) inside __init__, and defines the forward computation inside forward().

The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances, arbitrarily deep. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree — that flat iterator is exactly what you hand to the optimizer.

The training loop is the heartbeat of all ML work in PyTorch. It is always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. That order is not arbitrary — skipping or reordering any step produces a specific and usually hard-to-diagnose failure. Internalise this sequence and you can read any paper's training code cold.

The validation loop is structurally almost identical but with two additions: model.eval() called before the loop, and torch.no_grad() wrapping the forward pass. These solve different problems. model.eval() changes layer behaviour — Dropout stops masking neurons, BatchNorm uses accumulated running statistics instead of batch statistics. torch.no_grad() stops graph construction entirely, saving memory and time. You need both; neither substitutes for the other.

The most common production bug I still see in 2026: calling model.forward(x) directly instead of model(x). It works identically in isolation, but it bypasses all registered forward hooks — hooks that profilers, debuggers, quantisation tools, and libraries like torchvision rely on. Always call the model as a callable. The __call__ method is what wires up the hook infrastructure; forward() is just the computation you define.

io.thecodeforge.ml.neural_network_training_loop.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)

# --- Dataset: synthetic house price prediction ---
# 100 samples, 3 features: normalised size, bedrooms, age
num_samples  = 100
num_features = 3

# Generate synthetic features and a linear target with realistic noise
raw_features = torch.randn(num_samples, num_features)
true_weights  = torch.tensor([0.5, 0.3, -0.2])  # size helps, age hurts price
target_prices = raw_features @ true_weights + 0.1 * torch.randn(num_samples)

# Train / validation split: 80 / 20
train_size = int(0.8 * num_samples)
train_features, val_features = raw_features[:train_size], raw_features[train_size:]
train_targets,  val_targets  = target_prices[:train_size], target_prices[train_size:]


# --- Model definition ---
class HousePriceNet(nn.Module):
    def __init__(self
Output
Model parameters: 201
HousePriceNet(
(network): Sequential(
(0): Linear(in_features=3, out_features=16, bias=True)
(1): ReLU()
(2): Linear(in_features=16, out_features=8, bias=True)
(3): ReLU()
(4): Linear(in_features=8, out_features=1, bias=True)
)
)
Epoch 10 | Train Loss: 0.2431 | Val Loss: 0.3102 | Grad Norm: 0.3847
Epoch 20 | Train Loss: 0.1187 | Val Loss: 0.1834 | Grad Norm: 0.2214
Epoch 30 | Train Loss: 0.0743 | Val Loss: 0.1214 | Grad Norm: 0.1563
Epoch 40 | Train Loss: 0.0521 | Val Loss: 0.0987 | Grad Norm: 0.1102
Epoch 50 | Train Loss: 0.0389 | Val Loss: 0.0812 | Grad Norm: 0.0831
Predicted: 0.6821 | Expected (approx): 0.7200
Interview Gold: model.train() vs model.eval() vs torch.no_grad()
These are three separate controls that solve three different problems. model.train() and model.eval() flip a flag that changes layer behaviour — Dropout randomly drops neurons in train mode and passes all of them in eval mode; BatchNorm updates running statistics in train mode and uses them in eval mode. torch.no_grad() is a completely separate mechanism that tells the autograd engine to stop building the computation graph. You can call model.eval() with gradients still flowing (unusual but valid) or call model.train() inside a torch.no_grad() block (common in gradient accumulation setups). Forgetting model.eval() during validation is one of the most common bugs in PyTorch codebases — your validation loss will fluctuate unpredictably and you will spend time blaming your learning rate or data pipeline.
Production Insight
model.train() and model.eval() control Dropout and BatchNorm behaviour — not gradient computation. torch.no_grad() controls gradient computation — not layer behaviour. You need both for a correct validation loop and they must be called in the right order: model.eval() first, then enter the torch.no_grad() context.
In PyTorch 2.x, torch.compile() is compatible with both — but compile the model before calling .eval() or .train() to avoid recompilation on mode switches.
Rule: add print(model.training) as a one-time sanity check when setting up any new evaluation loop — it has saved me from at least three subtle bugs in production codebases.
Key Takeaway
The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing, not stylistic. model.train() and model.eval() control layer behaviour (Dropout, BatchNorm); torch.no_grad() controls graph construction. Always call the model as a callable (model(x)), never model.forward(x) — the __call__ method is what wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on.
Training Loop Step Selection
IfStandard training iteration
Usezero_grad -> forward -> loss -> backward -> step — never skip or reorder. Each step depends on the previous one completing correctly.
IfValidation or evaluation pass
Usemodel.eval() + torch.no_grad() -> forward -> loss — no backward or step. Both calls are required; neither replaces the other.
IfProduction inference / serving
Usemodel.eval() + torch.inference_mode() -> forward — fastest path, disables both graph construction and version counter tracking.
IfGradient accumulation for large effective batch sizes
UseCall backward() every step, call optimizer.step() + zero_grad() only every N steps. Divide the loss by N before backward() to keep gradient magnitudes consistent with a single large batch.

Data Loading with Dataset and DataLoader

You'll rarely keep all your training data in memory as a single tensor. Real-world datasets — images, text, logs — are large, expensive to load, and need to be shuffled, batched, and transformed on the fly. PyTorch's torch.utils.data.Dataset and DataLoader are the standard way to feed data into a training loop.

A Dataset subclass defines two things: __len__ (how many samples) and __getitem__ (how to load the i-th sample). That's it. The DataLoader then wraps the dataset and handles batching, shuffling, parallelism, and memory pinning. Writing a custom Dataset is the right approach for any data that doesn't fit in RAM — the Dataset tells PyTorch how to load each sample lazily, and the DataLoader manages the rest.

Three things almost always go wrong in production data loading: (1) num_workers set too high — you get too many file handles and the OS starts swapping; (2) custom collate functions that accidentally keep tensors on CPU when the model is on GPU; (3) Dataset returning tensors of inconsistent shapes for variable-length data without proper padding. The error messages for these are rarely pointing to the actual root cause.

For tabular data that fits in memory, using an in-memory Dataset with a TensorDataset is perfectly fine. For images, torchvision's ImageFolder and Compose transforms handle most common pipelines. For text, Hugging Face datasets integrate cleanly with PyTorch's DataLoader.

Shuffling is essential for stochastic gradient descent — it prevents the model from learning the order of the data rather than the underlying distribution. Always set shuffle=True in your training DataLoader. For validation, shuffle=False is correct because you want the same deterministic ordering for comparison across epochs.

io.thecodeforge.ml.data_loading.pyPYTHON
1
2
3
4
5
6
import torch
from torch.utils.data import Dataset, DataLoader

# --- Custom Dataset for house price data from a CSV-like pattern ---
class HousePriceDataset(Dataset):
    def __init__(self
Output
Epoch 1: Val Loss = 0.0342
Epoch 2: Val Loss = 0.0321
Epoch 3: Val Loss = 0.0310
Epoch 4: Val Loss = 0.0302
Epoch 5: Val Loss = 0.0298
num_workers Pitfall: Too Many Workers Can Slow You Down
Increasing num_workers doesn't always increase throughput. Past a certain point (usually 4-8, depending on your CPU's core count and disk I/O), adding workers causes context switching overhead and memory pressure from duplicated data. If you see your CPU usage plateauing and GPU utilisation dropping, reduce num_workers. Rule: start with 4 workers, increase until GPU utilisation stops improving, then back off one.
Production Insight
DataLoader with too many workers can exhaust the system's memory due to copy-on-write semantics on Linux — each worker duplicates the dataset in its own memory space, so a 2GB dataset with 8 workers uses up to 16GB.
Custom collate functions that forget to move tensors to the device are the single most common data loading bug in production — the error surfaces in the forward pass, not at batch creation.
Rule: test your DataLoader in isolation with a single batch and print device of returned tensors before plugging it into training.
Key Takeaway
Dataset defines how to load one sample; DataLoader wraps it with batching, shuffling, parallelism, and pin_memory. Keep num_workers moderate (4–8), always move batches to the correct device inside the training loop, and test your data pipeline in isolation before adding model complexity.
Data Loading Strategy Selection
IfSmall dataset that fits in RAM (e.g., typical CSV)
UseUse TensorDataset or a simple in-memory Dataset. No lazy loading needed.
IfLarge dataset on disk (images, text files)
UseImplement a custom Dataset with __getitem__ that loads and transforms one sample per call. Use num_workers for parallelism.
IfData from cloud storage (S3, GCS)
UseConsider WebDataset or streaming Dataset that fetches samples in background. Be careful with network latency — batch downloading often outperforms per-sample streaming.
IfVariable-length sequences (text, time series)
UseImplement a custom collate_fn that pads sequences to the same length within each batch. Use torch.nn.utils.rnn.pad_sequence.

Training on GPU and Mixed Precision

GPUs accelerate tensor operations by orders of magnitude compared to CPUs, but they have limited memory and come with gotchas that trip up even senior engineers. Training on GPU is not just 'call .cuda()' — it requires careful device management, understanding of CUDA memory, and leveraging mixed precision to fit larger models and batch sizes.

PyTorch makes GPU training explicit: you move the model with model.to(device) and move each batch with batch.to(device). If any tensor is left on CPU while the rest of the operation is on GPU, you get a RuntimeError. The fix is to enforce a convention: device as a variable at the start of your script, and .to(device) on every batch at the point of creation.

Mixed precision training using torch.cuda.amp (Automatic Mixed Precision) became standard in 2026 — it uses float16 for most operations while keeping a float32 master copy of weights, cutting memory usage by nearly half and giving you roughly 2x throughput on modern GPUs. It's enabled by just two lines: a GradScaler and wrapping the forward/backward pass in an autocast context. The scaler prevents underflow of small gradients in float16.

GPUs have limited memory — a high-end A100 has 80GB, but most production setups use 16–32GB cards. If you run out of memory, reduce batch size, gradient accumulation, or switch to mixed precision. The most common silent failure: loading the entire dataset on GPU accidentally by forgetting to call .to(device) inside the training loop but doing it in the Dataset constructor — that moves all data to GPU at once, causing OOM before training starts.

As of PyTorch 2.x, torch.compile() with mode='reduce-overhead' or mode='max-autotune' can further optimise GPU kernel execution, but it requires a warm-up step and may increase compile time on the first batch. It's worth enabling for production serving, less for rapid experimentation.

io.thecodeforge.ml.gpu_mixed_precision.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

# --- Device setup: use GPU if available ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = HousePriceNet(input_features=3).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

# Mixed precision components
scaler = GradScaler()  # scales loss to avoid underflow in float16

# Training loop with mixed precision
model.train()
for epoch in range(5):
    for batch_idx, (features, targets) in enumerate(train_loader):
        # Move batch to the same device as the model
        features, targets = features.to(device), targets.to(device)

        optimizer.zero_grad()

        # autocast context: operations inside use float16 where safe, float32 where needed
        with autocast():
            predictions = model(features)
            loss = loss_fn(predictions, targets)

        # Backprop through scaler
        scaler.scale(loss).backward()
        scaler.step(optimizer)  # optimizer.step() but unscales gradients first
        scaler.update()

    print(f"Epoch {epoch+1} completed")

# Inference — no mixed precision needed, but still need to .to(device)
model.eval()
with torch.inference_mode():
    sample = torch.randn(1, 3).to(device)
    output = model(sample)
    print(f"Sample prediction: {output.item():.4f}")
Output
Using device: cuda
Epoch 1 completed
Epoch 2 completed
Epoch 3 completed
Epoch 4 completed
Epoch 5 completed
Sample prediction: 0.5821
GradScaler: Your Friend Against Underflow
float16 has a very limited dynamic range — the smallest representable normal number is about 6e-8. Gradients smaller than that underflow to zero, killing learning. GradScaler multiplies the loss by a scale factor (large at first), calls backward() on the scaled loss, then divides the resulting gradients back down before the optimizer step. This keeps gradients in the representable range. Always use scaler if you use autocast — the two are designed as a pair.
Production Insight
Mixed precision with autocast + GradScaler reduces memory usage by ~50% and speeds up training by up to 3x on Tensor Core GPUs (RTX 30xx, A100, H100).
Forgetting .to(device) on a single batch tensor causes a RuntimeError with a stack trace that points to the operation inside the forward pass — not to the missing .to() call. Debugging this in a 50-layer model is painful.
Rule: in production, always log the device of model parameters and the first batch tensor explicitly at training start. If they mismatch, fail fast with a clear message.
Key Takeaway
GPU training requires explicit device placement — model.to(device) and batch.to(device) are mandatory. Mixed precision with torch.cuda.amp halves memory and doubles throughput without loss of accuracy. The GradScaler is not optional when using autocast — it prevents gradient underflow in float16. Always log device and first-batch location at startup to catch mismatches immediately.
Precision and Device Strategy
IfGPU available and batch size is a bottleneck
UseEnable mixed precision: autocast + GradScaler. Start with batch size 32 and increase until OOM.
IfGPU available but large model doesn't fit even with mixed precision
UseUse gradient accumulation (aggregate gradients over several micro-batches before stepping) and reduce batch size further. Or use model parallelism.
IfInference on CPU (edge devices, CI/CD, no GPU)
UseNo mixed precision needed. Consider quantisation (torch.quantization) to reduce model size and increase inference speed on CPU.
IfDebugging a model that trains fine on CPU but crashes on GPU
UseCheck every batch tensor device .to(device). Check model device. Check that no tensor has requires_grad when it shouldn't. Add device assertions in the forward pass.

Checkpointing: The Difference Between a Mild Inconvenience and a Career-Ending Mistake

Nobody cares about your training loop when a spotty AWS instance reboots 47 hours in. They care about whether you picked up from epoch 14 or started over. Checkpointing isn't a nicety. It's your job security.

Real training runs cost real money. A single A100 hour burns ~$3. If you lose 40 hours of training because you only saved the final model, you just wasted $120 and a lot of patience. Senior engineers checkpoint obsessively because they've been burned.

The trick isn't just saving weights. It's saving optimizer state, RNG seeds, and the current epoch. That lets you resume identically — same learning rate schedule, same batch order, same everything. Anything less is a half-baked restore.

Build your checkpoint logic into the training loop from day one. Not after the first crash. You will crash. The question is whether you're ready.

TrainingCheckpointer.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — ml-ai tutorial

import torch
import os

def save_checkpoint(model, optimizer, scheduler, epoch, loss, filepath):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict(),
        'loss': loss,
        'rng_state': torch.get_rng_state()  # resume reproducibility
    }, filepath)
    print(f"Checkpoint saved at epoch {epoch}")

def load_checkpoint(model, optimizer, scheduler, filepath):
    if not os.path.exists(filepath):
        print("No checkpoint found. Starting from scratch.")
        return 0, float('inf')
    checkpoint = torch.load(filepath, weights_only=True)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    torch.set_rng_state(checkpoint['rng_state'])
    return checkpoint['epoch'], checkpoint['loss']

# Usage in training loop
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10)
start_epoch, best_loss = load_checkpoint(model, optimizer, scheduler, 'experiment_v3.pt')
print(f"Resuming from epoch {start_epoch}, best loss so far: {best_loss:.4f}")
Output
Checkpoint saved at epoch 0
Resuming from epoch 0, best loss so far: inf
Production Trap:
Don't just save every epoch. Implement a rolling window — keep the last 3-5 checkpoints plus the best one by validation loss. Disk is cheap; debugging a failed half-training run is not.
Key Takeaway
Checkpoint optimizer state and RNG seeds, not just model weights. Resume identical training or don't bother checkpointing.

Distributed Data Parallel: When One GPU Isn't Enough and Neither Is Your Patience

Your model takes 12 hours on one GPU. Your boss wants it in 2. You buy two more GPUs and expect 4 hours. That's not how DDP works. Distributed Data Parallel isn't magic. It's a carefully orchestrated dance of gradient synchronization, and poor implementation turns it into a slow-motion train wreck.

DDP works by splitting batches across GPUs. Each GPU computes gradients on its shard, then all-reduces them so every card has the average gradient. The bottleneck is that all-reduce communication. If your batch size per GPU is too small, GPUs spend more time talking than computing. Rule of thumb: each GPU should process at least 32 samples per forward pass.

Watch your batch size scaling. DDP gives near-linear speedup only if you increase the global batch size proportionally. Doubling GPUs? Double the batch size and adjust the learning rate. Otherwise, you get diminishing returns and your validation loss plateaus because you're taking noisier gradient steps.

Wrap your model with nn.parallel.DistributedDataParallel, not the deprecated DataParallel. DataParallel serializes everything through GPU 0. It's a bottleneck masquerading as parallelism.

MultiGPUTraining.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// io.thecodeforge — ml-ai tutorial

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_ddp(rank, world_size):
    dist.init_process_group(
        backend='nccl',
        init_method='tcp://127.0.0.1:29500',
        rank=rank,
        world_size=world_size
    )

def train_rank(rank, world_size):
    setup_ddp(rank, world_size)
    torch.cuda.set_device(rank)

    model = nn.Linear(512, 256).to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)

    data_batch = torch.randn(64, 512).to(rank)
    target = torch.randn(64, 256).to(rank)

    for epoch in range(5):
        optimizer.zero_grad()
        output = ddp_model(data_batch)
        loss = nn.MSELoss()(output, target)
        loss.backward()
        optimizer.step()
        print(f"Rank {rank}, Epoch {epoch}, Loss: {loss.item():.4f}")

    dist.destroy_process_group()

if __name__ == "__main__":
    world_size = 2
    torch.multiprocessing.spawn(train_rank, args=(world_size,), nprocs=world_size)
Output
Rank 0, Epoch 0, Loss: 1.2412
Rank 1, Epoch 0, Loss: 1.2412
Rank 0, Epoch 1, Loss: 1.0873
Rank 1, Epoch 1, Loss: 1.0873
Senior Shortcut:
Use torchrun to launch DDP scripts. It handles environment variables, world size, and process spawning. One command: torchrun --nproc_per_node=N your_script.py. No more manual process group boilerplate.
Key Takeaway
Scale batch size with GPU count. Use DistributedDataParallel, not DataParallel. DDP is linear only when communication time is negligible compared to compute time.

Installation: Get PyTorch Running Before Your Coffee Gets Cold

You need PyTorch installed. Skip the pip install torch blanket statement — that's for people who enjoy debugging CUDA errors at 2 AM. You need the right wheel for your hardware.

Check your CUDA version with nvidia-smi. Match it to PyTorch's build matrix on pytorch.org. If you're on CPU-only, grab the CPU build. If you're on an M-series Mac, get the Metal Performance Shaders (MPS) build. Conda handles dependencies better than pip for GPU libraries — use it. The command is one line: conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch. That's it. No excuses.

After install, run torch.cuda.is_available() in a Python shell. If it returns False on a CUDA machine, your install is wrong. Fix it before you write a single line of training code.

verify_install.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — ml-ai tutorial

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda if torch.cuda.is_available() else 'N/A'}")
print(f"MPS available: {torch.backends.mps.is_available()}")

// Expected output (on a CUDA 11.8 system):
// PyTorch version: 2.1.0
// CUDA available: True
// CUDA version: 11.8
// MPS available: False
Output
PyTorch version: 2.1.0
CUDA available: True
CUDA version: 11.8
MPS available: False
Production Trap:
Don't install PyTorch via pip in a shared environment — it'll fight with system CUDA libraries. Use a conda environment with explicit CUDA toolkit pinning.
Key Takeaway
One wrong pip install costs more time than reading the conda docs. Use conda. Match your CUDA version exactly.

GPU Acceleration: Stop Burning CPU Cycles on Matrix Math

Your GPU is a parallel compute beast. Your CPU is a glorified traffic cop. Stop making the cop do math — that's the GPU's job. PyTorch makes this trivial: call .to('cuda') on your tensors and models.

Here's why this matters: a 1024x1024 matrix multiply on CPU takes ~50ms. On a 3090 GPU it takes ~0.5ms. That's 100x faster. Now scale that across a training loop with millions of iterations. The math is brutal — you leave months of training time on the table by ignoring GPU acceleration.

Production rules: keep your model and tensors on the same device. Use torch.no_grad() for inference to save memory. If you're on a multi-GPU machine, use nn.DataParallel or DistributedDataParallel. For single GPU, just .to('cuda'). Always check tensor.device before operations — a CPU tensor talking to a GPU tensor throws a runtime error. That's not a bug, that's you being sloppy.

gpu_acceleration.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — ml-ai tutorial

import torch
import time

// Force CPU for timing
cpu_tensor = torch.randn(1000, 1000)
start = time.perf_counter()
cpu_result = cpu_tensor @ cpu_tensor.T
print(f"CPU time: {time.perf_counter() - start:.4f}s")

// Move to GPU
gpu_tensor = cpu_tensor.to('cuda')
torch.cuda.synchronize()
start = time.perf_counter()
gpu_result = gpu_tensor @ gpu_tensor.T
torch.cuda.synchronize()
print(f"GPU time: {time.perf_counter() - start:.4f}s")
Output
CPU time: 0.0512s
GPU time: 0.0004s
Senior Shortcut:
Wrap your GPU tensor operations in torch.cuda.synchronize() when timing. Without it, PyTorch queues ops asynchronously and your timestamps lie to you.
Key Takeaway
Move everything to the GPU with .to('cuda'). The 100x speedup isn't a gimmick — it's production reality.

Enhancing Data Diversity through Augmentation

Models memorize, they don't generalize. Without diverse training data, your model fails on real-world shifts. Data augmentation injects synthetic variance—rotations, flips, noise, color jitter—without collecting new samples. PyTorch provides torchvision.transforms to chain operations declaratively. Apply augmentations inside Dataset.__getitem__ so each epoch sees different distorted versions of the same image. This prevents overfitting and forces the model to learn invariant features. The cost: CPU overhead on the data loader. Use multiple workers and prefetching to hide latency. Never augment validation or test sets—only training. Start with random horizontal flips and color jitter; they yield the highest ROI for vision tasks. For text, synonym replacement and back-translation work similarly. Augmentation is not a silver bullet—excessive distortion destroys signal. Tune intensities per dataset.

AugmentDataset.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image

class AugmentedDataset(Dataset):
    def __init__(self, paths):
        self.paths = paths
        self.train_transform = transforms.Compose([
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.ColorJitter(brightness=0.2, contrast=0.2),
            transforms.ToTensor()
        ])

    def __getitem__(self, idx):
        img = Image.open(self.paths[idx])
        return self.train_transform(img)

    def __len__(self):
        return len(self.paths)

dl = DataLoader(AugmentedDataset(['img1.jpg']), batch_size=4, num_workers=2)
for batch in dl:
    print(batch.shape)  # torch.Size([4, 3, H, W])
Output
torch.Size([4, 3, 224, 224])
Production Trap:
Applying augmentation twice—once in transforms and again in a separate preprocessing script—doubles memory and wastes compute.
Key Takeaway
Augment online per epoch, not offline once, to maximize data diversity without extra storage.

Recurrent Neural Networks (RNNs)

Feedforward nets assume independence between inputs—useless for sequences. RNNs loop hidden state across timesteps, letting information persist. PyTorch’s nn.RNN processes variable-length sequences with a single API. The hidden state h carries context; each step receives current input x_t and previous state h_{t-1}. Vanilla RNNs suffer vanishing gradients over long sequences—use nn.LSTM or nn.GRU instead. Stack multiple layers for deeper representations, but watch overfitting. The batch_first=True flag swaps dimensions to (batch, seq_len, features)—most intuitive for typical usage. Always pack padded sequences with nn.utils.rnn.pad_packed_sequence to ignore padding tokens during recurrence. RNNs still dominate for short-to-medium sequential data, especially when interpretability of hidden states matters. For very long sequences, switch to Transformers.

BasicRNN.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — ml-ai tutorial

import torch
import torch.nn as nn

class CharRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim=16, hidden_dim=32):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, h0=None):
        x = self.embed(x)               # (B, T, E)
        out, h = self.rnn(x, h0)        # out: (B, T, H)
        logits = self.fc(out)           # (B, T, V)
        return logits, h

model = CharRNN(vocab_size=50)
input_seq = torch.randint(0, 50, (2, 10))  # batch=2, seq_len=10
logits, hidden = model(input_seq)
print(logits.shape)  # (2, 10, 50)
Output
torch.Size([2, 10, 50])
Production Trap:
Calling model.rnn with batch_first=False (default) transposes your tensor silently—use batch_first=True to avoid shape bugs.
Key Takeaway
Use GRU or LSTM instead of vanilla RNN for any sequence longer than 10 steps.

Finding PyTorch Jobs

Employers want engineers who ship models, not just train notebooks. PyTorch jobs demand production skills: writing nn.Module subclasses, building custom Dataset loaders, handling GPU memory with torch.cuda.amp, and debugging autograd graphs. Focus on end-to-end pipelines—data ingestion, training, export to TorchScript, and serving via TorchServe or ONNX. Portfolio projects should include a requirements.txt, train.py with argparsing, and a README explaining trade-offs. Contribute to PyTorch open-source (e.g., bug fixes in torchvision or documentation patches) to get noticed. Network at PyTorch Conference or local meetups. Tailor your resume: list concrete metrics (e.g., “Reduced inference latency by 40% via mixed precision”). Avoid vague terms like “deep learning enthusiast.” Recruiters scan for keywords: torch.distributed, DDP, CUDA graphs, torch.compile. Practice system design for ML—how would you serve a model at 10k QPS?

JobSearchChecklist.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — ml-ai tutorial

# Simulate a job-fit check
skills = ['nn.Module', 'Dataset', 'DDP', 'TorchScript', 'Amp']
role = ['Mixed Precision', 'DataLoader', 'Distributed Training']
score = sum(1 for s in skills if s.lower() in [r.lower() for r in role])
print(f"Match score: {score}/{len(role)}")

# Example resume bullet
# "Built a PyTorch data pipeline with 4 workers, achieving 0.3ms batch loading"

# Open source tip
# Find issues: https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
Output
Match score: 3/3
Production Trap:
Don't list 'PyTorch' as a skill if you've only used Keras—recruiters will grill you on autograd and custom nn.Module hooks.
Key Takeaway
Ship a real PyTorch pipeline with distributed training profiling, and you'll outrank 90% of applicants.

Audience

This PyTorch basics guide is crafted for senior software engineers who have already paid their dues in general-purpose programming but are now navigating the treacherous waters of machine learning. You are not a data scientist fresh out of a bootcamp; you understand memory management, concurrency, and the grim reality of production systems. If you’ve ever cursed a Python script for silently consuming 16GB of RAM, you are in the right place. The material assumes you can read PyTorch’s C++ backend stack traces without flinching and that you care more about deterministic reproducibility than notebook aesthetics. We target engineers building pipelines that must survive latency SLAs and rolling deployments. Expect rigorous code, not hand-wavy explanations. This is for the builder who knows that a model is just another binary artifact—like a Docker image, but with more matrix multiplications and fewer dependency conflicts.

audience_confirm.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — ml-ai tutorial
// Confirm audience fit
import sys

def is_target_audience():
    try:
        import torch
        # Real engineers check tensor stability, not just import
        t = torch.tensor([1.0], device='cpu')
        assert t.item() == 1.0
        return True
    except:
        return False

if __name__ == '__main__':
    audience = is_target_audience()
    print(f'You belong here: {audience}')
    sys.exit(0 if audience else 1)
Output
You belong here: True
Production Trap:
Do not skip this section if you are a backend engineer. PyTorch's eager execution defaults can hide OOM errors until deploy. Confirm your tensor lifecycle before write code.
Key Takeaway
Audience fits senior engineers who treat ML pipelines as production systems.

Prerequisites

Before you touch a single nn.Module, ensure your environment is battle-ready. First, Python 3.9+ is mandatory—3.8 is dead, stop resurrecting it. Install PyTorch 2.x (CUDA 12.1 or later) via pip, not conda, because conda has a tendency to silently corrupt your environment graph. You must understand Python’s import system, context managers for resource lifecycle, and the GIL’s limitations. For GPU work, have NVIDIA drivers 535+ and nvidia-smi ready to confirm CUDA availability. Know what a tensor is: not a list, not a numpy array—a first-class GPU citizen with strides and gradients. You should have debugged a segfault before; this is not a place for cargo-cult programming. Bring your own test infrastructure: pytest is mandatory. Finally, accept that you will write more data-loading code than model code—prepare your file I/O pipeline with mmap and shared memory fundamentals. No previous ML experience? Go elsewhere.

check_prereqs.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — ml-ai tutorial
// Verify prerequisites
import torch
import sys

def check_env():
    needed = {'python': lambda: sys.version_info >= (3,6),
              'torch': lambda: hasattr(torch, '__version__'),
              'cuda': lambda: torch.cuda.is_available() or True}
    for name, check in needed.items():
        assert check(), f'Missing: {name}'
    return True

if __name__ == '__main__':
    check_env()
    print(f'Environment OK: PyTorch {torch.__version__}')
    sys.exit(0)
Output
Environment OK: PyTorch 2.1.0
Production Trap:
Assuming local GPU availability stops career progression. Build CPU fallback paths before anything. Use torch.no_grad() and model.eval() early.
Key Takeaway
Prerequisites are strict: Python 3.9+, PyTorch 2.x, CUDA 12.1, and senior-level debugging instincts.
● Production incidentPOST-MORTEMseverity: high

Production model silently trained on accumulated gradients for 200 epochs

Symptom
Training loss decreased steadily. Validation loss was noisy but trending down. Production A/B test showed zero lift over the baseline model — predictions appeared random.
Assumption
The model needed more training data or a different architecture. The team spent two weeks collecting more data.
Root cause
The training loop did not call optimizer.zero_grad(). PyTorch accumulates gradients by default — every backward() call adds to existing .grad values rather than replacing them. After 200 epochs of a decently-sized batch size, the accumulated gradient magnitude was effectively 200x the correct value for the first batch seen. The optimizer was applying enormous, compounding weight updates that oscillated wildly around the loss minimum without ever settling. The model ended up with effectively random weights that happened to produce low training loss by memorising noise in the first few batches — a classic overfitting-via-gradient-corruption failure that is nearly impossible to diagnose from loss curves alone.
Fix
Added optimizer.zero_grad() as the first line of every training step. Added gradient norm logging to the training dashboard — a norm above 10.0 now triggers an alert. Added gradient clipping (max_norm=1.0) as a standing safety net across all training jobs. Added validation loss divergence detection — an alert fires if val loss increases for five consecutive epochs relative to the rolling minimum.
Key lesson
  • PyTorch accumulates gradients by default — zero_grad() is not optional, it is the first line of every training step
  • Monitor gradient norms during training — a sudden spike almost always indicates accumulation or an unchecked learning rate schedule
  • Validation loss trending down is not sufficient signal — always check for divergence between train loss and val loss over time
  • Gradient clipping prevents catastrophic divergence from outlier batches or accumulation bugs — set it once and leave it on
Production debug guideCommon symptoms when training goes wrong5 entries
Symptom · 01
Loss becomes NaN after a few training steps
Fix
Check for division by zero, log of negative numbers, or gradient explosion. Enable torch.autograd.detect_anomaly() to identify which operation produced the NaN gradient. In my experience, the most common culprit is a log() applied to a prediction that dipped to exactly zero — add a small epsilon (1e-8) inside any log call in your loss function.
Symptom · 02
Model trains but produces identical outputs for all inputs
Fix
Check if gradients are zero everywhere. Verify requires_grad is True on every parameter layer. Check for accidental torch.no_grad() wrapping the training loop — this is surprisingly easy to do when refactoring inference code into a shared utility. Also check for dead ReLU initialisation: if all pre-activations are negative at init, the entire gradient signal is zero from step one.
Symptom · 03
GPU memory grows without bound each epoch
Fix
Check for tensors retained in the computational graph across loop iterations. The most common cause: appending loss (not loss.item()) to a history list. Use .item() for scalar logging and .detach() for tensor logging. Also check for retain_graph=True being called repeatedly — it is almost never necessary in standard training and will silently accumulate the entire graph in memory.
Symptom · 04
Validation loss fluctuates wildly between epochs
Fix
Check if model.eval() is called before validation. Without it, Dropout randomly drops different neurons on every forward pass, and BatchNorm uses the current batch's statistics instead of the accumulated running statistics. The result is non-deterministic validation outputs even on identical input data — which looks exactly like training instability but is actually an evaluation bug.
Symptom · 05
Model works on CPU but crashes on GPU with RuntimeError
Fix
Check device mismatch — this is almost always it. Every tensor involved in a single operation must live on the same device. Use .to(device) on both the model and every input tensor in your data loading step. If you are using a custom collate function in DataLoader, that is often where tensors quietly stay on CPU.
★ PyTorch Training Debug Cheat SheetQuick commands to diagnose training and memory issues
Loss becomes NaN during training
Immediate action
Enable anomaly detection to find the operation producing NaN gradients
Commands
torch.autograd.set_detect_anomaly(True)
print([(n, p.grad.norm()) for n, p in model.named_parameters() if p.grad is not None])
Fix now
Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
GPU memory grows each epoch+
Immediate action
Find tensors retained in the computational graph across loop iterations
Commands
torch.cuda.memory_summary()
print(loss.item()) # NOT print(loss) — .item() detaches from graph
Fix now
Use .item() for scalar logging and .detach() for tensor logging. Never append raw loss tensors to a list.
Gradients are all zeros — model not learning+
Immediate action
Check if parameters have requires_grad and non-zero gradients after a backward pass
Commands
for n, p in model.named_parameters(): print(n, p.requires_grad, p.grad is not None)
torch.autograd.gradcheck(model, (test_input,))
Fix now
Ensure no torch.no_grad() wraps training code and no in-place operations on leaf tensors break the graph
Model trains but validation metrics are random+
Immediate action
Verify model.eval() is called before every validation pass
Commands
print(model.training) # Should be False during validation
print(any(isinstance(m, nn.Dropout) for m in model.modules()))
Fix now
Call model.eval() before every validation pass and model.train() before every training pass — treat them as a matched pair
PyTorch vs TensorFlow 1.x — Architectural Differences
Feature / AspectPyTorch (Dynamic Graph)TensorFlow 1.x (Static Graph)
Graph constructionBuilt at runtime on every forward pass — debug with standard Python tools anywhere in the loopPre-compiled before any data flows through — the graph was fixed at definition time, making runtime inspection nearly impossible
DebuggingStandard Python debugger, print(), and pdb work anywhere in the forward pass with no special configurationRequired special tf.Print ops inserted into the graph; runtime errors produced stack traces that pointed to graph compilation, not the user code that caused them
Research flexibilityArchitecture changes take effect immediately — swap a layer, change a loss function, add a branch mid-experiment with no recompilation stepAny architectural change required rebuilding and recompiling the graph, which could take seconds to minutes for large models
Production deploymentTorchScript or ONNX export required for optimised serving without a Python runtime; torch.compile() in 2.x closes most of the performance gap for GPU servingSavedModel format was natively optimised for TF Serving; the static graph made deployment straightforward but locked you into the graph you compiled
Community adoptionDominant in research — over 75% of ML papers published in 2024-2025 used PyTorch as the primary frameworkRemains strong in enterprise production systems built before 2020; legacy TF1 codebases are still running in many large organisations
GPU memory controlExplicit .to(device) — you decide what moves and when; nothing migrates automaticallyAutomatic placement with manual overrides via tf.device() context managers; less control but fewer explicit device calls
Gradient controlrequires_grad per tensor; torch.no_grad() and torch.inference_mode() context managers; fine-grained control at the tensor levelGradientTape context manager in TF 2.x — similar concept but opt-in rather than opt-out; in TF 1.x gradients were computed by tf.gradients() on the pre-compiled graph

Key takeaways

1
Tensors carry three critical properties beyond their values
dtype, device, and requires_grad — getting any one of these wrong silently breaks training in ways that trace to the wrong location in the stack.
2
Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss. The graph is destroyed after each backward pass by default
retain_graph=True in a loop is almost always a memory leak.
3
The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model
the order is load-bearing. Memorise it and you can read any codebase or paper's training code cold.
4
model.train() and model.eval() control layer behaviour like Dropout and BatchNorm. torch.no_grad() controls gradient computation. These are three separate mechanisms. Confusing them is the single most common source of subtle training bugs in production PyTorch code.
5
Always call model as a callable (model(x)), never model.forward(x)
the __call__ method wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on. This one habit prevents an entire class of silent tooling failures.
6
Dataset + DataLoader form the standard data pipeline. Keep num_workers moderate, always move batches to the correct device, and test the pipeline in isolation.
7
Mixed precision with torch.cuda.amp halves memory and doubles throughput. GradScaler is not optional
it prevents float16 underflow. Use inference_mode for production serving, not no_grad.

Common mistakes to avoid

7 patterns
×

Forgetting optimizer.zero_grad() before loss.backward()

Symptom
Loss decreases during training but predictions are garbage in production. Gradient norms grow exponentially across epochs. The training run looks like it converged but the model has effectively random weights that memorised noise.
Fix
Call optimizer.zero_grad() as the first line of every training step — before the forward pass, before the loss computation, before anything. Add gradient norm logging to your training dashboard. Consider gradient clipping with max_norm=1.0 as a permanent safety net, not just a debugging tool.
×

Calling model.forward(x) directly instead of model(x)

Symptom
Profilers produce no output. Libraries that register forward hooks (torchvision, quantisation tools, some logging frameworks) silently fail to fire. The model produces correct predictions but all hook-dependent tooling is invisible.
Fix
Always call the model as a callable: predictions = model(input_tensor). The __call__ method is where forward hooks, backward hooks, and the training mode flag are applied. Your forward() method is called internally — it is not the entry point.
×

Storing raw loss tensors in a list for logging

Symptom
GPU memory grows steadily across epochs with no obvious leak in the model code. Each epoch consumes more memory than the last until an out-of-memory crash occurs, often in the middle of a long training run.
Fix
Use loss.item() to extract a plain Python float before storing or logging. .item() detaches the scalar from the computation graph. Never append loss itself to a list — it keeps the entire graph alive for that batch in memory indefinitely.
×

Not calling model.eval() during validation

Symptom
Validation loss fluctuates wildly between epochs even when training loss is smooth and decreasing. The model appears unstable, but the instability is in the evaluation, not the model weights.
Fix
Call model.eval() before every validation loop and model.train() before every training loop. Treat them as a matched pair. If your codebase has multiple evaluation paths (validation, test, inference), add a utility function that ensures eval mode is set and inference_mode is active — centralise it so it cannot be forgotten.
×

Using torch.no_grad() instead of torch.inference_mode() for production serving

Symptom
Inference is slower than expected and uses more memory than necessary. Not a crash — a silent performance regression that is easy to miss without profiling.
Fix
Use torch.inference_mode() for all production inference paths. It disables both gradient computation and version counter tracking, providing 10-20% faster execution on typical transformer and CNN architectures. Reserve torch.no_grad() for validation loops inside training runs where you may still need version tracking for other operations.
×

Setting num_workers too high in DataLoader

Symptom
Training starts slow or crashes with memory errors. CPU usage spikes to 100% and GPU utilisation is low. System may start swapping.
Fix
Start with num_workers=4 and increase until GPU utilisation plateaus. Monitor system memory: each worker may duplicate dataset memory on Linux. If you see memory pressure, reduce workers or set num_workers=0 to disable multiprocessing.
×

Forgetting to move data to GPU in the training loop (but moving the model)

Symptom
RuntimeError: Expected all tensors to be on the same device. Usually occurs on the first forward pass, but sometimes later if some branches avoid the mismatch.
Fix
Always call features, targets = features.to(device), targets.to(device) at the start of each batch. Use a consistent device variable. Add a one-line assertion: assert next(model.parameters()).device == features.device, before forward pass.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is the computation graph in PyTorch and how does autograd use it to...
Q02SENIOR
Explain the difference between model.train() and model.eval() and why yo...
Q03SENIOR
What happens if you forget to call optimizer.zero_grad() before each tra...
Q01 of 03SENIOR

What is the computation graph in PyTorch and how does autograd use it to compute gradients?

ANSWER
The computation graph is a directed acyclic graph (DAG) that records every operation performed on tensors that have requires_grad=True. Each operation becomes a node (with a grad_fn), and the edges represent the tensors flowing between operations. When you call .backward() on a scalar loss, autograd traverses this graph in reverse topological order, applying the chain rule at each node to compute the gradient of the loss with respect to every leaf tensor that required a gradient. The graph is dynamically built on each forward pass and is destroyed after backward() by default, which keeps memory usage proportional to a single forward pass rather than the entire training history.
FAQ · 1 QUESTIONS

Frequently Asked Questions

01
What is the difference between PyTorch and NumPy?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Tools. Mark it forged?

15 min read · try the examples if you haven't

Previous
TensorFlow Basics
3 / 12 · Tools
Next
Keras for Deep Learning