Senior 6 min · March 09, 2026

PyTorch Training Loop — Missing zero_grad Causes Nonsense

Q: What is Training Loop in PyTorch Explained in simple terms?

It is the repeated process by which a model learns: make a prediction, measure the error, compute how the weights should change, update them, and do it again for the next batch. Everything else in training is built around that cycle.

Q: Why does my loss stay exactly the same every epoch?

The usual causes are mechanical before they are mathematical: optimizer.step() may be missing, parameters may be frozen, the learning rate may be effectively zero, the model might be in eval mode during training, or the loss may not be connected to the model outputs the way you think it is. Start by printing one parameter value before and after optimizer.step() and confirm it actually changes.

Q: Can I use multiple loss functions in one loop?

Yes. That is a standard pattern in multi-task learning and regularized objectives. Compute each loss, weight them as needed, sum them into one scalar total_loss, and call total_loss.backward() once. The important part is that the final object passed to backward must be a scalar unless you explicitly provide gradient arguments.

Q: Do I need a training loop if I use a pre-trained model?

Not for inference. If you only want predictions, you load the weights, switch to eval mode, and run the model. But if you are fine-tuning the pre-trained model on your data, then yes — you still need a training loop. Usually it is just a lighter one: smaller learning rate, fewer epochs, sometimes frozen backbone layers at the start.

Q: What is the difference between loss.item() and loss directly?

loss is still a tensor tied to the computation graph. loss.item() extracts its Python scalar value. For backward(), you need the tensor. For logging, averaging, and printing, you almost always want loss.item(). If you keep storing loss tensors in lists, you also keep their graphs alive longer than necessary.

Training loss drops but validation stays random? Growing weights.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

The training loop is a 4-step cycle: zero_grad, forward, backward, optimizer.step — the sequence matters because each step depends on state created by the previous one
optimizer.zero_grad() clears previous gradients — without it, gradients accumulate across batches and updates quickly become wrong
loss.backward() computes gradients through Autograd via the chain rule — you call it on the loss tensor, not on the model
model.train() and model.eval() switch Dropout and BatchNorm behavior — forgetting eval mode makes validation noisy and misleading
torch.no_grad() during validation avoids building a graph you will never backprop through — less memory, faster evaluation
The most common production bug is missing zero_grad; the close second is logging loss tensors instead of loss.item(), which quietly leaks memory

✦ Definition~90s read

What is Training Loop in PyTorch?

The PyTorch training loop exists because optimization is stateful, and PyTorch chooses to make that state visible rather than hide it behind a one-line fit call. Each batch passes through the same sequence: clear old gradients, run the forward pass, compute the loss, backpropagate, and step the optimizer. That looks repetitive because it is repetitive. Training is controlled repetition.

★

Think of the training loop as the repetition that turns a rough guess into a skill.

The reason this pattern matters is that gradients in PyTorch accumulate by default. Parameters remember their previous .grad values until you clear them. Autograd also records the operations from the forward pass so backward can traverse that graph in reverse. The loop is not just procedural boilerplate — it is how you manage that state correctly.

In 2026, the canonical loop usually includes a few production-grade upgrades even for ordinary models: optimizer.zero_grad(set_to_none=True) for slightly lower memory traffic, mixed precision with torch.autocast on supported GPUs, gradient clipping when the model is deep or unstable, and explicit validation blocks with model.eval() plus torch.no_grad(). If your model is compile-friendly, torch.compile can sit on top of the same loop structure without changing the fundamentals.

What does not change is the contract. The loop still answers the same four questions every iteration: what did the model predict, how wrong was it, how should the weights change, and did they actually change.

Plain-English First

Think of the training loop as the repetition that turns a rough guess into a skill. The model looks at a batch of data and makes a prediction. You measure how wrong that prediction was. PyTorch then works backward through the model to figure out which internal weights contributed to the error, and the optimizer nudges those weights in a better direction. Then you repeat. That is the entire game. If you want a concrete picture, imagine coaching someone learning to throw darts: they throw, you measure how far off they were, you explain what to adjust, they correct, and they throw again. The training loop is just that feedback cycle written in code — disciplined, repetitive, and brutally honest.

The training loop is the core execution pattern in PyTorch — a cycle that repeats for every batch of data: clear gradients, compute predictions, compute loss, backpropagate, update weights. PyTorch keeps this explicit on purpose. You are never far from the mechanics of optimization.

That explicitness is the trade-off. You get full visibility into gradient flow, parameter updates, device placement, mixed precision, gradient clipping, and scheduling. The cost is that there is no place to hide sloppy thinking. The loop is short, but it is stateful, and the order of operations matters.

The failure pattern I see most often in real code reviews is not an exotic math bug. It is a copy-pasted tutorial loop with one small change in the wrong place. Someone forgets optimizer.zero_grad(). Someone validates without model.eval(). Someone logs the loss tensor itself instead of loss.item() and wonders why GPU memory keeps growing. The loop is simple. The discipline around it is what separates a model that trains cleanly from one that burns two days of GPU time to produce nonsense.

What Is Training Loop in PyTorch Explained and Why Does It Exist?

io/thecodeforge/ml/train_loop.pyPYTHON

import torch
import torch.nn as nn
from torch.utils.data import DataLoader


def train_one_epoch(model, loader, criterion, optimizer, device, scaler=None, max_grad_norm=1.0):
    model.train()
    running_loss = 0.0
    total_samples = 0

    use_amp = device.type == 'cuda'

    for inputs, labels in loader:
        inputs = inputs.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)

        # 1) Clear stale gradients from the previous iteration
        optimizer.zero_grad(set_to_none=True)

        # 2) Forward pass + loss calculation
        with torch.autocast(device_type=device.type, dtype=torch.float16, enabled=use_amp):
            outputs = model(inputs)
            loss = criterion(outputs, labels)

        # 3) Backward pass
        if scaler is not None and use_amp:
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)  # unscale before clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)

            # 4) Optimizer step
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)
            optimizer.step()

        batch_size = inputs.size(0)
        running_loss += loss.item() * batch_size
        total_samples += batch_size

    return running_loss / total_samples


@torch.no_grad()
def validate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in loader:
        inputs = inputs.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        running_loss += loss.item() * inputs.size(0)
        preds = outputs.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

    avg_loss = running_loss / total
    accuracy = correct / total
    return avg_loss, accuracy


# Example wiring
# model = MyClassifier().to(device)
# model = torch.compile(model)  # optional in PyTorch 2.x when the model is compile-friendly
# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
# criterion = nn.CrossEntropyLoss()
# scaler = torch.amp.GradScaler('cuda', enabled=(device.type == 'cuda'))
# scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
#
# for epoch in range(10):
#     train_loss = train_one_epoch(model, train_loader, criterion, optimizer, device, scaler)
#     val_loss, val_acc = validate(model, val_loader, criterion, device)
#     scheduler.step()  # epoch-level scheduler: step after the epoch
#     lr = optimizer.param_groups[0]['lr']
#     print(f'Epoch {epoch + 1:02d} | train_loss={train_loss:.4f} | val_loss={val_loss:.4f} | val_acc={val_acc:.2%} | lr={lr:.2e}')

Output

Epoch 01 | train_loss=0.6124 | val_loss=0.4018 | val_acc=84.75% | lr=3.00e-04

The Training Loop Mental Model

zero_grad clears old gradient state so the next update reflects the current batch rather than stale history
The forward pass converts inputs into predictions using the current weights
The loss function turns prediction quality into a scalar signal Autograd can differentiate
backward populates parameter.grad by walking the graph in reverse
optimizer.step reads parameter.grad and updates the weights in place

Production Insight

Use optimizer.zero_grad(set_to_none=True) as the default in new code — same semantics for standard training, slightly less memory traffic.

If you enable mixed precision, unscale gradients before clipping or your clip threshold is meaningless.

Rule: keep the training loop boring. Most production failures come from clever additions in the wrong place, not from the core pattern itself.

Key Takeaway

The loop is short because the underlying idea is simple: clear stale gradients, predict, measure error, backpropagate, update.

What makes training reliable is not complexity but respecting the state transitions inside that sequence.

If the order is wrong, the whole run is suspect even when the code still executes.

Training Loop Strategy Decision

IfStandard supervised training with one model, one loss, and ordinary validation

→

UseUse a plain native PyTorch loop — explicit, easy to debug, and flexible enough for most production work

IfNeed gradient accumulation, multiple losses, custom clipping, or unusual optimizer scheduling

→

UseStay in native PyTorch — these cases are exactly where explicit loops are worth having

IfNeed to reduce boilerplate for callbacks, checkpointing, and distributed setup

→

UseConsider Lightning or Accelerate, but keep the mental model of the native loop because you will still debug the same underlying states

IfTraining on modern NVIDIA GPUs and the model is stable under compilation

→

UseAdd mixed precision and evaluate torch.compile — both sit on top of the same loop and can improve throughput materially

thecodeforge.io

PyTorch Training Loop Flow

Pytorch Training Loop

Experiment Tracking: Logging Metrics and Checkpoints Like an Engineer

A training loop that only prints to stdout is fine for a notebook. It is not enough for a team. Once a model matters, you need a trace of what happened: which code ran, which hyperparameters were used, what the learning rate was at each epoch, what checkpoint corresponded to the best validation metric, and when the run started to go sideways if it did.

The simple pattern is still the right one. Log epoch-level metrics to a durable store. Keep checkpoints in object storage or a mounted artifact directory. Store the checkpoint path alongside the metrics row rather than pretending those two systems are unrelated. They are part of the same training story.

The practical benefit is not just reporting. It is rollback and diagnosis. When somebody says the new model is worse, you should be able to answer with evidence: which run, which epoch, which checkpoint, what the validation curve looked like, and whether the learning rate schedule or data version changed. Without that, model debugging turns into folklore.

io/thecodeforge/db/training_metrics.sqlSQL

-- io.thecodeforge: epoch-level training metrics for experiment tracking
INSERT INTO io.thecodeforge.training_history (
    run_id,
    model_id,
    epoch_number,
    train_loss,
    val_loss,
    val_accuracy,
    learning_rate,
    checkpoint_path,
    created_at
) VALUES (
    'run_2026_04_19_001',
    'ForgeResNet50-v3',
    12,
    0.2841,
    0.3198,
    0.9142,
    0.000300,
    's3://forge-models/ForgeResNet50-v3/epoch_12.pt',
    CURRENT_TIMESTAMP
);

-- Example rollback query: fetch the best validation checkpoint for a run
SELECT epoch_number, checkpoint_path, val_loss, val_accuracy
FROM io.thecodeforge.training_history
WHERE run_id = 'run_2026_04_19_001'
ORDER BY val_accuracy DESC, val_loss ASC
LIMIT 1;

Output

Metric logged successfully. Best checkpoint for run_2026_04_19_001 returned to the training dashboard.

Forge Best Practice:

Do not hard-code learning rate, batch size, or checkpoint paths inside the loop. Load them from configuration at startup and persist that configuration with the run. The moment you are comparing experiments across weeks or teammates, inline constants become a liability.

Production Insight

Log epoch metrics to something durable — SQL, MLflow, or Weights & Biases — not just stdout.

The checkpoint path belongs in the same record as the metrics that justified keeping it.

Rule: if a run cannot be reproduced from its config, metrics, and checkpoint references, it was not tracked well enough.

Key Takeaway

Training is not complete when the epoch ends. It is complete when the metrics, config, and checkpoint are all durable and queryable.

That audit trail is what lets you explain success, debug regression, and roll back with confidence.

A good loop teaches the model. A good training system teaches the team.

Experiment Logging Decision

IfSolo developer with a small number of runs

→

UseSQLite or even a disciplined CSV can be enough, provided checkpoint paths and configs are captured consistently

IfTeam with shared training infrastructure and dashboard needs

→

UseUse PostgreSQL or a managed experiment system so metrics are queryable across runs and users

IfMany experiments, multiple model families, artifact comparison, and sweeps

→

UseUse MLflow or Weights & Biases — purpose-built tools save time once experiment volume stops being trivial

Containerizing the Forge Training Environment

Training jobs fail for boring reasons far more often than people admit: wrong CUDA runtime, mismatched drivers, DataLoader workers starved by tiny shared memory, and buffered logs that make a crashed container look idle for ten minutes. Docker does not solve those problems automatically, but it gives you one place to make them explicit.

For 2026-era PyTorch stacks, three things matter immediately. First, pin the framework and CUDA versions rather than using latest. Second, use unbuffered Python output so logs appear in real time in whatever runtime you use. Third, remember that DataLoader workers share memory through /dev/shm inside the container. If you spin up multiple workers without enough shared memory, you get hangs, worker exits, or mysterious throughput collapse.

The other trap is silent CPU fallback. Teams assume the container is using the GPU because the base image has CUDA in its name. That proves nothing. The host still needs the NVIDIA Container Toolkit, the container still needs to run with --gpus all, and your startup logs should still print torch.cuda.is_available() plus the device name. If you do not verify that, you can burn hours training on CPU and only discover it when the epoch time looks absurd.

DockerfileDOCKERFILE

# io.thecodeforge: Reproducible PyTorch training container
# Pin versions. Never use a floating 'latest' tag in training infrastructure.
FROM pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime

WORKDIR /app

ENV PYTHONUNBUFFERED=1
ENV PIP_NO_CACHE_DIR=1

# System deps commonly needed by vision and tabular training stacks
RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    curl \
    libgl1 \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

# Good startup hygiene: print environment info before training begins
CMD ["python", "-u", "train.py"]

# Example runtime command:
# docker run --gpus all --shm-size=2g -v $(pwd)/data:/app/data -v $(pwd)/checkpoints:/app/checkpoints forge-trainer:latest

Output

Successfully built image forge-trainer:2.4.1-cuda12.4

Startup log: cuda_available=True | device=NVIDIA A100-SXM4-40GB

Docker Tip:

If you are training with num_workers greater than 0, give the container enough shared memory. In practice, --shm-size=2g is a sensible baseline for many workloads. Without it, DataLoader worker crashes and unexplained stalls are common.

Production Insight

Use python -u or PYTHONUNBUFFERED=1 so logs stream immediately — delayed logs make dead jobs harder to diagnose.

Pin PyTorch and CUDA versions and print torch.cuda.is_available() at startup — never assume the GPU is active because the image name says CUDA.

Rule: for GPU training, the two Docker flags you check first are --gpus all and a sane --shm-size value.

Key Takeaway

Containerization is not just packaging. It is how you make runtime assumptions visible: framework version, CUDA version, logging behavior, and memory settings.

Most training container bugs are environment bugs, not model bugs.

Pin the environment, verify GPU visibility, and stop guessing.

Docker Training Environment Decision

IfTraining on GPU with DataLoader num_workers = 0

→

UseUse --gpus all and verify CUDA visibility at startup; shared memory is less likely to be the bottleneck

IfTraining on GPU with DataLoader num_workers > 0

→

UseUse --gpus all plus a larger --shm-size allocation because worker processes depend on shared memory for batch transfer

IfTraining or fine-tuning on CPU-only infrastructure

→

UseUse a CPU-specific PyTorch base image — smaller pull size, fewer moving parts, and no unused CUDA runtime

IfNeed to compile custom CUDA extensions during build

→

UseUse a devel image for the build stage, then switch to a slimmer runtime image for execution if possible

Common Mistakes and How to Avoid Them

Most training-loop bugs come from state you forgot was stateful. Gradients persist until you clear them. Dropout and BatchNorm switch behavior based on mode. Loss tensors keep their graph unless you detach or convert them properly for logging. PyTorch is explicit about all of this, but explicit does not mean self-correcting.

The two mode switches people most often misuse are model.train() and model.eval(). These are not decorative. They change the behavior of real layers. Validation without model.eval() is not a small mistake; it changes the model you think you are measuring. The same goes for validation without torch.no_grad() — maybe the metrics are still numerically correct, but you pay the full memory cost of graphs you will never use.

The other class of mistake is device mismatch. PyTorch will never silently move your batch to match the model. If the model is on GPU and the inputs are on CPU, the first forward pass fails. That is good. What is less obvious is partial mismatch inside more complex training code — an auxiliary tensor created on CPU in the middle of loss calculation, or class weights left on CPU while logits are on GPU. The discipline is the same: decide the device once, move everything that participates in the computation there, and verify it early.

io/thecodeforge/ml/validate_model.pyPYTHON

import torch


def validate_model(model, data_loader, criterion):
    device = next(model.parameters()).device
    model.eval()  # CRITICAL: switch Dropout / BatchNorm to inference behavior

    total_loss = 0.0
    total_correct = 0
    total_samples = 0

    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs = inputs.to(device, non_blocking=True)
            labels = labels.to(device, non_blocking=True)

            outputs = model(inputs)
            loss = criterion(outputs, labels)

            total_loss += loss.item() * inputs.size(0)
            preds = outputs.argmax(dim=1)
            total_correct += (preds == labels).sum().item()
            total_samples += labels.size(0)

    avg_loss = total_loss / total_samples
    accuracy = total_correct / total_samples

    model.train()  # restore training mode for the next epoch
    return avg_loss, accuracy


# Common anti-patterns to avoid:
# 1) Forgetting model.eval() before validation
# 2) Forgetting torch.no_grad() in validation
# 3) Logging loss instead of loss.item()
# 4) Using outputs.data instead of outputs.argmax(dim=1)
# 5) Forgetting to switch back to model.train() before the next epoch

Output

Validation completed: avg_loss=0.2874 | accuracy=91.32%

Watch Out:

If your native loop is turning into a small framework — custom callbacks, distributed coordination, dozens of logging hooks, resumable checkpointing, gradient accumulation, EMA weights, early stopping, and scheduler orchestration — pause and ask whether you still want to maintain that abstraction yourself. Native PyTorch is the right default. It does not have to be the only tool you use.

Production Insight

Forgetting zero_grad causes accidental gradient accumulation; forgetting eval mode corrupts validation; logging loss tensors instead of loss.item() leaks memory.

Those are not advanced bugs. They are routine, and they are exactly why a good training loop should be predictable and slightly boring.

Rule: zero_grad before backward, model.eval() plus torch.no_grad() for validation, and use loss.item() for every metric you log.

Key Takeaway

The most common training-loop problems are bookkeeping problems: stale gradients, wrong model mode, wrong device, or graphs retained longer than necessary.

None of them are hard once you know where the state lives.

A reliable loop is explicit about that state at every phase.

Debugging Training Loop Mistakes

IfLoss stays flat or gets erratic after a few batches

→

UseCheck loop order first and verify optimizer.zero_grad() is inside the batch loop before backward

IfValidation accuracy is much lower or noisier than expected

→

UseEnsure model.eval() is active during validation and that validation is not using random training augmentations

IfRuntimeError: Expected all tensors to be on the same device

→

UseMove inputs, labels, and any auxiliary tensors used in the loss to the same device as the model before the forward pass

IfGPU memory climbs during validation or logging

→

UseWrap validation in torch.no_grad() and store loss.item() rather than the loss tensor

Why Your Loss Is Flat: Debugging the Silent Failures in Gradient Flow

You've seen it happen. The metrics dashboard shows a flat line for 10 epochs. Loss refuses to drop. Your first instinct might be learning rate, but seasoned engineers know the real culprit often lives deeper: vanishing or exploding gradients.

PyTorch's autograd builds a dynamic computation graph per forward pass. After backward(), gradients accumulate in .grad attributes. If you're not zeroing them with optimizer.zero_grad() each iteration, your model steps in a random direction every epoch. Worse, if your layers have sigmoid activations with inputs outside [-2,2], gradients saturate to zero. The gradient flow vanishes before it reaches early layers.

Always run a gradient sanity check early: log the mean and std of gradient norms for each layer in your first epoch. If norms drop below 1e-6 for any layer, consider switching activation functions or adding batch normalization. If you see nan or explosion (norms > 100), clip gradients with torch.nn.utils.clip_grad_norm_. It buys you debugging time without a model rewrite.

gradient_debug.pyPYTHON

// io.thecodeforge
import torch
import torch.nn as nn

def log_gradient_norm(model: nn.Module, epoch: int, step: int):
    total_norm = 0.0
    for name, param in model.named_parameters():
        if param.grad is not None:
            param_norm = param.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
            print(f"Layer '{name}' grad norm: {param_norm:.6f}")
    total_norm = total_norm ** 0.5
    print(f"Epoch {epoch} step {step} - total grad norm: {total_norm:.4f}")

Output

Layer 'fc1.weight' grad norm: 0.002354

Layer 'fc1.bias' grad norm: 0.000012

Epoch 1 step 10 - total grad norm: 0.2389

Production Trap:

Silent gradient loss in BatchNorm layers. If you freeze batch norm statistics during evaluation but forget to set model.eval(), your validation gradients can carry stale running stats that kill learning.

Key Takeaway

Zero gradients every iteration. Log gradient norms in epoch 1. Clip if they explode. Replace sigmoid with ReLU or LeakyReLU as default.

Containerized Reproducibility: Why Your Model Works Locally But Fails in Production

The bug report lands on your desk: "Model accuracy drops from 94% to 55% after deployment." You check the code. Same training loop. Same checkpoint. The problem isn't your model, it's the environment. A different CUDA version, a Python stdlib change, or even numpy's random seed behavior across platforms can silently corrupt your training.

That's why you containerize everything. A Dockerfile with pinned base images fixes the runtime. But more insidious is data: if you shuffle indices or split datasets without a fixed seed, each run sees different samples. PyTorch's DataLoader shuffling depends on the system's random state. Set torch.manual_seed(42) AND DataLoader(shuffle=True, generator=torch.Generator().manual_seed(42)). Also pin torch.backends.cudnn.deterministic = True to lock cuDNN algorithms and torch.backends.cudnn.benchmark = False to avoid algorithm swapping.

Your production pipeline should match training precisely: same batch order, same augmentation transforms evaluated in the same order, same loss scaling. Break this chain, and you're debugging phantom regressions.

reproducible_dataloader.pyPYTHON

// io.thecodeforge
import torch
from torch.utils.data import DataLoader, TensorDataset

def build_reproducible_loader(
    data: torch.Tensor,
    labels: torch.Tensor,
    batch_size: int = 32,
    shuffle: bool = True,
    seed: int = 42
) -> DataLoader:
    torch.manual_seed(seed)
    generator = torch.Generator()
    generator.manual_seed(seed)
    dataset = TensorDataset(data, labels)
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        generator=generator,
        num_workers=0  # Pin to main process for deterministic order
    )

Output

DataLoader configured. Each epoch's shuffle order is deterministic across runs with seed=42.

Engineering Reality:

Don't trust 'random_state' parameters in libraries like sklearn if you're mixing them with PyTorch. Prefer controlling randomness at the PyTorch level with one generator per DataLoader.

Key Takeaway

Containerize the whole stack. Set all random seeds explicitly. Use DataLoader(generator=torch.Generator().manual_seed(...)). Never assume reproducibility without runtime verification.

● Production incidentPOST-MORTEMseverity: high

Model converges to confident nonsense because gradients kept accumulating

Symptom

Training loss decreased smoothly enough to look healthy on the dashboard. Validation accuracy stayed near random chance. Weight magnitudes kept growing from epoch to epoch. The model became more confident over time, but its predictions were confidently wrong.

Assumption

The first assumption was that the learning rate was too high or the labels were noisy. Both are reasonable guesses. Neither was correct.

Root cause

The training loop omitted optimizer.zero_grad() inside the per-batch iteration. PyTorch accumulates gradients into parameter.grad by design. That is useful when you intentionally want gradient accumulation. Here it was accidental. By the end of each epoch, every update was based on gradients that included stale contributions from many previous batches. The optimizer was not following the current batch signal anymore — it was dragging around the residue of the whole epoch.

Fix

Added optimizer.zero_grad(set_to_none=True) at the start of every training iteration, before the forward pass. Verified the fix by printing representative gradient norms before backward, after backward, and after zero_grad. Added gradient clipping with torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) as a safety net, and added a unit-level training smoke test that asserts loss decreases over a few batches on a known dataset slice.

Key lesson

optimizer.zero_grad() belongs inside the batch loop and before loss.backward() — if it is missing, accumulation is happening whether you intended it or not
If training loss looks fine but validation accuracy is random, do not just tune the learning rate — inspect gradient norms and the loop order first
Gradient accumulation is a real technique, but when you use it intentionally you must divide the loss by accumulation_steps before backward
Add a short smoke test that runs 5 to 10 batches and checks whether loss trends downward — it catches loop-order bugs early and cheaply

Production debug guideCommon symptoms when the loop runs without crashing but still produces bad training behavior5 entries

Symptom · 01

Loss stays flat or oscillates wildly across epochs

→

Fix

Check the loop order first: zero_grad, forward, backward, step. Then print gradient norms with: for n, p in model.named_parameters(): print(n, None if p.grad is None else p.grad.norm().item()). Missing zero_grad or an over-aggressive learning rate are the usual causes.

Symptom · 02

Loss decreases but validation accuracy stays at random chance

→

Fix

Check that model.eval() is called before validation and model.train() is restored before the next epoch. Then verify label alignment, class-index mapping, and that you are not accidentally shuffling labels in the dataset or collate function. Also inspect whether gradients are accumulating unintentionally.

Symptom · 03

CUDA out of memory during training but not during inference

→

Fix

That is normal in principle because training stores activations for backprop, but large unexplained growth usually means the validation loop is missing torch.no_grad(), loss tensors are being stored instead of loss.item(), or retain_graph=True is being used unnecessarily.

Symptom · 04

Training crashes with RuntimeError: grad can be implicitly created only for scalar outputs

→

Fix

Your loss is not a scalar. Many reduction='none' losses return one value per sample. Reduce it with loss.mean() or loss.sum() before calling backward().

Symptom · 05

Loss becomes NaN after a few iterations

→

Fix

Check input normalization, learning rate, and gradient norms. If you are using mixed precision, confirm GradScaler is enabled and the loss is finite before stepping. Gradient clipping is often enough to stop a bad run from blowing up completely.

★ Training Loop Debug Cheat SheetFast checks you can run before touching architecture or hyperparameters

Gradients appear to grow every batch−

Immediate action

Confirm gradients are being reset inside the loop

Commands

python -c "import torch; print('Check your loop for optimizer.zero_grad(set_to_none=True) before forward/backward')"

python -c "for n, p in model.named_parameters(): print(n, None if p.grad is None else p.grad.norm().item())"

Fix now

Add optimizer.zero_grad(set_to_none=True) at the top of each batch iteration. If you are intentionally accumulating gradients, divide loss by accumulation_steps and only step every N batches.

Validation memory keeps growing+

Device mismatch crash on first batch+

Loss decreases but predictions are unstable across validation runs+

Training Loop Steps Explained

Phase	Action	Purpose
zero_grad()	Clear parameter.grad values from the previous step	Prevents stale gradients from accumulating into the next update
Forward Pass	outputs = model(inputs)	Produces predictions using the current model weights
Loss Calculation	loss = criterion(outputs, labels)	Reduces prediction quality to a differentiable training signal
Backward Pass	loss.backward()	Computes gradients for every leaf parameter through Autograd
Optimizer Step	optimizer.step()	Applies the parameter update using the gradients just computed
Validation Phase	model.eval() + `torch.no_grad()`	Measures model quality without Dropout noise or graph allocation
Scheduler Step	scheduler.step()	Adjusts learning rate on a planned cadence rather than leaving it static

Key takeaways

The PyTorch training loop is explicit by design

you manage gradients, forward passes, backward passes, optimizer updates, and validation state yourself.

The canonical order matters

zero_grad, forward, loss, backward, optimizer.step. If that sequence is wrong, the run is not trustworthy.

model.train() and model.eval() are real mode switches, not decoration. Dropout and BatchNorm depend on them.

torch.no_grad() during validation saves memory and time by avoiding graphs you will never backprop through.

Use loss.item() for logging and reporting. Keeping loss tensors around is a common and unnecessary source of memory growth.

In 2026, a production-quality loop usually adds mixed precision, gradient clipping, scheduler control, and durable metric logging

but the underlying mechanics have not changed.

Common mistakes to avoid

5 patterns

Forgetting optimizer.zero_grad() before loss.backward()

Symptom

Loss may look unstable, drift strangely, or even decrease while validation quality collapses. Gradients accumulate across batches, so each update contains stale information from prior iterations. Weight norms often grow much faster than expected.

Fix

Call optimizer.zero_grad(set_to_none=True) once per batch before the forward pass. If you intentionally use gradient accumulation, divide the loss by accumulation_steps and only call optimizer.step() every N batches.

Not moving data to the same device as the model

Symptom

Training crashes on the first forward pass with a device mismatch error, or more subtly later when an auxiliary tensor inside the loss is created on CPU while logits are on GPU.

Fix

Set device once, move the model there, and move every batch with inputs.to(device, non_blocking=True) and labels.to(device, non_blocking=True). Any tensor that participates in the computation must live on that same device.

Skipping model.eval() and torch.no_grad() during validation

Symptom

Validation metrics bounce around more than they should, inference-like accuracy looks worse than training suggested, and GPU memory usage during validation is higher than necessary.

Fix

Call model.eval() before validation and wrap the entire validation loop in torch.no_grad(). After validation, call model.train() before the next epoch begins.

Calling model.train() and model.eval() in the wrong places

Symptom

The model trains with Dropout effectively disabled or validates with Dropout still active. BatchNorm running statistics are updated when they should be frozen, or frozen when they should still be learning.

Fix

Use model.train() at the start of each training epoch. Use model.eval() for every validation or inference block. Do not mix the two and do not assume the previous call is still the correct state.

Logging loss tensors instead of loss.item()

Symptom

Memory usage grows slowly over the course of an epoch or run, especially when you append loss tensors to Python lists for later reporting. The graphs attached to those tensors are kept alive longer than intended.

Fix

Use loss.item() for logging, aggregation, and dashboard reporting. Keep the tensor form only for backward(). Once you are measuring or printing, you almost always want the scalar value.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the 'Gradient Accumulation' technique. Why would a developer int...

Q02SENIOR

Why is backward() called on the loss tensor and not on the model itself?...

Q03SENIOR

Compare model.train() and model.eval(). Which specific layers are actual...

Q04JUNIOR

How does the PyTorch training loop get around the Python GIL during data...

Q05SENIOR

Where should a learning rate scheduler step happen in the training loop,...

Q01 of 05SENIOR

Explain the 'Gradient Accumulation' technique. Why would a developer intentionally skip optimizer.zero_grad() for a few batches?

ANSWER

Gradient accumulation is a memory-saving technique used when the desired effective batch size does not fit on the GPU. Suppose the real target batch size is 128, but only 32 samples fit in memory. You process 32 samples four times, call loss.backward() each time, and delay optimizer.step() until after the fourth mini-batch. Because PyTorch accumulates gradients by default, those four backward passes approximate one larger batch update. The critical detail many people miss is scale: you typically divide the loss by accumulation_steps before backward so the total gradient magnitude matches what you would have gotten from the full batch. In other words, accumulation is not 'forgetting zero_grad.' It is a deliberate strategy with explicit control over when gradients are cleared and when the optimizer steps.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is Training Loop in PyTorch Explained in simple terms?

Why does my loss stay exactly the same every epoch?

Can I use multiple loss functions in one loop?

Do I need a training loop if I use a pre-trained model?

What is the difference between loss.item() and loss directly?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's PyTorch. Mark it forged?

6 min read · try the examples if you haven't