Skip to content
Home ML / AI Training Loop in PyTorch Explained

Training Loop in PyTorch Explained

Where developers are forged. · Structured learning · Free forever.
📍 Part of: PyTorch → Topic 5 of 7
Master the PyTorch training loop — a practical deep dive into forward passes, loss calculation, backpropagation, optimizer steps, validation, and production-grade training patterns.
⚙️ Intermediate — basic ML / AI knowledge assumed
In this tutorial, you'll learn
Master the PyTorch training loop — a practical deep dive into forward passes, loss calculation, backpropagation, optimizer steps, validation, and production-grade training patterns.
  • The PyTorch training loop is explicit by design: you manage gradients, forward passes, backward passes, optimizer updates, and validation state yourself.
  • The canonical order matters: zero_grad, forward, loss, backward, optimizer.step. If that sequence is wrong, the run is not trustworthy.
  • model.train() and model.eval() are real mode switches, not decoration. Dropout and BatchNorm depend on them.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • The training loop is a 4-step cycle: zero_grad, forward, backward, optimizer.step — the sequence matters because each step depends on state created by the previous one
  • optimizer.zero_grad() clears previous gradients — without it, gradients accumulate across batches and updates quickly become wrong
  • loss.backward() computes gradients through Autograd via the chain rule — you call it on the loss tensor, not on the model
  • model.train() and model.eval() switch Dropout and BatchNorm behavior — forgetting eval mode makes validation noisy and misleading
  • torch.no_grad() during validation avoids building a graph you will never backprop through — less memory, faster evaluation
  • The most common production bug is missing zero_grad; the close second is logging loss tensors instead of loss.item(), which quietly leaks memory
🚨 START HERE
Training Loop Debug Cheat Sheet
Fast checks you can run before touching architecture or hyperparameters
🟡Gradients appear to grow every batch
Immediate ActionConfirm gradients are being reset inside the loop
Commands
python -c "import torch; print('Check your loop for optimizer.zero_grad(set_to_none=True) before forward/backward')"
python -c "for n, p in model.named_parameters(): print(n, None if p.grad is None else p.grad.norm().item())"
Fix NowAdd optimizer.zero_grad(set_to_none=True) at the top of each batch iteration. If you are intentionally accumulating gradients, divide loss by accumulation_steps and only step every N batches.
🟡Validation memory keeps growing
Immediate ActionCheck whether validation is building graphs or storing tensors
Commands
python -c "print('Use model.eval() and wrap validation in torch.no_grad()')"
python -c "running_loss += loss.item() # not running_loss += loss"
Fix NowUse model.eval(), wrap the loop in torch.no_grad(), and log loss.item() rather than the loss tensor itself.
🟡Device mismatch crash on first batch
Immediate ActionVerify model, inputs, and labels are on the same device
Commands
python -c "print(next(model.parameters()).device)"
python -c "print(inputs.device, labels.device)"
Fix NowMove tensors with inputs = inputs.to(device, non_blocking=True) and labels = labels.to(device, non_blocking=True) before the forward pass.
🟡Loss decreases but predictions are unstable across validation runs
Immediate ActionCheck mode switching and randomness
Commands
python -c "print(model.training) # should be False during validation"
python -c "print('Call model.eval() before validation and model.train() before the next training epoch')"
Fix NowSwitch to model.eval() before validation so Dropout and BatchNorm stop behaving like training-time layers.
Production IncidentModel converges to confident nonsense because gradients kept accumulatingA classification model trained for 50 epochs with steadily decreasing training loss, but validation accuracy stayed near random chance. The optimizer was stepping on accumulated gradients from every prior batch.
SymptomTraining loss decreased smoothly enough to look healthy on the dashboard. Validation accuracy stayed near random chance. Weight magnitudes kept growing from epoch to epoch. The model became more confident over time, but its predictions were confidently wrong.
AssumptionThe first assumption was that the learning rate was too high or the labels were noisy. Both are reasonable guesses. Neither was correct.
Root causeThe training loop omitted optimizer.zero_grad() inside the per-batch iteration. PyTorch accumulates gradients into parameter.grad by design. That is useful when you intentionally want gradient accumulation. Here it was accidental. By the end of each epoch, every update was based on gradients that included stale contributions from many previous batches. The optimizer was not following the current batch signal anymore — it was dragging around the residue of the whole epoch.
FixAdded optimizer.zero_grad(set_to_none=True) at the start of every training iteration, before the forward pass. Verified the fix by printing representative gradient norms before backward, after backward, and after zero_grad. Added gradient clipping with torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) as a safety net, and added a unit-level training smoke test that asserts loss decreases over a few batches on a known dataset slice.
Key Lesson
optimizer.zero_grad() belongs inside the batch loop and before loss.backward() — if it is missing, accumulation is happening whether you intended it or notIf training loss looks fine but validation accuracy is random, do not just tune the learning rate — inspect gradient norms and the loop order firstGradient accumulation is a real technique, but when you use it intentionally you must divide the loss by accumulation_steps before backwardAdd a short smoke test that runs 5 to 10 batches and checks whether loss trends downward — it catches loop-order bugs early and cheaply
Production Debug GuideCommon symptoms when the loop runs without crashing but still produces bad training behavior
Loss stays flat or oscillates wildly across epochsCheck the loop order first: zero_grad, forward, backward, step. Then print gradient norms with: for n, p in model.named_parameters(): print(n, None if p.grad is None else p.grad.norm().item()). Missing zero_grad or an over-aggressive learning rate are the usual causes.
Loss decreases but validation accuracy stays at random chanceCheck that model.eval() is called before validation and model.train() is restored before the next epoch. Then verify label alignment, class-index mapping, and that you are not accidentally shuffling labels in the dataset or collate function. Also inspect whether gradients are accumulating unintentionally.
CUDA out of memory during training but not during inferenceThat is normal in principle because training stores activations for backprop, but large unexplained growth usually means the validation loop is missing torch.no_grad(), loss tensors are being stored instead of loss.item(), or retain_graph=True is being used unnecessarily.
Training crashes with RuntimeError: grad can be implicitly created only for scalar outputsYour loss is not a scalar. Many reduction='none' losses return one value per sample. Reduce it with loss.mean() or loss.sum() before calling backward().
Loss becomes NaN after a few iterationsCheck input normalization, learning rate, and gradient norms. If you are using mixed precision, confirm GradScaler is enabled and the loss is finite before stepping. Gradient clipping is often enough to stop a bad run from blowing up completely.

The training loop is the core execution pattern in PyTorch — a cycle that repeats for every batch of data: clear gradients, compute predictions, compute loss, backpropagate, update weights. PyTorch keeps this explicit on purpose. You are never far from the mechanics of optimization.

That explicitness is the trade-off. You get full visibility into gradient flow, parameter updates, device placement, mixed precision, gradient clipping, and scheduling. The cost is that there is no place to hide sloppy thinking. The loop is short, but it is stateful, and the order of operations matters.

The failure pattern I see most often in real code reviews is not an exotic math bug. It is a copy-pasted tutorial loop with one small change in the wrong place. Someone forgets optimizer.zero_grad(). Someone validates without model.eval(). Someone logs the loss tensor itself instead of loss.item() and wonders why GPU memory keeps growing. The loop is simple. The discipline around it is what separates a model that trains cleanly from one that burns two days of GPU time to produce nonsense.

What Is Training Loop in PyTorch Explained and Why Does It Exist?

The PyTorch training loop exists because optimization is stateful, and PyTorch chooses to make that state visible rather than hide it behind a one-line fit call. Each batch passes through the same sequence: clear old gradients, run the forward pass, compute the loss, backpropagate, and step the optimizer. That looks repetitive because it is repetitive. Training is controlled repetition.

The reason this pattern matters is that gradients in PyTorch accumulate by default. Parameters remember their previous .grad values until you clear them. Autograd also records the operations from the forward pass so backward can traverse that graph in reverse. The loop is not just procedural boilerplate — it is how you manage that state correctly.

In 2026, the canonical loop usually includes a few production-grade upgrades even for ordinary models: optimizer.zero_grad(set_to_none=True) for slightly lower memory traffic, mixed precision with torch.autocast on supported GPUs, gradient clipping when the model is deep or unstable, and explicit validation blocks with model.eval() plus torch.no_grad(). If your model is compile-friendly, torch.compile can sit on top of the same loop structure without changing the fundamentals.

What does not change is the contract. The loop still answers the same four questions every iteration: what did the model predict, how wrong was it, how should the weights change, and did they actually change.

io/thecodeforge/ml/train_loop.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import torch
import torch.nn as nn
from torch.utils.data import DataLoader


def train_one_epoch(model, loader, criterion, optimizer, device, scaler=None, max_grad_norm=1.0):
    model.train()
    running_loss = 0.0
    total_samples = 0

    use_amp = device.type == 'cuda'

    for inputs, labels in loader:
        inputs = inputs.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)

        # 1) Clear stale gradients from the previous iteration
        optimizer.zero_grad(set_to_none=True)

        # 2) Forward pass + loss calculation
        with torch.autocast(device_type=device.type, dtype=torch.float16, enabled=use_amp):
            outputs = model(inputs)
            loss = criterion(outputs, labels)

        # 3) Backward pass
        if scaler is not None and use_amp:
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)  # unscale before clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)

            # 4) Optimizer step
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)
            optimizer.step()

        batch_size = inputs.size(0)
        running_loss += loss.item() * batch_size
        total_samples += batch_size

    return running_loss / total_samples


@torch.no_grad()
def validate(model, loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, labels in loader:
        inputs = inputs.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        running_loss += loss.item() * inputs.size(0)
        preds = outputs.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

    avg_loss = running_loss / total
    accuracy = correct / total
    return avg_loss, accuracy


# Example wiring
# model = MyClassifier().to(device)
# model = torch.compile(model)  # optional in PyTorch 2.x when the model is compile-friendly
# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
# criterion = nn.CrossEntropyLoss()
# scaler = torch.amp.GradScaler('cuda', enabled=(device.type == 'cuda'))
# scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)
#
# for epoch in range(10):
#     train_loss = train_one_epoch(model, train_loader, criterion, optimizer, device, scaler)
#     val_loss, val_acc = validate(model, val_loader, criterion, device)
#     scheduler.step()  # epoch-level scheduler: step after the epoch
#     lr = optimizer.param_groups[0]['lr']
#     print(f'Epoch {epoch + 1:02d} | train_loss={train_loss:.4f} | val_loss={val_loss:.4f} | val_acc={val_acc:.2%} | lr={lr:.2e}')
▶ Output
Epoch 01 | train_loss=0.6124 | val_loss=0.4018 | val_acc=84.75% | lr=3.00e-04
Mental Model
The Training Loop Mental Model
A good training loop is not complicated. It is disciplined. The same small sequence runs over and over, and each step is responsible for one piece of the learning process.
  • zero_grad clears old gradient state so the next update reflects the current batch rather than stale history
  • The forward pass converts inputs into predictions using the current weights
  • The loss function turns prediction quality into a scalar signal Autograd can differentiate
  • backward populates parameter.grad by walking the graph in reverse
  • optimizer.step reads parameter.grad and updates the weights in place
📊 Production Insight
Use optimizer.zero_grad(set_to_none=True) as the default in new code — same semantics for standard training, slightly less memory traffic.
If you enable mixed precision, unscale gradients before clipping or your clip threshold is meaningless.
Rule: keep the training loop boring. Most production failures come from clever additions in the wrong place, not from the core pattern itself.
🎯 Key Takeaway
The loop is short because the underlying idea is simple: clear stale gradients, predict, measure error, backpropagate, update.
What makes training reliable is not complexity but respecting the state transitions inside that sequence.
If the order is wrong, the whole run is suspect even when the code still executes.
Training Loop Strategy Decision
IfStandard supervised training with one model, one loss, and ordinary validation
UseUse a plain native PyTorch loop — explicit, easy to debug, and flexible enough for most production work
IfNeed gradient accumulation, multiple losses, custom clipping, or unusual optimizer scheduling
UseStay in native PyTorch — these cases are exactly where explicit loops are worth having
IfNeed to reduce boilerplate for callbacks, checkpointing, and distributed setup
UseConsider Lightning or Accelerate, but keep the mental model of the native loop because you will still debug the same underlying states
IfTraining on modern NVIDIA GPUs and the model is stable under compilation
UseAdd mixed precision and evaluate torch.compile — both sit on top of the same loop and can improve throughput materially

Experiment Tracking: Logging Metrics and Checkpoints Like an Engineer

A training loop that only prints to stdout is fine for a notebook. It is not enough for a team. Once a model matters, you need a trace of what happened: which code ran, which hyperparameters were used, what the learning rate was at each epoch, what checkpoint corresponded to the best validation metric, and when the run started to go sideways if it did.

The simple pattern is still the right one. Log epoch-level metrics to a durable store. Keep checkpoints in object storage or a mounted artifact directory. Store the checkpoint path alongside the metrics row rather than pretending those two systems are unrelated. They are part of the same training story.

The practical benefit is not just reporting. It is rollback and diagnosis. When somebody says the new model is worse, you should be able to answer with evidence: which run, which epoch, which checkpoint, what the validation curve looked like, and whether the learning rate schedule or data version changed. Without that, model debugging turns into folklore.

io/thecodeforge/db/training_metrics.sql · SQL
1234567891011121314151617181920212223242526272829
-- io.thecodeforge: epoch-level training metrics for experiment tracking
INSERT INTO io.thecodeforge.training_history (
    run_id,
    model_id,
    epoch_number,
    train_loss,
    val_loss,
    val_accuracy,
    learning_rate,
    checkpoint_path,
    created_at
) VALUES (
    'run_2026_04_19_001',
    'ForgeResNet50-v3',
    12,
    0.2841,
    0.3198,
    0.9142,
    0.000300,
    's3://forge-models/ForgeResNet50-v3/epoch_12.pt',
    CURRENT_TIMESTAMP
);

-- Example rollback query: fetch the best validation checkpoint for a run
SELECT epoch_number, checkpoint_path, val_loss, val_accuracy
FROM io.thecodeforge.training_history
WHERE run_id = 'run_2026_04_19_001'
ORDER BY val_accuracy DESC, val_loss ASC
LIMIT 1;
▶ Output
Metric logged successfully. Best checkpoint for run_2026_04_19_001 returned to the training dashboard.
💡Forge Best Practice:
Do not hard-code learning rate, batch size, or checkpoint paths inside the loop. Load them from configuration at startup and persist that configuration with the run. The moment you are comparing experiments across weeks or teammates, inline constants become a liability.
📊 Production Insight
Log epoch metrics to something durable — SQL, MLflow, or Weights & Biases — not just stdout.
The checkpoint path belongs in the same record as the metrics that justified keeping it.
Rule: if a run cannot be reproduced from its config, metrics, and checkpoint references, it was not tracked well enough.
🎯 Key Takeaway
Training is not complete when the epoch ends. It is complete when the metrics, config, and checkpoint are all durable and queryable.
That audit trail is what lets you explain success, debug regression, and roll back with confidence.
A good loop teaches the model. A good training system teaches the team.
Experiment Logging Decision
IfSolo developer with a small number of runs
UseSQLite or even a disciplined CSV can be enough, provided checkpoint paths and configs are captured consistently
IfTeam with shared training infrastructure and dashboard needs
UseUse PostgreSQL or a managed experiment system so metrics are queryable across runs and users
IfMany experiments, multiple model families, artifact comparison, and sweeps
UseUse MLflow or Weights & Biases — purpose-built tools save time once experiment volume stops being trivial

Containerizing the Forge Training Environment

Training jobs fail for boring reasons far more often than people admit: wrong CUDA runtime, mismatched drivers, DataLoader workers starved by tiny shared memory, and buffered logs that make a crashed container look idle for ten minutes. Docker does not solve those problems automatically, but it gives you one place to make them explicit.

For 2026-era PyTorch stacks, three things matter immediately. First, pin the framework and CUDA versions rather than using latest. Second, use unbuffered Python output so logs appear in real time in whatever runtime you use. Third, remember that DataLoader workers share memory through /dev/shm inside the container. If you spin up multiple workers without enough shared memory, you get hangs, worker exits, or mysterious throughput collapse.

The other trap is silent CPU fallback. Teams assume the container is using the GPU because the base image has CUDA in its name. That proves nothing. The host still needs the NVIDIA Container Toolkit, the container still needs to run with --gpus all, and your startup logs should still print torch.cuda.is_available() plus the device name. If you do not verify that, you can burn hours training on CPU and only discover it when the epoch time looks absurd.

Dockerfile · DOCKERFILE
123456789101112131415161718192021222324252627
# io.thecodeforge: Reproducible PyTorch training container
# Pin versions. Never use a floating 'latest' tag in training infrastructure.
FROM pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime

WORKDIR /app

ENV PYTHONUNBUFFERED=1
ENV PIP_NO_CACHE_DIR=1

# System deps commonly needed by vision and tabular training stacks
RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    curl \
    libgl1 \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

# Good startup hygiene: print environment info before training begins
CMD ["python", "-u", "train.py"]

# Example runtime command:
# docker run --gpus all --shm-size=2g -v $(pwd)/data:/app/data -v $(pwd)/checkpoints:/app/checkpoints forge-trainer:latest
▶ Output
Successfully built image forge-trainer:2.4.1-cuda12.4
Startup log: cuda_available=True | device=NVIDIA A100-SXM4-40GB
💡Docker Tip:
If you are training with num_workers greater than 0, give the container enough shared memory. In practice, --shm-size=2g is a sensible baseline for many workloads. Without it, DataLoader worker crashes and unexplained stalls are common.
📊 Production Insight
Use python -u or PYTHONUNBUFFERED=1 so logs stream immediately — delayed logs make dead jobs harder to diagnose.
Pin PyTorch and CUDA versions and print torch.cuda.is_available() at startup — never assume the GPU is active because the image name says CUDA.
Rule: for GPU training, the two Docker flags you check first are --gpus all and a sane --shm-size value.
🎯 Key Takeaway
Containerization is not just packaging. It is how you make runtime assumptions visible: framework version, CUDA version, logging behavior, and memory settings.
Most training container bugs are environment bugs, not model bugs.
Pin the environment, verify GPU visibility, and stop guessing.
Docker Training Environment Decision
IfTraining on GPU with DataLoader num_workers = 0
UseUse --gpus all and verify CUDA visibility at startup; shared memory is less likely to be the bottleneck
IfTraining on GPU with DataLoader num_workers > 0
UseUse --gpus all plus a larger --shm-size allocation because worker processes depend on shared memory for batch transfer
IfTraining or fine-tuning on CPU-only infrastructure
UseUse a CPU-specific PyTorch base image — smaller pull size, fewer moving parts, and no unused CUDA runtime
IfNeed to compile custom CUDA extensions during build
UseUse a devel image for the build stage, then switch to a slimmer runtime image for execution if possible

Common Mistakes and How to Avoid Them

Most training-loop bugs come from state you forgot was stateful. Gradients persist until you clear them. Dropout and BatchNorm switch behavior based on mode. Loss tensors keep their graph unless you detach or convert them properly for logging. PyTorch is explicit about all of this, but explicit does not mean self-correcting.

The two mode switches people most often misuse are model.train() and model.eval(). These are not decorative. They change the behavior of real layers. Validation without model.eval() is not a small mistake; it changes the model you think you are measuring. The same goes for validation without torch.no_grad() — maybe the metrics are still numerically correct, but you pay the full memory cost of graphs you will never use.

The other class of mistake is device mismatch. PyTorch will never silently move your batch to match the model. If the model is on GPU and the inputs are on CPU, the first forward pass fails. That is good. What is less obvious is partial mismatch inside more complex training code — an auxiliary tensor created on CPU in the middle of loss calculation, or class weights left on CPU while logits are on GPU. The discipline is the same: decide the device once, move everything that participates in the computation there, and verify it early.

io/thecodeforge/ml/validate_model.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637
import torch


def validate_model(model, data_loader, criterion):
    device = next(model.parameters()).device
    model.eval()  # CRITICAL: switch Dropout / BatchNorm to inference behavior

    total_loss = 0.0
    total_correct = 0
    total_samples = 0

    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs = inputs.to(device, non_blocking=True)
            labels = labels.to(device, non_blocking=True)

            outputs = model(inputs)
            loss = criterion(outputs, labels)

            total_loss += loss.item() * inputs.size(0)
            preds = outputs.argmax(dim=1)
            total_correct += (preds == labels).sum().item()
            total_samples += labels.size(0)

    avg_loss = total_loss / total_samples
    accuracy = total_correct / total_samples

    model.train()  # restore training mode for the next epoch
    return avg_loss, accuracy


# Common anti-patterns to avoid:
# 1) Forgetting model.eval() before validation
# 2) Forgetting torch.no_grad() in validation
# 3) Logging loss instead of loss.item()
# 4) Using outputs.data instead of outputs.argmax(dim=1)
# 5) Forgetting to switch back to model.train() before the next epoch
▶ Output
Validation completed: avg_loss=0.2874 | accuracy=91.32%
⚠ Watch Out:
If your native loop is turning into a small framework — custom callbacks, distributed coordination, dozens of logging hooks, resumable checkpointing, gradient accumulation, EMA weights, early stopping, and scheduler orchestration — pause and ask whether you still want to maintain that abstraction yourself. Native PyTorch is the right default. It does not have to be the only tool you use.
📊 Production Insight
Forgetting zero_grad causes accidental gradient accumulation; forgetting eval mode corrupts validation; logging loss tensors instead of loss.item() leaks memory.
Those are not advanced bugs. They are routine, and they are exactly why a good training loop should be predictable and slightly boring.
Rule: zero_grad before backward, model.eval() plus torch.no_grad() for validation, and use loss.item() for every metric you log.
🎯 Key Takeaway
The most common training-loop problems are bookkeeping problems: stale gradients, wrong model mode, wrong device, or graphs retained longer than necessary.
None of them are hard once you know where the state lives.
A reliable loop is explicit about that state at every phase.
Debugging Training Loop Mistakes
IfLoss stays flat or gets erratic after a few batches
UseCheck loop order first and verify optimizer.zero_grad() is inside the batch loop before backward
IfValidation accuracy is much lower or noisier than expected
UseEnsure model.eval() is active during validation and that validation is not using random training augmentations
IfRuntimeError: Expected all tensors to be on the same device
UseMove inputs, labels, and any auxiliary tensors used in the loss to the same device as the model before the forward pass
IfGPU memory climbs during validation or logging
UseWrap validation in torch.no_grad() and store loss.item() rather than the loss tensor
🗂 Training Loop Steps Explained
What each phase does and why the order matters
PhaseActionPurpose
zero_grad()Clear parameter.grad values from the previous stepPrevents stale gradients from accumulating into the next update
Forward Passoutputs = model(inputs)Produces predictions using the current model weights
Loss Calculationloss = criterion(outputs, labels)Reduces prediction quality to a differentiable training signal
Backward Passloss.backward()Computes gradients for every leaf parameter through Autograd
Optimizer Stepoptimizer.step()Applies the parameter update using the gradients just computed
Validation Phasemodel.eval() + torch.no_grad()Measures model quality without Dropout noise or graph allocation
Scheduler Stepscheduler.step()Adjusts learning rate on a planned cadence rather than leaving it static

🎯 Key Takeaways

  • The PyTorch training loop is explicit by design: you manage gradients, forward passes, backward passes, optimizer updates, and validation state yourself.
  • The canonical order matters: zero_grad, forward, loss, backward, optimizer.step. If that sequence is wrong, the run is not trustworthy.
  • model.train() and model.eval() are real mode switches, not decoration. Dropout and BatchNorm depend on them.
  • torch.no_grad() during validation saves memory and time by avoiding graphs you will never backprop through.
  • Use loss.item() for logging and reporting. Keeping loss tensors around is a common and unnecessary source of memory growth.
  • In 2026, a production-quality loop usually adds mixed precision, gradient clipping, scheduler control, and durable metric logging — but the underlying mechanics have not changed.

⚠ Common Mistakes to Avoid

    Forgetting optimizer.zero_grad() before loss.backward()
    Symptom

    Loss may look unstable, drift strangely, or even decrease while validation quality collapses. Gradients accumulate across batches, so each update contains stale information from prior iterations. Weight norms often grow much faster than expected.

    Fix

    Call optimizer.zero_grad(set_to_none=True) once per batch before the forward pass. If you intentionally use gradient accumulation, divide the loss by accumulation_steps and only call optimizer.step() every N batches.

    Not moving data to the same device as the model
    Symptom

    Training crashes on the first forward pass with a device mismatch error, or more subtly later when an auxiliary tensor inside the loss is created on CPU while logits are on GPU.

    Fix

    Set device once, move the model there, and move every batch with inputs.to(device, non_blocking=True) and labels.to(device, non_blocking=True). Any tensor that participates in the computation must live on that same device.

    Skipping model.eval() and torch.no_grad() during validation
    Symptom

    Validation metrics bounce around more than they should, inference-like accuracy looks worse than training suggested, and GPU memory usage during validation is higher than necessary.

    Fix

    Call model.eval() before validation and wrap the entire validation loop in torch.no_grad(). After validation, call model.train() before the next epoch begins.

    Calling model.train() and model.eval() in the wrong places
    Symptom

    The model trains with Dropout effectively disabled or validates with Dropout still active. BatchNorm running statistics are updated when they should be frozen, or frozen when they should still be learning.

    Fix

    Use model.train() at the start of each training epoch. Use model.eval() for every validation or inference block. Do not mix the two and do not assume the previous call is still the correct state.

    Logging loss tensors instead of loss.item()
    Symptom

    Memory usage grows slowly over the course of an epoch or run, especially when you append loss tensors to Python lists for later reporting. The graphs attached to those tensors are kept alive longer than intended.

    Fix

    Use loss.item() for logging, aggregation, and dashboard reporting. Keep the tensor form only for backward(). Once you are measuring or printing, you almost always want the scalar value.

Interview Questions on This Topic

  • QExplain the 'Gradient Accumulation' technique. Why would a developer intentionally skip optimizer.zero_grad() for a few batches?Mid-levelReveal
    Gradient accumulation is a memory-saving technique used when the desired effective batch size does not fit on the GPU. Suppose the real target batch size is 128, but only 32 samples fit in memory. You process 32 samples four times, call loss.backward() each time, and delay optimizer.step() until after the fourth mini-batch. Because PyTorch accumulates gradients by default, those four backward passes approximate one larger batch update. The critical detail many people miss is scale: you typically divide the loss by accumulation_steps before backward so the total gradient magnitude matches what you would have gotten from the full batch. In other words, accumulation is not 'forgetting zero_grad.' It is a deliberate strategy with explicit control over when gradients are cleared and when the optimizer steps.
  • QWhy is backward() called on the loss tensor and not on the model itself? How does this relate to Autograd?Mid-levelReveal
    Autograd works from tensors, not from high-level containers like nn.Module. The loss tensor is the scalar root of the computation graph built during the forward pass. It depends, directly or indirectly, on every parameter that influenced the prediction. Calling loss.backward() tells Autograd to traverse that graph in reverse, compute local derivatives at each node, and accumulate the final gradients into the .grad fields of the leaf tensors — typically the model parameters. The model itself is just a structured collection of operations and parameters. It does not represent a single differentiable scalar, so it is not the right object to backpropagate from.
  • QCompare model.train() and model.eval(). Which specific layers are actually affected by these mode switches?Mid-levelReveal
    model.train() sets the module hierarchy to training mode, while model.eval() sets it to evaluation mode. The two most important affected families are Dropout and BatchNorm. In training mode, Dropout randomly zeroes activations and BatchNorm uses the current batch statistics while updating its running mean and variance. In eval mode, Dropout becomes a no-op and BatchNorm stops updating its running statistics, using the stored running values instead. Ordinary layers like Linear, Conv2d, and ReLU do not change behavior across these modes. LayerNorm also does not depend on train versus eval in the way BatchNorm does. The reason this matters is practical, not theoretical: if you validate in train mode, you are not measuring the model you plan to deploy.
  • QHow does the PyTorch training loop get around the Python GIL during data loading?JuniorReveal
    The training loop itself still runs in Python on the main process, so it is subject to the GIL. Data loading gets parallelism by using multiple worker processes in DataLoader when num_workers is greater than 0. Each worker is a separate Python process with its own interpreter and its own GIL, which means file I/O, decoding, and preprocessing can happen in parallel. Batches are prepared ahead of time and transferred back to the main process through shared memory and inter-process communication. GPU execution is separate again — once kernels are launched, the heavy compute happens outside the Python interpreter entirely. So the short version is: PyTorch does not beat the GIL with threads here; it sidesteps it with multiprocessing and GPU kernels.
  • QWhere should a learning rate scheduler step happen in the training loop, and why is that not a trivial detail?SeniorReveal
    It depends on the scheduler. Epoch-based schedulers such as StepLR or CosineAnnealingLR usually step once per epoch, commonly after validation so the logged learning rate corresponds to the next epoch. Batch-based schedulers such as OneCycleLR or some warmup schedules step once per optimizer update, which means inside the batch loop after optimizer.step(). Treating all schedulers the same is a common bug: stepping an epoch scheduler every batch collapses the learning rate too quickly, while stepping a batch scheduler only once per epoch makes it nearly useless. The rule is simple: match scheduler cadence to the schedule design, not to habit.

Frequently Asked Questions

What is Training Loop in PyTorch Explained in simple terms?

It is the repeated process by which a model learns: make a prediction, measure the error, compute how the weights should change, update them, and do it again for the next batch. Everything else in training is built around that cycle.

Why does my loss stay exactly the same every epoch?

The usual causes are mechanical before they are mathematical: optimizer.step() may be missing, parameters may be frozen, the learning rate may be effectively zero, the model might be in eval mode during training, or the loss may not be connected to the model outputs the way you think it is. Start by printing one parameter value before and after optimizer.step() and confirm it actually changes.

Can I use multiple loss functions in one loop?

Yes. That is a standard pattern in multi-task learning and regularized objectives. Compute each loss, weight them as needed, sum them into one scalar total_loss, and call total_loss.backward() once. The important part is that the final object passed to backward must be a scalar unless you explicitly provide gradient arguments.

Do I need a training loop if I use a pre-trained model?

Not for inference. If you only want predictions, you load the weights, switch to eval mode, and run the model. But if you are fine-tuning the pre-trained model on your data, then yes — you still need a training loop. Usually it is just a lighter one: smaller learning rate, fewer epochs, sometimes frozen backbone layers at the start.

What is the difference between loss.item() and loss directly?

loss is still a tensor tied to the computation graph. loss.item() extracts its Python scalar value. For backward(), you need the tensor. For logging, averaging, and printing, you almost always want loss.item(). If you keep storing loss tensors in lists, you also keep their graphs alive longer than necessary.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAutograd and Backpropagation in PyTorchNext →PyTorch DataLoader and Datasets
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged