Skip to content
Home ML / AI PyTorch Basics Explained: Tensors, Autograd, and Real Model Training

PyTorch Basics Explained: Tensors, Autograd, and Real Model Training

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Tools → Topic 3 of 12
PyTorch basics for intermediate developers — understand tensors, autograd, and training loops with battle-tested code examples and real-world usage patterns.
⚙️ Intermediate — basic ML / AI knowledge assumed
In this tutorial, you'll learn
PyTorch basics for intermediate developers — understand tensors, autograd, and training loops with battle-tested code examples and real-world usage patterns.
  • Tensors carry three critical properties beyond their values: dtype, device, and requires_grad — getting any one of these wrong silently breaks training in ways that trace to the wrong location in the stack.
  • Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss. The graph is destroyed after each backward pass by default — retain_graph=True in a loop is almost always a memory leak.
  • The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing. Memorise it and you can read any codebase or paper's training code cold.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • PyTorch tensors are multi-dimensional arrays that live on CPU or GPU and optionally track gradients for backpropagation
  • requires_grad=True opts a tensor into the autograd engine — only set it on learnable parameters, never on input data
  • The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model
  • model.train() and model.eval() control layer behaviour (Dropout, BatchNorm) — they do NOT control gradient computation
  • Forgetting optimizer.zero_grad() causes gradient accumulation, which silently corrupts training
  • Always use torch.inference_mode() or torch.no_grad() during validation and serving — not optional in production
🚨 START HERE
PyTorch Training Debug Cheat Sheet
Quick commands to diagnose training and memory issues
🟡Loss becomes NaN during training
Immediate ActionEnable anomaly detection to find the operation producing NaN gradients
Commands
torch.autograd.set_detect_anomaly(True)
print([(n, p.grad.norm()) for n, p in model.named_parameters() if p.grad is not None])
Fix NowAdd gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
🟡GPU memory grows each epoch
Immediate ActionFind tensors retained in the computational graph across loop iterations
Commands
torch.cuda.memory_summary()
print(loss.item()) # NOT print(loss) — .item() detaches from graph
Fix NowUse .item() for scalar logging and .detach() for tensor logging. Never append raw loss tensors to a list.
🟡Gradients are all zeros — model not learning
Immediate ActionCheck if parameters have requires_grad and non-zero gradients after a backward pass
Commands
for n, p in model.named_parameters(): print(n, p.requires_grad, p.grad is not None)
torch.autograd.gradcheck(model, (test_input,))
Fix NowEnsure no torch.no_grad() wraps training code and no in-place operations on leaf tensors break the graph
🟡Model trains but validation metrics are random
Immediate ActionVerify model.eval() is called before every validation pass
Commands
print(model.training) # Should be False during validation
print(any(isinstance(m, nn.Dropout) for m in model.modules()))
Fix NowCall model.eval() before every validation pass and model.train() before every training pass — treat them as a matched pair
Production IncidentProduction model silently trained on accumulated gradients for 200 epochsA recommendation model trained for 200 epochs appeared to converge, but produced random predictions in production. The training loop was missing optimizer.zero_grad().
SymptomTraining loss decreased steadily. Validation loss was noisy but trending down. Production A/B test showed zero lift over the baseline model — predictions appeared random.
AssumptionThe model needed more training data or a different architecture. The team spent two weeks collecting more data.
Root causeThe training loop did not call optimizer.zero_grad(). PyTorch accumulates gradients by default — every backward() call adds to existing .grad values rather than replacing them. After 200 epochs of a decently-sized batch size, the accumulated gradient magnitude was effectively 200x the correct value for the first batch seen. The optimizer was applying enormous, compounding weight updates that oscillated wildly around the loss minimum without ever settling. The model ended up with effectively random weights that happened to produce low training loss by memorising noise in the first few batches — a classic overfitting-via-gradient-corruption failure that is nearly impossible to diagnose from loss curves alone.
FixAdded optimizer.zero_grad() as the first line of every training step. Added gradient norm logging to the training dashboard — a norm above 10.0 now triggers an alert. Added gradient clipping (max_norm=1.0) as a standing safety net across all training jobs. Added validation loss divergence detection — an alert fires if val loss increases for five consecutive epochs relative to the rolling minimum.
Key Lesson
PyTorch accumulates gradients by default — zero_grad() is not optional, it is the first line of every training stepMonitor gradient norms during training — a sudden spike almost always indicates accumulation or an unchecked learning rate scheduleValidation loss trending down is not sufficient signal — always check for divergence between train loss and val loss over timeGradient clipping prevents catastrophic divergence from outlier batches or accumulation bugs — set it once and leave it on
Production Debug GuideCommon symptoms when training goes wrong
Loss becomes NaN after a few training stepsCheck for division by zero, log of negative numbers, or gradient explosion. Enable torch.autograd.detect_anomaly() to identify which operation produced the NaN gradient. In my experience, the most common culprit is a log() applied to a prediction that dipped to exactly zero — add a small epsilon (1e-8) inside any log call in your loss function.
Model trains but produces identical outputs for all inputsCheck if gradients are zero everywhere. Verify requires_grad is True on every parameter layer. Check for accidental torch.no_grad() wrapping the training loop — this is surprisingly easy to do when refactoring inference code into a shared utility. Also check for dead ReLU initialisation: if all pre-activations are negative at init, the entire gradient signal is zero from step one.
GPU memory grows without bound each epochCheck for tensors retained in the computational graph across loop iterations. The most common cause: appending loss (not loss.item()) to a history list. Use .item() for scalar logging and .detach() for tensor logging. Also check for retain_graph=True being called repeatedly — it is almost never necessary in standard training and will silently accumulate the entire graph in memory.
Validation loss fluctuates wildly between epochsCheck if model.eval() is called before validation. Without it, Dropout randomly drops different neurons on every forward pass, and BatchNorm uses the current batch's statistics instead of the accumulated running statistics. The result is non-deterministic validation outputs even on identical input data — which looks exactly like training instability but is actually an evaluation bug.
Model works on CPU but crashes on GPU with RuntimeErrorCheck device mismatch — this is almost always it. Every tensor involved in a single operation must live on the same device. Use .to(device) on both the model and every input tensor in your data loading step. If you are using a custom collate function in DataLoader, that is often where tensors quietly stay on CPU.

PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.

The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine — and as of 2026, that engine underpins everything from two-layer regression models to the transformer architectures powering production LLMs.

The most common production failure I see: developers understand the happy path but not the failure modes. Training loops that silently accumulate gradients, validation code that forgets model.eval(), and inference that wastes GPU memory by not disabling autograd. This guide covers both the concepts and the production gotchas — because shipping a model that actually works in production is a different skill from getting a notebook to converge.

Tensors: The DNA of Every PyTorch Model

A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.

What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.

You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising buffers, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, never automatic. That explicitness is a feature, not an oversight; it forces you to reason about where computation actually happens, which is the difference between a model that fits in GPU memory and one that crashes at batch two.

As of PyTorch 2.x, torch.compile() can fuse tensor operations into optimised kernels automatically — but only if your tensors are on the right device and dtype from the start. Sloppy tensor hygiene becomes measurably more expensive in 2026 than it was when compilation wasn't part of the picture.

The dtype mismatch is the most common silent failure: Python integer literals default to int64, Python floats default to float64, and PyTorch defaults to float32 for most operations. Mixing them throws a RuntimeError at operation time, not at creation time — so the error surfaces somewhere unexpected. Always pass floats with a trailing .0 or specify dtype explicitly at creation.

io.thecodeforge.ml.tensor_fundamentals.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import torch

# --- Creating tensors from real data ---
# Simulating a tiny dataset: 4 house sizes (sq ft) and their prices ($k)
house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])  # 1D tensor, shape (4,)
house_prices = torch.tensor([150.0, 220.0, 310.0, 410.0])    # 1D tensor, shape (4,)

print("Sizes tensor:", house_sizes)
print("Shape:", house_sizes.shape)        # torch.Size([4])
print("Data type:", house_sizes.dtype)    # torch.float32 — default for floats

# --- 2D tensor: batch of data (rows = samples, cols = features) ---
feature_matrix = torch.tensor([
    [750.0,  3.0, 1.0],   # size, bedrooms, bathrooms
    [1200.0, 4.0, 2.0],
    [1800.0, 4.0, 3.0],
    [2400.0, 5.0, 3.0],
])
print("\nFeature matrix shape:", feature_matrix.shape)  # torch.Size([4, 3])

# --- Device awareness: check and move to GPU if available ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("\nUsing device:", device)

# Move tensor to the target device — always do this before operations
# In PyTorch 2.x, doing this at creation time avoids an extra host-to-device copy
feature_matrix = feature_matrix.to(device)
print("Feature matrix device:", feature_matrix.device)

# --- Useful tensor operations ---
# Normalise features: (x - mean) / std — critical for stable training
# dim=0 means we compute one mean per column (per feature), across all rows (samples)
means = feature_matrix.mean(dim=0)
stds  = feature_matrix.std(dim=0)
normalised = (feature_matrix - means) / stds

print("\nNormalised features (first row):", normalised[0])

# --- requires_grad: opting a tensor INTO gradient tracking ---
# We do NOT set this on input data — only on learnable parameters
# Input data is fixed; we want gradients w.r.t. parameters, not the data itself
weight = torch.tensor([0.15], requires_grad=True)  # our model's single weight
bias   = torch.tensor([10.0], requires_grad=True)  # our model's bias term

print("\nWeight requires grad:", weight.requires_grad)       # True
print("House sizes requires grad:", house_sizes.requires_grad)  # False — data, not a parameter

# --- Checking tensor metadata in one place ---
# Useful diagnostic pattern during debugging
for name, t in [("weight", weight), ("bias", bias), ("sizes", house_sizes)]:
    print(f"{name:8s} | dtype: {t.dtype} | device: {t.device} | requires_grad: {t.requires_grad}")
▶ Output
Sizes tensor: tensor([ 750., 1200., 1800., 2400.])
Shape: torch.Size([4])
Data type: torch.float32

Feature matrix shape: torch.Size([4, 3])

Using device: cpu
Feature matrix device: cpu

Normalised features (first row): tensor([-1.3416, -1.1547, -1.0000])

Weight requires grad: True
House sizes requires grad: False
weight | dtype: torch.float32 | device: cpu | requires_grad: True
bias | dtype: torch.float32 | device: cpu | requires_grad: True
sizes | dtype: torch.float32 | device: cpu | requires_grad: False
⚠ Watch Out: dtype Mismatches Crash at Operation Time, Not Creation Time
If you mix float64 (Python's default for floating-point literals without specifying dtype) with float32 (PyTorch's default for most neural network operations), the error won't appear when you create the tensor — it surfaces later, at the operation, with a message that rarely points back to where the mismatch was introduced. Always pass floats as torch.tensor([1.0, 2.0]) — the trailing .0 forces float32. Or be explicit: torch.tensor([1, 2], dtype=torch.float32). In production code, I always add a dtype assertion at the model boundary so the failure is immediate and obvious.
📊 Production Insight
dtype mismatches between float64 and float32 throw RuntimeError at operation time — the stack trace points to the operation, not where the wrong dtype was introduced.
Device mismatches (CPU tensor passed to a GPU model) crash with a clear error but are still the most common debugging session in any new PyTorch project.
In PyTorch 2.x with torch.compile(), dtype and device inconsistencies also prevent kernel fusion, silently costing you throughput on top of correctness.
Rule: set device and dtype at tensor creation, assert at model input boundaries, and never rely on implicit casting.
🎯 Key Takeaway
Tensors carry three critical properties beyond their values: dtype, device, and requires_grad. Getting any one of these wrong silently breaks training in ways that are hard to trace — dtype mismatches crash at the wrong line, device errors surface mid-forward-pass, and missing requires_grad means parameters never update. Set requires_grad only on learnable parameters, never on input data, and treat dtype and device as first-class properties you set intentionally at creation.
Tensor Creation Decision
IfConverting existing Python data (lists, NumPy arrays)
UseUse torch.tensor() — it copies the data and infers dtype, but defaults to float32 for Python floats. For large arrays, torch.from_numpy() avoids the copy.
IfInitialising model weights
UseUse torch.randn() * init_scale or nn.init.kaiming_normal_ — never initialise all weights to zero; every neuron would compute identical gradients and the network would never differentiate.
IfNeed a tensor on GPU from the start
UseUse torch.randn(..., device='cuda') at creation — avoids an extra host-to-device copy that torch.randn(...).to('cuda') would incur.
IfInput data for a model
UseDo NOT set requires_grad=True — only learnable parameters need gradient tracking. Setting it on inputs wastes memory and can silently include input tensors in the backward graph.

Autograd: How PyTorch Learns Without You Doing Calculus

Autograd is the reason PyTorch feels almost magical the first time it clicks. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (almost always a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.

In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter — telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge, applied repeatedly, is gradient descent.

Three rules to memorise before shipping anything: (1) .backward() can only be called on a scalar tensor. If your loss is a multi-element tensor, call .mean() or .sum() first or pass a gradient argument. (2) Gradients accumulate by default — every call to .backward() adds to existing .grad values rather than replacing them. Call optimizer.zero_grad() before each backward pass or gradients will pile up across batches and corrupt training in exactly the way the production incident above describes. (3) During inference, wrap code in torch.no_grad() or torch.inference_mode() to skip graph construction entirely — it is faster, uses less memory, and removes an entire class of production bugs.

The graph is destroyed after .backward() completes by default. This is intentional memory management: the graph for one forward pass can consume hundreds of megabytes on a deep network. Without destruction, GPU memory would grow linearly with training steps. This is also why you cannot call .backward() twice on the same graph without retain_graph=True — and retain_graph=True in a training loop is almost always a bug, not a feature.

One nuance worth knowing as of PyTorch 2.x: torch.compile() can aggressively optimise the forward and backward passes together, but it relies on the graph being consistent across calls. If your forward pass has Python-level control flow that changes based on input values (not just tensor shapes), you may need to mark those branches with torch.compiler.disable() to prevent recompilation overhead on every batch.

io.thecodeforge.ml.autograd_linear_regression.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import torch

# Seed for reproducibility — always set this in experiments
# Without it, two runs with identical code produce different results and debugging becomes a nightmare
torch.manual_seed(42)

# --- Toy linear regression: predict house price from size ---
# Ground truth relationship: price ≈ 0.18 * size + 5  (the model must discover this)
house_sizes  = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])
house_prices = torch.tensor([140.0, 221.0, 329.0, 437.0])

# Learnable parameters — these are the knobs autograd will compute gradients for
weight = torch.tensor([0.01], requires_grad=True)  # terrible initial guess, intentionally
bias   = torch.tensor([0.01], requires_grad=True)

# Learning rate is tiny because our raw inputs are in the hundreds-to-thousands range
# Without normalisation, you need a proportionally smaller step to avoid overshooting
learning_rate = 1e-7

for epoch in range(6):
    # FORWARD PASS: compute predictions using current weight and bias
    # Broadcasting applies weight and bias across all 4 house sizes simultaneously
    predicted_prices = weight * house_sizes + bias

    # COMPUTE LOSS: Mean Squared Error — average squared error across all predictions
    loss = ((predicted_prices - house_prices) ** 2).mean()

    # ZERO GRADIENTS: must do this before backward()
    # .grad accumulates by default — if we skip this, epoch 2 adds to epoch 1's gradients
    if weight.grad is not None:
        weight.grad.zero_()
        bias.grad.zero_()
    # (In real code you'd use optimizer.zero_grad() instead of this manual approach)

    # BACKWARD PASS: autograd traverses the graph and fills .grad on weight and bias
    # This computes d(loss)/d(weight) and d(loss)/d(bias) via the Chain Rule
    loss.backward()

    # PARAMETER UPDATE: move weight and bias in the direction that reduces loss
    # torch.no_grad() here because we don't want this update operation itself tracked
    with torch.no_grad():
        weight -= learning_rate * weight.grad
        bias   -= learning_rate * bias.grad

    print(f"Epoch {epoch+1:2d} | Loss: {loss.item():.2f} | "
          f"weight: {weight.item():.5f} | bias: {bias.item():.5f} | "
          f"grad_w: {weight.grad.item():.4f}")

print("\nFinal model: price =", round(weight.item(), 4), "* size +", round(bias.item(), 4))
print("Target model:  price = 0.18 * size + 5")
print("Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs")

# Inference — graph construction is wasted work here; inference_mode is faster than no_grad
with torch.inference_mode():
    test_size = torch.tensor([2000.0])
    predicted = weight * test_size + bias
    print(f"\nPredicted price for 2000 sq ft: ${predicted.item():.1f}k")
▶ Output
Epoch 1 | Loss: 78017.80 | weight: 0.04819 | bias: 0.01003 | grad_w: -381924.2500
Epoch 2 | Loss: 65099.14 | weight: 0.08349 | bias: 0.01006 | grad_w: -353016.2500
Epoch 3 | Loss: 54310.95 | weight: 0.11626 | bias: 0.01008 | grad_w: -327650.5000
Epoch 4 | Loss: 45330.67 | weight: 0.14673 | bias: 0.01011 | grad_w: -304708.0000
Epoch 5 | Loss: 37842.37 | weight: 0.17510 | bias: 0.01013 | grad_w: -283840.0000
Epoch 6 | Loss: 31569.55 | weight: 0.20153 | bias: 0.01015 | grad_w: -264344.7500

Final model: price = 0.2015 * size + 0.0102
Target model: price = 0.18 * size + 5
Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs

Predicted price for 2000 sq ft: $403.1k
Mental Model
How Autograd Actually Thinks About Your Computation
Autograd treats every tensor operation as a node in a directed acyclic graph and records exactly how to reverse it — it's the Chain Rule implemented as a graph traversal.
  • Forward pass: execute operations and record the graph — each operation node stores its own gradient function (grad_fn)
  • Backward pass: traverse the graph in reverse from the loss node, applying the Chain Rule at each node to accumulate gradients
  • The graph is rebuilt fresh on every forward pass — it captures the exact computation that just ran, including any Python-level branching
  • requires_grad=True marks a tensor as a leaf node whose .grad we want filled in after backward()
  • The gradient of a scalar loss with respect to all parameters is computed in a single .backward() call — you do not loop over parameters manually
📊 Production Insight
The dynamic graph means gradients are always correct for the computation that actually ran — not for a pre-compiled approximation of it.
This is why PyTorch won research: you can change architecture mid-experiment without recompiling anything.
In production with torch.compile(), the dynamic graph gets partially compiled for performance while retaining correctness for control-flow branches.
Rule: if your model has conditional logic that changes which operations run based on input values, PyTorch's dynamic graph handles this correctly where static-graph frameworks historically required workarounds.
🎯 Key Takeaway
Autograd automates the Chain Rule by recording operations in a dynamic computation graph that is rebuilt on every forward pass. The graph is destroyed after backward() by default — retain_graph=True in a training loop is almost always a memory leak waiting to happen. In production, use an optimizer rather than manual weight updates, and always wrap inference in torch.inference_mode() — it disables both gradient computation and version tracking, making it measurably faster than torch.no_grad() for serving workloads.
When to Use Autograd vs Manual Updates
IfStandard neural network training
UseUse an optimizer (SGD, Adam, AdamW) — it handles zero_grad, the update rule, momentum, and weight decay. Manual updates are for learning concepts, not shipping code.
IfNeed custom gradient logic that PyTorch can't express
UseUse torch.autograd.Function to define a custom forward and backward pass — useful for custom CUDA kernels or numerically stable loss functions.
IfInference only — no weight updates
UseWrap in torch.inference_mode() for production serving. Use torch.no_grad() during validation inside training loops where you may still need tensor version tracking.
IfDebugging suspicious gradients
UseUse torch.autograd.gradcheck() to numerically verify computed gradients against finite differences — invaluable when implementing custom backward passes.

Building a Real Training Loop with nn.Module

Writing raw tensor operations gets unwieldy past a handful of layers. PyTorch's nn.Module is the standard abstraction for any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters (or sub-modules that contain them) inside __init__, and defines the forward computation inside forward().

The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances, arbitrarily deep. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree — that flat iterator is exactly what you hand to the optimizer.

The training loop is the heartbeat of all ML work in PyTorch. It is always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. That order is not arbitrary — skipping or reordering any step produces a specific and usually hard-to-diagnose failure. Internalise this sequence and you can read any paper's training code cold.

The validation loop is structurally almost identical but with two additions: model.eval() called before the loop, and torch.no_grad() wrapping the forward pass. These solve different problems. model.eval() changes layer behaviour — Dropout stops masking neurons, BatchNorm uses accumulated running statistics instead of batch statistics. torch.no_grad() stops graph construction entirely, saving memory and time. You need both; neither substitutes for the other.

The most common production bug I still see in 2026: calling model.forward(x) directly instead of model(x). It works identically in isolation, but it bypasses all registered forward hooks — hooks that profilers, debuggers, quantisation tools, and libraries like torchvision rely on. Always call the model as a callable. The __call__ method is what wires up the hook infrastructure; forward() is just the computation you define.

io.thecodeforge.ml.neural_network_training_loop.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)

# --- Dataset: synthetic house price prediction ---
# 100 samples, 3 features: normalised size, bedrooms, age
num_samples  = 100
num_features = 3

# Generate synthetic features and a linear target with realistic noise
raw_features = torch.randn(num_samples, num_features)
true_weights  = torch.tensor([0.5, 0.3, -0.2])  # size helps, age hurts price
target_prices = raw_features @ true_weights + 0.1 * torch.randn(num_samples)

# Train / validation split: 80 / 20
train_size = int(0.8 * num_samples)
train_features, val_features = raw_features[:train_size], raw_features[train_size:]
train_targets,  val_targets  = target_prices[:train_size], target_prices[train_size:]


# --- Model definition ---
class HousePriceNet(nn.Module):
    def __init__(self, input_features: int):
        super().__init__()  # always call parent __init__ — skipping this breaks parameter registration

        self.network = nn.Sequential(
            nn.Linear(input_features, 16),  # input -> hidden (16 neurons)
            nn.ReLU(),                       # non-linearity: clamps negatives to zero
            nn.Linear(16, 8),               # hidden -> smaller hidden
            nn.ReLU(),
            nn.Linear(8, 1),                # final layer: one price prediction per sample
        )

    def forward(self, feature_batch: torch.Tensor) -> torch.Tensor:
        # squeeze(1) removes the trailing dimension: (batch_size, 1) -> (batch_size,)
        # This matches the shape of target_prices for MSELoss
        return self.network(feature_batch).squeeze(1)


model     = HousePriceNet(input_features=num_features)
loss_fn   = nn.MSELoss()                             # Mean Squared Error for regression
optimizer = optim.Adam(model.parameters(), lr=1e-3)  # Adam adapts step size per parameter

# Quick sanity check before training begins
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(model)
print()

# --- Training loop ---
for epoch in range(1, 51):
    # Always set training mode at the top of the training loop body
    model.train()  # enables Dropout masking and BatchNorm batch statistics updates

    # Step 1: Zero stale gradients from the previous iteration
    # This is the line that the production incident was missing
    optimizer.zero_grad()

    # Step 2: Forward pass — model(x) not model.forward(x)
    # model() wires up forward hooks; model.forward() bypasses them
    train_predictions = model(train_features)

    # Step 3: Compute loss
    train_loss = loss_fn(train_predictions, train_targets)

    # Step 4: Backprop — autograd fills .grad on every parameter
    train_loss.backward()

    # Optional but recommended: log gradient norms before the update step
    # A norm above ~10.0 is worth investigating; above ~100.0 is a red flag
    if epoch % 10 == 0:
        total_norm = sum(p.grad.norm().item() ** 2 for p in model.parameters() if p.grad is not None) ** 0.5

    # Step 5: Update every parameter using its stored gradient
    optimizer.step()

    # --- Validation pass every 10 epochs ---
    if epoch % 10 == 0:
        model.eval()  # disables Dropout, freezes BatchNorm running stats
        with torch.no_grad():  # no graph construction needed — saves memory
            val_predictions = model(val_features)
            val_loss = loss_fn(val_predictions, val_targets)

        print(f"Epoch {epoch:3d} | Train Loss: {train_loss.item():.4f} | "
              f"Val Loss: {val_loss.item():.4f} | Grad Norm: {total_norm:.4f}")

# --- Single inference example ---
# inference_mode is faster than no_grad for serving — disables version tracking too
model.eval()
with torch.inference_mode():
    new_house = torch.tensor([[1.2, 0.5, -0.8]])  # one sample, 3 normalised features
    price_pred = model(new_house)
    expected   = new_house[0] @ true_weights
    print(f"\nPredicted: {price_pred.item():.4f} | Expected (approx): {expected.item():.4f}")
▶ Output
Model parameters: 201
HousePriceNet(
(network): Sequential(
(0): Linear(in_features=3, out_features=16, bias=True)
(1): ReLU()
(2): Linear(in_features=16, out_features=8, bias=True)
(3): ReLU()
(4): Linear(in_features=8, out_features=1, bias=True)
)
)

Epoch 10 | Train Loss: 0.2431 | Val Loss: 0.3102 | Grad Norm: 0.3847
Epoch 20 | Train Loss: 0.1187 | Val Loss: 0.1834 | Grad Norm: 0.2214
Epoch 30 | Train Loss: 0.0743 | Val Loss: 0.1214 | Grad Norm: 0.1563
Epoch 40 | Train Loss: 0.0521 | Val Loss: 0.0987 | Grad Norm: 0.1102
Epoch 50 | Train Loss: 0.0389 | Val Loss: 0.0812 | Grad Norm: 0.0831

Predicted: 0.6821 | Expected (approx): 0.7200
🔥Interview Gold: model.train() vs model.eval() vs torch.no_grad()
These are three separate controls that solve three different problems. model.train() and model.eval() flip a flag that changes layer behaviour — Dropout randomly drops neurons in train mode and passes all of them in eval mode; BatchNorm updates running statistics in train mode and uses them in eval mode. torch.no_grad() is a completely separate mechanism that tells the autograd engine to stop building the computation graph. You can call model.eval() with gradients still flowing (unusual but valid) or call model.train() inside a torch.no_grad() block (common in gradient accumulation setups). Forgetting model.eval() during validation is one of the most common bugs in PyTorch codebases — your validation loss will fluctuate unpredictably and you will spend time blaming your learning rate or data pipeline.
📊 Production Insight
model.train() and model.eval() control Dropout and BatchNorm behaviour — not gradient computation. torch.no_grad() controls gradient computation — not layer behaviour. You need both for a correct validation loop and they must be called in the right order: model.eval() first, then enter the torch.no_grad() context.
In PyTorch 2.x, torch.compile() is compatible with both — but compile the model before calling .eval() or .train() to avoid recompilation on mode switches.
Rule: add print(model.training) as a one-time sanity check when setting up any new evaluation loop — it has saved me from at least three subtle bugs in production codebases.
🎯 Key Takeaway
The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing, not stylistic. model.train() and model.eval() control layer behaviour (Dropout, BatchNorm); torch.no_grad() controls graph construction. Always call the model as a callable (model(x)), never model.forward(x) — the __call__ method is what wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on.
Training Loop Step Selection
IfStandard training iteration
Usezero_grad -> forward -> loss -> backward -> step — never skip or reorder. Each step depends on the previous one completing correctly.
IfValidation or evaluation pass
Usemodel.eval() + torch.no_grad() -> forward -> loss — no backward or step. Both calls are required; neither replaces the other.
IfProduction inference / serving
Usemodel.eval() + torch.inference_mode() -> forward — fastest path, disables both graph construction and version counter tracking.
IfGradient accumulation for large effective batch sizes
UseCall backward() every step, call optimizer.step() + zero_grad() only every N steps. Divide the loss by N before backward() to keep gradient magnitudes consistent with a single large batch.
🗂 PyTorch vs TensorFlow 1.x — Architectural Differences
Why PyTorch's dynamic graph won the research community and what it costs in production
Feature / AspectPyTorch (Dynamic Graph)TensorFlow 1.x (Static Graph)
Graph constructionBuilt at runtime on every forward pass — debug with standard Python tools anywhere in the loopPre-compiled before any data flows through — the graph was fixed at definition time, making runtime inspection nearly impossible
DebuggingStandard Python debugger, print(), and pdb work anywhere in the forward pass with no special configurationRequired special tf.Print ops inserted into the graph; runtime errors produced stack traces that pointed to graph compilation, not the user code that caused them
Research flexibilityArchitecture changes take effect immediately — swap a layer, change a loss function, add a branch mid-experiment with no recompilation stepAny architectural change required rebuilding and recompiling the graph, which could take seconds to minutes for large models
Production deploymentTorchScript or ONNX export required for optimised serving without a Python runtime; torch.compile() in 2.x closes most of the performance gap for GPU servingSavedModel format was natively optimised for TF Serving; the static graph made deployment straightforward but locked you into the graph you compiled
Community adoptionDominant in research — over 75% of ML papers published in 2024-2025 used PyTorch as the primary frameworkRemains strong in enterprise production systems built before 2020; legacy TF1 codebases are still running in many large organisations
GPU memory controlExplicit .to(device) — you decide what moves and when; nothing migrates automaticallyAutomatic placement with manual overrides via tf.device() context managers; less control but fewer explicit device calls
Gradient controlrequires_grad per tensor; torch.no_grad() and torch.inference_mode() context managers; fine-grained control at the tensor levelGradientTape context manager in TF 2.x — similar concept but opt-in rather than opt-out; in TF 1.x gradients were computed by tf.gradients() on the pre-compiled graph

🎯 Key Takeaways

  • Tensors carry three critical properties beyond their values: dtype, device, and requires_grad — getting any one of these wrong silently breaks training in ways that trace to the wrong location in the stack.
  • Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss. The graph is destroyed after each backward pass by default — retain_graph=True in a loop is almost always a memory leak.
  • The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing. Memorise it and you can read any codebase or paper's training code cold.
  • model.train() and model.eval() control layer behaviour like Dropout and BatchNorm. torch.no_grad() controls gradient computation. These are three separate mechanisms. Confusing them is the single most common source of subtle training bugs in production PyTorch code.
  • Always call model as a callable (model(x)), never model.forward(x) — the __call__ method wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on. This one habit prevents an entire class of silent tooling failures.

⚠ Common Mistakes to Avoid

    Forgetting optimizer.zero_grad() before loss.backward()
    Symptom

    Loss decreases during training but predictions are garbage in production. Gradient norms grow exponentially across epochs. The training run looks like it converged but the model has effectively random weights that memorised noise.

    Fix

    Call optimizer.zero_grad() as the first line of every training step — before the forward pass, before the loss computation, before anything. Add gradient norm logging to your training dashboard. Consider gradient clipping with max_norm=1.0 as a permanent safety net, not just a debugging tool.

    Calling model.forward(x) directly instead of model(x)
    Symptom

    Profilers produce no output. Libraries that register forward hooks (torchvision, quantisation tools, some logging frameworks) silently fail to fire. The model produces correct predictions but all hook-dependent tooling is invisible.

    Fix

    Always call the model as a callable: predictions = model(input_tensor). The __call__ method is where forward hooks, backward hooks, and the training mode flag are applied. Your forward() method is called internally — it is not the entry point.

    Storing raw loss tensors in a list for logging
    Symptom

    GPU memory grows steadily across epochs with no obvious leak in the model code. Each epoch consumes more memory than the last until an out-of-memory crash occurs, often in the middle of a long training run.

    Fix

    Use loss.item() to extract a plain Python float before storing or logging. .item() detaches the scalar from the computation graph. Never append loss itself to a list — it keeps the entire graph alive for that batch in memory indefinitely.

    Not calling model.eval() during validation
    Symptom

    Validation loss fluctuates wildly between epochs even when training loss is smooth and decreasing. The model appears unstable, but the instability is in the evaluation, not the model weights.

    Fix

    Call model.eval() before every validation loop and model.train() before every training loop. Treat them as a matched pair. If your codebase has multiple evaluation paths (validation, test, inference), add a utility function that ensures eval mode is set and inference_mode is active — centralise it so it cannot be forgotten.

    Using torch.no_grad() instead of torch.inference_mode() for production serving
    Symptom

    Inference is slower than expected and uses more memory than necessary. Not a crash — a silent performance regression that is easy to miss without profiling.

    Fix

    Use torch.inference_mode() for all production inference paths. It disables both gradient computation and version counter tracking, providing 10-20% faster execution on typical transformer and CNN architectures. Reserve torch.no_grad() for validation loops inside training runs where you may still need version tracking for other operations.

Frequently Asked Questions

What is the difference between PyTorch and NumPy?

NumPy arrays live only on the CPU and have no concept of gradients or automatic differentiation. PyTorch tensors can live on a GPU — which is what makes large matrix operations fast enough for deep learning in practice — and tensors with requires_grad=True automatically track every operation performed on them so that gradients can be computed via .backward(). For pure numerical computing with no learning involved, NumPy is lighter and more widely supported in the scientific Python ecosystem. The moment you need a model to learn from data, PyTorch is the right tool. Many teams also mix both: NumPy for data preprocessing and analysis, PyTorch for the model itself.

When should I use torch.no_grad()?

Any time you are running the model but not updating its weights — validation during training, evaluation on a test set, or production inference. Without it, PyTorch builds the full computation graph on every forward pass regardless of whether you call backward(), which wastes memory proportional to your model depth and batch size. For validation loops inside a training run, torch.no_grad() is the right choice. For production serving, use torch.inference_mode() instead — it is faster because it also disables version counter tracking, and tensors created inside it cannot accidentally be used in a backward pass.

Why does my PyTorch model train fine on CPU but crash on GPU?

Almost always a device mismatch — your model is on the GPU but your input tensors are still on the CPU, or a specific tensor created inside your forward pass defaults to CPU while the model parameters are on CUDA. Every tensor involved in a single operation must be on the same device. The fix is calling input_tensor = input_tensor.to(device) in your data loading step, and ensuring device matches whatever you passed to model.to(device). If you are using a custom DataLoader collate function, that is a common place where tensors quietly stay on CPU without triggering an obvious error until the first forward pass.

What does model.train() actually do?

model.train() sets the model's internal training flag to True, which changes the runtime behaviour of specific layer types. Dropout layers start randomly masking neurons according to the configured drop probability. BatchNorm layers update their running mean and variance statistics using each batch's statistics rather than the accumulated running values. It does not enable gradient computation — that is controlled by requires_grad on individual tensors and the torch.no_grad() context manager, which are completely independent mechanisms. Calling model.train() is not optional before training loops; skipping it on a model with Dropout or BatchNorm produces optimistically low training loss and unreliable generalisation.

How do I save and load a PyTorch model for production?

Save only the learned weights using torch.save(model.state_dict(), 'model.pt'). Load them with model.load_state_dict(torch.load('model.pt', weights_only=True)) — the weights_only=True argument is important from a security standpoint as of PyTorch 2.x; it prevents arbitrary code execution from a malicious checkpoint file. Always call model.eval() after loading for inference. For deployment in environments without a Python runtime, export to TorchScript with torch.jit.script(model) or to ONNX with torch.onnx.export(). Never pickle the entire model object — it binds the weights to your exact class definition, PyTorch version, and Python version, which makes it fragile across environments and time.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousTensorFlow BasicsNext →Keras for Deep Learning
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged