PyTorch Basics Explained: Tensors, Autograd, and Real Model Training
- Tensors carry three critical properties beyond their values: dtype, device, and requires_grad — getting any one of these wrong silently breaks training in ways that trace to the wrong location in the stack.
- Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss. The graph is destroyed after each backward pass by default — retain_graph=True in a loop is almost always a memory leak.
- The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing. Memorise it and you can read any codebase or paper's training code cold.
- PyTorch tensors are multi-dimensional arrays that live on CPU or GPU and optionally track gradients for backpropagation
- requires_grad=True opts a tensor into the autograd engine — only set it on learnable parameters, never on input data
- The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model
- model.train() and model.eval() control layer behaviour (Dropout, BatchNorm) — they do NOT control gradient computation
- Forgetting optimizer.zero_grad() causes gradient accumulation, which silently corrupts training
- Always use torch.inference_mode() or torch.no_grad() during validation and serving — not optional in production
Loss becomes NaN during training
torch.autograd.set_detect_anomaly(True)print([(n, p.grad.norm()) for n, p in model.named_parameters() if p.grad is not None])GPU memory grows each epoch
torch.cuda.memory_summary()print(loss.item()) # NOT print(loss) — .item() detaches from graphGradients are all zeros — model not learning
for n, p in model.named_parameters(): print(n, p.requires_grad, p.grad is not None)torch.autograd.gradcheck(model, (test_input,))Model trains but validation metrics are random
print(model.training) # Should be False during validationprint(any(isinstance(m, nn.Dropout) for m in model.modules()))Production Incident
optimizer.zero_grad(). PyTorch accumulates gradients by default — every backward() call adds to existing .grad values rather than replacing them. After 200 epochs of a decently-sized batch size, the accumulated gradient magnitude was effectively 200x the correct value for the first batch seen. The optimizer was applying enormous, compounding weight updates that oscillated wildly around the loss minimum without ever settling. The model ended up with effectively random weights that happened to produce low training loss by memorising noise in the first few batches — a classic overfitting-via-gradient-corruption failure that is nearly impossible to diagnose from loss curves alone.optimizer.zero_grad() as the first line of every training step. Added gradient norm logging to the training dashboard — a norm above 10.0 now triggers an alert. Added gradient clipping (max_norm=1.0) as a standing safety net across all training jobs. Added validation loss divergence detection — an alert fires if val loss increases for five consecutive epochs relative to the rolling minimum.zero_grad() is not optional, it is the first line of every training stepMonitor gradient norms during training — a sudden spike almost always indicates accumulation or an unchecked learning rate scheduleValidation loss trending down is not sufficient signal — always check for divergence between train loss and val loss over timeGradient clipping prevents catastrophic divergence from outlier batches or accumulation bugs — set it once and leave it onProduction Debug GuideCommon symptoms when training goes wrong
torch.autograd.detect_anomaly() to identify which operation produced the NaN gradient. In my experience, the most common culprit is a log() applied to a prediction that dipped to exactly zero — add a small epsilon (1e-8) inside any log call in your loss function.torch.no_grad() wrapping the training loop — this is surprisingly easy to do when refactoring inference code into a shared utility. Also check for dead ReLU initialisation: if all pre-activations are negative at init, the entire gradient signal is zero from step one.loss.item()) to a history list. Use .item() for scalar logging and .detach() for tensor logging. Also check for retain_graph=True being called repeatedly — it is almost never necessary in standard training and will silently accumulate the entire graph in memory.model.eval() is called before validation. Without it, Dropout randomly drops different neurons on every forward pass, and BatchNorm uses the current batch's statistics instead of the accumulated running statistics. The result is non-deterministic validation outputs even on identical input data — which looks exactly like training instability but is actually an evaluation bug.PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.
The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine — and as of 2026, that engine underpins everything from two-layer regression models to the transformer architectures powering production LLMs.
The most common production failure I see: developers understand the happy path but not the failure modes. Training loops that silently accumulate gradients, validation code that forgets model.eval(), and inference that wastes GPU memory by not disabling autograd. This guide covers both the concepts and the production gotchas — because shipping a model that actually works in production is a different skill from getting a notebook to converge.
Tensors: The DNA of Every PyTorch Model
A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.
What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.
You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising buffers, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, never automatic. That explicitness is a feature, not an oversight; it forces you to reason about where computation actually happens, which is the difference between a model that fits in GPU memory and one that crashes at batch two.
As of PyTorch 2.x, torch.compile() can fuse tensor operations into optimised kernels automatically — but only if your tensors are on the right device and dtype from the start. Sloppy tensor hygiene becomes measurably more expensive in 2026 than it was when compilation wasn't part of the picture.
The dtype mismatch is the most common silent failure: Python integer literals default to int64, Python floats default to float64, and PyTorch defaults to float32 for most operations. Mixing them throws a RuntimeError at operation time, not at creation time — so the error surfaces somewhere unexpected. Always pass floats with a trailing .0 or specify dtype explicitly at creation.
import torch # --- Creating tensors from real data --- # Simulating a tiny dataset: 4 house sizes (sq ft) and their prices ($k) house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0]) # 1D tensor, shape (4,) house_prices = torch.tensor([150.0, 220.0, 310.0, 410.0]) # 1D tensor, shape (4,) print("Sizes tensor:", house_sizes) print("Shape:", house_sizes.shape) # torch.Size([4]) print("Data type:", house_sizes.dtype) # torch.float32 — default for floats # --- 2D tensor: batch of data (rows = samples, cols = features) --- feature_matrix = torch.tensor([ [750.0, 3.0, 1.0], # size, bedrooms, bathrooms [1200.0, 4.0, 2.0], [1800.0, 4.0, 3.0], [2400.0, 5.0, 3.0], ]) print("\nFeature matrix shape:", feature_matrix.shape) # torch.Size([4, 3]) # --- Device awareness: check and move to GPU if available --- device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print("\nUsing device:", device) # Move tensor to the target device — always do this before operations # In PyTorch 2.x, doing this at creation time avoids an extra host-to-device copy feature_matrix = feature_matrix.to(device) print("Feature matrix device:", feature_matrix.device) # --- Useful tensor operations --- # Normalise features: (x - mean) / std — critical for stable training # dim=0 means we compute one mean per column (per feature), across all rows (samples) means = feature_matrix.mean(dim=0) stds = feature_matrix.std(dim=0) normalised = (feature_matrix - means) / stds print("\nNormalised features (first row):", normalised[0]) # --- requires_grad: opting a tensor INTO gradient tracking --- # We do NOT set this on input data — only on learnable parameters # Input data is fixed; we want gradients w.r.t. parameters, not the data itself weight = torch.tensor([0.15], requires_grad=True) # our model's single weight bias = torch.tensor([10.0], requires_grad=True) # our model's bias term print("\nWeight requires grad:", weight.requires_grad) # True print("House sizes requires grad:", house_sizes.requires_grad) # False — data, not a parameter # --- Checking tensor metadata in one place --- # Useful diagnostic pattern during debugging for name, t in [("weight", weight), ("bias", bias), ("sizes", house_sizes)]: print(f"{name:8s} | dtype: {t.dtype} | device: {t.device} | requires_grad: {t.requires_grad}")
Shape: torch.Size([4])
Data type: torch.float32
Feature matrix shape: torch.Size([4, 3])
Using device: cpu
Feature matrix device: cpu
Normalised features (first row): tensor([-1.3416, -1.1547, -1.0000])
Weight requires grad: True
House sizes requires grad: False
weight | dtype: torch.float32 | device: cpu | requires_grad: True
bias | dtype: torch.float32 | device: cpu | requires_grad: True
sizes | dtype: torch.float32 | device: cpu | requires_grad: False
torch.compile(), dtype and device inconsistencies also prevent kernel fusion, silently costing you throughput on top of correctness.torch.tensor() — it copies the data and infers dtype, but defaults to float32 for Python floats. For large arrays, torch.from_numpy() avoids the copy.torch.randn() * init_scale or nn.init.kaiming_normal_ — never initialise all weights to zero; every neuron would compute identical gradients and the network would never differentiate.Autograd: How PyTorch Learns Without You Doing Calculus
Autograd is the reason PyTorch feels almost magical the first time it clicks. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (almost always a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.
In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter — telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge, applied repeatedly, is gradient descent.
Three rules to memorise before shipping anything: (1) .backward() can only be called on a scalar tensor. If your loss is a multi-element tensor, call .mean() or .sum() first or pass a gradient argument. (2) Gradients accumulate by default — every call to .backward() adds to existing .grad values rather than replacing them. Call optimizer.zero_grad() before each backward pass or gradients will pile up across batches and corrupt training in exactly the way the production incident above describes. (3) During inference, wrap code in torch.no_grad() or torch.inference_mode() to skip graph construction entirely — it is faster, uses less memory, and removes an entire class of production bugs.
The graph is destroyed after .backward() completes by default. This is intentional memory management: the graph for one forward pass can consume hundreds of megabytes on a deep network. Without destruction, GPU memory would grow linearly with training steps. This is also why you cannot call .backward() twice on the same graph without retain_graph=True — and retain_graph=True in a training loop is almost always a bug, not a feature.
One nuance worth knowing as of PyTorch 2.x: torch.compile() can aggressively optimise the forward and backward passes together, but it relies on the graph being consistent across calls. If your forward pass has Python-level control flow that changes based on input values (not just tensor shapes), you may need to mark those branches with torch.compiler.disable() to prevent recompilation overhead on every batch.
import torch # Seed for reproducibility — always set this in experiments # Without it, two runs with identical code produce different results and debugging becomes a nightmare torch.manual_seed(42) # --- Toy linear regression: predict house price from size --- # Ground truth relationship: price ≈ 0.18 * size + 5 (the model must discover this) house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0]) house_prices = torch.tensor([140.0, 221.0, 329.0, 437.0]) # Learnable parameters — these are the knobs autograd will compute gradients for weight = torch.tensor([0.01], requires_grad=True) # terrible initial guess, intentionally bias = torch.tensor([0.01], requires_grad=True) # Learning rate is tiny because our raw inputs are in the hundreds-to-thousands range # Without normalisation, you need a proportionally smaller step to avoid overshooting learning_rate = 1e-7 for epoch in range(6): # FORWARD PASS: compute predictions using current weight and bias # Broadcasting applies weight and bias across all 4 house sizes simultaneously predicted_prices = weight * house_sizes + bias # COMPUTE LOSS: Mean Squared Error — average squared error across all predictions loss = ((predicted_prices - house_prices) ** 2).mean() # ZERO GRADIENTS: must do this before backward() # .grad accumulates by default — if we skip this, epoch 2 adds to epoch 1's gradients if weight.grad is not None: weight.grad.zero_() bias.grad.zero_() # (In real code you'd use optimizer.zero_grad() instead of this manual approach) # BACKWARD PASS: autograd traverses the graph and fills .grad on weight and bias # This computes d(loss)/d(weight) and d(loss)/d(bias) via the Chain Rule loss.backward() # PARAMETER UPDATE: move weight and bias in the direction that reduces loss # torch.no_grad() here because we don't want this update operation itself tracked with torch.no_grad(): weight -= learning_rate * weight.grad bias -= learning_rate * bias.grad print(f"Epoch {epoch+1:2d} | Loss: {loss.item():.2f} | " f"weight: {weight.item():.5f} | bias: {bias.item():.5f} | " f"grad_w: {weight.grad.item():.4f}") print("\nFinal model: price =", round(weight.item(), 4), "* size +", round(bias.item(), 4)) print("Target model: price = 0.18 * size + 5") print("Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs") # Inference — graph construction is wasted work here; inference_mode is faster than no_grad with torch.inference_mode(): test_size = torch.tensor([2000.0]) predicted = weight * test_size + bias print(f"\nPredicted price for 2000 sq ft: ${predicted.item():.1f}k")
Epoch 2 | Loss: 65099.14 | weight: 0.08349 | bias: 0.01006 | grad_w: -353016.2500
Epoch 3 | Loss: 54310.95 | weight: 0.11626 | bias: 0.01008 | grad_w: -327650.5000
Epoch 4 | Loss: 45330.67 | weight: 0.14673 | bias: 0.01011 | grad_w: -304708.0000
Epoch 5 | Loss: 37842.37 | weight: 0.17510 | bias: 0.01013 | grad_w: -283840.0000
Epoch 6 | Loss: 31569.55 | weight: 0.20153 | bias: 0.01015 | grad_w: -264344.7500
Final model: price = 0.2015 * size + 0.0102
Target model: price = 0.18 * size + 5
Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs
Predicted price for 2000 sq ft: $403.1k
- Forward pass: execute operations and record the graph — each operation node stores its own gradient function (grad_fn)
- Backward pass: traverse the graph in reverse from the loss node, applying the Chain Rule at each node to accumulate gradients
- The graph is rebuilt fresh on every forward pass — it captures the exact computation that just ran, including any Python-level branching
- requires_grad=True marks a tensor as a leaf node whose .grad we want filled in after
backward() - The gradient of a scalar loss with respect to all parameters is computed in a single .backward() call — you do not loop over parameters manually
torch.compile(), the dynamic graph gets partially compiled for performance while retaining correctness for control-flow branches.backward() by default — retain_graph=True in a training loop is almost always a memory leak waiting to happen. In production, use an optimizer rather than manual weight updates, and always wrap inference in torch.inference_mode() — it disables both gradient computation and version tracking, making it measurably faster than torch.no_grad() for serving workloads.torch.inference_mode() for production serving. Use torch.no_grad() during validation inside training loops where you may still need tensor version tracking.torch.autograd.gradcheck() to numerically verify computed gradients against finite differences — invaluable when implementing custom backward passes.Building a Real Training Loop with nn.Module
Writing raw tensor operations gets unwieldy past a handful of layers. PyTorch's nn.Module is the standard abstraction for any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters (or sub-modules that contain them) inside __init__, and defines the forward computation inside forward().
The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances, arbitrarily deep. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree — that flat iterator is exactly what you hand to the optimizer.
The training loop is the heartbeat of all ML work in PyTorch. It is always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. That order is not arbitrary — skipping or reordering any step produces a specific and usually hard-to-diagnose failure. Internalise this sequence and you can read any paper's training code cold.
The validation loop is structurally almost identical but with two additions: model.eval() called before the loop, and torch.no_grad() wrapping the forward pass. These solve different problems. model.eval() changes layer behaviour — Dropout stops masking neurons, BatchNorm uses accumulated running statistics instead of batch statistics. torch.no_grad() stops graph construction entirely, saving memory and time. You need both; neither substitutes for the other.
The most common production bug I still see in 2026: calling model.forward(x) directly instead of model(x). It works identically in isolation, but it bypasses all registered forward hooks — hooks that profilers, debuggers, quantisation tools, and libraries like torchvision rely on. Always call the model as a callable. The __call__ method is what wires up the hook infrastructure; forward() is just the computation you define.
import torch import torch.nn as nn import torch.optim as optim torch.manual_seed(42) # --- Dataset: synthetic house price prediction --- # 100 samples, 3 features: normalised size, bedrooms, age num_samples = 100 num_features = 3 # Generate synthetic features and a linear target with realistic noise raw_features = torch.randn(num_samples, num_features) true_weights = torch.tensor([0.5, 0.3, -0.2]) # size helps, age hurts price target_prices = raw_features @ true_weights + 0.1 * torch.randn(num_samples) # Train / validation split: 80 / 20 train_size = int(0.8 * num_samples) train_features, val_features = raw_features[:train_size], raw_features[train_size:] train_targets, val_targets = target_prices[:train_size], target_prices[train_size:] # --- Model definition --- class HousePriceNet(nn.Module): def __init__(self, input_features: int): super().__init__() # always call parent __init__ — skipping this breaks parameter registration self.network = nn.Sequential( nn.Linear(input_features, 16), # input -> hidden (16 neurons) nn.ReLU(), # non-linearity: clamps negatives to zero nn.Linear(16, 8), # hidden -> smaller hidden nn.ReLU(), nn.Linear(8, 1), # final layer: one price prediction per sample ) def forward(self, feature_batch: torch.Tensor) -> torch.Tensor: # squeeze(1) removes the trailing dimension: (batch_size, 1) -> (batch_size,) # This matches the shape of target_prices for MSELoss return self.network(feature_batch).squeeze(1) model = HousePriceNet(input_features=num_features) loss_fn = nn.MSELoss() # Mean Squared Error for regression optimizer = optim.Adam(model.parameters(), lr=1e-3) # Adam adapts step size per parameter # Quick sanity check before training begins print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}") print(model) print() # --- Training loop --- for epoch in range(1, 51): # Always set training mode at the top of the training loop body model.train() # enables Dropout masking and BatchNorm batch statistics updates # Step 1: Zero stale gradients from the previous iteration # This is the line that the production incident was missing optimizer.zero_grad() # Step 2: Forward pass — model(x) not model.forward(x) # model() wires up forward hooks; model.forward() bypasses them train_predictions = model(train_features) # Step 3: Compute loss train_loss = loss_fn(train_predictions, train_targets) # Step 4: Backprop — autograd fills .grad on every parameter train_loss.backward() # Optional but recommended: log gradient norms before the update step # A norm above ~10.0 is worth investigating; above ~100.0 is a red flag if epoch % 10 == 0: total_norm = sum(p.grad.norm().item() ** 2 for p in model.parameters() if p.grad is not None) ** 0.5 # Step 5: Update every parameter using its stored gradient optimizer.step() # --- Validation pass every 10 epochs --- if epoch % 10 == 0: model.eval() # disables Dropout, freezes BatchNorm running stats with torch.no_grad(): # no graph construction needed — saves memory val_predictions = model(val_features) val_loss = loss_fn(val_predictions, val_targets) print(f"Epoch {epoch:3d} | Train Loss: {train_loss.item():.4f} | " f"Val Loss: {val_loss.item():.4f} | Grad Norm: {total_norm:.4f}") # --- Single inference example --- # inference_mode is faster than no_grad for serving — disables version tracking too model.eval() with torch.inference_mode(): new_house = torch.tensor([[1.2, 0.5, -0.8]]) # one sample, 3 normalised features price_pred = model(new_house) expected = new_house[0] @ true_weights print(f"\nPredicted: {price_pred.item():.4f} | Expected (approx): {expected.item():.4f}")
HousePriceNet(
(network): Sequential(
(0): Linear(in_features=3, out_features=16, bias=True)
(1): ReLU()
(2): Linear(in_features=16, out_features=8, bias=True)
(3): ReLU()
(4): Linear(in_features=8, out_features=1, bias=True)
)
)
Epoch 10 | Train Loss: 0.2431 | Val Loss: 0.3102 | Grad Norm: 0.3847
Epoch 20 | Train Loss: 0.1187 | Val Loss: 0.1834 | Grad Norm: 0.2214
Epoch 30 | Train Loss: 0.0743 | Val Loss: 0.1214 | Grad Norm: 0.1563
Epoch 40 | Train Loss: 0.0521 | Val Loss: 0.0987 | Grad Norm: 0.1102
Epoch 50 | Train Loss: 0.0389 | Val Loss: 0.0812 | Grad Norm: 0.0831
Predicted: 0.6821 | Expected (approx): 0.7200
model.train() and model.eval() flip a flag that changes layer behaviour — Dropout randomly drops neurons in train mode and passes all of them in eval mode; BatchNorm updates running statistics in train mode and uses them in eval mode. torch.no_grad() is a completely separate mechanism that tells the autograd engine to stop building the computation graph. You can call model.eval() with gradients still flowing (unusual but valid) or call model.train() inside a torch.no_grad() block (common in gradient accumulation setups). Forgetting model.eval() during validation is one of the most common bugs in PyTorch codebases — your validation loss will fluctuate unpredictably and you will spend time blaming your learning rate or data pipeline.model.eval() control Dropout and BatchNorm behaviour — not gradient computation. torch.no_grad() controls gradient computation — not layer behaviour. You need both for a correct validation loop and they must be called in the right order: model.eval() first, then enter the torch.no_grad() context.torch.compile() is compatible with both — but compile the model before calling .eval() or .train() to avoid recompilation on mode switches.model.train() and model.eval() control layer behaviour (Dropout, BatchNorm); torch.no_grad() controls graph construction. Always call the model as a callable (model(x)), never model.forward(x) — the __call__ method is what wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on.torch.no_grad() -> forward -> loss — no backward or step. Both calls are required; neither replaces the other.torch.inference_mode() -> forward — fastest path, disables both graph construction and version counter tracking.backward() every step, call optimizer.step() + zero_grad() only every N steps. Divide the loss by N before backward() to keep gradient magnitudes consistent with a single large batch.| Feature / Aspect | PyTorch (Dynamic Graph) | TensorFlow 1.x (Static Graph) |
|---|---|---|
| Graph construction | Built at runtime on every forward pass — debug with standard Python tools anywhere in the loop | Pre-compiled before any data flows through — the graph was fixed at definition time, making runtime inspection nearly impossible |
| Debugging | Standard Python debugger, print(), and pdb work anywhere in the forward pass with no special configuration | Required special tf.Print ops inserted into the graph; runtime errors produced stack traces that pointed to graph compilation, not the user code that caused them |
| Research flexibility | Architecture changes take effect immediately — swap a layer, change a loss function, add a branch mid-experiment with no recompilation step | Any architectural change required rebuilding and recompiling the graph, which could take seconds to minutes for large models |
| Production deployment | TorchScript or ONNX export required for optimised serving without a Python runtime; torch.compile() in 2.x closes most of the performance gap for GPU serving | SavedModel format was natively optimised for TF Serving; the static graph made deployment straightforward but locked you into the graph you compiled |
| Community adoption | Dominant in research — over 75% of ML papers published in 2024-2025 used PyTorch as the primary framework | Remains strong in enterprise production systems built before 2020; legacy TF1 codebases are still running in many large organisations |
| GPU memory control | Explicit .to(device) — you decide what moves and when; nothing migrates automatically | Automatic placement with manual overrides via tf.device() context managers; less control but fewer explicit device calls |
| Gradient control | requires_grad per tensor; torch.no_grad() and torch.inference_mode() context managers; fine-grained control at the tensor level | GradientTape context manager in TF 2.x — similar concept but opt-in rather than opt-out; in TF 1.x gradients were computed by tf.gradients() on the pre-compiled graph |
🎯 Key Takeaways
- Tensors carry three critical properties beyond their values: dtype, device, and requires_grad — getting any one of these wrong silently breaks training in ways that trace to the wrong location in the stack.
- Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss. The graph is destroyed after each backward pass by default — retain_graph=True in a loop is almost always a memory leak.
- The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing. Memorise it and you can read any codebase or paper's training code cold.
- model.train() and
model.eval()control layer behaviour like Dropout and BatchNorm.torch.no_grad()controls gradient computation. These are three separate mechanisms. Confusing them is the single most common source of subtle training bugs in production PyTorch code. - Always call model as a callable (model(x)), never model.forward(x) — the __call__ method wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on. This one habit prevents an entire class of silent tooling failures.
⚠ Common Mistakes to Avoid
Frequently Asked Questions
What is the difference between PyTorch and NumPy?
NumPy arrays live only on the CPU and have no concept of gradients or automatic differentiation. PyTorch tensors can live on a GPU — which is what makes large matrix operations fast enough for deep learning in practice — and tensors with requires_grad=True automatically track every operation performed on them so that gradients can be computed via .backward(). For pure numerical computing with no learning involved, NumPy is lighter and more widely supported in the scientific Python ecosystem. The moment you need a model to learn from data, PyTorch is the right tool. Many teams also mix both: NumPy for data preprocessing and analysis, PyTorch for the model itself.
When should I use torch.no_grad()?
Any time you are running the model but not updating its weights — validation during training, evaluation on a test set, or production inference. Without it, PyTorch builds the full computation graph on every forward pass regardless of whether you call backward(), which wastes memory proportional to your model depth and batch size. For validation loops inside a training run, torch.no_grad() is the right choice. For production serving, use torch.inference_mode() instead — it is faster because it also disables version counter tracking, and tensors created inside it cannot accidentally be used in a backward pass.
Why does my PyTorch model train fine on CPU but crash on GPU?
Almost always a device mismatch — your model is on the GPU but your input tensors are still on the CPU, or a specific tensor created inside your forward pass defaults to CPU while the model parameters are on CUDA. Every tensor involved in a single operation must be on the same device. The fix is calling input_tensor = input_tensor.to(device) in your data loading step, and ensuring device matches whatever you passed to model.to(device). If you are using a custom DataLoader collate function, that is a common place where tensors quietly stay on CPU without triggering an obvious error until the first forward pass.
What does model.train() actually do?
model.train() sets the model's internal training flag to True, which changes the runtime behaviour of specific layer types. Dropout layers start randomly masking neurons according to the configured drop probability. BatchNorm layers update their running mean and variance statistics using each batch's statistics rather than the accumulated running values. It does not enable gradient computation — that is controlled by requires_grad on individual tensors and the torch.no_grad() context manager, which are completely independent mechanisms. Calling model.train() is not optional before training loops; skipping it on a model with Dropout or BatchNorm produces optimistically low training loss and unreliable generalisation.
How do I save and load a PyTorch model for production?
Save only the learned weights using torch.save(model.state_dict(), 'model.pt'). Load them with model.load_state_dict(torch.load('model.pt', weights_only=True)) — the weights_only=True argument is important from a security standpoint as of PyTorch 2.x; it prevents arbitrary code execution from a malicious checkpoint file. Always call model.eval() after loading for inference. For deployment in environments without a Python runtime, export to TorchScript with torch.jit.script(model) or to ONNX with torch.onnx.export(). Never pickle the entire model object — it binds the weights to your exact class definition, PyTorch version, and Python version, which makes it fragile across environments and time.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.