Intermediate 12 min · March 06, 2026

PyTorch Basics

PyTorch Gradient Accumulation — 200 Epoch Silent Failure

Q: What is the difference between `torch.Tensor` and `torch.nn.Parameter`?

`torch.Tensor` is a multi-dimensional array with automatic differentiation support. `torch.nn.Parameter` is a subclass of Tensor that is automatically registered as a module parameter when assigned as an attribute of `nn.Module`. Use Parameter for learnable weights; use plain Tensor for constants or non-trainable buffers.

Q: Why does my model's loss not decrease after switching to GPU?

Common causes: (1) Forgetting to call `.to(device)` on both model and input tensors. (2) Using CPU-only operations like `.numpy()` inside the training loop. (3) Not setting `torch.backends.cudnn.deterministic = True` when reproducibility is needed. Always check that all tensors are on the same device with `print(tensor.device)`.

Q: How do I handle variable-length sequences in a batch?

Use `torch.nn.utils.rnn.pad_sequence` to pad sequences to the same length, then `pack_padded_sequence` before feeding into an RNN. After the RNN, use `pad_packed_sequence` to revert. This avoids computation on padding tokens. For transformers, use attention masks instead.

Q: What is the purpose of `torch.no_grad()` and when should I use it?

`torch.no_grad()` disables gradient computation, reducing memory usage and speeding up operations. Use it during inference, validation, and when computing metrics. Never wrap training steps with it—you'll break gradient flow. Also use it when modifying tensors in-place that require gradients.

Q: How do I debug a vanishing gradient in PyTorch?

Monitor gradient norms by registering hooks on layer parameters: `param.register_hook(lambda grad: print(grad.norm()))`. Common fixes: use ReLU instead of sigmoid/tanh, apply batch normalization, use residual connections, or switch to optimizers like Adam that adapt learning rates per parameter.

Q: What's the best way to save and load a model for inference?

Save only the state_dict: `torch.save(model.state_dict(), 'model.pth')`. Load with `model.load_state_dict(torch.load('model.pth'))`. Avoid saving the entire model object—it's fragile across PyTorch versions. For deployment, trace the model with `torch.jit.trace` or export to ONNX.

Missing optimizer.zero_grad() caused 200x gradient accumulation over 200 epochs, corrupting weights silently.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

PyTorch tensors are multi-dimensional arrays that live on CPU or GPU and optionally track gradients for backpropagation
requires_grad=True opts a tensor into the autograd engine — only set it on learnable parameters, never on input data
The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model
model.train() and model.eval() control layer behaviour (Dropout, BatchNorm) — they do NOT control gradient computation
Forgetting optimizer.zero_grad() causes gradient accumulation, which silently corrupts training
Always use torch.inference_mode() or torch.no_grad() during validation and serving — not optional in production

✦ Definition~90s read

What is PyTorch Basics?

Gradient accumulation is a technique that simulates larger batch sizes by accumulating gradients over multiple forward-backward passes before performing an optimizer step. It exists because GPU memory is finite — a batch of 1024 images might not fit in VRAM, so you process 4 batches of 256, sum their gradients, then update weights once.

★

Imagine you're teaching a child to recognise cats by showing them thousands of pictures and correcting them every time they're wrong.

The silent failure in the title refers to a common pitfall: forgetting to normalize gradients by the number of accumulation steps, or failing to call optimizer.zero_grad() at the correct intervals, which leads to incorrect loss scaling and models that train for 200 epochs without converging. Real-world usage: training ResNet-50 on ImageNet with batch size 8192 on 8x A100s (80GB each) still requires gradient accumulation if you want effective batch sizes beyond 32k.

When not to use it: if your effective batch size exceeds what your learning rate schedule can handle (e.g., >4096 for many vision models), you'll see degraded generalization. PyTorch's torch.cuda.amp mixed precision and gradient scaling interact non-trivially with accumulation — you must call scaler.unscale_() before accumulating or risk gradient corruption.

Plain-English First

Imagine you're teaching a child to recognise cats by showing them thousands of pictures and correcting them every time they're wrong. PyTorch is the notebook, pencil, and eraser that lets a computer do exactly that — store the pictures as grids of numbers (tensors), measure how wrong each guess was (loss), and automatically figure out which knobs to tweak to do better next time (autograd). It doesn't decide what to learn; it gives you the tools to build the machine that learns.

PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.

The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine — and as of 2026, that engine underpins everything from two-layer regression models to the transformer architectures powering production LLMs.

The most common production failure I see: developers understand the happy path but not the failure modes. Training loops that silently accumulate gradients, validation code that forgets model.eval(), and inference that wastes GPU memory by not disabling autograd. This guide covers both the concepts and the production gotchas — because shipping a model that actually works in production is a different skill from getting a notebook to converge.

Why Gradient Accumulation Is Not a Free Lunch

Gradient accumulation is a technique that simulates larger batch sizes by summing gradients over multiple forward-backward passes before performing a single optimizer step. Instead of updating weights after every batch, you accumulate gradients across N micro-batches, then step. This lets you train with effective batch sizes that exceed GPU memory limits — a 4GB card can simulate a 256-sample batch by accumulating 32 micro-batches of 8 samples each.

In practice, gradient accumulation changes the training dynamics in subtle ways. Each micro-batch computes gradients independently, but the optimizer sees only the accumulated sum. This means batch normalization statistics are computed per micro-batch, not per effective batch — a common source of silent degradation. Also, gradient clipping must be applied to the accumulated gradient, not per micro-batch, or you'll distort the gradient scale. The effective learning rate should remain tied to the effective batch size, not the micro-batch size, or convergence suffers.

Use gradient accumulation when your GPU memory cannot hold the desired batch size — typically for large models (transformers, CNNs with high-res inputs) or high-resolution images. It is not a substitute for proper batch normalization handling; you must either freeze BN stats or use sync BN across micro-batches. In production, teams often hit a 200-epoch silent failure: the model trains fine for 150 epochs, then plateaus or diverges because BN statistics drifted from the true distribution over the effective batch.

⚠ Batch Normalization Breaks

Gradient accumulation with batch norm computes running stats per micro-batch, not per effective batch — this silently corrupts inference statistics after many epochs.

📊 Production Insight

Production scenario: training a ResNet-50 on 8GB GPUs with gradient accumulation of 4 micro-batches (effective batch 256).

Symptom: validation accuracy plateaus at epoch 150, then drops 2% — BN running mean/var diverge from true distribution.

Rule of thumb: if using BN, either freeze BN during accumulation or switch to group/layer norm; never accumulate more than 2 micro-batches without verifying BN stats.

🎯 Key Takeaway

Gradient accumulation simulates batch size, not batch statistics — BN, dropout, and loss scaling behave per micro-batch.

Effective learning rate must scale with effective batch size, not micro-batch size — use linear scaling rule.

Always validate convergence on a small run before committing to 200+ epochs; silent failures appear late.

thecodeforge.io

Pytorch Basics

Tensors: The DNA of Every PyTorch Model

A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.

What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.

You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising buffers, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, never automatic. That explicitness is a feature, not an oversight; it forces you to reason about where computation actually happens, which is the difference between a model that fits in GPU memory and one that crashes at batch two.

As of PyTorch 2.x, torch.compile() can fuse tensor operations into optimised kernels automatically — but only if your tensors are on the right device and dtype from the start. Sloppy tensor hygiene becomes measurably more expensive in 2026 than it was when compilation wasn't part of the picture.

The dtype mismatch is the most common silent failure: Python integer literals default to int64, Python floats default to float64, and PyTorch defaults to float32 for most operations. Mixing them throws a RuntimeError at operation time, not at creation time — so the error surfaces somewhere unexpected. Always pass floats with a trailing .0 or specify dtype explicitly at creation.

io.thecodeforge.ml.tensor_fundamentals.pyPYTHON

import torch

# --- Creating tensors from real data ---
# Simulating a tiny dataset: 4 house sizes (sq ft) and their prices ($k)
house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])  # 1D tensor, shape (4,)
house_prices = torch.tensor([150.0, 220.0, 310.0, 410.0])    # 1D tensor, shape (4,)

print("Sizes tensor:", house_sizes)
print("Shape:", house_sizes.shape)        # torch.Size([4])
print("Data type:", house_sizes.dtype)    # torch.float32 — default for floats

# --- 2D tensor: batch of data (rows = samples, cols = features) ---
feature_matrix = torch.tensor([
    [750.0,  3.0, 1.0],   # size, bedrooms, bathrooms
    [1200.0, 4.0, 2.0],
    [1800.0, 4.0, 3.0],
    [2400.0, 5.0, 3.0],
])
print("\nFeature matrix shape:", feature_matrix.shape)  # torch.Size([4, 3])

# --- Device awareness: check and move to GPU if available ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("\nUsing device:", device)

# Move tensor to the target device — always do this before operations
# In PyTorch 2.x, doing this at creation time avoids an extra host-to-device copy
feature_matrix = feature_matrix.to(device)
print("Feature matrix device:", feature_matrix.device)

# --- Useful tensor operations ---
# Normalise features: (x - mean) / std — critical for stable training
# dim=0 means we compute one mean per column (per feature), across all rows (samples)
means = feature_matrix.mean(dim=0)
stds  = feature_matrix.std(dim=0)
normalised = (feature_matrix - means) / stds

print("\nNormalised features (first row):", normalised[0])

# --- requires_grad: opting a tensor INTO gradient tracking ---
# We do NOT set this on input data — only on learnable parameters
# Input data is fixed; we want gradients w.r.t. parameters, not the data itself
weight = torch.tensor([0.15], requires_grad=True)  # our model's single weight
bias   = torch.tensor([10.0], requires_grad=True)  # our model's bias term

print("\nWeight requires grad:", weight.requires_grad)       # True
print("House sizes requires grad:", house_sizes.requires_grad)  # False — data, not a parameter

# --- Checking tensor metadata in one place ---
# Useful diagnostic pattern during debugging
for name, t in [("weight", weight), ("bias", bias), ("sizes", house_sizes)]:
    print(f"{name:8s} | dtype: {t.dtype} | device: {t.device} | requires_grad: {t.requires_grad}")

Output

Sizes tensor: tensor([ 750., 1200., 1800., 2400.])

Shape: torch.Size([4])

Data type: torch.float32

Feature matrix shape: torch.Size([4, 3])

Using device: cpu

Feature matrix device: cpu

Normalised features (first row): tensor([-1.3416, -1.1547, -1.0000])

Weight requires grad: True

House sizes requires grad: False

weight | dtype: torch.float32 | device: cpu | requires_grad: True

bias | dtype: torch.float32 | device: cpu | requires_grad: True

sizes | dtype: torch.float32 | device: cpu | requires_grad: False

⚠ Watch Out: dtype Mismatches Crash at Operation Time, Not Creation Time

If you mix float64 (Python's default for floating-point literals without specifying dtype) with float32 (PyTorch's default for most neural network operations), the error won't appear when you create the tensor — it surfaces later, at the operation, with a message that rarely points back to where the mismatch was introduced. Always pass floats as torch.tensor([1.0, 2.0]) — the trailing .0 forces float32. Or be explicit: torch.tensor([1, 2], dtype=torch.float32). In production code, I always add a dtype assertion at the model boundary so the failure is immediate and obvious.

📊 Production Insight

dtype mismatches between float64 and float32 throw RuntimeError at operation time — the stack trace points to the operation, not where the wrong dtype was introduced.

Device mismatches (CPU tensor passed to a GPU model) crash with a clear error but are still the most common debugging session in any new PyTorch project.

In PyTorch 2.x with torch.compile(), dtype and device inconsistencies also prevent kernel fusion, silently costing you throughput on top of correctness.

Rule: set device and dtype at tensor creation, assert at model input boundaries, and never rely on implicit casting.

🎯 Key Takeaway

Tensors carry three critical properties beyond their values: dtype, device, and requires_grad. Getting any one of these wrong silently breaks training in ways that are hard to trace — dtype mismatches crash at the wrong line, device errors surface mid-forward-pass, and missing requires_grad means parameters never update. Set requires_grad only on learnable parameters, never on input data, and treat dtype and device as first-class properties you set intentionally at creation.

Tensor Creation Decision

IfConverting existing Python data (lists, NumPy arrays)

→

UseUse torch.tensor() — it copies the data and infers dtype, but defaults to float32 for Python floats. For large arrays, torch.from_numpy() avoids the copy.

IfInitialising model weights

→

UseUse torch.randn() * init_scale or nn.init.kaiming_normal_ — never initialise all weights to zero; every neuron would compute identical gradients and the network would never differentiate.

IfNeed a tensor on GPU from the start

→

UseUse torch.randn(..., device='cuda') at creation — avoids an extra host-to-device copy that torch.randn(...).to('cuda') would incur.

IfInput data for a model

→

UseDo NOT set requires_grad=True — only learnable parameters need gradient tracking. Setting it on inputs wastes memory and can silently include input tensors in the backward graph.

Autograd: How PyTorch Learns Without You Doing Calculus

Autograd is the reason PyTorch feels almost magical the first time it clicks. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (almost always a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.

In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter — telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge, applied repeatedly, is gradient descent.

Three rules to memorise before shipping anything: (1) .backward() can only be called on a scalar tensor. If your loss is a multi-element tensor, call .mean() or .sum() first or pass a gradient argument. (2) Gradients accumulate by default — every call to .backward() adds to existing .grad values rather than replacing them. Call optimizer.zero_grad() before each backward pass or gradients will pile up across batches and corrupt training in exactly the way the production incident above describes. (3) During inference, wrap code in torch.no_grad() or torch.inference_mode() to skip graph construction entirely — it is faster, uses less memory, and removes an entire class of production bugs.

The graph is destroyed after .backward() completes by default. This is intentional memory management: the graph for one forward pass can consume hundreds of megabytes on a deep network. Without destruction, GPU memory would grow linearly with training steps. This is also why you cannot call .backward() twice on the same graph without retain_graph=True — and retain_graph=True in a training loop is almost always a bug, not a feature.

One nuance worth knowing as of PyTorch 2.x: torch.compile() can aggressively optimise the forward and backward passes together, but it relies on the graph being consistent across calls. If your forward pass has Python-level control flow that changes based on input values (not just tensor shapes), you may need to mark those branches with torch.compiler.disable() to prevent recompilation overhead on every batch.

io.thecodeforge.ml.autograd_linear_regression.pyPYTHON

import torch

# Seed for reproducibility — always set this in experiments
# Without it, two runs with identical code produce different results and debugging becomes a nightmare
torch.manual_seed(42)

# --- Toy linear regression: predict house price from size ---
# Ground truth relationship: price ≈ 0.18 * size + 5  (the model must discover this)
house_sizes  = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])
house_prices = torch.tensor([140.0, 221.0, 329.0, 437.0])

# Learnable parameters — these are the knobs autograd will compute gradients for
weight = torch.tensor([0.01], requires_grad=True)  # terrible initial guess, intentionally
bias   = torch.tensor([0.01], requires_grad=True)

# Learning rate is tiny because our raw inputs are in the hundreds-to-thousands range
# Without normalisation, you need a proportionally smaller step to avoid overshooting
learning_rate = 1e-7

for epoch in range(6):
    # FORWARD PASS: compute predictions using current weight and bias
    # Broadcasting applies weight and bias across all 4 house sizes simultaneously
    predicted_prices = weight * house_sizes + bias

    # COMPUTE LOSS: Mean Squared Error — average squared error across all predictions
    loss = ((predicted_prices - house_prices) ** 2).mean()

    # ZERO GRADIENTS: must do this before backward()
    # .grad accumulates by default — if we skip this, epoch 2 adds to epoch 1's gradients
    if weight.grad is not None:
        weight.grad.zero_()
        bias.grad.zero_()
    # (In real code you'd use optimizer.zero_grad() instead of this manual approach)

    # BACKWARD PASS: autograd traverses the graph and fills .grad on weight and bias
    # This computes d(loss)/d(weight) and d(loss)/d(bias) via the Chain Rule
    loss.backward()

    # PARAMETER UPDATE: move weight and bias in the direction that reduces loss
    # torch.no_grad() here because we don't want this update operation itself tracked
    with torch.no_grad():
        weight -= learning_rate * weight.grad
        bias   -= learning_rate * bias.grad

    print(f"Epoch {epoch+1:2d} | Loss: {loss.item():.2f} | "
          f"weight: {weight.item():.5f} | bias: {bias.item():.5f} | "
          f"grad_w: {weight.grad.item():.4f}")

print("\nFinal model: price =", round(weight.item(), 4), "* size +", round(bias.item(), 4))
print("Target model:  price = 0.18 * size + 5")
print("Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs")

# Inference — graph construction is wasted work here; inference_mode is faster than no_grad
with torch.inference_mode():
    test_size = torch.tensor([2000.0])
    predicted = weight * test_size + bias
    print(f"\nPredicted price for 2000 sq ft: ${predicted.item():.1f}k")

Output

Epoch 1 | Loss: 78017.80 | weight: 0.04819 | bias: 0.01003 | grad_w: -381924.2500

Epoch 2 | Loss: 65099.14 | weight: 0.08349 | bias: 0.01006 | grad_w: -353016.2500

Epoch 3 | Loss: 54310.95 | weight: 0.11626 | bias: 0.01008 | grad_w: -327650.5000

Epoch 4 | Loss: 45330.67 | weight: 0.14673 | bias: 0.01011 | grad_w: -304708.0000

Epoch 5 | Loss: 37842.37 | weight: 0.17510 | bias: 0.01013 | grad_w: -283840.0000

Epoch 6 | Loss: 31569.55 | weight: 0.20153 | bias: 0.01015 | grad_w: -264344.7500

Final model: price = 0.2015 * size + 0.0102

Target model: price = 0.18 * size + 5

Note: bias is far from 5.0 — this is expected with unnormalised features and only 6 epochs

Predicted price for 2000 sq ft: $403.1k

Mental Model

How Autograd Actually Thinks About Your Computation

Autograd treats every tensor operation as a node in a directed acyclic graph and records exactly how to reverse it — it's the Chain Rule implemented as a graph traversal.

Forward pass: execute operations and record the graph — each operation node stores its own gradient function (grad_fn)
Backward pass: traverse the graph in reverse from the loss node, applying the Chain Rule at each node to accumulate gradients
The graph is rebuilt fresh on every forward pass — it captures the exact computation that just ran, including any Python-level branching
requires_grad=True marks a tensor as a leaf node whose .grad we want filled in after backward()
The gradient of a scalar loss with respect to all parameters is computed in a single .backward() call — you do not loop over parameters manually

📊 Production Insight

The dynamic graph means gradients are always correct for the computation that actually ran — not for a pre-compiled approximation of it.

This is why PyTorch won research: you can change architecture mid-experiment without recompiling anything.

In production with torch.compile(), the dynamic graph gets partially compiled for performance while retaining correctness for control-flow branches.

Rule: if your model has conditional logic that changes which operations run based on input values, PyTorch's dynamic graph handles this correctly where static-graph frameworks historically required workarounds.

🎯 Key Takeaway

Autograd automates the Chain Rule by recording operations in a dynamic computation graph that is rebuilt on every forward pass. The graph is destroyed after backward() by default — retain_graph=True in a training loop is almost always a memory leak waiting to happen. In production, use an optimizer rather than manual weight updates, and always wrap inference in torch.inference_mode() — it disables both gradient computation and version tracking, making it measurably faster than torch.no_grad() for serving workloads.

When to Use Autograd vs Manual Updates

IfStandard neural network training

→

UseUse an optimizer (SGD, Adam, AdamW) — it handles zero_grad, the update rule, momentum, and weight decay. Manual updates are for learning concepts, not shipping code.

IfNeed custom gradient logic that PyTorch can't express

→

UseUse torch.autograd.Function to define a custom forward and backward pass — useful for custom CUDA kernels or numerically stable loss functions.

IfInference only — no weight updates

→

UseWrap in torch.inference_mode() for production serving. Use torch.no_grad() during validation inside training loops where you may still need tensor version tracking.

IfDebugging suspicious gradients

→

UseUse torch.autograd.gradcheck() to numerically verify computed gradients against finite differences — invaluable when implementing custom backward passes.

thecodeforge.io

Pytorch Basics

Building a Real Training Loop with nn.Module

Writing raw tensor operations gets unwieldy past a handful of layers. PyTorch's nn.Module is the standard abstraction for any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters (or sub-modules that contain them) inside __init__, and defines the forward computation inside forward().

The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances, arbitrarily deep. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree — that flat iterator is exactly what you hand to the optimizer.

The training loop is the heartbeat of all ML work in PyTorch. It is always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. That order is not arbitrary — skipping or reordering any step produces a specific and usually hard-to-diagnose failure. Internalise this sequence and you can read any paper's training code cold.

The validation loop is structurally almost identical but with two additions: model.eval() called before the loop, and torch.no_grad() wrapping the forward pass. These solve different problems. model.eval() changes layer behaviour — Dropout stops masking neurons, BatchNorm uses accumulated running statistics instead of batch statistics. torch.no_grad() stops graph construction entirely, saving memory and time. You need both; neither substitutes for the other.

The most common production bug I still see in 2026: calling model.forward(x) directly instead of model(x). It works identically in isolation, but it bypasses all registered forward hooks — hooks that profilers, debuggers, quantisation tools, and libraries like torchvision rely on. Always call the model as a callable. The __call__ method is what wires up the hook infrastructure; forward() is just the computation you define.

io.thecodeforge.ml.neural_network_training_loop.pyPYTHON

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)

# --- Dataset: synthetic house price prediction ---
# 100 samples, 3 features: normalised size, bedrooms, age
num_samples  = 100
num_features = 3

# Generate synthetic features and a linear target with realistic noise
raw_features = torch.randn(num_samples, num_features)
true_weights  = torch.tensor([0.5, 0.3, -0.2])  # size helps, age hurts price
target_prices = raw_features @ true_weights + 0.1 * torch.randn(num_samples)

# Train / validation split: 80 / 20
train_size = int(0.8 * num_samples)
train_features, val_features = raw_features[:train_size], raw_features[train_size:]
train_targets,  val_targets  = target_prices[:train_size], target_prices[train_size:]


# --- Model definition ---
class HousePriceNet(nn.Module):
    def __init__(self

Output

Model parameters: 201

HousePriceNet(

(network): Sequential(

(0): Linear(in_features=3, out_features=16, bias=True)

(1): ReLU()

(2): Linear(in_features=16, out_features=8, bias=True)

(3): ReLU()

(4): Linear(in_features=8, out_features=1, bias=True)

)

Epoch 10 | Train Loss: 0.2431 | Val Loss: 0.3102 | Grad Norm: 0.3847

Epoch 20 | Train Loss: 0.1187 | Val Loss: 0.1834 | Grad Norm: 0.2214

Epoch 30 | Train Loss: 0.0743 | Val Loss: 0.1214 | Grad Norm: 0.1563

Epoch 40 | Train Loss: 0.0521 | Val Loss: 0.0987 | Grad Norm: 0.1102

Epoch 50 | Train Loss: 0.0389 | Val Loss: 0.0812 | Grad Norm: 0.0831

Predicted: 0.6821 | Expected (approx): 0.7200

🔥Interview Gold: model.train() vs model.eval() vs torch.no_grad()

These are three separate controls that solve three different problems. model.train() and model.eval() flip a flag that changes layer behaviour — Dropout randomly drops neurons in train mode and passes all of them in eval mode; BatchNorm updates running statistics in train mode and uses them in eval mode. torch.no_grad() is a completely separate mechanism that tells the autograd engine to stop building the computation graph. You can call model.eval() with gradients still flowing (unusual but valid) or call model.train() inside a torch.no_grad() block (common in gradient accumulation setups). Forgetting model.eval() during validation is one of the most common bugs in PyTorch codebases — your validation loss will fluctuate unpredictably and you will spend time blaming your learning rate or data pipeline.

📊 Production Insight

model.train() and model.eval() control Dropout and BatchNorm behaviour — not gradient computation. torch.no_grad() controls gradient computation — not layer behaviour. You need both for a correct validation loop and they must be called in the right order: model.eval() first, then enter the torch.no_grad() context.

In PyTorch 2.x, torch.compile() is compatible with both — but compile the model before calling .eval() or .train() to avoid recompilation on mode switches.

Rule: add print(model.training) as a one-time sanity check when setting up any new evaluation loop — it has saved me from at least three subtle bugs in production codebases.

🎯 Key Takeaway

The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model — the order is load-bearing, not stylistic. model.train() and model.eval() control layer behaviour (Dropout, BatchNorm); torch.no_grad() controls graph construction. Always call the model as a callable (model(x)), never model.forward(x) — the __call__ method is what wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on.

Training Loop Step Selection

IfStandard training iteration

→

Usezero_grad -> forward -> loss -> backward -> step — never skip or reorder. Each step depends on the previous one completing correctly.

IfValidation or evaluation pass

→

Usemodel.eval() + torch.no_grad() -> forward -> loss — no backward or step. Both calls are required; neither replaces the other.

IfProduction inference / serving

→

Usemodel.eval() + torch.inference_mode() -> forward — fastest path, disables both graph construction and version counter tracking.

IfGradient accumulation for large effective batch sizes

→

UseCall backward() every step, call optimizer.step() + zero_grad() only every N steps. Divide the loss by N before backward() to keep gradient magnitudes consistent with a single large batch.

Data Loading with Dataset and DataLoader

You'll rarely keep all your training data in memory as a single tensor. Real-world datasets — images, text, logs — are large, expensive to load, and need to be shuffled, batched, and transformed on the fly. PyTorch's torch.utils.data.Dataset and DataLoader are the standard way to feed data into a training loop.

A Dataset subclass defines two things: __len__ (how many samples) and __getitem__ (how to load the i-th sample). That's it. The DataLoader then wraps the dataset and handles batching, shuffling, parallelism, and memory pinning. Writing a custom Dataset is the right approach for any data that doesn't fit in RAM — the Dataset tells PyTorch how to load each sample lazily, and the DataLoader manages the rest.

Three things almost always go wrong in production data loading: (1) num_workers set too high — you get too many file handles and the OS starts swapping; (2) custom collate functions that accidentally keep tensors on CPU when the model is on GPU; (3) Dataset returning tensors of inconsistent shapes for variable-length data without proper padding. The error messages for these are rarely pointing to the actual root cause.

For tabular data that fits in memory, using an in-memory Dataset with a TensorDataset is perfectly fine. For images, torchvision's ImageFolder and Compose transforms handle most common pipelines. For text, Hugging Face datasets integrate cleanly with PyTorch's DataLoader.

Shuffling is essential for stochastic gradient descent — it prevents the model from learning the order of the data rather than the underlying distribution. Always set shuffle=True in your training DataLoader. For validation, shuffle=False is correct because you want the same deterministic ordering for comparison across epochs.

io.thecodeforge.ml.data_loading.pyPYTHON

import torch
from torch.utils.data import Dataset, DataLoader

# --- Custom Dataset for house price data from a CSV-like pattern ---
class HousePriceDataset(Dataset):
    def __init__(self

Output

Epoch 1: Val Loss = 0.0342

Epoch 2: Val Loss = 0.0321

Epoch 3: Val Loss = 0.0310

Epoch 4: Val Loss = 0.0302

Epoch 5: Val Loss = 0.0298

⚠ num_workers Pitfall: Too Many Workers Can Slow You Down

Increasing num_workers doesn't always increase throughput. Past a certain point (usually 4-8, depending on your CPU's core count and disk I/O), adding workers causes context switching overhead and memory pressure from duplicated data. If you see your CPU usage plateauing and GPU utilisation dropping, reduce num_workers. Rule: start with 4 workers, increase until GPU utilisation stops improving, then back off one.

📊 Production Insight

DataLoader with too many workers can exhaust the system's memory due to copy-on-write semantics on Linux — each worker duplicates the dataset in its own memory space, so a 2GB dataset with 8 workers uses up to 16GB.

Custom collate functions that forget to move tensors to the device are the single most common data loading bug in production — the error surfaces in the forward pass, not at batch creation.

Rule: test your DataLoader in isolation with a single batch and print device of returned tensors before plugging it into training.

🎯 Key Takeaway

Dataset defines how to load one sample; DataLoader wraps it with batching, shuffling, parallelism, and pin_memory. Keep num_workers moderate (4–8), always move batches to the correct device inside the training loop, and test your data pipeline in isolation before adding model complexity.

Data Loading Strategy Selection

IfSmall dataset that fits in RAM (e.g., typical CSV)

→

UseUse TensorDataset or a simple in-memory Dataset. No lazy loading needed.

IfLarge dataset on disk (images, text files)

→

UseImplement a custom Dataset with __getitem__ that loads and transforms one sample per call. Use num_workers for parallelism.

IfData from cloud storage (S3, GCS)

→

UseConsider WebDataset or streaming Dataset that fetches samples in background. Be careful with network latency — batch downloading often outperforms per-sample streaming.

IfVariable-length sequences (text, time series)

→

UseImplement a custom collate_fn that pads sequences to the same length within each batch. Use torch.nn.utils.rnn.pad_sequence.

Training on GPU and Mixed Precision

GPUs accelerate tensor operations by orders of magnitude compared to CPUs, but they have limited memory and come with gotchas that trip up even senior engineers. Training on GPU is not just 'call .cuda()' — it requires careful device management, understanding of CUDA memory, and leveraging mixed precision to fit larger models and batch sizes.

PyTorch makes GPU training explicit: you move the model with model.to(device) and move each batch with batch.to(device). If any tensor is left on CPU while the rest of the operation is on GPU, you get a RuntimeError. The fix is to enforce a convention: device as a variable at the start of your script, and .to(device) on every batch at the point of creation.

Mixed precision training using torch.cuda.amp (Automatic Mixed Precision) became standard in 2026 — it uses float16 for most operations while keeping a float32 master copy of weights, cutting memory usage by nearly half and giving you roughly 2x throughput on modern GPUs. It's enabled by just two lines: a GradScaler and wrapping the forward/backward pass in an autocast context. The scaler prevents underflow of small gradients in float16.

GPUs have limited memory — a high-end A100 has 80GB, but most production setups use 16–32GB cards. If you run out of memory, reduce batch size, gradient accumulation, or switch to mixed precision. The most common silent failure: loading the entire dataset on GPU accidentally by forgetting to call .to(device) inside the training loop but doing it in the Dataset constructor — that moves all data to GPU at once, causing OOM before training starts.

As of PyTorch 2.x, torch.compile() with mode='reduce-overhead' or mode='max-autotune' can further optimise GPU kernel execution, but it requires a warm-up step and may increase compile time on the first batch. It's worth enabling for production serving, less for rapid experimentation.

io.thecodeforge.ml.gpu_mixed_precision.pyPYTHON

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

# --- Device setup: use GPU if available ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

model = HousePriceNet(input_features=3).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

# Mixed precision components
scaler = GradScaler()  # scales loss to avoid underflow in float16

# Training loop with mixed precision
model.train()
for epoch in range(5):
    for batch_idx, (features, targets) in enumerate(train_loader):
        # Move batch to the same device as the model
        features, targets = features.to(device), targets.to(device)

        optimizer.zero_grad()

        # autocast context: operations inside use float16 where safe, float32 where needed
        with autocast():
            predictions = model(features)
            loss = loss_fn(predictions, targets)

        # Backprop through scaler
        scaler.scale(loss).backward()
        scaler.step(optimizer)  # optimizer.step() but unscales gradients first
        scaler.update()

    print(f"Epoch {epoch+1} completed")

# Inference — no mixed precision needed, but still need to .to(device)
model.eval()
with torch.inference_mode():
    sample = torch.randn(1, 3).to(device)
    output = model(sample)
    print(f"Sample prediction: {output.item():.4f}")

Output

Using device: cuda

Epoch 1 completed

Epoch 2 completed

Epoch 3 completed

Epoch 4 completed

Epoch 5 completed

Sample prediction: 0.5821

💡GradScaler: Your Friend Against Underflow

float16 has a very limited dynamic range — the smallest representable normal number is about 6e-8. Gradients smaller than that underflow to zero, killing learning. GradScaler multiplies the loss by a scale factor (large at first), calls backward() on the scaled loss, then divides the resulting gradients back down before the optimizer step. This keeps gradients in the representable range. Always use scaler if you use autocast — the two are designed as a pair.

📊 Production Insight

Mixed precision with autocast + GradScaler reduces memory usage by ~50% and speeds up training by up to 3x on Tensor Core GPUs (RTX 30xx, A100, H100).

Forgetting .to(device) on a single batch tensor causes a RuntimeError with a stack trace that points to the operation inside the forward pass — not to the missing .to() call. Debugging this in a 50-layer model is painful.

Rule: in production, always log the device of model parameters and the first batch tensor explicitly at training start. If they mismatch, fail fast with a clear message.

🎯 Key Takeaway

GPU training requires explicit device placement — model.to(device) and batch.to(device) are mandatory. Mixed precision with torch.cuda.amp halves memory and doubles throughput without loss of accuracy. The GradScaler is not optional when using autocast — it prevents gradient underflow in float16. Always log device and first-batch location at startup to catch mismatches immediately.

Precision and Device Strategy

IfGPU available and batch size is a bottleneck

→

UseEnable mixed precision: autocast + GradScaler. Start with batch size 32 and increase until OOM.

IfGPU available but large model doesn't fit even with mixed precision

→

UseUse gradient accumulation (aggregate gradients over several micro-batches before stepping) and reduce batch size further. Or use model parallelism.

IfInference on CPU (edge devices, CI/CD, no GPU)

→

UseNo mixed precision needed. Consider quantisation (torch.quantization) to reduce model size and increase inference speed on CPU.

IfDebugging a model that trains fine on CPU but crashes on GPU

→

UseCheck every batch tensor device .to(device). Check model device. Check that no tensor has requires_grad when it shouldn't. Add device assertions in the forward pass.

Checkpointing: The Difference Between a Mild Inconvenience and a Career-Ending Mistake

Nobody cares about your training loop when a spotty AWS instance reboots 47 hours in. They care about whether you picked up from epoch 14 or started over. Checkpointing isn't a nicety. It's your job security.

Real training runs cost real money. A single A100 hour burns ~$3. If you lose 40 hours of training because you only saved the final model, you just wasted $120 and a lot of patience. Senior engineers checkpoint obsessively because they've been burned.

The trick isn't just saving weights. It's saving optimizer state, RNG seeds, and the current epoch. That lets you resume identically — same learning rate schedule, same batch order, same everything. Anything less is a half-baked restore.

Build your checkpoint logic into the training loop from day one. Not after the first crash. You will crash. The question is whether you're ready.

TrainingCheckpointer.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
import os

def save_checkpoint(model, optimizer, scheduler, epoch, loss, filepath):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict(),
        'loss': loss,
        'rng_state': torch.get_rng_state()  # resume reproducibility
    }, filepath)
    print(f"Checkpoint saved at epoch {epoch}")

def load_checkpoint(model, optimizer, scheduler, filepath):
    if not os.path.exists(filepath):
        print("No checkpoint found. Starting from scratch.")
        return 0, float('inf')
    checkpoint = torch.load(filepath, weights_only=True)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    torch.set_rng_state(checkpoint['rng_state'])
    return checkpoint['epoch'], checkpoint['loss']

# Usage in training loop
model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10)
start_epoch, best_loss = load_checkpoint(model, optimizer, scheduler, 'experiment_v3.pt')
print(f"Resuming from epoch {start_epoch}, best loss so far: {best_loss:.4f}")

Output

Checkpoint saved at epoch 0

Resuming from epoch 0, best loss so far: inf

⚠ Production Trap:

Don't just save every epoch. Implement a rolling window — keep the last 3-5 checkpoints plus the best one by validation loss. Disk is cheap; debugging a failed half-training run is not.

🎯 Key Takeaway

Checkpoint optimizer state and RNG seeds, not just model weights. Resume identical training or don't bother checkpointing.

Distributed Data Parallel: When One GPU Isn't Enough and Neither Is Your Patience

Your model takes 12 hours on one GPU. Your boss wants it in 2. You buy two more GPUs and expect 4 hours. That's not how DDP works. Distributed Data Parallel isn't magic. It's a carefully orchestrated dance of gradient synchronization, and poor implementation turns it into a slow-motion train wreck.

DDP works by splitting batches across GPUs. Each GPU computes gradients on its shard, then all-reduces them so every card has the average gradient. The bottleneck is that all-reduce communication. If your batch size per GPU is too small, GPUs spend more time talking than computing. Rule of thumb: each GPU should process at least 32 samples per forward pass.

Watch your batch size scaling. DDP gives near-linear speedup only if you increase the global batch size proportionally. Doubling GPUs? Double the batch size and adjust the learning rate. Otherwise, you get diminishing returns and your validation loss plateaus because you're taking noisier gradient steps.

Wrap your model with nn.parallel.DistributedDataParallel, not the deprecated DataParallel. DataParallel serializes everything through GPU 0. It's a bottleneck masquerading as parallelism.

MultiGPUTraining.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_ddp(rank, world_size):
    dist.init_process_group(
        backend='nccl',
        init_method='tcp://127.0.0.1:29500',
        rank=rank,
        world_size=world_size
    )

def train_rank(rank, world_size):
    setup_ddp(rank, world_size)
    torch.cuda.set_device(rank)

    model = nn.Linear(512, 256).to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)

    data_batch = torch.randn(64, 512).to(rank)
    target = torch.randn(64, 256).to(rank)

    for epoch in range(5):
        optimizer.zero_grad()
        output = ddp_model(data_batch)
        loss = nn.MSELoss()(output, target)
        loss.backward()
        optimizer.step()
        print(f"Rank {rank}, Epoch {epoch}, Loss: {loss.item():.4f}")

    dist.destroy_process_group()

if __name__ == "__main__":
    world_size = 2
    torch.multiprocessing.spawn(train_rank, args=(world_size,), nprocs=world_size)

Output

Rank 0, Epoch 0, Loss: 1.2412

Rank 1, Epoch 0, Loss: 1.2412

Rank 0, Epoch 1, Loss: 1.0873

Rank 1, Epoch 1, Loss: 1.0873

💡Senior Shortcut:

Use torchrun to launch DDP scripts. It handles environment variables, world size, and process spawning. One command: torchrun --nproc_per_node=N your_script.py. No more manual process group boilerplate.

🎯 Key Takeaway

Scale batch size with GPU count. Use DistributedDataParallel, not DataParallel. DDP is linear only when communication time is negligible compared to compute time.

Installation: Get PyTorch Running Before Your Coffee Gets Cold

You need PyTorch installed. Skip the pip install torch blanket statement — that's for people who enjoy debugging CUDA errors at 2 AM. You need the right wheel for your hardware.

Check your CUDA version with nvidia-smi. Match it to PyTorch's build matrix on pytorch.org. If you're on CPU-only, grab the CPU build. If you're on an M-series Mac, get the Metal Performance Shaders (MPS) build. Conda handles dependencies better than pip for GPU libraries — use it. The command is one line: conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch. That's it. No excuses.

After install, run torch.cuda.is_available() in a Python shell. If it returns False on a CUDA machine, your install is wrong. Fix it before you write a single line of training code.

verify_install.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda if torch.cuda.is_available() else 'N/A'}")
print(f"MPS available: {torch.backends.mps.is_available()}")

// Expected output (on a CUDA 11.8 system):
// PyTorch version: 2.1.0
// CUDA available: True
// CUDA version: 11.8
// MPS available: False

Output

PyTorch version: 2.1.0

CUDA available: True

CUDA version: 11.8

MPS available: False

⚠ Production Trap:

Don't install PyTorch via pip in a shared environment — it'll fight with system CUDA libraries. Use a conda environment with explicit CUDA toolkit pinning.

🎯 Key Takeaway

One wrong pip install costs more time than reading the conda docs. Use conda. Match your CUDA version exactly.

GPU Acceleration: Stop Burning CPU Cycles on Matrix Math

Your GPU is a parallel compute beast. Your CPU is a glorified traffic cop. Stop making the cop do math — that's the GPU's job. PyTorch makes this trivial: call .to('cuda') on your tensors and models.

Here's why this matters: a 1024x1024 matrix multiply on CPU takes ~50ms. On a 3090 GPU it takes ~0.5ms. That's 100x faster. Now scale that across a training loop with millions of iterations. The math is brutal — you leave months of training time on the table by ignoring GPU acceleration.

Production rules: keep your model and tensors on the same device. Use torch.no_grad() for inference to save memory. If you're on a multi-GPU machine, use nn.DataParallel or DistributedDataParallel. For single GPU, just .to('cuda'). Always check tensor.device before operations — a CPU tensor talking to a GPU tensor throws a runtime error. That's not a bug, that's you being sloppy.

gpu_acceleration.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
import time

// Force CPU for timing
cpu_tensor = torch.randn(1000, 1000)
start = time.perf_counter()
cpu_result = cpu_tensor @ cpu_tensor.T
print(f"CPU time: {time.perf_counter() - start:.4f}s")

// Move to GPU
gpu_tensor = cpu_tensor.to('cuda')
torch.cuda.synchronize()
start = time.perf_counter()
gpu_result = gpu_tensor @ gpu_tensor.T
torch.cuda.synchronize()
print(f"GPU time: {time.perf_counter() - start:.4f}s")

Output

CPU time: 0.0512s

GPU time: 0.0004s

💡Senior Shortcut:

Wrap your GPU tensor operations in torch.cuda.synchronize() when timing. Without it, PyTorch queues ops asynchronously and your timestamps lie to you.

🎯 Key Takeaway

Move everything to the GPU with .to('cuda'). The 100x speedup isn't a gimmick — it's production reality.

Enhancing Data Diversity through Augmentation

Models memorize, they don't generalize. Without diverse training data, your model fails on real-world shifts. Data augmentation injects synthetic variance—rotations, flips, noise, color jitter—without collecting new samples. PyTorch provides torchvision.transforms to chain operations declaratively. Apply augmentations inside Dataset.__getitem__ so each epoch sees different distorted versions of the same image. This prevents overfitting and forces the model to learn invariant features. The cost: CPU overhead on the data loader. Use multiple workers and prefetching to hide latency. Never augment validation or test sets—only training. Start with random horizontal flips and color jitter; they yield the highest ROI for vision tasks. For text, synonym replacement and back-translation work similarly. Augmentation is not a silver bullet—excessive distortion destroys signal. Tune intensities per dataset.

AugmentDataset.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image

class AugmentedDataset(Dataset):
    def __init__(self, paths):
        self.paths = paths
        self.train_transform = transforms.Compose([
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.ColorJitter(brightness=0.2, contrast=0.2),
            transforms.ToTensor()
        ])

    def __getitem__(self, idx):
        img = Image.open(self.paths[idx])
        return self.train_transform(img)

    def __len__(self):
        return len(self.paths)

dl = DataLoader(AugmentedDataset(['img1.jpg']), batch_size=4, num_workers=2)
for batch in dl:
    print(batch.shape)  # torch.Size([4, 3, H, W])

Output

torch.Size([4, 3, 224, 224])

⚠ Production Trap:

Applying augmentation twice—once in transforms and again in a separate preprocessing script—doubles memory and wastes compute.

🎯 Key Takeaway

Augment online per epoch, not offline once, to maximize data diversity without extra storage.

Recurrent Neural Networks (RNNs)

Feedforward nets assume independence between inputs—useless for sequences. RNNs loop hidden state across timesteps, letting information persist. PyTorch’s nn.RNN processes variable-length sequences with a single API. The hidden state h carries context; each step receives current input x_t and previous state h_{t-1}. Vanilla RNNs suffer vanishing gradients over long sequences—use nn.LSTM or nn.GRU instead. Stack multiple layers for deeper representations, but watch overfitting. The batch_first=True flag swaps dimensions to (batch, seq_len, features)—most intuitive for typical usage. Always pack padded sequences with nn.utils.rnn.pad_packed_sequence to ignore padding tokens during recurrence. RNNs still dominate for short-to-medium sequential data, especially when interpretability of hidden states matters. For very long sequences, switch to Transformers.

BasicRNN.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
import torch.nn as nn

class CharRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim=16, hidden_dim=32):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, h0=None):
        x = self.embed(x)               # (B, T, E)
        out, h = self.rnn(x, h0)        # out: (B, T, H)
        logits = self.fc(out)           # (B, T, V)
        return logits, h

model = CharRNN(vocab_size=50)
input_seq = torch.randint(0, 50, (2, 10))  # batch=2, seq_len=10
logits, hidden = model(input_seq)
print(logits.shape)  # (2, 10, 50)

Output

torch.Size([2, 10, 50])

⚠ Production Trap:

Calling model.rnn with batch_first=False (default) transposes your tensor silently—use batch_first=True to avoid shape bugs.

🎯 Key Takeaway

Use GRU or LSTM instead of vanilla RNN for any sequence longer than 10 steps.

Finding PyTorch Jobs

Employers want engineers who ship models, not just train notebooks. PyTorch jobs demand production skills: writing nn.Module subclasses, building custom Dataset loaders, handling GPU memory with torch.cuda.amp, and debugging autograd graphs. Focus on end-to-end pipelines—data ingestion, training, export to TorchScript, and serving via TorchServe or ONNX. Portfolio projects should include a requirements.txt, train.py with argparsing, and a README explaining trade-offs. Contribute to PyTorch open-source (e.g., bug fixes in torchvision or documentation patches) to get noticed. Network at PyTorch Conference or local meetups. Tailor your resume: list concrete metrics (e.g., “Reduced inference latency by 40% via mixed precision”). Avoid vague terms like “deep learning enthusiast.” Recruiters scan for keywords: torch.distributed, DDP, CUDA graphs, torch.compile. Practice system design for ML—how would you serve a model at 10k QPS?

JobSearchChecklist.pyPYTHON

// io.thecodeforge — ml-ai tutorial

# Simulate a job-fit check
skills = ['nn.Module', 'Dataset', 'DDP', 'TorchScript', 'Amp']
role = ['Mixed Precision', 'DataLoader', 'Distributed Training']
score = sum(1 for s in skills if s.lower() in [r.lower() for r in role])
print(f"Match score: {score}/{len(role)}")

# Example resume bullet
# "Built a PyTorch data pipeline with 4 workers, achieving 0.3ms batch loading"

# Open source tip
# Find issues: https://github.com/pytorch/pytorch/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22

Output

Match score: 3/3

⚠ Production Trap:

Don't list 'PyTorch' as a skill if you've only used Keras—recruiters will grill you on autograd and custom nn.Module hooks.

🎯 Key Takeaway

Ship a real PyTorch pipeline with distributed training profiling, and you'll outrank 90% of applicants.

Audience

This PyTorch basics guide is crafted for senior software engineers who have already paid their dues in general-purpose programming but are now navigating the treacherous waters of machine learning. You are not a data scientist fresh out of a bootcamp; you understand memory management, concurrency, and the grim reality of production systems. If you’ve ever cursed a Python script for silently consuming 16GB of RAM, you are in the right place. The material assumes you can read PyTorch’s C++ backend stack traces without flinching and that you care more about deterministic reproducibility than notebook aesthetics. We target engineers building pipelines that must survive latency SLAs and rolling deployments. Expect rigorous code, not hand-wavy explanations. This is for the builder who knows that a model is just another binary artifact—like a Docker image, but with more matrix multiplications and fewer dependency conflicts.

audience_confirm.pyPYTHON

// io.thecodeforge — ml-ai tutorial
// Confirm audience fit
import sys

def is_target_audience():
    try:
        import torch
        # Real engineers check tensor stability, not just import
        t = torch.tensor([1.0], device='cpu')
        assert t.item() == 1.0
        return True
    except:
        return False

if __name__ == '__main__':
    audience = is_target_audience()
    print(f'You belong here: {audience}')
    sys.exit(0 if audience else 1)

Output

You belong here: True

⚠ Production Trap:

Do not skip this section if you are a backend engineer. PyTorch's eager execution defaults can hide OOM errors until deploy. Confirm your tensor lifecycle before write code.

🎯 Key Takeaway

Audience fits senior engineers who treat ML pipelines as production systems.

Prerequisites

Before you touch a single nn.Module, ensure your environment is battle-ready. First, Python 3.9+ is mandatory—3.8 is dead, stop resurrecting it. Install PyTorch 2.x (CUDA 12.1 or later) via pip, not conda, because conda has a tendency to silently corrupt your environment graph. You must understand Python’s import system, context managers for resource lifecycle, and the GIL’s limitations. For GPU work, have NVIDIA drivers 535+ and nvidia-smi ready to confirm CUDA availability. Know what a tensor is: not a list, not a numpy array—a first-class GPU citizen with strides and gradients. You should have debugged a segfault before; this is not a place for cargo-cult programming. Bring your own test infrastructure: pytest is mandatory. Finally, accept that you will write more data-loading code than model code—prepare your file I/O pipeline with mmap and shared memory fundamentals. No previous ML experience? Go elsewhere.

check_prereqs.pyPYTHON

// io.thecodeforge — ml-ai tutorial
// Verify prerequisites
import torch
import sys

def check_env():
    needed = {'python': lambda: sys.version_info >= (3,6),
              'torch': lambda: hasattr(torch, '__version__'),
              'cuda': lambda: torch.cuda.is_available() or True}
    for name, check in needed.items():
        assert check(), f'Missing: {name}'
    return True

if __name__ == '__main__':
    check_env()
    print(f'Environment OK: PyTorch {torch.__version__}')
    sys.exit(0)

Output

Environment OK: PyTorch 2.1.0

⚠ Production Trap:

Assuming local GPU availability stops career progression. Build CPU fallback paths before anything. Use torch.no_grad() and model.eval() early.

🎯 Key Takeaway

Prerequisites are strict: Python 3.9+, PyTorch 2.x, CUDA 12.1, and senior-level debugging instincts.

● Production incidentPOST-MORTEMseverity: high

Production model silently trained on accumulated gradients for 200 epochs

Symptom

Training loss decreased steadily. Validation loss was noisy but trending down. Production A/B test showed zero lift over the baseline model — predictions appeared random.

Assumption

The model needed more training data or a different architecture. The team spent two weeks collecting more data.

Root cause

The training loop did not call optimizer.zero_grad(). PyTorch accumulates gradients by default — every backward() call adds to existing .grad values rather than replacing them. After 200 epochs of a decently-sized batch size, the accumulated gradient magnitude was effectively 200x the correct value for the first batch seen. The optimizer was applying enormous, compounding weight updates that oscillated wildly around the loss minimum without ever settling. The model ended up with effectively random weights that happened to produce low training loss by memorising noise in the first few batches — a classic overfitting-via-gradient-corruption failure that is nearly impossible to diagnose from loss curves alone.

Fix

Added optimizer.zero_grad() as the first line of every training step. Added gradient norm logging to the training dashboard — a norm above 10.0 now triggers an alert. Added gradient clipping (max_norm=1.0) as a standing safety net across all training jobs. Added validation loss divergence detection — an alert fires if val loss increases for five consecutive epochs relative to the rolling minimum.

Key lesson

PyTorch accumulates gradients by default — zero_grad() is not optional, it is the first line of every training step
Monitor gradient norms during training — a sudden spike almost always indicates accumulation or an unchecked learning rate schedule
Validation loss trending down is not sufficient signal — always check for divergence between train loss and val loss over time
Gradient clipping prevents catastrophic divergence from outlier batches or accumulation bugs — set it once and leave it on

Production debug guideCommon symptoms when training goes wrong5 entries

Symptom · 01

Loss becomes NaN after a few training steps

→

Fix

Check for division by zero, log of negative numbers, or gradient explosion. Enable torch.autograd.detect_anomaly() to identify which operation produced the NaN gradient. In my experience, the most common culprit is a log() applied to a prediction that dipped to exactly zero — add a small epsilon (1e-8) inside any log call in your loss function.

Symptom · 02

Model trains but produces identical outputs for all inputs

→

Fix

Check if gradients are zero everywhere. Verify requires_grad is True on every parameter layer. Check for accidental torch.no_grad() wrapping the training loop — this is surprisingly easy to do when refactoring inference code into a shared utility. Also check for dead ReLU initialisation: if all pre-activations are negative at init, the entire gradient signal is zero from step one.

Symptom · 03

GPU memory grows without bound each epoch

→

Fix

Check for tensors retained in the computational graph across loop iterations. The most common cause: appending loss (not loss.item()) to a history list. Use .item() for scalar logging and .detach() for tensor logging. Also check for retain_graph=True being called repeatedly — it is almost never necessary in standard training and will silently accumulate the entire graph in memory.

Symptom · 04

Validation loss fluctuates wildly between epochs

→

Fix

Check if model.eval() is called before validation. Without it, Dropout randomly drops different neurons on every forward pass, and BatchNorm uses the current batch's statistics instead of the accumulated running statistics. The result is non-deterministic validation outputs even on identical input data — which looks exactly like training instability but is actually an evaluation bug.

Symptom · 05

Model works on CPU but crashes on GPU with RuntimeError

→

Fix

Check device mismatch — this is almost always it. Every tensor involved in a single operation must live on the same device. Use .to(device) on both the model and every input tensor in your data loading step. If you are using a custom collate function in DataLoader, that is often where tensors quietly stay on CPU.

★ PyTorch Training Debug Cheat SheetQuick commands to diagnose training and memory issues

Loss becomes NaN during training−

Immediate action

Enable anomaly detection to find the operation producing NaN gradients

Commands

torch.autograd.set_detect_anomaly(True)

print([(n, p.grad.norm()) for n, p in model.named_parameters() if p.grad is not None])

Fix now

Add gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

GPU memory grows each epoch+

Gradients are all zeros — model not learning+

Model trains but validation metrics are random+

PyTorch vs TensorFlow 1.x — Architectural Differences

Feature / Aspect	PyTorch (Dynamic Graph)	TensorFlow 1.x (Static Graph)
Graph construction	Built at runtime on every forward pass — debug with standard Python tools anywhere in the loop	Pre-compiled before any data flows through — the graph was fixed at definition time, making runtime inspection nearly impossible
Debugging	Standard Python debugger, `print()`, and pdb work anywhere in the forward pass with no special configuration	Required special tf.Print ops inserted into the graph; runtime errors produced stack traces that pointed to graph compilation, not the user code that caused them
Research flexibility	Architecture changes take effect immediately — swap a layer, change a loss function, add a branch mid-experiment with no recompilation step	Any architectural change required rebuilding and recompiling the graph, which could take seconds to minutes for large models
Production deployment	TorchScript or ONNX export required for optimised serving without a Python runtime; `torch.compile()` in 2.x closes most of the performance gap for GPU serving	SavedModel format was natively optimised for TF Serving; the static graph made deployment straightforward but locked you into the graph you compiled
Community adoption	Dominant in research — over 75% of ML papers published in 2024-2025 used PyTorch as the primary framework	Remains strong in enterprise production systems built before 2020; legacy TF1 codebases are still running in many large organisations
GPU memory control	Explicit .to(device) — you decide what moves and when; nothing migrates automatically	Automatic placement with manual overrides via `tf.device()` context managers; less control but fewer explicit device calls
Gradient control	requires_grad per tensor; `torch.no_grad()` and `torch.inference_mode()` context managers; fine-grained control at the tensor level	GradientTape context manager in TF 2.x — similar concept but opt-in rather than opt-out; in TF 1.x gradients were computed by `tf.gradients()` on the pre-compiled graph

⚙ Quick Reference

14 commands from this guide

File	Command / Code	Purpose
io.thecodeforge.ml.tensor_fundamentals.py	house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0]) # 1D tensor, shape ...	Tensors
io.thecodeforge.ml.autograd_linear_regression.py	torch.manual_seed(42)	Autograd
io.thecodeforge.ml.neural_network_training_loop.py	torch.manual_seed(42)	Building a Real Training Loop with nn.Module
io.thecodeforge.ml.data_loading.py	from torch.utils.data import Dataset, DataLoader	Data Loading with Dataset and DataLoader
io.thecodeforge.ml.gpu_mixed_precision.py	from torch.cuda.amp import autocast, GradScaler	Training on GPU and Mixed Precision
TrainingCheckpointer.py	def save_checkpoint(model, optimizer, scheduler, epoch, loss, filepath):	Checkpointing
MultiGPUTraining.py	from torch.nn.parallel import DistributedDataParallel as DDP	Distributed Data Parallel
verify_install.py	print(f"PyTorch version: {torch.__version__}")	Installation
gpu_acceleration.py	cpu_tensor = torch.randn(1000, 1000)	GPU Acceleration
AugmentDataset.py	from torch.utils.data import Dataset, DataLoader	Enhancing Data Diversity through Augmentation
BasicRNN.py	class CharRNN(nn.Module):	Recurrent Neural Networks (RNNs)
JobSearchChecklist.py	skills = ['nn.Module', 'Dataset', 'DDP', 'TorchScript', 'Amp']	Finding PyTorch Jobs
audience_confirm.py	def is_target_audience():	Audience
check_prereqs.py	def check_env():	Prerequisites

Key takeaways

Tensors carry three critical properties beyond their values

dtype, device, and requires_grad — getting any one of these wrong silently breaks training in ways that trace to the wrong location in the stack.

Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss. The graph is destroyed after each backward pass by default

retain_graph=True in a loop is almost always a memory leak.

The five-step training loop (zero_grad, forward, loss, backward, step) is the universal skeleton of every PyTorch model

the order is load-bearing. Memorise it and you can read any codebase or paper's training code cold.

model.train() and model.eval() control layer behaviour like Dropout and BatchNorm. torch.no_grad() controls gradient computation. These are three separate mechanisms. Confusing them is the single most common source of subtle training bugs in production PyTorch code.

Always call model as a callable (model(x)), never model.forward(x)

the __call__ method wires up the hook infrastructure that profilers, quantisation tools, and debugging libraries depend on. This one habit prevents an entire class of silent tooling failures.

Dataset + DataLoader form the standard data pipeline. Keep num_workers moderate, always move batches to the correct device, and test the pipeline in isolation.

Mixed precision with torch.cuda.amp halves memory and doubles throughput. GradScaler is not optional

it prevents float16 underflow. Use inference_mode for production serving, not no_grad.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is the computation graph in PyTorch and how does autograd use it to...

Q02SENIOR

Explain the difference between model.train() and model.eval() and why yo...

Q03SENIOR

What happens if you forget to call optimizer.zero_grad() before each tra...

Q01 of 03SENIOR

What is the computation graph in PyTorch and how does autograd use it to compute gradients?

ANSWER

The computation graph is a directed acyclic graph (DAG) that records every operation performed on tensors that have requires_grad=True. Each operation becomes a node (with a grad_fn), and the edges represent the tensors flowing between operations. When you call .backward() on a scalar loss, autograd traverses this graph in reverse topological order, applying the chain rule at each node to compute the gradient of the loss with respect to every leaf tensor that required a gradient. The graph is dynamically built on each forward pass and is destroyed after backward() by default, which keeps memory usage proportional to a single forward pass rather than the entire training history.

FAQ · 6 QUESTIONS

Frequently Asked Questions

What is the difference between `torch.Tensor` and `torch.nn.Parameter`?

Why does my model's loss not decrease after switching to GPU?

How do I handle variable-length sequences in a batch?

What is the purpose of `torch.no_grad()` and when should I use it?

How do I debug a vanishing gradient in PyTorch?

What's the best way to save and load a model for inference?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Written from production experience, not tutorials.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Tools. Mark it forged?

12 min read · try the examples if you haven't