PyTorch Basics Explained: Tensors, Autograd, and Real Model Training
Every time Netflix recommends a show you actually want to watch, or your phone unlocks from your face, a neural network trained by a framework like PyTorch is behind it. PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems — not because it's magic, but because it thinks the same way a developer does: imperative, debuggable, and Pythonic. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.
The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine, turning what would be weeks of manual calculus into three lines of code.
By the end of this article you'll understand what tensors actually are and why they're not just fancy arrays, how PyTorch's autograd eliminates manual gradient calculation, and how to wire a complete training loop from scratch — the kind of loop that sits inside every PyTorch model in the wild. You'll walk away with working code and the mental model to extend it.
Tensors: The DNA of Every PyTorch Model
A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.
What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.
You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising weights, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, not automatic. That explicitness is a feature; it forces you to think about where computation happens, which is critical for performance.
import torch # --- Creating tensors from real data --- # Simulating a tiny dataset: 4 house sizes (sq ft) and their prices ($k) house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0]) # 1D tensor, shape (4,) house_prices = torch.tensor([150.0, 220.0, 310.0, 410.0]) # 1D tensor, shape (4,) print("Sizes tensor:", house_sizes) print("Shape:", house_sizes.shape) # torch.Size([4]) print("Data type:", house_sizes.dtype) # torch.float32 — default for floats # --- 2D tensor: batch of data (rows = samples, cols = features) --- feature_matrix = torch.tensor([ [750.0, 3.0, 1.0], # size, bedrooms, bathrooms [1200.0, 4.0, 2.0], [1800.0, 4.0, 3.0], [2400.0, 5.0, 3.0], ]) print("\nFeature matrix shape:", feature_matrix.shape) # torch.Size([4, 3]) # --- Device awareness: check and move to GPU if available --- device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print("\nUsing device:", device) # Move tensor to the target device — always do this before operations feature_matrix = feature_matrix.to(device) print("Feature matrix device:", feature_matrix.device) # --- Useful tensor operations --- # Normalise features: (x - mean) / std — critical for stable training means = feature_matrix.mean(dim=0) # mean along rows, one value per column stds = feature_matrix.std(dim=0) # std along rows, one value per column normalised = (feature_matrix - means) / stds print("\nNormalised features (first row):", normalised[0]) # --- require_grad: opting a tensor INTO gradient tracking --- # We do NOT set this on input data — only on learnable parameters weight = torch.tensor([0.15], requires_grad=True) # our model's single weight bias = torch.tensor([10.0], requires_grad=True) # our model's bias term print("\nWeight requires grad:", weight.requires_grad) # True print("House sizes requires grad:", house_sizes.requires_grad) # False — data, not a parameter
Shape: torch.Size([4])
Data type: torch.float32
Feature matrix shape: torch.Size([4, 3])
Using device: cpu
Feature matrix device: cpu
Normalised features (first row): tensor([-1.3416, -1.1547, -1.0000])
Weight requires grad: True
House sizes requires grad: False
Autograd: How PyTorch Learns Without You Doing Calculus
Autograd is the reason PyTorch feels almost magical. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (usually a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.
In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge is gradient descent.
Three rules to remember: (1) .backward() can only be called on a scalar tensor — if your loss is multi-element, call .mean() or .sum() first. (2) Gradients accumulate by default — call optimizer.zero_grad() before each backward pass or they'll pile up across batches. (3) During inference (not training), wrap code in torch.no_grad() to skip graph construction entirely, which is faster and uses less memory.
import torch # Seed for reproducibility — always set this in experiments torch.manual_seed(42) # --- Toy linear regression: predict house price from size --- # Ground truth: price = 0.18 * size + 5 (we want the model to discover this) house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0]) house_prices = torch.tensor([140.0, 221.0, 329.0, 437.0]) # Learnable parameters — these are the knobs PyTorch will tune weight = torch.tensor([0.01], requires_grad=True) # starts as a bad guess bias = torch.tensor([0.01], requires_grad=True) learning_rate = 0.0000001 # tiny LR because our inputs are large (hundreds to thousands) for epoch in range(6): # FORWARD PASS: compute predictions using current weight and bias predicted_prices = weight * house_sizes + bias # broadcasting: applies to all 4 houses # COMPUTE LOSS: Mean Squared Error — how wrong are we on average? loss = ((predicted_prices - house_prices) ** 2).mean() # BACKWARD PASS: autograd fills .grad on weight and bias # Must zero gradients first — they accumulate by default! if weight.grad is not None: weight.grad.zero_() bias.grad.zero_() loss.backward() # compute d(loss)/d(weight) and d(loss)/d(bias) # PARAMETER UPDATE: nudge weight and bias toward lower loss # torch.no_grad() because we don't want this update tracked in the graph with torch.no_grad(): weight -= learning_rate * weight.grad # gradient descent step bias -= learning_rate * bias.grad print(f"Epoch {epoch+1:2d} | Loss: {loss.item():.2f} | " f"weight: {weight.item():.5f} | bias: {bias.item():.5f}") print("\nFinal model: price =", round(weight.item(), 4), "* size +", round(bias.item(), 4)) print("Target model: price = 0.18 * size + 5") # Inference — no gradient tracking needed, saves memory with torch.no_grad(): test_size = torch.tensor([2000.0]) predicted = weight * test_size + bias print(f"\nPredicted price for 2000 sq ft: ${predicted.item():.1f}k")
Epoch 2 | Loss: 65099.14 | weight: 0.08349 | bias: 0.01006
Epoch 3 | Loss: 54310.95 | weight: 0.11626 | bias: 0.01008
Epoch 4 | Loss: 45330.67 | weight: 0.14673 | bias: 0.01011
Epoch 5 | Loss: 37842.37 | weight: 0.17510 | bias: 0.01013
Epoch 6 | Loss: 31569.55 | weight: 0.20153 | bias: 0.01015
Final model: price = 0.2015 * size + 0.0102
Target model: price = 0.18 * size + 5
Predicted price for 2000 sq ft: $403.1k
Building a Real Training Loop with nn.Module
Writing raw tensor operations gets unwieldy fast. PyTorch's nn.Module is the standard way to define any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters inside __init__, and defines the forward computation inside forward().
The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree. That's what you hand to the optimizer.
The training loop is the heartbeat of all ML work in PyTorch. It's always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. Internalise that sequence and you can adapt any paper's training code. The validation loop is almost identical but wrapped in torch.no_grad() and with model.eval() called first — that disables Dropout and fixes BatchNorm statistics so you get deterministic, representative predictions.
import torch import torch.nn as nn import torch.optim as optim torch.manual_seed(42) # --- Dataset: Boston-style house price prediction (synthetic) --- # 100 samples, 3 features: size(sqft), bedrooms, age(years) num_samples = 100 num_features = 3 # Generate synthetic features and a linear target with some noise raw_features = torch.randn(num_samples, num_features) true_weights = torch.tensor([0.5, 0.3, -0.2]) # size helps, age hurts price target_prices = raw_features @ true_weights + 0.1 * torch.randn(num_samples) # Train / validation split: 80 / 20 train_size = int(0.8 * num_samples) train_features, val_features = raw_features[:train_size], raw_features[train_size:] train_targets, val_targets = target_prices[:train_size], target_prices[train_size:] # --- Model definition --- class HousePriceNet(nn.Module): def __init__(self, input_features: int): super().__init__() # always call parent __init__ # nn.Sequential stacks layers; no need to write forward() manually for simple nets self.network = nn.Sequential( nn.Linear(input_features, 16), # input layer -> hidden layer (16 neurons) nn.ReLU(), # activation: turn negatives to zero nn.Linear(16, 8), # hidden -> smaller hidden nn.ReLU(), nn.Linear(8, 1), # final layer: output one price prediction ) def forward(self, feature_batch: torch.Tensor) -> torch.Tensor: # Squeeze removes the trailing dimension: (batch, 1) -> (batch,) return self.network(feature_batch).squeeze(1) model = HousePriceNet(input_features=num_features) loss_fn = nn.MSELoss() # Mean Squared Error for regression optimizer = optim.Adam(model.parameters(), lr=1e-3) # Adam adapts the learning rate per param print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}") print(model) print() # --- Training loop --- for epoch in range(1, 51): model.train() # sets model to training mode (enables Dropout, BatchNorm updates) # Step 1: Zero out stale gradients from the previous iteration optimizer.zero_grad() # Step 2: Forward pass — get predictions for the whole training set train_predictions = model(train_features) # Step 3: Compute how wrong we are train_loss = loss_fn(train_predictions, train_targets) # Step 4: Backprop — autograd computes all gradients train_loss.backward() # Step 5: Optimizer updates every parameter using its stored gradient optimizer.step() # --- Validation (every 10 epochs) --- if epoch % 10 == 0: model.eval() # disables Dropout, fixes BatchNorm — crucial for fair evaluation with torch.no_grad(): # no graph needed; saves memory val_predictions = model(val_features) val_loss = loss_fn(val_predictions, val_targets) print(f"Epoch {epoch:3d} | Train Loss: {train_loss.item():.4f} | " f"Val Loss: {val_loss.item():.4f}") # --- Single inference example --- model.eval() with torch.no_grad(): new_house = torch.tensor([[1.2, 0.5, -0.8]]) # normalised features price_pred = model(new_house) print(f"\nPredicted price index for new house: {price_pred.item():.4f}") print(f"True expected value (approx): {(new_house[0] @ true_weights).item():.4f}")
HousePriceNet(
(network): Sequential(
(0): Linear(in_features=3, out_features=16, bias=True)
(1): ReLU()
(2): Linear(in_features=16, out_features=8, bias=True)
(3): ReLU()
(4): Linear(in_features=8, out_features=1, bias=True)
)
)
Epoch 10 | Train Loss: 0.2431 | Val Loss: 0.3102
Epoch 20 | Train Loss: 0.1187 | Val Loss: 0.1834
Epoch 30 | Train Loss: 0.0743 | Val Loss: 0.1214
Epoch 40 | Train Loss: 0.0521 | Val Loss: 0.0987
Epoch 50 | Train Loss: 0.0389 | Val Loss: 0.0812
Predicted price index for new house: 0.6821
True expected value (approx): 0.7200
| Feature / Aspect | PyTorch (Dynamic Graph) | TensorFlow 1.x (Static Graph) |
|---|---|---|
| Graph construction | Built at runtime per forward pass — debug like normal Python | Pre-compiled before any data flows through — hard to inspect |
| Debugging | Standard Python debugger / print() works anywhere | Needed special tf.Print ops; graph errors were cryptic |
| Research flexibility | Change architecture mid-loop trivially | Required rebuilding and recompiling the graph |
| Production deployment | TorchScript / ONNX export needed for optimised serving | SavedModel format was natively optimised for TF Serving |
| Community adoption | Dominant in research papers (>70% of ML papers in 2023) | Strong in enterprise / legacy production systems |
| GPU memory control | Explicit .to(device) — you decide what moves | Automatic placement with manual overrides via tf.device() |
| Gradient control | requires_grad per tensor; no_grad context manager | GradientTape context manager in TF 2.x (similar concept) |
🎯 Key Takeaways
- Tensors carry three critical properties beyond their values: dtype, device, and requires_grad — getting any one of these wrong silently breaks training.
- Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss.
- The five-step training loop (zero_grad → forward → loss → backward → step) is the universal skeleton of every PyTorch model — memorise the order and you can read any codebase.
- model.train() and model.eval() control layer behaviour like Dropout, NOT gradient computation — that's torch.no_grad()'s job. Confusing these two is a top-tier interview gotcha.
⚠ Common Mistakes to Avoid
- ✕Mistake 1: Forgetting optimizer.zero_grad() before loss.backward() — gradients accumulate across calls, so your model sees a sum of all past gradients instead of just the current batch's. The symptom is exploding loss values or wildly oscillating training curves. Fix: make
optimizer.zero_grad()the first line inside every training loop iteration — treat it like a reflex. - ✕Mistake 2: Calling model.forward(x) directly instead of model(x) — it works, but it bypasses all registered forward hooks (used by profilers, debuggers, and libraries like torchvision). Always call the model as a callable:
predictions = model(input_tensor). The__call__method is what wires the hooks together. - ✕Mistake 3: Keeping the computation graph alive by storing loss tensors in a list —
losses.append(loss)inside a training loop holds a reference to the entire graph for every iteration, causing memory to grow until you hit an OOM crash. Fix: always detach the scalar value before storing:losses.append(loss.item()). The.item()call extracts a plain Python float and releases the graph.
Interview Questions on This Topic
- QWhat is the difference between a PyTorch tensor with requires_grad=True and one without it — and why would you ever set requires_grad=False on a parameter intentionally?
- QWalk me through the exact sequence of operations in a PyTorch training loop and explain what would go wrong if you skipped any single step.
- QIf your validation loss is wildly inconsistent between epochs even though your training loss is smooth and decreasing, what is the most likely PyTorch-specific cause and how do you confirm it?
Frequently Asked Questions
What is the difference between PyTorch and NumPy?
NumPy arrays live only on the CPU and have no concept of gradients or automatic differentiation. PyTorch tensors can live on a GPU (dramatically accelerating matrix operations), and tensors with requires_grad=True automatically track every operation for backpropagation. For pure numerical computing with no ML, NumPy is lighter; the moment you need learning, use PyTorch.
When should I use torch.no_grad()?
Any time you're running the model but not updating its weights — validation, evaluation, or production inference. Without it, PyTorch still builds the computation graph for every forward pass, wasting memory and time. Wrapping inference code in with torch.no_grad(): is not optional in production; it's a correctness and performance requirement.
Why does my PyTorch model train fine on CPU but crash on GPU?
The most common cause is a device mismatch — your model is on the GPU but your input tensors are still on the CPU, or vice versa. Every tensor involved in a single operation must be on the same device. Fix by calling input_tensor = input_tensor.to(device) in your data loading step, and ensure device matches the one you passed to model.to(device).
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.