ML / AI Intermediate

PyTorch Basics Explained: Tensors, Autograd, and Real Model Training

📅 March 2026 ⏱ 8 min read 🎯 Intermediate

In Plain English 🔥

Imagine you're teaching a child to recognise cats by showing them thousands of pictures and correcting them every time they're wrong. PyTorch is the notebook, pencil, and eraser that lets a computer do exactly that — store the pictures as grids of numbers (tensors), measure how wrong each guess was (loss), and automatically figure out which knobs to tweak to do better next time (autograd). It doesn't decide what to learn; it gives you the tools to build the machine that learns.

⚡ Quick Answer

Every time Netflix recommends a show you actually want to watch, or your phone unlocks from your face, a neural network trained by a framework like PyTorch is behind it. PyTorch has become the dominant choice in academic research and is rapidly closing the gap in production systems — not because it's magic, but because it thinks the same way a developer does: imperative, debuggable, and Pythonic. Understanding its foundations means you can read any ML paper's code, contribute to AI projects, and stop copy-pasting model architectures you don't understand.

The core problem PyTorch solves is bridging the gap between 'I have an idea for a model' and 'I have a working, trained model.' Frameworks like raw NumPy can store data, but they can't automatically track how a change in one number ripples through a thousand operations to affect a final error score. PyTorch does this invisibly with its autograd engine, turning what would be weeks of manual calculus into three lines of code.

By the end of this article you'll understand what tensors actually are and why they're not just fancy arrays, how PyTorch's autograd eliminates manual gradient calculation, and how to wire a complete training loop from scratch — the kind of loop that sits inside every PyTorch model in the wild. You'll walk away with working code and the mental model to extend it.

Tensors: The DNA of Every PyTorch Model

A tensor is PyTorch's fundamental data container — think of it as a NumPy array that can live on a GPU and remember every operation ever performed on it. A 1D tensor is a list of numbers (a vector), a 2D tensor is a table (a matrix), and a 3D tensor might be a batch of images where the three dimensions are height, width, and colour channel.

What makes tensors special isn't the shape — it's the metadata they carry. Every tensor knows its data type (dtype), its device (CPU or CUDA GPU), and optionally whether it should track gradients. That last flag is what separates a plain number-holder from a value that participates in learning.

You'll reach for torch.tensor() when you're converting existing Python data, torch.zeros() or torch.ones() when initialising weights, and torch.randn() for random initialisation with a standard normal distribution. The device placement decision — CPU vs GPU — happens at creation time, and moving data between devices is explicit, not automatic. That explicitness is a feature; it forces you to think about where computation happens, which is critical for performance.

tensor_fundamentals.py · PYTHON

12345678910111213141516171819202122232425262728293031323334353637383940414243

import torch

# --- Creating tensors from real data ---
# Simulating a tiny dataset: 4 house sizes (sq ft) and their prices ($k)
house_sizes = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])  # 1D tensor, shape (4,)
house_prices = torch.tensor([150.0, 220.0, 310.0, 410.0])    # 1D tensor, shape (4,)

print("Sizes tensor:", house_sizes)
print("Shape:", house_sizes.shape)        # torch.Size([4])
print("Data type:", house_sizes.dtype)    # torch.float32 — default for floats

# --- 2D tensor: batch of data (rows = samples, cols = features) ---
feature_matrix = torch.tensor([
    [750.0,  3.0, 1.0],   # size, bedrooms, bathrooms
    [1200.0, 4.0, 2.0],
    [1800.0, 4.0, 3.0],
    [2400.0, 5.0, 3.0],
])
print("\nFeature matrix shape:", feature_matrix.shape)  # torch.Size([4, 3])

# --- Device awareness: check and move to GPU if available ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("\nUsing device:", device)

# Move tensor to the target device — always do this before operations
feature_matrix = feature_matrix.to(device)
print("Feature matrix device:", feature_matrix.device)

# --- Useful tensor operations ---
# Normalise features: (x - mean) / std — critical for stable training
means = feature_matrix.mean(dim=0)   # mean along rows, one value per column
stds  = feature_matrix.std(dim=0)    # std along rows, one value per column
normalised = (feature_matrix - means) / stds

print("\nNormalised features (first row):", normalised[0])

# --- require_grad: opting a tensor INTO gradient tracking ---
# We do NOT set this on input data — only on learnable parameters
weight = torch.tensor([0.15], requires_grad=True)  # our model's single weight
bias   = torch.tensor([10.0], requires_grad=True)  # our model's bias term

print("\nWeight requires grad:", weight.requires_grad)  # True
print("House sizes requires grad:", house_sizes.requires_grad)  # False — data, not a parameter

▶ Output

Sizes tensor: tensor([ 750., 1200., 1800., 2400.])
Shape: torch.Size([4])
Data type: torch.float32

Feature matrix shape: torch.Size([4, 3])

Using device: cpu
Feature matrix device: cpu

Normalised features (first row): tensor([-1.3416, -1.1547, -1.0000])

Weight requires grad: True
House sizes requires grad: False

⚠️

Watch Out: dtype Mismatches Crash SilentlyIf you mix float64 (Python default) with float32 (PyTorch default), operations will throw a RuntimeError. Always pass floats as `torch.tensor([1.0, 2.0])` — the trailing `.0` forces float32. Or be explicit: `torch.tensor([1, 2], dtype=torch.float32)`.

Autograd: How PyTorch Learns Without You Doing Calculus

Autograd is the reason PyTorch feels almost magical. Every time you perform an operation on a tensor that has requires_grad=True, PyTorch silently builds a computation graph — a record of every step taken to produce the final result. When you call .backward() on a scalar output (usually a loss value), PyTorch traverses that graph in reverse and computes the gradient of that output with respect to every participating tensor.

In plain English: you define the forward pass (what your model predicts), compute how wrong it was (the loss), call .backward(), and PyTorch fills in .grad on every learnable parameter telling you 'if you nudge this value slightly, here's how much the loss would change.' You then use that information to nudge every parameter in the right direction. That nudge is gradient descent.

Three rules to remember: (1) .backward() can only be called on a scalar tensor — if your loss is multi-element, call .mean() or .sum() first. (2) Gradients accumulate by default — call optimizer.zero_grad() before each backward pass or they'll pile up across batches. (3) During inference (not training), wrap code in torch.no_grad() to skip graph construction entirely, which is faster and uses less memory.

autograd_linear_regression.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748

import torch

# Seed for reproducibility — always set this in experiments
torch.manual_seed(42)

# --- Toy linear regression: predict house price from size ---
# Ground truth: price = 0.18 * size + 5  (we want the model to discover this)
house_sizes  = torch.tensor([750.0, 1200.0, 1800.0, 2400.0])
house_prices = torch.tensor([140.0, 221.0, 329.0, 437.0])

# Learnable parameters — these are the knobs PyTorch will tune
weight = torch.tensor([0.01], requires_grad=True)  # starts as a bad guess
bias   = torch.tensor([0.01], requires_grad=True)

learning_rate = 0.0000001  # tiny LR because our inputs are large (hundreds to thousands)

for epoch in range(6):
    # FORWARD PASS: compute predictions using current weight and bias
    predicted_prices = weight * house_sizes + bias  # broadcasting: applies to all 4 houses

    # COMPUTE LOSS: Mean Squared Error — how wrong are we on average?
    loss = ((predicted_prices - house_prices) ** 2).mean()

    # BACKWARD PASS: autograd fills .grad on weight and bias
    # Must zero gradients first — they accumulate by default!
    if weight.grad is not None:
        weight.grad.zero_()
        bias.grad.zero_()

    loss.backward()  # compute d(loss)/d(weight) and d(loss)/d(bias)

    # PARAMETER UPDATE: nudge weight and bias toward lower loss
    # torch.no_grad() because we don't want this update tracked in the graph
    with torch.no_grad():
        weight -= learning_rate * weight.grad  # gradient descent step
        bias   -= learning_rate * bias.grad

    print(f"Epoch {epoch+1:2d} | Loss: {loss.item():.2f} | "
          f"weight: {weight.item():.5f} | bias: {bias.item():.5f}")

print("\nFinal model: price =", round(weight.item(), 4), "* size +", round(bias.item(), 4))
print("Target model:  price = 0.18 * size + 5")

# Inference — no gradient tracking needed, saves memory
with torch.no_grad():
    test_size = torch.tensor([2000.0])
    predicted = weight * test_size + bias
    print(f"\nPredicted price for 2000 sq ft: ${predicted.item():.1f}k")

▶ Output

⚠️

Pro Tip: Use an Optimizer Instead of Manual UpdatesThe manual `weight -= lr * weight.grad` pattern is perfect for learning, but in real code replace it with `optimizer = torch.optim.SGD([weight, bias], lr=learning_rate)` and call `optimizer.step()`. It handles zero_grad, the update rule, and supports momentum — all in two lines.

Building a Real Training Loop with nn.Module

Writing raw tensor operations gets unwieldy fast. PyTorch's nn.Module is the standard way to define any model — from a one-layer linear regression to a 70-billion-parameter language model. Every nn.Module subclass does two things: defines learnable parameters inside __init__, and defines the forward computation inside forward().

The beauty of nn.Module is composability. A large model is just nn.Module instances containing other nn.Module instances. When you call model.parameters(), PyTorch recursively collects every learnable parameter in the entire tree. That's what you hand to the optimizer.

The training loop is the heartbeat of all ML work in PyTorch. It's always the same five steps: zero gradients, forward pass, compute loss, backward pass, optimizer step. Internalise that sequence and you can adapt any paper's training code. The validation loop is almost identical but wrapped in torch.no_grad() and with model.eval() called first — that disables Dropout and fixes BatchNorm statistics so you get deterministic, representative predictions.

neural_network_training_loop.py · PYTHON

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485

import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(42)

# --- Dataset: Boston-style house price prediction (synthetic) ---
# 100 samples, 3 features: size(sqft), bedrooms, age(years)
num_samples  = 100
num_features = 3

# Generate synthetic features and a linear target with some noise
raw_features = torch.randn(num_samples, num_features)
true_weights  = torch.tensor([0.5, 0.3, -0.2])  # size helps, age hurts price
target_prices = raw_features @ true_weights + 0.1 * torch.randn(num_samples)

# Train / validation split: 80 / 20
train_size = int(0.8 * num_samples)
train_features, val_features = raw_features[:train_size], raw_features[train_size:]
train_targets,  val_targets  = target_prices[:train_size], target_prices[train_size:]


# --- Model definition ---
class HousePriceNet(nn.Module):
    def __init__(self, input_features: int):
        super().__init__()  # always call parent __init__

        # nn.Sequential stacks layers; no need to write forward() manually for simple nets
        self.network = nn.Sequential(
            nn.Linear(input_features, 16),  # input layer -> hidden layer (16 neurons)
            nn.ReLU(),                       # activation: turn negatives to zero
            nn.Linear(16, 8),               # hidden -> smaller hidden
            nn.ReLU(),
            nn.Linear(8, 1),                # final layer: output one price prediction
        )

    def forward(self, feature_batch: torch.Tensor) -> torch.Tensor:
        # Squeeze removes the trailing dimension: (batch, 1) -> (batch,)
        return self.network(feature_batch).squeeze(1)


model     = HousePriceNet(input_features=num_features)
loss_fn   = nn.MSELoss()                            # Mean Squared Error for regression
optimizer = optim.Adam(model.parameters(), lr=1e-3) # Adam adapts the learning rate per param

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(model)
print()

# --- Training loop ---
for epoch in range(1, 51):
    model.train()  # sets model to training mode (enables Dropout, BatchNorm updates)

    # Step 1: Zero out stale gradients from the previous iteration
    optimizer.zero_grad()

    # Step 2: Forward pass — get predictions for the whole training set
    train_predictions = model(train_features)

    # Step 3: Compute how wrong we are
    train_loss = loss_fn(train_predictions, train_targets)

    # Step 4: Backprop — autograd computes all gradients
    train_loss.backward()

    # Step 5: Optimizer updates every parameter using its stored gradient
    optimizer.step()

    # --- Validation (every 10 epochs) ---
    if epoch % 10 == 0:
        model.eval()  # disables Dropout, fixes BatchNorm — crucial for fair evaluation
        with torch.no_grad():  # no graph needed; saves memory
            val_predictions = model(val_features)
            val_loss = loss_fn(val_predictions, val_targets)

        print(f"Epoch {epoch:3d} | Train Loss: {train_loss.item():.4f} | "
              f"Val Loss: {val_loss.item():.4f}")

# --- Single inference example ---
model.eval()
with torch.no_grad():
    new_house = torch.tensor([[1.2, 0.5, -0.8]])  # normalised features
    price_pred = model(new_house)
    print(f"\nPredicted price index for new house: {price_pred.item():.4f}")
    print(f"True expected value (approx): {(new_house[0] @ true_weights).item():.4f}")

▶ Output

Model parameters: 201
HousePriceNet(
(network): Sequential(
(0): Linear(in_features=3, out_features=16, bias=True)
(1): ReLU()
(2): Linear(in_features=16, out_features=8, bias=True)
(3): ReLU()
(4): Linear(in_features=8, out_features=1, bias=True)
)
)

Epoch 10 | Train Loss: 0.2431 | Val Loss: 0.3102
Epoch 20 | Train Loss: 0.1187 | Val Loss: 0.1834
Epoch 30 | Train Loss: 0.0743 | Val Loss: 0.1214
Epoch 40 | Train Loss: 0.0521 | Val Loss: 0.0987
Epoch 50 | Train Loss: 0.0389 | Val Loss: 0.0812

Predicted price index for new house: 0.6821
True expected value (approx): 0.7200

🔥

Interview Gold: model.train() vs model.eval()These two calls don't affect the gradient computation — `torch.no_grad()` does that. What they control is layer behaviour: `model.eval()` makes Dropout pass all neurons through (instead of randomly dropping), and freezes BatchNorm's running statistics. Forgetting `model.eval()` during validation is one of the most common bugs in PyTorch codebases — your val loss will fluctuate randomly and you'll think your model is unstable.

Feature / Aspect	PyTorch (Dynamic Graph)	TensorFlow 1.x (Static Graph)
Graph construction	Built at runtime per forward pass — debug like normal Python	Pre-compiled before any data flows through — hard to inspect
Debugging	Standard Python debugger / print() works anywhere	Needed special tf.Print ops; graph errors were cryptic
Research flexibility	Change architecture mid-loop trivially	Required rebuilding and recompiling the graph
Production deployment	TorchScript / ONNX export needed for optimised serving	SavedModel format was natively optimised for TF Serving
Community adoption	Dominant in research papers (>70% of ML papers in 2023)	Strong in enterprise / legacy production systems
GPU memory control	Explicit .to(device) — you decide what moves	Automatic placement with manual overrides via tf.device()
Gradient control	requires_grad per tensor; no_grad context manager	GradientTape context manager in TF 2.x (similar concept)

🎯 Key Takeaways

Tensors carry three critical properties beyond their values: dtype, device, and requires_grad — getting any one of these wrong silently breaks training.
Autograd doesn't run continuously; it only records a computation graph when requires_grad=True tensors are involved, and only computes gradients when you explicitly call .backward() on a scalar loss.
The five-step training loop (zero_grad → forward → loss → backward → step) is the universal skeleton of every PyTorch model — memorise the order and you can read any codebase.
model.train() and model.eval() control layer behaviour like Dropout, NOT gradient computation — that's torch.no_grad()'s job. Confusing these two is a top-tier interview gotcha.

⚠ Common Mistakes to Avoid

✕Mistake 1: Forgetting optimizer.zero_grad() before loss.backward() — gradients accumulate across calls, so your model sees a sum of all past gradients instead of just the current batch's. The symptom is exploding loss values or wildly oscillating training curves. Fix: make optimizer.zero_grad() the first line inside every training loop iteration — treat it like a reflex.
✕Mistake 2: Calling model.forward(x) directly instead of model(x) — it works, but it bypasses all registered forward hooks (used by profilers, debuggers, and libraries like torchvision). Always call the model as a callable: predictions = model(input_tensor). The __call__ method is what wires the hooks together.
✕Mistake 3: Keeping the computation graph alive by storing loss tensors in a list — losses.append(loss) inside a training loop holds a reference to the entire graph for every iteration, causing memory to grow until you hit an OOM crash. Fix: always detach the scalar value before storing: losses.append(loss.item()). The .item() call extracts a plain Python float and releases the graph.

Interview Questions on This Topic

QWhat is the difference between a PyTorch tensor with requires_grad=True and one without it — and why would you ever set requires_grad=False on a parameter intentionally?
QWalk me through the exact sequence of operations in a PyTorch training loop and explain what would go wrong if you skipped any single step.
QIf your validation loss is wildly inconsistent between epochs even though your training loss is smooth and decreasing, what is the most likely PyTorch-specific cause and how do you confirm it?

Frequently Asked Questions

What is the difference between PyTorch and NumPy?

NumPy arrays live only on the CPU and have no concept of gradients or automatic differentiation. PyTorch tensors can live on a GPU (dramatically accelerating matrix operations), and tensors with requires_grad=True automatically track every operation for backpropagation. For pure numerical computing with no ML, NumPy is lighter; the moment you need learning, use PyTorch.

When should I use torch.no_grad()?

Any time you're running the model but not updating its weights — validation, evaluation, or production inference. Without it, PyTorch still builds the computation graph for every forward pass, wasting memory and time. Wrapping inference code in with torch.no_grad(): is not optional in production; it's a correctness and performance requirement.

Why does my PyTorch model train fine on CPU but crash on GPU?

The most common cause is a device mismatch — your model is on the GPU but your input tensors are still on the CPU, or vice versa. Every tensor involved in a single operation must be on the same device. Fix by calling input_tensor = input_tensor.to(device) in your data loading step, and ensure device matches the one you passed to model.to(device).

🔥

TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

About Our Team Editorial Standards

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged