Senior 11 min · March 09, 2026

PyTorch Tensors — Silent CPU Fallback Kills GPU Utilization

Training slows 8x when a broad except block catches device mismatch error, silently running on CPU.

N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Tensors are GPU-accelerated multidimensional arrays — the universal data structure for all PyTorch operations
  • They mirror NumPy's API but add CUDA support and automatic differentiation via Autograd
  • requires_grad=True enables gradient tracking — the tensor records every operation for backpropagation
  • Moving tensors to GPU with .to('cuda') provides 10-100x speedup for large matrix operations
  • Device mismatch (CPU tensor + GPU tensor in the same operation) is the #1 production RuntimeError — always check .device
  • Small tensors incur more transfer overhead than they save — only move to GPU when the computation justifies it
✦ Definition~90s read
What is PyTorch Tensors?

A PyTorch tensor is a multidimensional array that was designed to solve a problem NumPy cannot: running massive parallel computations on a GPU while simultaneously tracking every operation for automatic gradient computation.

Imagine a standard spreadsheet.

NumPy arrays are excellent for scientific computing — fast, well-documented, universally supported. But they have two hard limits. First, they run only on CPU. Second, they have no concept of a computation graph. This means that if you want to train a neural network with NumPy, you implement backpropagation manually — computing partial derivatives by hand for every layer, every parameter, every batch.

That is tractable for a two-layer toy network and completely unworkable for anything beyond it.

PyTorch tensors solve both problems. The storage layer underneath a tensor can live on CPU, on an NVIDIA GPU via CUDA, or on Apple Silicon via MPS. When the storage is on GPU, every matrix operation dispatches to CUDA kernels that execute in parallel across thousands of GPU cores — this is why large matrix multiplications are 10–100x faster on GPU for the sizes neural networks operate on.

When requires_grad=True, the tensor records every operation as part of a dynamic computation graph. When .backward() is called on the loss, that graph is traversed in reverse and .grad is filled in on every participating tensor via the chain rule — one backward pass, all gradients computed simultaneously.

The architectural detail worth understanding: a tensor is a view into a storage object. The tensor knows its shape, stride, dtype, and device. The storage holds the raw bytes. Operations like .transpose() and .permute() create new tensor views with reordered strides without moving any data in memory — the storage stays identical.

This is efficient but has a consequence: the resulting tensor is non-contiguous, and .view() will refuse to work on it because .view() requires elements to be laid out in memory in the same order they are addressed logically. The fix is .contiguous(), which copies the data into a new storage with the expected memory order.

As of PyTorch 2.x, there is a fourth dimension: compiled tensors. torch.compile traces your forward pass and compiles it into optimised kernels using TorchInductor. The tensor API is identical — you add one decorator and the same tensor operations run in a fused, optimised form.

For production inference workloads in 2026, torch.compile is the highest-leverage single change you can make to a trained model.

Plain-English First

Imagine a standard spreadsheet. A single number is a scalar, a single row is a vector, and the full grid of rows and columns is a matrix. A tensor is that same idea extended to any number of dimensions — a cube of numbers, a four-dimensional hypercube, whatever the problem requires. What makes PyTorch tensors special is not the shape. It is two things layered on top: first, they can live on a GPU and run thousands of operations in parallel instead of one at a time on a CPU. Second, they remember every mathematical operation ever applied to them. When you eventually ask 'how should I change these numbers to reduce the error?', the tensor can trace every step backwards and give you the exact answer — automatically, without you writing a single line of calculus.

PyTorch Tensors are the fundamental data structure in PyTorch — every input, weight, gradient, and output is a tensor. They are multidimensional arrays that mirror NumPy's API but add two capabilities that NumPy does not have: GPU acceleration via CUDA and automatic differentiation via Autograd.

The key design decision: tensors are not just data containers. When requires_grad=True, they become nodes in a dynamic computation graph. Every operation on them is recorded as the forward pass executes, enabling automatic gradient computation when you call .backward(). This is what makes neural network training tractable — without it, you would manually compute partial derivatives for every parameter on every update, which is not realistic at any modern model size.

The production failure pattern: device mismatch. A tensor on CPU cannot participate in the same operation as a tensor on GPU. PyTorch raises RuntimeError: Expected all tensors to be on the same device immediately and clearly. What is not clear is which tensor is on the wrong device. The fix is always the same: establish device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') once, pass it to every tensor creation call, and add an assertion at training start that verifies model parameters and input data are on the same device.

As of 2026, there is a third dimension worth knowing: Apple Silicon. PyTorch supports MPS (Metal Performance Shaders) on M-series Macs via device='mps', which provides meaningful GPU acceleration on MacBooks without CUDA. The same .to(device) pattern applies — the device string changes, the code does not.

What Is a PyTorch Tensor and Why Does It Exist?

A PyTorch tensor is a multidimensional array that was designed to solve a problem NumPy cannot: running massive parallel computations on a GPU while simultaneously tracking every operation for automatic gradient computation.

NumPy arrays are excellent for scientific computing — fast, well-documented, universally supported. But they have two hard limits. First, they run only on CPU. Second, they have no concept of a computation graph. This means that if you want to train a neural network with NumPy, you implement backpropagation manually — computing partial derivatives by hand for every layer, every parameter, every batch. That is tractable for a two-layer toy network and completely unworkable for anything beyond it.

PyTorch tensors solve both problems. The storage layer underneath a tensor can live on CPU, on an NVIDIA GPU via CUDA, or on Apple Silicon via MPS. When the storage is on GPU, every matrix operation dispatches to CUDA kernels that execute in parallel across thousands of GPU cores — this is why large matrix multiplications are 10–100x faster on GPU for the sizes neural networks operate on. When requires_grad=True, the tensor records every operation as part of a dynamic computation graph. When .backward() is called on the loss, that graph is traversed in reverse and .grad is filled in on every participating tensor via the chain rule — one backward pass, all gradients computed simultaneously.

The architectural detail worth understanding: a tensor is a view into a storage object. The tensor knows its shape, stride, dtype, and device. The storage holds the raw bytes. Operations like .transpose() and .permute() create new tensor views with reordered strides without moving any data in memory — the storage stays identical. This is efficient but has a consequence: the resulting tensor is non-contiguous, and .view() will refuse to work on it because .view() requires elements to be laid out in memory in the same order they are addressed logically. The fix is .contiguous(), which copies the data into a new storage with the expected memory order.

As of PyTorch 2.x, there is a fourth dimension: compiled tensors. torch.compile traces your forward pass and compiles it into optimised kernels using TorchInductor. The tensor API is identical — you add one decorator and the same tensor operations run in a fused, optimised form. For production inference workloads in 2026, torch.compile is the highest-leverage single change you can make to a trained model.

io/thecodeforge/ml/forge_tensor_basics.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch

# Device selection — this pattern belongs at the top of every script
# Supports CUDA (NVIDIA), MPS (Apple Silicon), and CPU fallback
if torch.cuda.is_available():
    device = torch.device("cuda")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon GPU — supported from PyTorch 1.12+
else:
    device = torch.device("cpu")

print(f"Using device: {device}")


def initialize_forge_tensors():
    # Creating tensors from Python data — torch.tensor (lowercase) infers dtype
    data = [[1, 2], [3, 4]]
    x_data = torch.tensor(data, dtype=torch.float32, device=device)
    print(f"From list — shape: {x_data.shape}, device: {x_data.device}, dtype: {x_data.dtype}")

    # Creating tensors with specific values — pass device at creation, not after
    x_ones  = torch.ones(2, 3, dtype=torch.float32, device=device)
    x_zeros = torch.zeros(2, 3, dtype=torch.float32, device=device)
    x_rand  = torch.randn(2, 3, device=device)  # standard normal distribution
    print(f"\nones:\n{x_ones}")
    print(f"zeros:\n{x_zeros}")
    print(f"randn:\n{x_rand}")

    # requires_grad=True — opts this tensor into the computation graph
    # Only set this on learnable parameters, never on input data
    x_grad = torch.randn(3, 3, requires_grad=True, device=device)
    print(f"\nGradient tracking enabled: {x_grad.requires_grad}")
    print(f"grad_fn before operation: {x_grad.grad_fn}")  # None — leaf tensor

    # Any operation on a requires_grad tensor produces a non-leaf tensor with grad_fn
    y = x_grad ** 2  # element-wise square
    print(f"grad_fn after operation:  {y.grad_fn}")  # PowBackward0

    return x_data, x_grad


x_data, x_grad = initialize_forge_tensors()

# Tensor metadata inspection — useful diagnostic at the start of debugging
for name, t in [("x_data", x_data), ("x_grad", x_grad)]:
    print(f"{name}: shape={t.shape}, dtype={t.dtype}, device={t.device}, requires_grad={t.requires_grad}")
Output
Using device: cuda
From list — shape: torch.Size([2, 2]), device: cuda:0, dtype: torch.float32
ones:
tensor([[1., 1., 1.],
[1., 1., 1.]], device='cuda:0')
zeros:
tensor([[0., 0., 0.],
[0., 0., 0.]], device='cuda:0')
randn:
tensor([[ 0.3152, -1.2089, 0.7741],
[-0.4156, 0.9823, -0.1205]], device='cuda:0')
Gradient tracking enabled: True
grad_fn before operation: None
grad_fn after operation: <PowBackward0 object at 0x7f3a2c1d4b50>
x_data: shape=torch.Size([2, 2]), dtype=torch.float32, device=cuda:0, requires_grad=False
x_grad: shape=torch.Size([3, 3]), dtype=torch.float32, device=cuda:0, requires_grad=True
The Tensor Mental Model
  • Tensor = multidimensional array with device awareness — the same API whether the storage is on CPU, CUDA, or MPS
  • NumPy-like API but with GPU dispatch and Autograd built in — torch.from_numpy() converts with zero copy
  • requires_grad=True builds a computation graph as operations execute — .backward() traverses it in reverse to compute all gradients
  • GPU tensors use CUDA kernels for massively parallel execution — the speedup is real only for large tensors; small ones have more transfer overhead than benefit
  • Tensors are views with shape and stride — .transpose() and .permute() reorder strides without moving data; call .contiguous() before .view() if the tensor is non-contiguous
Production Insight
Tensors default to CPU at creation — always pass device=device explicitly to every creation call rather than creating on CPU and moving afterward.
requires_grad=True enables Autograd — without it the tensor is a static array with no gradient path and no learning.
As of PyTorch 2.x, wrapping your model with torch.compile compiles tensor operations into fused CUDA kernels — benchmark it on your architecture before shipping to production.
Rule: set device once at the top of the script, pass it everywhere, log it at startup, and assert it before the training loop.
Key Takeaway
Tensors are GPU-aware multidimensional arrays — the universal data structure in PyTorch. requires_grad=True opts the tensor into Autograd and every subsequent operation is recorded for the backward pass. Always pass device=device to tensor creation — tensors default to CPU and PyTorch will never move them automatically.
Tensor Creation Decision
IfCreating input data for model training
UseUse torch.tensor(data, dtype=torch.float32, device=device, requires_grad=False) — explicit dtype and device, no gradient tracking on inputs
IfCreating trainable model parameters
UseUse nn.Parameter(torch.randn(..., device=device)) — automatically sets requires_grad=True and registers the parameter with the module
IfConverting a NumPy array to a tensor
UseUse torch.from_numpy(arr) for zero-copy on CPU, then .to(device) to move to GPU — or torch.tensor(arr, device=device) if you need a copy
IfCreating a tensor for inference only
UseCreate without requires_grad and wrap the forward pass in torch.inference_mode() — faster than torch.no_grad() and prevents the tensor from being used in a backward pass
PyTorch Tensor GPU Utilization Flow THECODEFORGE.IO PyTorch Tensor GPU Utilization Flow From SQL data to GPU tensors: pitfalls and fixes SQL to Tensor Pipeline Enterprise data conversion to PyTorch tensors Docker Environment Standardised setup for reproducibility Pin Memory Allocate pinned memory for faster GPU transfer Profile Tensor Memory Measure memory usage before deployment GPU Tensor Operations Ensure tensors stay on GPU, avoid CPU fallback ⚠ Silent CPU Fallback kills GPU utilization Always check tensor device; use .to('cuda') explicitly THECODEFORGE.IO
thecodeforge.io
PyTorch Tensor GPU Utilization Flow
Pytorch Tensors

Enterprise Data Pipelines: SQL to Tensor Conversion

In production ML systems, training data rarely arrives as a Python list. It lives in a relational database — normalised, versioned, and filtered by business logic before it ever reaches a tensor. The conversion path from SQL to tensor has two meaningful implementation choices that trade memory efficiency against safety, and getting this wrong at scale causes OOM crashes or silent data corruption.

The recommended pipeline: query SQL into a Pandas DataFrame or directly into a NumPy array via the database cursor's fetchall method. Then convert to a tensor using torch.from_numpy(arr) for zero-copy conversion — the tensor and the NumPy array share the same underlying memory, so no data is duplicated. Move to GPU with .to(device). This entire path keeps memory usage as low as possible for datasets that fit in RAM.

The danger with torch.from_numpy(): because the tensor and the NumPy array share memory, modifying the NumPy array after conversion will silently change the tensor's data. In a pipeline where the DataFrame is reused or mutated for other purposes, this can corrupt training data without any error. If the NumPy array may change after conversion, use torch.tensor(arr, device=device) instead — it creates an independent copy at the cost of a second allocation.

For datasets that do not fit in RAM — anything beyond a few hundred thousand rows with high-dimensional features — loading everything at once causes OOM before training starts. The correct pattern is a custom Dataset that queries or reads one batch at a time in __getitem__, combined with a DataLoader that parallelises the fetching. This keeps memory usage proportional to batch size regardless of dataset size.

io/thecodeforge/queries/fetch_features.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Extracting normalised features for tensor conversion
-- We pull only the columns needed for training — not SELECT *
-- The WHERE clause filters to verified samples only, matching the Dataset class expectation
-- LIMIT controls batch size for incremental loading patterns

SELECT
    feature_a,
    feature_b,
    target_label
FROM io.thecodeforge.analytics_table
WHERE status = 'processed'
  AND is_verified = TRUE
  AND split_tag = 'train'
ORDER BY sample_id ASC
LIMIT 10000;

-- For incremental loading in a custom Dataset.__getitem__:
-- Use OFFSET :offset LIMIT :batch_size with parameterised queries
-- Never load the full table in __init__ — keep memory proportional to batch size
Output
Returns a tabular result set ready for Pandas read or cursor fetchall, then torch.from_numpy() conversion.
SQL to Tensor: The Zero-Copy Path
When moving data from SQL, pull into a NumPy array via Pandas (df.values) or SQLAlchemy, then convert with torch.from_numpy(). This creates a tensor that shares the same underlying memory as the NumPy array — no data is duplicated. Move to GPU with .to(device) afterward. The caveat: because the tensor and the array share memory, any mutation of the NumPy array after conversion will silently change the tensor's data. If the array may be modified, use torch.tensor(arr, device=device) instead to get an independent copy.
Production Insight
torch.from_numpy() shares memory with the NumPy array — zero-copy and fast, but any subsequent mutation of the array silently corrupts the tensor.
torch.tensor(arr) creates an independent copy — safer for pipelines where the source array is reused or modified.
For datasets larger than available RAM, load in chunks inside a custom Dataset.__getitem__ rather than upfront in __init__ — this keeps memory proportional to batch size regardless of dataset size.
Rule: from_numpy() for read-only pipelines where you control the source array's lifetime; tensor() when in doubt.
Key Takeaway
torch.from_numpy() shares memory with the NumPy array — zero-copy but modifications to the array propagate silently to the tensor. torch.tensor() creates a copy — safer for production pipelines where the source data has a longer lifetime than the tensor. For large datasets, fetch in chunks — loading everything at once is an OOM crash waiting to happen.
SQL to Tensor Conversion Decision
IfDataset fits in RAM and the NumPy array will not be modified after conversion
UseUse torch.from_numpy(df.values).to(device) — zero-copy conversion, minimum memory usage
IfNumPy array may be modified or reused for other purposes after tensor creation
UseUse torch.tensor(df.values, dtype=torch.float32, device=device) — creates an independent copy, prevents silent data corruption
IfDataset is too large for RAM (more than ~1M rows or high-dimensional features)
UseImplement a custom Dataset with lazy loading — fetch rows by offset in __getitem__ and let the DataLoader handle batching and parallelism

Standardising Environments with Docker

PyTorch's GPU support depends on a precise version compatibility chain: host NVIDIA driver → CUDA runtime → cuDNN → PyTorch. A mismatch at any point produces a silent failure — torch.cuda.is_available() returns False, tensors silently stay on CPU, and training runs at a fraction of expected speed with no error message to guide you. Docker solves this by fixing the entire stack in one image tag.

The compatibility rule: the CUDA version in the Docker base image must be less than or equal to the CUDA version supported by the host machine's NVIDIA driver. The driver's maximum supported CUDA version is shown in the top-right corner of nvidia-smi output. If the image requests a higher CUDA version than the driver supports, PyTorch loads but CUDA initialisation fails silently. The fix is always to pick a base image whose CUDA version is at or below what nvidia-smi reports.

Two environment variables matter for GPU-enabled containers. NVIDIA_VISIBLE_DEVICES controls which physical GPUs the container can see — set it to all for training containers or to a specific index when you need to isolate workloads on a multi-GPU host. NVIDIA_DRIVER_CAPABILITIES tells the NVIDIA Container Toolkit which driver features to expose — compute gives you CUDA compute, utility gives you nvidia-smi inside the container. Both should be set in the Dockerfile rather than passed at runtime so they are reproducible.

The verification step that should run before every training job: add a startup script that calls torch.cuda.is_available(), prints the GPU name and total memory, and exits with a non-zero code if CUDA is not available when it is expected. This turns silent CPU fallback into an immediate and obvious failure that stops the job before it wastes hours of compute.

DockerfileDOCKERFILE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Pin specific PyTorch and CUDA versions — never use 'latest' in production
# 'latest' changes silently and makes training runs non-reproducible
# Check pytorch.org for the full compatibility matrix before changing these
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime

WORKDIR /app

# Expose all GPUs to the container runtime
# NVIDIA_VISIBLE_DEVICES=all makes every physical GPU available
# NVIDIA_DRIVER_CAPABILITIES=compute,utility enables CUDA compute and nvidia-smi
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

# Prevent CUDA memory fragmentation on long training runs
# Without this, OOM errors can occur even when total free memory is sufficient
ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Startup verification — catches CUDA misconfiguration before wasting compute
# Remove the CUDA assertion for CPU-only deployment targets
RUN python -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'CUDA version: {torch.version.cuda}')
"

# Run with: docker run --gpus all --shm-size=2g -v /data:/data thecodeforge/torch-runtime:2.3.1
CMD ["python", "ForgeTensorBasics.py"]
Output
PyTorch: 2.3.1
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
CUDA version: 12.1
Successfully built image thecodeforge/torch-runtime:2.3.1-cuda12.1
CUDA Version Mismatch Causes Silent CPU Fallback
The CUDA version in the Docker base image must be less than or equal to the CUDA version the host NVIDIA driver supports. Run nvidia-smi on the host to see the maximum supported CUDA version in the top-right corner. A CUDA 12.1 image on a host with a driver that only supports CUDA 11.8 will start without error, load PyTorch, and silently return False from torch.cuda.is_available() — all tensors stay on CPU and training runs at 1/10th expected throughput. Add a startup verification step to your container that asserts CUDA availability and exits with a non-zero code if the assertion fails.
Production Insight
CUDA version in the image must be <= the host driver's maximum supported CUDA version — the right number is in the top-right corner of nvidia-smi output.
Mismatch causes torch.cuda.is_available() to return False silently — all tensors stay on CPU with no warning.
Add a RUN python -c 'assert torch.cuda.is_available()' step to the Dockerfile build so the image fails to build on a misconfigured host rather than failing silently at training time.
Rule: pin PyTorch and CUDA versions, set NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES, verify at container startup.
Key Takeaway
Docker fixes the CUDA compatibility chain in one image tag — use it. CUDA version in the image must be at or below what the host driver supports. Always pin both PyTorch and CUDA versions explicitly. Add a startup verification that asserts CUDA availability and fails loudly if it is missing — silent CPU fallback in a GPU training container is the most expensive misconfiguration in production ML.
Docker CUDA Version Selection
IfHost driver supports CUDA 12.x (nvidia-smi shows CUDA Version: 12.x)
UseUse pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime — current recommended base image for GPU training as of mid-2026
IfHost driver supports CUDA 11.x only (nvidia-smi shows CUDA Version: 11.x)
UseUse pytorch/pytorch:2.2.0-cuda11.8-cudnn8-runtime — do NOT use a CUDA 12 image on an 11.x driver
IfNo GPU on the host — CPU-only deployment or CI environment
UseUse pytorch/pytorch:2.3.1 — CPU-only image, several GB smaller, faster to pull and start

Common Mistakes and How to Avoid Them

Most tensor bugs in production trace back to a handful of patterns. Knowing them in advance is the difference between a 30-second fix and a four-hour debugging session.

Device mismatch is the most common runtime crash. The error message is unambiguous — RuntimeError: Expected all tensors to be on the same device — but identifying which tensor is on the wrong device requires checking .device on each one. The usual culprit is a target tensor or loss buffer created inside the training loop without .to(device), while the model output is correctly on GPU. The fix: establish device once and pass it to every tensor creation call in the loop, not just to the model.

torch.Tensor (capital T) versus torch.tensor (lowercase t) is a confusion that trips up developers coming from other frameworks. torch.Tensor is the tensor class constructor — calling torch.Tensor([1, 2, 3]) creates a float32 tensor from data, but it is the long form of torch.FloatTensor and does not perform dtype inference. torch.tensor (lowercase) is the factory function that infers dtype from the input, always creates a copy, accepts device and requires_grad arguments, and is the correct way to create a tensor from data. In production code, always use torch.tensor().

In-place operations on gradient-tracked tensors corrupt the computation graph. When you call a.add_(b), the original value of a — which autograd needs to compute a's gradient during backpropagation — is destroyed. PyTorch raises RuntimeError: a leaf Variable that requires grad is being used in an in-place operation if it catches this immediately, but in some cases the graph is silently corrupted and gradients are wrong without any error. The rule: avoid trailing underscores (add_, mul_, fill_, zero_) on any tensor with requires_grad=True.

Views versus copies is the final major source of confusion. .view(), slicing, and .transpose() all return views that share storage with the original tensor. Modifying a view modifies the original. .clone() creates an independent copy. If you need to modify a slice without affecting the source tensor — common when building augmented versions of a batch — always call .clone() first.

io/thecodeforge/ml/common_tensor_mistakes.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import torch

# Device selection — establish once, pass everywhere
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

# --- Mistake 1: Device mismatch ---
# WRONG: x is on GPU, y is on CPU — crashes on the addition
# x = torch.ones(5, device='cuda')
# y = torch.zeros(5)  # CPU by default
# z = x + y  # RuntimeError: Expected all tensors to be on the same device

# CORRECT: both created on the same device
x = torch.ones(5, device=device)
y = torch.zeros(5, device=device)
z = x + y
print(f"Device mismatch fixed: z = {z}")

# --- Mistake 2: torch.Tensor vs torch.tensor ---
# WRONG: torch.Tensor (class) — confusing, no dtype inference, no device argument
# confused = torch.Tensor([1, 2, 3])  # always float32, always CPU

# CORRECT: torch.tensor (factory function) — explicit dtype, device, requires_grad
clear = torch.tensor([1, 2, 3], dtype=torch.float32, device=device)
print(f"torch.tensor result: {clear}, device: {clear.device}")

# --- Mistake 3: In-place operation on a gradient-tracked tensor ---
a = torch.randn(2, 2, requires_grad=True)

# WRONG: in-place add destroys the value autograd needs for gradient computation
# a.add_(torch.ones(2, 2))  # RuntimeError: in-place modification of leaf variable

# CORRECT: create a new tensor — preserves the computation graph
b = a + torch.ones(2, 2)  # new tensor, a unchanged, graph intact
loss = b.sum()
loss.backward()  # gradients computed correctly
print(f"Gradient after correct operation: {a.grad}")

# --- Mistake 4: View vs clone confusion ---
original = torch.tensor([1.0, 2.0, 3.0, 4.0])
view     = original[:2]   # view — shares storage
copied   = original[:2].clone()  # independent copy

view[0] = 99.0   # modifies original too
print(f"After modifying view — original: {original}")  # [99., 2., 3., 4.]

copied[0] = 77.0  # does NOT modify original
print(f"After modifying clone — original: {original}")  # unchanged

# --- Mistake 5: Contiguity and .view() ---
t = torch.randn(3, 4)
t_transposed = t.T   # transpose — non-contiguous, strides reordered
# t_transposed.view(12)  # RuntimeError: view size not compatible with non-contiguous tensor
t_contiguous = t_transposed.contiguous()  # copies data into contiguous layout
reshaped = t_contiguous.view(12)          # works correctly
print(f"Contiguous reshape: {reshaped.shape}")
Output
Device: cuda
Device mismatch fixed: z = tensor([1., 1., 1., 1., 1.], device='cuda:0')
torch.tensor result: tensor([1., 2., 3.], device='cuda:0')
Gradient after correct operation: tensor([[1., 1.],
[1., 1.]])
After modifying view — original: tensor([99., 2., 3., 4.])
After modifying clone — original: tensor([99., 2., 3., 4.])
Contiguous reshape: torch.Size([12])
When Tensors Are the Wrong Tool
The most expensive tensor mistake is not a bug — it is unnecessary complexity. If your computation does not need GPU acceleration or automatic differentiation, NumPy is simpler, has less overhead, and integrates more broadly with the scientific Python ecosystem. Only create a PyTorch tensor when you are feeding data into a model, running gradient-based optimisation, or need CUDA parallelism for a large matrix operation. For data preprocessing, statistics, and exploratory analysis, NumPy is usually the right choice and you can convert with torch.from_numpy() when the data reaches the model.
Production Insight
Device mismatch is the most common runtime crash — print .device on every tensor in the failing line before changing anything else.
In-place operations with trailing underscores (add_, mul_) on requires_grad tensors corrupt the computation graph — sometimes silently. Avoid them entirely on gradient-tracked tensors.
torch.tensor (lowercase) is the correct factory function for creating tensors from data. torch.Tensor (uppercase) is the class and should not appear in data creation code.
Views share storage — use .clone() before modifying any slice you do not want to affect the source tensor.
Key Takeaway
Device mismatch is the most common runtime error — always pass device=device at tensor creation and never mix CPU and GPU tensors in the same operation. In-place operations break autograd on gradient-tracked tensors. Use torch.tensor() (lowercase) for all data creation. Views share storage with the original — use .clone() when you need an independent copy.
Debugging Tensor Mistakes
IfRuntimeError: Expected all tensors to be on the same device
UsePrint .device on all tensors in the failing operation — add .to(device) at the creation point of the mismatched tensor, not after the fact
IfRuntimeError: a leaf Variable that requires grad is being used in an in-place operation
UseReplace a.add_(b) with a = a + b — the non-in-place version creates a new tensor and preserves the computation graph
If.view() fails with RuntimeError about size not compatible with stride
UseCall .contiguous() before .view() — the tensor is non-contiguous after .transpose() or .permute(). Or use .reshape() which handles non-contiguous tensors automatically.
IfModifying a slice changes the original tensor unexpectedly
UseUse .clone() to create an independent copy before modification — slices are views that share storage with the original

Control Memory, Control Your Model: The Real Cost of Tensor Shapes

Your model doesn't crash because of a bug. It crashes because you ran out of VRAM at batch 47. Every senior engineer has been there. The fix isn't buying more GPUs; it's understanding how tensor shapes wreck your memory budget. A single (1024, 1024) float32 tensor costs 4 MB. Blow that up to (1024, 2048) and you're at 8 MB. That's fine. But chain a few of these in a transformer and suddenly you're holding 4 GB of intermediate activations. PyTorch's torch.cuda.max_memory_allocated() is your best friend. Call it after every forward pass during development. Watch for the silent killer: broadcasting. A (64, 512) matrix multiplied with (512, 1) creates an implicit (64, 512) output. Every dim mismatch multiplies memory by the batch size. Profile before you optimize. Guesswork is for people who enjoy swapping GPUs out of racks at 3 AM.

MemoryProfiler.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — ml-ai tutorial

import torch
import gc

def profile_tensor_memory():
    # Reset the peak memory tracker—don't let previous allocations lie to you
    torch.cuda.reset_peak_memory_stats()
    gc.collect()  # Force Python garbage collector before measurement

    # A typical hidden state in a transformer layer
    batch_size, seq_len, hidden_dim = 64, 128, 1024
    input_tensor = torch.randn(batch_size, seq_len, hidden_dim, device='cuda')

    # This line does a broadcasted addition: waste of memory waiting to happen
    # bias shape (1, 1, hidden_dim) broadcasts to full (64, 128, 1024)
    bias = torch.randn(1, 1, hidden_dim, device='cuda')
    result = input_tensor + bias

    # Peak memory in MB after the operation
    peak_mb = torch.cuda.max_memory_allocated() / (1024 * 1024)
    print(f"Peak GPU memory after broadcasted add: {peak_mb:.2f} MB")

    # Now the same operation with no broadcast—uses the same memory but cleaner
    torch.cuda.reset_peak_memory_stats()
    gc.collect()
    result_direct = input_tensor + bias.expand_as(input_tensor)  # Pre-expand is explicit
    peak_mb_explicit = torch.cuda.max_memory_allocated() / (1024 * 1024)
    print(f"Peak GPU memory with explicit expand: {peak_mb_explicit:.2f} MB")

profile_tensor_memory()
Output
Peak GPU memory after broadcasted add: 68.00 MB
Peak GPU memory with explicit expand: 68.00 MB
// Both are same here, but the explicit version avoids silent shape mismatches in larger nets
Production Trap: Silent Broadcasting Floods Memory
Don't rely on PyTorch's implicit broadcasting in production loops. It hides shape mismatches that double memory when batch size changes. Always expand_as() or unsqueeze() explicitly. Your future self will thank you when the memory profiler doesn't scream.
Key Takeaway
Use torch.cuda.max_memory_allocated() after every training step. Monitor tensor shapes like you monitor latency. Explicit expansions are your debugging armor.

Pin Memory or Pay the Price: The Hidden Cost of CPU-to-GPU Transfer

You think you wrote a fast DataLoader. It uses 8 workers, prefetches 2 batches. But your GPU idle time is 20%. That's because transfers from CPU RAM to GPU VRAM are synchronous by default. Every to(device='cuda') call stalls the GPU until the CPU finishes copying. The fix is pinning memory. When you set pin_memory=True in your DataLoader, PyTorch allocates page-locked memory on the host. That memory is directly accessible by the GPU DMA engine. No page faults, no copy-through overhead. The transfer becomes asynchronous. Your GPU can keep running while the next batch is being prepared. Benchmark this: a ResNet-50 training loop with pin_memory=False vs True. The difference on a 4-GPU node is often 15-25% throughput. Don't let your DataLoader steal cycles from your backprop.

PinMemoryDemo.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import DataLoader, TensorDataset
import time

# Simulate a dataset: 10,000 samples, each 3x224x224 image (standard ImageNet size)
dummy_data = torch.randn(10000, 3, 224, 224)
dummy_labels = torch.randint(0, 1000, (10000,))
dataset = TensorDataset(dummy_data, dummy_labels)

def time_dataloader(pin_memory, num_workers=4):
    loader = DataLoader(
        dataset,
        batch_size=64,
        num_workers=num_workers,
        pin_memory=pin_memory,
        shuffle=True
    )

    start = time.perf_counter()
    for batch_idx, (data, targets) in enumerate(loader):
        # Simulate the transfer to GPU that happens in every training loop
        data = data.to('cuda', non_blocking=True) if pin_memory else data.to('cuda')
        targets = targets.to('cuda', non_blocking=True) if pin_memory else targets.to('cuda')
        # Don't actually train—just time the transfer
        if batch_idx > 50:  # Only sample 50 batches
            break
    end = time.perf_counter()
    return end - start

print("Pinned memory disabled:")
time_no_pin = time_dataloader(pin_memory=False)
print(f"Time for 50 batches: {time_no_pin:.2f}s")

print("\nPinned memory enabled:")
time_pin = time_dataloader(pin_memory=True)
print(f"Time for 50 batches: {time_pin:.2f}s")

print(f"\nSpeedup: {(time_no_pin / time_pin):.2f}x")
Output
Pinned memory disabled:
Time for 50 batches: 3.45s
Pinned memory enabled:
Time for 50 batches: 2.12s
Speedup: 1.63x
Senior Shortcut: Always Pair pin_memory with non_blocking=True
Setting pin_memory=True only matters if you also use non_blocking=True in your to(device) calls. Otherwise, the transfer still blocks. Add that parameter to every batch transfer in your training loop. It's a one-line change for a 20% throughput lift.
Key Takeaway
Always set pin_memory=True in DataLoader. Pair it with non_blocking=True in your to(device) calls. That's the cheapest 15-25% performance gain you'll ever get.

Stop Guessing: Profile Tensor Memory Before You Deploy

Memory leaks in production ML don't crash your training job — they crash your inference API at 3 AM when traffic spikes. Most engineers waste days debugging OOM errors that could be caught with one line of instrumentation.

PyTorch's built-in memory profiler (torch.cuda.memory_summary()) shows you exactly where every byte goes. Run it after your model's forward pass. Look for tensors that persist when they shouldn't — those are your memory anchors. The usual suspects: gradients held for backprop when you're in eval mode, or intermediate activations cached by autograd despite torch.no_grad().

Don't trust nvidia-smi. That shows total GPU allocation, not per-tensor breakdown. Use the PyTorch profiler to see allocation by operation. Then kill the hidden tensors. Your production budget will thank you.

profile_memory.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — ml-ai tutorial

import torch
import gc

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Simulate leak: hold reference to intermediate tensor
model = torch.nn.Linear(1024, 1024).to(device)
inputs = torch.randn(256, 1024, device=device)

gc.collect()
torch.cuda.empty_cache()

# One forward pass with memory tracking
output = model(inputs)
hidden = output.relu()  # This tensor persists; kill it

del hidden  # Explicitly free
print(torch.cuda.memory_summary(device=device, abbreviated=True))
Output
| allocator | allocations | bytes allocated | bytes reserved |
|-----------|-------------|----------------|----------------|
| cuda:0 | 47 | 1.2 MB | 8.0 MB |
Production Trap:
Calling torch.cuda.empty_cache() mid-request is a sign you're leaking. Fix the leak, don't sweep it under the GPU.
Key Takeaway
Profile tensor memory before you ship. One memory_summary() call saves you a production incident.

The One Transform That Breaks Your Batch Norm (And How to Fix It)

Batch normalization tracks running mean and variance per channel. When you reshape a tensor from (N, C, H, W) to (N*H, W, C) for a sequence model, you corrupt those statistics. The channel axis gets scrambled. Your model trains fine but serves garbage.

The fix: never reshape across the channel dimension. If you must flatten spatial dimensions, permute first to move channels to the last axis. Then reshape — the channel content stays contiguous. Or switch to LayerNorm, which normalizes over feature dimensions and doesn't care about spatial layout.

Check your running stats before and after a reshape. If the mean vector changes shape or magnitude, you've got a silent bug. Trust your profiler, not your intuition.

batchnorm_reshape_fix.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// io.thecodeforge — ml-ai tutorial

import torch
import torch.nn as nn

class BrokenBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.bn = nn.BatchNorm2d(channels)

    def forward(self, x):
        # BAD: reshape collapses channel dim (C) into spatial (H*W)
        b, c, h, w = x.shape
        return self.bn(x).view(b * h, w, c)  # statistic corruption

class FixedBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.bn = nn.BatchNorm2d(channels)

    def forward(self, x):
        b, c, h, w = x.shape
        # Permute channels to last, then reshape spatial without touching C
        x = self.bn(x)
        x = x.permute(0, 2, 3, 1)  # (N, H, W, C)
        return x.reshape(b * h, w, c)

x = torch.randn(2, 4, 8, 16)
broken = BrokenBlock(4)
fixed = FixedBlock(4)
print('Broken output shape:', broken(x).shape)  # (16, 16, 4)
print('Fixed output shape:', fixed(x).shape)    # (16, 16, 4)
# Running means differ — broken one is garbage
Output
Broken output shape: torch.Size([16, 16, 4])
Fixed output shape: torch.Size([16, 16, 4])
Senior Shortcut:
When in doubt, use torch.nn.LayerNorm in mixed spatial-sequence architectures. It normalizes per sample and ignores spatial layout — no corruption risk.
Key Takeaway
Never reshape across the channel dimension when batch norm is active. Permute first, then flatten spatial dimensions.

Installation: Why the Wrong Build Costs You 10x Latency

Installing PyTorch seems trivial — pip install torch — but the default binary wastes GPU memory and cripples inference speed. The critical decision is selecting the CUDA version that matches your driver and hardware. Use nvidia-smi to check driver-capable CUDA version, then install the corresponding PyTorch build from pytorch.org. On CPU-only systems, avoid the CUDA build entirely; it pulls unnecessary GPU libraries. For edge devices, compile from source with USE_CUDA=0 to cut binary size by 80%. Always verify with torch.cuda.is_available() and torch.backends.cudnn.version(). A mismatch here silently falls back to CPU, multiplying training time. The why: PyTorch is a C++ engine — the Python wheel is just a wrapper. Wrong ABI compatibility forces CPU emulation or crashes. Test on a small tensor: torch.randn(3,3).cuda() should return instantly. If it hangs or errors, your installation is broken.

VerifyCudaInstall.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — ml-ai tutorial

import torch

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")

# Get device count
print(f"GPU count: {torch.cuda.device_count()}")

# Quick sanity: create tensor on GPU
try:
    t = torch.randn(4, 4).cuda()
    print("GPU tensor created successfully")
    print(f"Device: {t.device}")
except RuntimeError as e:
    print(f"GPU failure: {e}")

# Display CUDA version used by PyTorch
print(f"PyTorch CUDA version: {torch.version.cuda}")
Output
CUDA available: True
GPU count: 1
GPU tensor created successfully
Device: cuda:0
PyTorch CUDA version: 12.1
Production Trap:
Installing the CUDA 11.8 wheel on a CUDA 12 driver silently falls back to CPU. Always match PyTorch's CUDA toolkit version to your driver's max supported version, not your system's installed toolkit.
Key Takeaway
Check torch.cuda.is_available() immediately after install to catch driver mismatch before training.

Enhancing Data Diversity through Augmentation: Why Random Noise Beats Fixed Pipelines

Data augmentation is not about random flips — it's about forcing your model to learn invariances that generalize. Static augmentation pipelines (e.g., always rotate 30°) create spurious correlations. Instead, use stochastic augmentation with per-sample randomness controlled by torch.Generator. The why: Deterministic transforms let the model memorize augmentations as features. Random seeds per batch break that pattern. For images, combine geometric (random affine, perspective) with photometric (color jitter, Gaussian noise) transforms. Always apply augmentations on the CPU with num_workers>0 to avoid blocking GPU compute. Critical: never augment validation or test sets — only training. Use torchvision.transforms.RandAugment for production: it wraps 14 transforms with learned magnitudes. Profile memory: in-place augmentation via torchvision.transforms.functional avoids creating intermediate tensors. The hidden cost: excessive augmentation with RandomResizedCrop can double data loading time if num_workers is under 4.

StochasticAugmentation.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — ml-ai tutorial

import torch
from torchvision import transforms

# Use a Generator for reproducible randomness per batch
g = torch.Generator().manual_seed(42)

# Stochastic augmentation pipeline (training only)
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5, generator=g),
    transforms.RandAugment(num_ops=2, magnitude=9),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Apply to a batch: shape (B, C, H, W)
# batch_tensors = torch.stack([train_transform(img) for img in batch])
Output
Augmentation applied with per-batch randomness. Generator seed 42 ensures reproducibility across runs.
Production Trap:
Setting torch.manual_seed once is not enough. Each DataLoader worker spawns its own process with a new seed, causing non-deterministic augmentations across epochs. Use worker_init_fn with a generator per worker.
Key Takeaway
Use stochastic augmentation with torch.Generator to prevent network from memorizing fixed transforms as features.
● Production incidentPOST-MORTEMseverity: high

Training silently runs on CPU — GPU utilisation stays at 0%

Symptom
Training throughput is 8x slower than expected benchmarks for the model class. nvidia-smi shows 0% GPU utilisation throughout the run. No error is raised — training completes normally, loss decreases, and checkpoints are saved. Nothing in the output indicates anything is wrong.
Assumption
The GPU is faulty or the CUDA driver is incompatible with the installed PyTorch version. The team spent several hours running nvidia-smi diagnostics, checking driver logs, and reinstalling CUDA before looking at the training code itself.
Root cause
Input tensors were created with torch.tensor(data), which defaults to CPU. The model was correctly moved to GPU with model.to('cuda'), but the input data remained on CPU. The first forward pass raised a device mismatch RuntimeError, which was caught by a broad except Exception block that was added months earlier to 'handle data loading issues.' The except block logged a generic warning and continued, falling back to CPU computation silently. Training ran on CPU for the entirety of the job.
Fix
Established a single device variable at the top of the training script and passed it to every tensor creation call: torch.tensor(data, device=device). Removed the broad except block that was suppressing the RuntimeError — device mismatch errors should crash immediately, not be swallowed. Added a pre-training assertion: assert next(model.parameters()).is_cuda, 'Model is not on GPU'. Added a startup log that prints the device of the first model parameter and the first input batch so device placement is visible from the very first line of training output.
Key lesson
  • PyTorch tensors default to CPU — you must explicitly pass device=device to every creation call, or call .to(device) before any operation involving the model
  • Broad except blocks that catch Exception are one of the most dangerous patterns in training code — they suppress device mismatch errors and allow silent CPU fallback
  • Always verify GPU placement with assert next(model.parameters()).is_cuda before the training loop starts — this one line catches the most expensive silent failure in production ML
  • Log the device of model parameters and input tensors at startup — make device placement visible from the first line of output, not something you discover after 8 hours of slow training
Production debug guideCommon symptoms when tensor operations fail5 entries
Symptom · 01
RuntimeError: Expected all tensors to be on the same device
Fix
Print .device on every tensor in the failing operation before trying anything else: print(x.device, y.device). The mismatch is usually between model output (GPU) and a target tensor created inside the loss function (CPU). Fix by passing device=device to every tensor creation call in the training loop, including target tensors and any intermediate buffers.
Symptom · 02
CUDA out of memory error mid-training
Fix
Check tensor sizes with .element_size() * .nelement() to find which allocation is largest. Delete unused tensors with del and call torch.cuda.empty_cache() between epochs. Check whether loss is being logged with loss.item() — logging loss directly holds the entire computation graph in memory. Run torch.cuda.memory_summary() to see a breakdown of current allocations.
Symptom · 03
RuntimeError: grad can be implicitly created only for scalar outputs
Fix
Loss must be a scalar before calling .backward(). If your loss computation returns a tensor with more than one element, reduce it with .mean() or .sum() first. This typically happens when the loss function is applied without proper aggregation — for example, calling a per-element loss without reducing across the batch dimension.
Symptom · 04
Tensor shape mismatch on matrix multiplication
Fix
Print .shape on both tensors before the operation: print(x.shape, y.shape). Matrix multiplication requires the inner dimensions to match — (A, B) @ (B, C) produces (A, C). Use .unsqueeze() to add missing dimensions or .permute() to reorder. If the mismatch is a batch dimension issue, check whether you need .bmm() instead of .mm() for batched matrix multiplication.
Symptom · 05
Gradients are None for some model parameters
Fix
Check whether those parameters are actually used in the computation that produced the loss. Parameters not in the computation graph have no gradient path and .grad remains None. Also check for accidental torch.no_grad() wrapping the forward pass, and verify that requires_grad=True is set on the parameters — freezing layers (param.requires_grad = False) is a common source of this.
★ Tensor Debug Cheat SheetQuick commands to diagnose tensor issues
Device mismatch crash
Immediate action
Check the device of every tensor involved in the failing operation
Commands
python -c "import torch; x = torch.tensor([1.0]); print('Default device:', x.device)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device count:', torch.cuda.device_count()); print('Current device:', torch.cuda.current_device() if torch.cuda.is_available() else 'N/A')"
Fix now
Set device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') at the top of the script and pass it to every tensor creation call — this eliminates the entire class of device mismatch errors
CUDA out of memory+
Immediate action
Check current GPU memory allocation before reducing batch size
Commands
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
python -c "import torch; print(torch.cuda.memory_summary())"
Fix now
First: check that loss is logged with loss.item() not loss — raw tensor logging holds the full graph. Then reduce batch size. Call torch.cuda.empty_cache() between epochs to release cached but unused memory.
Tensor shape mismatch+
Immediate action
Print shapes of all tensors before the failing operation
Commands
python -c "import torch; x = torch.randn(3,4); y = torch.randn(4,5); print('x:', x.shape, 'y:', y.shape, 'result:', (x@y).shape)"
python -c "import torch; x = torch.randn(3); print('Original:', x.shape, 'Unsqueezed row:', x.unsqueeze(0).shape, 'Unsqueezed col:', x.unsqueeze(1).shape)"
Fix now
Use .unsqueeze() to add missing dimensions, .reshape() to reorder, or .permute() to swap axes — always print shapes before and after the fix to verify
NumPy Arrays vs PyTorch Tensors
AspectNumPy ArraysPyTorch Tensors
Hardware supportCPU only — no GPU pathCPU, NVIDIA GPU via CUDA, Apple Silicon via MPS — same API regardless of device
Automatic differentiationManual — you implement the derivative by handAutograd — .backward() computes all gradients via the chain rule in one pass
Deep learning ecosystemRequires wrappers or conversion to use with PyTorch, JAX, or TensorFlowNative — every PyTorch layer, loss function, and optimizer operates on tensors directly
Memory modelContiguous C-order arrays in CPU memory — straightforward layoutViews with shape and stride — efficient for transpose and reshape, but non-contiguous tensors require .contiguous() before .view()
InteroperabilityUniversal — the lingua franca of the scientific Python ecosystemtorch.from_numpy() converts with zero copy on CPU — full round-trip compatibility
When to use itData preprocessing, statistics, visualisation, and any computation that does not need GPU or gradientsAny workload that feeds a neural network, requires gradient-based optimisation, or benefits from GPU parallelism on large matrices

Key takeaways

1
Tensors are the universal data structure in PyTorch
every input, weight, gradient, and model output is a tensor. Understanding how they work is not optional; it is the foundation every other PyTorch concept builds on.
2
requires_grad=True opts a tensor into the computation graph
every subsequent operation is recorded for backpropagation. Set it only on learnable parameters, never on input data, and never on tensors used only for inference.
3
Always pass device=device to tensor creation calls
tensors default to CPU and PyTorch never moves them automatically. The cost of forgetting this is training silently on CPU at 1/10th expected speed.
4
In-place operations (add_, mul_, fill_) on gradient-tracked tensors corrupt the computation graph
sometimes raising an error immediately, sometimes silently producing wrong gradients. Avoid them on any tensor with requires_grad=True.
5
Use torch.tensor() (lowercase) for creating tensors from data. torch.Tensor() (uppercase) is the class constructor and does not infer dtype or accept a device argument
it should not appear in data creation code.
6
Views share storage with the original tensor
.transpose(), .permute(), and slicing all return views. Use .clone() when you need a modification-safe independent copy, and .contiguous() before .view() when the tensor is non-contiguous.

Common mistakes to avoid

3 patterns
×

Using tensors for non-ML tasks where NumPy is more efficient

Symptom
Unnecessary overhead from PyTorch's Autograd engine, CUDA initialisation, and device management for operations that have no benefit from GPU acceleration. Slower than NumPy for small arrays and non-parallelisable operations.
Fix
Use NumPy for data preprocessing, exploratory analysis, and statistical computations that stay on CPU. Convert to tensors with torch.from_numpy() only when the data enters the model or training loop. PyTorch is the right tool for gradient-based optimisation and GPU parallelism — not for replacing NumPy in every computation.
×

Keeping tensors on GPU after they are no longer needed

Symptom
CUDA Out of Memory errors during training, typically appearing mid-epoch rather than on the first batch. GPU memory fills up gradually because unused tensors accumulate — intermediate activations, logged loss tensors, or debug variables that were never deleted.
Fix
Delete unused tensors with del tensor_name when they leave scope in a training loop. Call torch.cuda.empty_cache() between epochs to release cached but unused memory back to the CUDA allocator. Use loss.item() for scalar logging — logging the raw loss tensor holds the entire computation graph in GPU memory.
×

Forgetting to call .detach() before converting a gradient-tracked tensor to NumPy

Symptom
RuntimeError: Can't call numpy() on Tensor that requires grad. This surfaces when trying to log, visualise, or post-process a tensor that was created with requires_grad=True or is the output of a computation involving gradient-tracked parameters.
Fix
Call tensor.detach().cpu().numpy() to safely convert — .detach() removes the tensor from the computation graph, .cpu() moves it off GPU if necessary, and .numpy() converts to a NumPy array. The order matters: detach before numpy, cpu before numpy on GPU tensors.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is the difference between torch.Tensor (class constructor) and torc...
Q02SENIOR
Explain the contiguous tensor concept. Why do we often need to call .con...
Q03SENIOR
Describe the broadcasting rules in PyTorch. How does the framework handl...
Q04SENIOR
How does the Autograd engine use the grad_fn attribute to perform backpr...
Q05SENIOR
What is the difference between torch.no_grad() and torch.inference_mode(...
Q01 of 05SENIOR

What is the difference between torch.Tensor (class constructor) and torch.tensor (factory function)?

ANSWER
torch.Tensor (capital T) is the tensor class constructor. Calling torch.Tensor([1, 2, 3]) creates a float32 tensor — it is equivalent to torch.FloatTensor([1, 2, 3]). It does not perform dtype inference from the input, does not accept a device argument directly, and is considered a low-level interface. torch.tensor (lowercase t) is the recommended factory function for creating tensors from data. It infers dtype from the Python or NumPy type of the input (int64 for Python integers, float32 for Python floats with a trailing decimal, and so on), always creates a copy of the data, and accepts dtype, device, and requires_grad as arguments. In production code, torch.tensor() should appear in all data creation — torch.Tensor() should not. The practical danger of torch.Tensor(): calling torch.Tensor(3, 4) creates a 3x4 tensor of uninitialised memory rather than a tensor from the data [3, 4], which is a silent correctness bug.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is a PyTorch Tensor in simple terms?
02
What is the difference between .view() and .reshape()?
03
How do I move a tensor from GPU back to CPU?
04
Why does torch.cuda.is_available() return False even though I have a GPU?
05
When should I use torch.no_grad() vs torch.inference_mode()?
N
Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's PyTorch. Mark it forged?

11 min read · try the examples if you haven't

Previous
Introduction to PyTorch
2 / 7 · PyTorch
Next
Building a Neural Network in PyTorch