Skip to content
Home ML / AI PyTorch Tensors Explained

PyTorch Tensors Explained

Where developers are forged. · Structured learning · Free forever.
📍 Part of: PyTorch → Topic 2 of 7
A comprehensive guide to PyTorch Tensors — the specialized multidimensional arrays that power deep learning with GPU acceleration and automatic differentiation.
🧑‍💻 Beginner-friendly — no prior ML / AI experience needed
In this tutorial, you'll learn
A comprehensive guide to PyTorch Tensors — the specialized multidimensional arrays that power deep learning with GPU acceleration and automatic differentiation.
  • Tensors are the universal data structure in PyTorch — every input, weight, gradient, and model output is a tensor. Understanding how they work is not optional; it is the foundation every other PyTorch concept builds on.
  • requires_grad=True opts a tensor into the computation graph — every subsequent operation is recorded for backpropagation. Set it only on learnable parameters, never on input data, and never on tensors used only for inference.
  • Always pass device=device to tensor creation calls — tensors default to CPU and PyTorch never moves them automatically. The cost of forgetting this is training silently on CPU at 1/10th expected speed.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Tensors are GPU-accelerated multidimensional arrays — the universal data structure for all PyTorch operations
  • They mirror NumPy's API but add CUDA support and automatic differentiation via Autograd
  • requires_grad=True enables gradient tracking — the tensor records every operation for backpropagation
  • Moving tensors to GPU with .to('cuda') provides 10-100x speedup for large matrix operations
  • Device mismatch (CPU tensor + GPU tensor in the same operation) is the #1 production RuntimeError — always check .device
  • Small tensors incur more transfer overhead than they save — only move to GPU when the computation justifies it
🚨 START HERE
Tensor Debug Cheat Sheet
Quick commands to diagnose tensor issues
🟡Device mismatch crash
Immediate ActionCheck the device of every tensor involved in the failing operation
Commands
python -c "import torch; x = torch.tensor([1.0]); print('Default device:', x.device)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device count:', torch.cuda.device_count()); print('Current device:', torch.cuda.current_device() if torch.cuda.is_available() else 'N/A')"
Fix NowSet device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') at the top of the script and pass it to every tensor creation call — this eliminates the entire class of device mismatch errors
🟡CUDA out of memory
Immediate ActionCheck current GPU memory allocation before reducing batch size
Commands
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
python -c "import torch; print(torch.cuda.memory_summary())"
Fix NowFirst: check that loss is logged with loss.item() not loss — raw tensor logging holds the full graph. Then reduce batch size. Call torch.cuda.empty_cache() between epochs to release cached but unused memory.
🟡Tensor shape mismatch
Immediate ActionPrint shapes of all tensors before the failing operation
Commands
python -c "import torch; x = torch.randn(3,4); y = torch.randn(4,5); print('x:', x.shape, 'y:', y.shape, 'result:', (x@y).shape)"
python -c "import torch; x = torch.randn(3); print('Original:', x.shape, 'Unsqueezed row:', x.unsqueeze(0).shape, 'Unsqueezed col:', x.unsqueeze(1).shape)"
Fix NowUse .unsqueeze() to add missing dimensions, .reshape() to reorder, or .permute() to swap axes — always print shapes before and after the fix to verify
Production IncidentTraining silently runs on CPU — GPU utilisation stays at 0%A model training job ran 8x slower than benchmarks. GPU utilisation was 0%. The input tensors were created on CPU and never moved to GPU.
SymptomTraining throughput is 8x slower than expected benchmarks for the model class. nvidia-smi shows 0% GPU utilisation throughout the run. No error is raised — training completes normally, loss decreases, and checkpoints are saved. Nothing in the output indicates anything is wrong.
AssumptionThe GPU is faulty or the CUDA driver is incompatible with the installed PyTorch version. The team spent several hours running nvidia-smi diagnostics, checking driver logs, and reinstalling CUDA before looking at the training code itself.
Root causeInput tensors were created with torch.tensor(data), which defaults to CPU. The model was correctly moved to GPU with model.to('cuda'), but the input data remained on CPU. The first forward pass raised a device mismatch RuntimeError, which was caught by a broad except Exception block that was added months earlier to 'handle data loading issues.' The except block logged a generic warning and continued, falling back to CPU computation silently. Training ran on CPU for the entirety of the job.
FixEstablished a single device variable at the top of the training script and passed it to every tensor creation call: torch.tensor(data, device=device). Removed the broad except block that was suppressing the RuntimeError — device mismatch errors should crash immediately, not be swallowed. Added a pre-training assertion: assert next(model.parameters()).is_cuda, 'Model is not on GPU'. Added a startup log that prints the device of the first model parameter and the first input batch so device placement is visible from the very first line of training output.
Key Lesson
PyTorch tensors default to CPU — you must explicitly pass device=device to every creation call, or call .to(device) before any operation involving the modelBroad except blocks that catch Exception are one of the most dangerous patterns in training code — they suppress device mismatch errors and allow silent CPU fallbackAlways verify GPU placement with assert next(model.parameters()).is_cuda before the training loop starts — this one line catches the most expensive silent failure in production MLLog the device of model parameters and input tensors at startup — make device placement visible from the first line of output, not something you discover after 8 hours of slow training
Production Debug GuideCommon symptoms when tensor operations fail
RuntimeError: Expected all tensors to be on the same devicePrint .device on every tensor in the failing operation before trying anything else: print(x.device, y.device). The mismatch is usually between model output (GPU) and a target tensor created inside the loss function (CPU). Fix by passing device=device to every tensor creation call in the training loop, including target tensors and any intermediate buffers.
CUDA out of memory error mid-trainingCheck tensor sizes with .element_size() * .nelement() to find which allocation is largest. Delete unused tensors with del and call torch.cuda.empty_cache() between epochs. Check whether loss is being logged with loss.item() — logging loss directly holds the entire computation graph in memory. Run torch.cuda.memory_summary() to see a breakdown of current allocations.
RuntimeError: grad can be implicitly created only for scalar outputsLoss must be a scalar before calling .backward(). If your loss computation returns a tensor with more than one element, reduce it with .mean() or .sum() first. This typically happens when the loss function is applied without proper aggregation — for example, calling a per-element loss without reducing across the batch dimension.
Tensor shape mismatch on matrix multiplicationPrint .shape on both tensors before the operation: print(x.shape, y.shape). Matrix multiplication requires the inner dimensions to match — (A, B) @ (B, C) produces (A, C). Use .unsqueeze() to add missing dimensions or .permute() to reorder. If the mismatch is a batch dimension issue, check whether you need .bmm() instead of .mm() for batched matrix multiplication.
Gradients are None for some model parametersCheck whether those parameters are actually used in the computation that produced the loss. Parameters not in the computation graph have no gradient path and .grad remains None. Also check for accidental torch.no_grad() wrapping the forward pass, and verify that requires_grad=True is set on the parameters — freezing layers (param.requires_grad = False) is a common source of this.

PyTorch Tensors are the fundamental data structure in PyTorch — every input, weight, gradient, and output is a tensor. They are multidimensional arrays that mirror NumPy's API but add two capabilities that NumPy does not have: GPU acceleration via CUDA and automatic differentiation via Autograd.

The key design decision: tensors are not just data containers. When requires_grad=True, they become nodes in a dynamic computation graph. Every operation on them is recorded as the forward pass executes, enabling automatic gradient computation when you call .backward(). This is what makes neural network training tractable — without it, you would manually compute partial derivatives for every parameter on every update, which is not realistic at any modern model size.

The production failure pattern: device mismatch. A tensor on CPU cannot participate in the same operation as a tensor on GPU. PyTorch raises RuntimeError: Expected all tensors to be on the same device immediately and clearly. What is not clear is which tensor is on the wrong device. The fix is always the same: establish device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') once, pass it to every tensor creation call, and add an assertion at training start that verifies model parameters and input data are on the same device.

As of 2026, there is a third dimension worth knowing: Apple Silicon. PyTorch supports MPS (Metal Performance Shaders) on M-series Macs via device='mps', which provides meaningful GPU acceleration on MacBooks without CUDA. The same .to(device) pattern applies — the device string changes, the code does not.

What Is a PyTorch Tensor and Why Does It Exist?

A PyTorch tensor is a multidimensional array that was designed to solve a problem NumPy cannot: running massive parallel computations on a GPU while simultaneously tracking every operation for automatic gradient computation.

NumPy arrays are excellent for scientific computing — fast, well-documented, universally supported. But they have two hard limits. First, they run only on CPU. Second, they have no concept of a computation graph. This means that if you want to train a neural network with NumPy, you implement backpropagation manually — computing partial derivatives by hand for every layer, every parameter, every batch. That is tractable for a two-layer toy network and completely unworkable for anything beyond it.

PyTorch tensors solve both problems. The storage layer underneath a tensor can live on CPU, on an NVIDIA GPU via CUDA, or on Apple Silicon via MPS. When the storage is on GPU, every matrix operation dispatches to CUDA kernels that execute in parallel across thousands of GPU cores — this is why large matrix multiplications are 10–100x faster on GPU for the sizes neural networks operate on. When requires_grad=True, the tensor records every operation as part of a dynamic computation graph. When .backward() is called on the loss, that graph is traversed in reverse and .grad is filled in on every participating tensor via the chain rule — one backward pass, all gradients computed simultaneously.

The architectural detail worth understanding: a tensor is a view into a storage object. The tensor knows its shape, stride, dtype, and device. The storage holds the raw bytes. Operations like .transpose() and .permute() create new tensor views with reordered strides without moving any data in memory — the storage stays identical. This is efficient but has a consequence: the resulting tensor is non-contiguous, and .view() will refuse to work on it because .view() requires elements to be laid out in memory in the same order they are addressed logically. The fix is .contiguous(), which copies the data into a new storage with the expected memory order.

As of PyTorch 2.x, there is a fourth dimension: compiled tensors. torch.compile traces your forward pass and compiles it into optimised kernels using TorchInductor. The tensor API is identical — you add one decorator and the same tensor operations run in a fused, optimised form. For production inference workloads in 2026, torch.compile is the highest-leverage single change you can make to a trained model.

io/thecodeforge/ml/forge_tensor_basics.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import torch

# Device selection — this pattern belongs at the top of every script
# Supports CUDA (NVIDIA), MPS (Apple Silicon), and CPU fallback
if torch.cuda.is_available():
    device = torch.device("cuda")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon GPU — supported from PyTorch 1.12+
else:
    device = torch.device("cpu")

print(f"Using device: {device}")


def initialize_forge_tensors():
    # Creating tensors from Python data — torch.tensor (lowercase) infers dtype
    data = [[1, 2], [3, 4]]
    x_data = torch.tensor(data, dtype=torch.float32, device=device)
    print(f"From list — shape: {x_data.shape}, device: {x_data.device}, dtype: {x_data.dtype}")

    # Creating tensors with specific values — pass device at creation, not after
    x_ones  = torch.ones(2, 3, dtype=torch.float32, device=device)
    x_zeros = torch.zeros(2, 3, dtype=torch.float32, device=device)
    x_rand  = torch.randn(2, 3, device=device)  # standard normal distribution
    print(f"\nones:\n{x_ones}")
    print(f"zeros:\n{x_zeros}")
    print(f"randn:\n{x_rand}")

    # requires_grad=True — opts this tensor into the computation graph
    # Only set this on learnable parameters, never on input data
    x_grad = torch.randn(3, 3, requires_grad=True, device=device)
    print(f"\nGradient tracking enabled: {x_grad.requires_grad}")
    print(f"grad_fn before operation: {x_grad.grad_fn}")  # None — leaf tensor

    # Any operation on a requires_grad tensor produces a non-leaf tensor with grad_fn
    y = x_grad ** 2  # element-wise square
    print(f"grad_fn after operation:  {y.grad_fn}")  # PowBackward0

    return x_data, x_grad


x_data, x_grad = initialize_forge_tensors()

# Tensor metadata inspection — useful diagnostic at the start of debugging
for name, t in [("x_data", x_data), ("x_grad", x_grad)]:
    print(f"{name}: shape={t.shape}, dtype={t.dtype}, device={t.device}, requires_grad={t.requires_grad}")
▶ Output
Using device: cuda
From list — shape: torch.Size([2, 2]), device: cuda:0, dtype: torch.float32

ones:
tensor([[1., 1., 1.],
[1., 1., 1.]], device='cuda:0')
zeros:
tensor([[0., 0., 0.],
[0., 0., 0.]], device='cuda:0')
randn:
tensor([[ 0.3152, -1.2089, 0.7741],
[-0.4156, 0.9823, -0.1205]], device='cuda:0')

Gradient tracking enabled: True
grad_fn before operation: None
grad_fn after operation: <PowBackward0 object at 0x7f3a2c1d4b50>

x_data: shape=torch.Size([2, 2]), dtype=torch.float32, device=cuda:0, requires_grad=False
x_grad: shape=torch.Size([3, 3]), dtype=torch.float32, device=cuda:0, requires_grad=True
Mental Model
The Tensor Mental Model
A tensor is a view into a storage buffer that lives on CPU or GPU — operations dispatch to hardware-specific kernels, and requires_grad threads a computation graph through every operation for automatic differentiation.
  • Tensor = multidimensional array with device awareness — the same API whether the storage is on CPU, CUDA, or MPS
  • NumPy-like API but with GPU dispatch and Autograd built in — torch.from_numpy() converts with zero copy
  • requires_grad=True builds a computation graph as operations execute — .backward() traverses it in reverse to compute all gradients
  • GPU tensors use CUDA kernels for massively parallel execution — the speedup is real only for large tensors; small ones have more transfer overhead than benefit
  • Tensors are views with shape and stride — .transpose() and .permute() reorder strides without moving data; call .contiguous() before .view() if the tensor is non-contiguous
📊 Production Insight
Tensors default to CPU at creation — always pass device=device explicitly to every creation call rather than creating on CPU and moving afterward.
requires_grad=True enables Autograd — without it the tensor is a static array with no gradient path and no learning.
As of PyTorch 2.x, wrapping your model with torch.compile compiles tensor operations into fused CUDA kernels — benchmark it on your architecture before shipping to production.
Rule: set device once at the top of the script, pass it everywhere, log it at startup, and assert it before the training loop.
🎯 Key Takeaway
Tensors are GPU-aware multidimensional arrays — the universal data structure in PyTorch. requires_grad=True opts the tensor into Autograd and every subsequent operation is recorded for the backward pass. Always pass device=device to tensor creation — tensors default to CPU and PyTorch will never move them automatically.
Tensor Creation Decision
IfCreating input data for model training
UseUse torch.tensor(data, dtype=torch.float32, device=device, requires_grad=False) — explicit dtype and device, no gradient tracking on inputs
IfCreating trainable model parameters
UseUse nn.Parameter(torch.randn(..., device=device)) — automatically sets requires_grad=True and registers the parameter with the module
IfConverting a NumPy array to a tensor
UseUse torch.from_numpy(arr) for zero-copy on CPU, then .to(device) to move to GPU — or torch.tensor(arr, device=device) if you need a copy
IfCreating a tensor for inference only
UseCreate without requires_grad and wrap the forward pass in torch.inference_mode() — faster than torch.no_grad() and prevents the tensor from being used in a backward pass

Enterprise Data Pipelines: SQL to Tensor Conversion

In production ML systems, training data rarely arrives as a Python list. It lives in a relational database — normalised, versioned, and filtered by business logic before it ever reaches a tensor. The conversion path from SQL to tensor has two meaningful implementation choices that trade memory efficiency against safety, and getting this wrong at scale causes OOM crashes or silent data corruption.

The recommended pipeline: query SQL into a Pandas DataFrame or directly into a NumPy array via the database cursor's fetchall method. Then convert to a tensor using torch.from_numpy(arr) for zero-copy conversion — the tensor and the NumPy array share the same underlying memory, so no data is duplicated. Move to GPU with .to(device). This entire path keeps memory usage as low as possible for datasets that fit in RAM.

The danger with torch.from_numpy(): because the tensor and the NumPy array share memory, modifying the NumPy array after conversion will silently change the tensor's data. In a pipeline where the DataFrame is reused or mutated for other purposes, this can corrupt training data without any error. If the NumPy array may change after conversion, use torch.tensor(arr, device=device) instead — it creates an independent copy at the cost of a second allocation.

For datasets that do not fit in RAM — anything beyond a few hundred thousand rows with high-dimensional features — loading everything at once causes OOM before training starts. The correct pattern is a custom Dataset that queries or reads one batch at a time in __getitem__, combined with a DataLoader that parallelises the fetching. This keeps memory usage proportional to batch size regardless of dataset size.

io/thecodeforge/queries/fetch_features.sql · SQL
12345678910111213141516171819
-- Extracting normalised features for tensor conversion
-- We pull only the columns needed for training — not SELECT *
-- The WHERE clause filters to verified samples only, matching the Dataset class expectation
-- LIMIT controls batch size for incremental loading patterns

SELECT
    feature_a,
    feature_b,
    target_label
FROM io.thecodeforge.analytics_table
WHERE status = 'processed'
  AND is_verified = TRUE
  AND split_tag = 'train'
ORDER BY sample_id ASC
LIMIT 10000;

-- For incremental loading in a custom Dataset.__getitem__:
-- Use OFFSET :offset LIMIT :batch_size with parameterised queries
-- Never load the full table in __init__ — keep memory proportional to batch size
▶ Output
Returns a tabular result set ready for Pandas read or cursor fetchall, then torch.from_numpy() conversion.
🔥SQL to Tensor: The Zero-Copy Path
When moving data from SQL, pull into a NumPy array via Pandas (df.values) or SQLAlchemy, then convert with torch.from_numpy(). This creates a tensor that shares the same underlying memory as the NumPy array — no data is duplicated. Move to GPU with .to(device) afterward. The caveat: because the tensor and the array share memory, any mutation of the NumPy array after conversion will silently change the tensor's data. If the array may be modified, use torch.tensor(arr, device=device) instead to get an independent copy.
📊 Production Insight
torch.from_numpy() shares memory with the NumPy array — zero-copy and fast, but any subsequent mutation of the array silently corrupts the tensor.
torch.tensor(arr) creates an independent copy — safer for pipelines where the source array is reused or modified.
For datasets larger than available RAM, load in chunks inside a custom Dataset.__getitem__ rather than upfront in __init__ — this keeps memory proportional to batch size regardless of dataset size.
Rule: from_numpy() for read-only pipelines where you control the source array's lifetime; tensor() when in doubt.
🎯 Key Takeaway
torch.from_numpy() shares memory with the NumPy array — zero-copy but modifications to the array propagate silently to the tensor. torch.tensor() creates a copy — safer for production pipelines where the source data has a longer lifetime than the tensor. For large datasets, fetch in chunks — loading everything at once is an OOM crash waiting to happen.
SQL to Tensor Conversion Decision
IfDataset fits in RAM and the NumPy array will not be modified after conversion
UseUse torch.from_numpy(df.values).to(device) — zero-copy conversion, minimum memory usage
IfNumPy array may be modified or reused for other purposes after tensor creation
UseUse torch.tensor(df.values, dtype=torch.float32, device=device) — creates an independent copy, prevents silent data corruption
IfDataset is too large for RAM (more than ~1M rows or high-dimensional features)
UseImplement a custom Dataset with lazy loading — fetch rows by offset in __getitem__ and let the DataLoader handle batching and parallelism

Standardising Environments with Docker

PyTorch's GPU support depends on a precise version compatibility chain: host NVIDIA driver → CUDA runtime → cuDNN → PyTorch. A mismatch at any point produces a silent failure — torch.cuda.is_available() returns False, tensors silently stay on CPU, and training runs at a fraction of expected speed with no error message to guide you. Docker solves this by fixing the entire stack in one image tag.

The compatibility rule: the CUDA version in the Docker base image must be less than or equal to the CUDA version supported by the host machine's NVIDIA driver. The driver's maximum supported CUDA version is shown in the top-right corner of nvidia-smi output. If the image requests a higher CUDA version than the driver supports, PyTorch loads but CUDA initialisation fails silently. The fix is always to pick a base image whose CUDA version is at or below what nvidia-smi reports.

Two environment variables matter for GPU-enabled containers. NVIDIA_VISIBLE_DEVICES controls which physical GPUs the container can see — set it to all for training containers or to a specific index when you need to isolate workloads on a multi-GPU host. NVIDIA_DRIVER_CAPABILITIES tells the NVIDIA Container Toolkit which driver features to expose — compute gives you CUDA compute, utility gives you nvidia-smi inside the container. Both should be set in the Dockerfile rather than passed at runtime so they are reproducible.

The verification step that should run before every training job: add a startup script that calls torch.cuda.is_available(), prints the GPU name and total memory, and exits with a non-zero code if CUDA is not available when it is expected. This turns silent CPU fallback into an immediate and obvious failure that stops the job before it wastes hours of compute.

Dockerfile · DOCKERFILE
1234567891011121314151617181920212223242526272829303132333435
# Pin specific PyTorch and CUDA versions — never use 'latest' in production
# 'latest' changes silently and makes training runs non-reproducible
# Check pytorch.org for the full compatibility matrix before changing these
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime

WORKDIR /app

# Expose all GPUs to the container runtime
# NVIDIA_VISIBLE_DEVICES=all makes every physical GPU available
# NVIDIA_DRIVER_CAPABILITIES=compute,utility enables CUDA compute and nvidia-smi
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

# Prevent CUDA memory fragmentation on long training runs
# Without this, OOM errors can occur even when total free memory is sufficient
ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Startup verification — catches CUDA misconfiguration before wasting compute
# Remove the CUDA assertion for CPU-only deployment targets
RUN python -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'CUDA version: {torch.version.cuda}')
"

# Run with: docker run --gpus all --shm-size=2g -v /data:/data thecodeforge/torch-runtime:2.3.1
CMD ["python", "ForgeTensorBasics.py"]
▶ Output
PyTorch: 2.3.1
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
CUDA version: 12.1

Successfully built image thecodeforge/torch-runtime:2.3.1-cuda12.1
⚠ CUDA Version Mismatch Causes Silent CPU Fallback
The CUDA version in the Docker base image must be less than or equal to the CUDA version the host NVIDIA driver supports. Run nvidia-smi on the host to see the maximum supported CUDA version in the top-right corner. A CUDA 12.1 image on a host with a driver that only supports CUDA 11.8 will start without error, load PyTorch, and silently return False from torch.cuda.is_available() — all tensors stay on CPU and training runs at 1/10th expected throughput. Add a startup verification step to your container that asserts CUDA availability and exits with a non-zero code if the assertion fails.
📊 Production Insight
CUDA version in the image must be <= the host driver's maximum supported CUDA version — the right number is in the top-right corner of nvidia-smi output.
Mismatch causes torch.cuda.is_available() to return False silently — all tensors stay on CPU with no warning.
Add a RUN python -c 'assert torch.cuda.is_available()' step to the Dockerfile build so the image fails to build on a misconfigured host rather than failing silently at training time.
Rule: pin PyTorch and CUDA versions, set NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES, verify at container startup.
🎯 Key Takeaway
Docker fixes the CUDA compatibility chain in one image tag — use it. CUDA version in the image must be at or below what the host driver supports. Always pin both PyTorch and CUDA versions explicitly. Add a startup verification that asserts CUDA availability and fails loudly if it is missing — silent CPU fallback in a GPU training container is the most expensive misconfiguration in production ML.
Docker CUDA Version Selection
IfHost driver supports CUDA 12.x (nvidia-smi shows CUDA Version: 12.x)
UseUse pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime — current recommended base image for GPU training as of mid-2026
IfHost driver supports CUDA 11.x only (nvidia-smi shows CUDA Version: 11.x)
UseUse pytorch/pytorch:2.2.0-cuda11.8-cudnn8-runtime — do NOT use a CUDA 12 image on an 11.x driver
IfNo GPU on the host — CPU-only deployment or CI environment
UseUse pytorch/pytorch:2.3.1 — CPU-only image, several GB smaller, faster to pull and start

Common Mistakes and How to Avoid Them

Most tensor bugs in production trace back to a handful of patterns. Knowing them in advance is the difference between a 30-second fix and a four-hour debugging session.

Device mismatch is the most common runtime crash. The error message is unambiguous — RuntimeError: Expected all tensors to be on the same device — but identifying which tensor is on the wrong device requires checking .device on each one. The usual culprit is a target tensor or loss buffer created inside the training loop without .to(device), while the model output is correctly on GPU. The fix: establish device once and pass it to every tensor creation call in the loop, not just to the model.

torch.Tensor (capital T) versus torch.tensor (lowercase t) is a confusion that trips up developers coming from other frameworks. torch.Tensor is the tensor class constructor — calling torch.Tensor([1, 2, 3]) creates a float32 tensor from data, but it is the long form of torch.FloatTensor and does not perform dtype inference. torch.tensor (lowercase) is the factory function that infers dtype from the input, always creates a copy, accepts device and requires_grad arguments, and is the correct way to create a tensor from data. In production code, always use torch.tensor().

In-place operations on gradient-tracked tensors corrupt the computation graph. When you call a.add_(b), the original value of a — which autograd needs to compute a's gradient during backpropagation — is destroyed. PyTorch raises RuntimeError: a leaf Variable that requires grad is being used in an in-place operation if it catches this immediately, but in some cases the graph is silently corrupted and gradients are wrong without any error. The rule: avoid trailing underscores (add_, mul_, fill_, zero_) on any tensor with requires_grad=True.

Views versus copies is the final major source of confusion. .view(), slicing, and .transpose() all return views that share storage with the original tensor. Modifying a view modifies the original. .clone() creates an independent copy. If you need to modify a slice without affecting the source tensor — common when building augmented versions of a batch — always call .clone() first.

io/thecodeforge/ml/common_tensor_mistakes.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import torch

# Device selection — establish once, pass everywhere
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

# --- Mistake 1: Device mismatch ---
# WRONG: x is on GPU, y is on CPU — crashes on the addition
# x = torch.ones(5, device='cuda')
# y = torch.zeros(5)  # CPU by default
# z = x + y  # RuntimeError: Expected all tensors to be on the same device

# CORRECT: both created on the same device
x = torch.ones(5, device=device)
y = torch.zeros(5, device=device)
z = x + y
print(f"Device mismatch fixed: z = {z}")

# --- Mistake 2: torch.Tensor vs torch.tensor ---
# WRONG: torch.Tensor (class) — confusing, no dtype inference, no device argument
# confused = torch.Tensor([1, 2, 3])  # always float32, always CPU

# CORRECT: torch.tensor (factory function) — explicit dtype, device, requires_grad
clear = torch.tensor([1, 2, 3], dtype=torch.float32, device=device)
print(f"torch.tensor result: {clear}, device: {clear.device}")

# --- Mistake 3: In-place operation on a gradient-tracked tensor ---
a = torch.randn(2, 2, requires_grad=True)

# WRONG: in-place add destroys the value autograd needs for gradient computation
# a.add_(torch.ones(2, 2))  # RuntimeError: in-place modification of leaf variable

# CORRECT: create a new tensor — preserves the computation graph
b = a + torch.ones(2, 2)  # new tensor, a unchanged, graph intact
loss = b.sum()
loss.backward()  # gradients computed correctly
print(f"Gradient after correct operation: {a.grad}")

# --- Mistake 4: View vs clone confusion ---
original = torch.tensor([1.0, 2.0, 3.0, 4.0])
view     = original[:2]   # view — shares storage
copied   = original[:2].clone()  # independent copy

view[0] = 99.0   # modifies original too
print(f"After modifying view — original: {original}")  # [99., 2., 3., 4.]

copied[0] = 77.0  # does NOT modify original
print(f"After modifying clone — original: {original}")  # unchanged

# --- Mistake 5: Contiguity and .view() ---
t = torch.randn(3, 4)
t_transposed = t.T   # transpose — non-contiguous, strides reordered
# t_transposed.view(12)  # RuntimeError: view size not compatible with non-contiguous tensor
t_contiguous = t_transposed.contiguous()  # copies data into contiguous layout
reshaped = t_contiguous.view(12)          # works correctly
print(f"Contiguous reshape: {reshaped.shape}")
▶ Output
Device: cuda
Device mismatch fixed: z = tensor([1., 1., 1., 1., 1.], device='cuda:0')
torch.tensor result: tensor([1., 2., 3.], device='cuda:0')
Gradient after correct operation: tensor([[1., 1.],
[1., 1.]])
After modifying view — original: tensor([99., 2., 3., 4.])
After modifying clone — original: tensor([99., 2., 3., 4.])
Contiguous reshape: torch.Size([12])
⚠ When Tensors Are the Wrong Tool
The most expensive tensor mistake is not a bug — it is unnecessary complexity. If your computation does not need GPU acceleration or automatic differentiation, NumPy is simpler, has less overhead, and integrates more broadly with the scientific Python ecosystem. Only create a PyTorch tensor when you are feeding data into a model, running gradient-based optimisation, or need CUDA parallelism for a large matrix operation. For data preprocessing, statistics, and exploratory analysis, NumPy is usually the right choice and you can convert with torch.from_numpy() when the data reaches the model.
📊 Production Insight
Device mismatch is the most common runtime crash — print .device on every tensor in the failing line before changing anything else.
In-place operations with trailing underscores (add_, mul_) on requires_grad tensors corrupt the computation graph — sometimes silently. Avoid them entirely on gradient-tracked tensors.
torch.tensor (lowercase) is the correct factory function for creating tensors from data. torch.Tensor (uppercase) is the class and should not appear in data creation code.
Views share storage — use .clone() before modifying any slice you do not want to affect the source tensor.
🎯 Key Takeaway
Device mismatch is the most common runtime error — always pass device=device at tensor creation and never mix CPU and GPU tensors in the same operation. In-place operations break autograd on gradient-tracked tensors. Use torch.tensor() (lowercase) for all data creation. Views share storage with the original — use .clone() when you need an independent copy.
Debugging Tensor Mistakes
IfRuntimeError: Expected all tensors to be on the same device
UsePrint .device on all tensors in the failing operation — add .to(device) at the creation point of the mismatched tensor, not after the fact
IfRuntimeError: a leaf Variable that requires grad is being used in an in-place operation
UseReplace a.add_(b) with a = a + b — the non-in-place version creates a new tensor and preserves the computation graph
If.view() fails with RuntimeError about size not compatible with stride
UseCall .contiguous() before .view() — the tensor is non-contiguous after .transpose() or .permute(). Or use .reshape() which handles non-contiguous tensors automatically.
IfModifying a slice changes the original tensor unexpectedly
UseUse .clone() to create an independent copy before modification — slices are views that share storage with the original
🗂 NumPy Arrays vs PyTorch Tensors
Understanding when to use PyTorch Tensors over NumPy arrays — and when NumPy is actually the better choice
AspectNumPy ArraysPyTorch Tensors
Hardware supportCPU only — no GPU pathCPU, NVIDIA GPU via CUDA, Apple Silicon via MPS — same API regardless of device
Automatic differentiationManual — you implement the derivative by handAutograd — .backward() computes all gradients via the chain rule in one pass
Deep learning ecosystemRequires wrappers or conversion to use with PyTorch, JAX, or TensorFlowNative — every PyTorch layer, loss function, and optimizer operates on tensors directly
Memory modelContiguous C-order arrays in CPU memory — straightforward layoutViews with shape and stride — efficient for transpose and reshape, but non-contiguous tensors require .contiguous() before .view()
InteroperabilityUniversal — the lingua franca of the scientific Python ecosystemtorch.from_numpy() converts with zero copy on CPU — full round-trip compatibility
When to use itData preprocessing, statistics, visualisation, and any computation that does not need GPU or gradientsAny workload that feeds a neural network, requires gradient-based optimisation, or benefits from GPU parallelism on large matrices

🎯 Key Takeaways

  • Tensors are the universal data structure in PyTorch — every input, weight, gradient, and model output is a tensor. Understanding how they work is not optional; it is the foundation every other PyTorch concept builds on.
  • requires_grad=True opts a tensor into the computation graph — every subsequent operation is recorded for backpropagation. Set it only on learnable parameters, never on input data, and never on tensors used only for inference.
  • Always pass device=device to tensor creation calls — tensors default to CPU and PyTorch never moves them automatically. The cost of forgetting this is training silently on CPU at 1/10th expected speed.
  • In-place operations (add_, mul_, fill_) on gradient-tracked tensors corrupt the computation graph — sometimes raising an error immediately, sometimes silently producing wrong gradients. Avoid them on any tensor with requires_grad=True.
  • Use torch.tensor() (lowercase) for creating tensors from data. torch.Tensor() (uppercase) is the class constructor and does not infer dtype or accept a device argument — it should not appear in data creation code.
  • Views share storage with the original tensor — .transpose(), .permute(), and slicing all return views. Use .clone() when you need a modification-safe independent copy, and .contiguous() before .view() when the tensor is non-contiguous.

⚠ Common Mistakes to Avoid

    Using tensors for non-ML tasks where NumPy is more efficient
    Symptom

    Unnecessary overhead from PyTorch's Autograd engine, CUDA initialisation, and device management for operations that have no benefit from GPU acceleration. Slower than NumPy for small arrays and non-parallelisable operations.

    Fix

    Use NumPy for data preprocessing, exploratory analysis, and statistical computations that stay on CPU. Convert to tensors with torch.from_numpy() only when the data enters the model or training loop. PyTorch is the right tool for gradient-based optimisation and GPU parallelism — not for replacing NumPy in every computation.

    Keeping tensors on GPU after they are no longer needed
    Symptom

    CUDA Out of Memory errors during training, typically appearing mid-epoch rather than on the first batch. GPU memory fills up gradually because unused tensors accumulate — intermediate activations, logged loss tensors, or debug variables that were never deleted.

    Fix

    Delete unused tensors with del tensor_name when they leave scope in a training loop. Call torch.cuda.empty_cache() between epochs to release cached but unused memory back to the CUDA allocator. Use loss.item() for scalar logging — logging the raw loss tensor holds the entire computation graph in GPU memory.

    Forgetting to call .detach() before converting a gradient-tracked tensor to NumPy
    Symptom

    RuntimeError: Can't call numpy() on Tensor that requires grad. This surfaces when trying to log, visualise, or post-process a tensor that was created with requires_grad=True or is the output of a computation involving gradient-tracked parameters.

    Fix

    Call tensor.detach().cpu().numpy() to safely convert — .detach() removes the tensor from the computation graph, .cpu() moves it off GPU if necessary, and .numpy() converts to a NumPy array. The order matters: detach before numpy, cpu before numpy on GPU tensors.

Interview Questions on This Topic

  • QWhat is the difference between torch.Tensor (class constructor) and torch.tensor (factory function)?Mid-levelReveal
    torch.Tensor (capital T) is the tensor class constructor. Calling torch.Tensor([1, 2, 3]) creates a float32 tensor — it is equivalent to torch.FloatTensor([1, 2, 3]). It does not perform dtype inference from the input, does not accept a device argument directly, and is considered a low-level interface. torch.tensor (lowercase t) is the recommended factory function for creating tensors from data. It infers dtype from the Python or NumPy type of the input (int64 for Python integers, float32 for Python floats with a trailing decimal, and so on), always creates a copy of the data, and accepts dtype, device, and requires_grad as arguments. In production code, torch.tensor() should appear in all data creation — torch.Tensor() should not. The practical danger of torch.Tensor(): calling torch.Tensor(3, 4) creates a 3x4 tensor of uninitialised memory rather than a tensor from the data [3, 4], which is a silent correctness bug.
  • QExplain the contiguous tensor concept. Why do we often need to call .contiguous() before a .view() operation?SeniorReveal
    A tensor is contiguous when its elements are stored in memory in the same order as they are addressed logically when iterating from the first to the last dimension. Operations like .transpose() and .permute() change the logical order of elements by reordering the stride metadata — the stride tells PyTorch how many memory positions to step in each dimension. After transposing, the strides no longer correspond to sequential memory access, so the tensor is non-contiguous. .view() reinterprets the same memory with a different shape, which only works if the elements are laid out contiguously — it cannot handle the non-sequential memory access pattern of a transposed tensor. Calling .contiguous() creates a new tensor with the data copied into the correct sequential memory order, after which .view() works. The practical alternative: use .reshape() instead of .view() — it handles non-contiguous tensors by making a copy when necessary and returning a view when possible. The symptom that tells you .contiguous() is needed: RuntimeError: view size is not compatible with input tensor's size and stride.
  • QDescribe the broadcasting rules in PyTorch. How does the framework handle an operation between a (3, 1) tensor and a (1, 3) tensor?Mid-levelReveal
    PyTorch broadcasting follows NumPy rules, applied from the rightmost dimension leftward. The three rules: (1) Align shapes from the right, padding with 1s on the left for any missing dimensions. (2) For each dimension pair, dimensions must either be equal, or one of them must be 1 — if one is 1, it is virtually expanded to match the other. (3) If neither condition holds for any dimension, the operation raises an error. For a (3, 1) tensor and a (1, 3) tensor: rightmost dimensions are 1 and 3 — the 1 expands to 3. Left dimensions are 3 and 1 — the 1 expands to 3. Result shape: (3, 3). Each element of the output combines the corresponding element from the first tensor's column (broadcast across columns) with the corresponding element from the second tensor's row (broadcast across rows). Broadcasting never allocates new memory for the expanded dimensions — it uses stride tricks to iterate over the same data. This is how bias addition works in linear layers: a (batch, features) output plus a (features,) bias broadcasts the bias across all batch items without copying.
  • QHow does the Autograd engine use the grad_fn attribute to perform backpropagation?SeniorReveal
    When an operation is performed on a tensor with requires_grad=True, the resulting tensor stores a reference to the Function object that created it in its grad_fn attribute. This Function knows: (1) which operation was performed (e.g., PowBackward0 for exponentiation); (2) references to the input tensors it needs to compute local gradients during backpropagation; and (3) how to compute those local gradients — each Function implements a backward() method that applies the local derivative of its operation. Leaf tensors — those created directly by the user, not as the output of an operation — have grad_fn=None. When .backward() is called on a scalar loss, Autograd collects the loss tensor's grad_fn and traverses the DAG of Function nodes in reverse topological order using a queue. At each node it calls the Function's backward() method, passing in the upstream gradient, and the result is accumulated into the .grad attribute of any leaf tensor encountered. This is reverse-mode automatic differentiation — one backward pass computes gradients with respect to all parameters simultaneously, making its complexity O(n) in the number of parameters.
  • QWhat is the difference between torch.no_grad() and torch.inference_mode(), and when should you use each?SeniorReveal
    Both context managers disable gradient computation, but they do so with different scope and performance characteristics. torch.no_grad() disables the Autograd engine — operations inside the context do not build a computation graph and do not record grad_fn. However, it still maintains version counters on tensors, which track in-place modifications to detect if a tensor was changed after being used in a computation. torch.inference_mode() is more aggressive — it disables both gradient computation and version counter tracking. Tensors created inside inference_mode() are permanently marked as inference tensors and cannot be used in a backward pass even after leaving the context. This makes inference_mode() 10–20% faster than no_grad() on typical inference workloads. Use torch.no_grad() during validation loops inside a training run — you may still need version tracking for other operations and the performance difference is small relative to the full training step. Use torch.inference_mode() for production serving and any code path that is purely inference — it is the faster and more semantically correct choice when you are certain no backward pass will follow.

Frequently Asked Questions

What is a PyTorch Tensor in simple terms?

A PyTorch tensor is a multidimensional array — like a NumPy array — that can live on a GPU and automatically track every mathematical operation performed on it. The GPU part makes large matrix computations 10–100x faster. The operation tracking is what enables automatic differentiation: when you tell PyTorch 'compute gradients', it traces every step backwards and tells you exactly how to adjust each number to reduce your error. These two capabilities together are what make neural network training practical.

What is the difference between .view() and .reshape()?

.view() returns a new tensor with a different shape that shares the same underlying storage as the original. It requires the tensor to be contiguous in memory — if it is not (for example, after a .transpose()), .view() raises RuntimeError: view size is not compatible with input tensor's size and stride. .reshape() is more flexible: it returns a view if the tensor is already contiguous, and silently makes a copy if it is not. The practical rule: use .reshape() unless you explicitly need the guarantee that no data was copied. If .reshape() returns a view, modifying it modifies the original — so be aware of the view semantics either way.

How do I move a tensor from GPU back to CPU?

Call tensor.cpu() to return a new tensor on CPU, or tensor.to('cpu'). If the tensor has requires_grad=True or is part of a computation graph, call tensor.detach().cpu() first — .detach() removes it from the graph so NumPy conversion and other CPU operations work correctly. The full pattern for converting a GPU tensor to a NumPy array: tensor.detach().cpu().numpy(). This order is required: detach before numpy (to remove autograd tracking), cpu before numpy (to move off GPU).

Why does torch.cuda.is_available() return False even though I have a GPU?

The most common causes in order: (1) The NVIDIA driver is not installed or is too old — check with nvidia-smi. (2) You installed the CPU-only version of PyTorch — the package name differs; install the CUDA-enabled version from pytorch.org. (3) Inside a Docker container, the CUDA version in the base image is higher than what the host driver supports — the container starts but CUDA initialisation fails. (4) The GPU is visible to the OS but not to the current user — check with nvidia-smi -L and verify permissions. In all cases, torch.cuda.is_available() returning False means every tensor stays on CPU and training runs at 1/10th expected speed with no error.

When should I use torch.no_grad() vs torch.inference_mode()?

Use torch.no_grad() during validation loops inside a training run — it disables gradient computation without disabling version counter tracking, which provides a small safety net if other code in the same scope depends on version information. Use torch.inference_mode() for production serving and any pure inference path — it disables both gradient computation and version tracking, runs 10–20% faster, and is the semantically correct choice when you are certain no backward pass will follow. Tensors created inside inference_mode() are marked permanently and cannot be used in a backward pass even after leaving the context, which prevents an entire class of accidental training-in-inference bugs.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousIntroduction to PyTorchNext →Building a Neural Network in PyTorch
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged