Beginner 9 min · March 09, 2026

PyTorch Tensors Explained

PyTorch Tensors — Silent CPU Fallback Kills GPU Utilization

Q: What is a PyTorch Tensor in simple terms?

A PyTorch tensor is a multidimensional array — like a NumPy array — that can live on a GPU and automatically track every mathematical operation performed on it. The GPU part makes large matrix computations 10–100x faster. The operation tracking is what enables automatic differentiation: when you tell PyTorch 'compute gradients', it traces every step backwards and tells you exactly how to adjust each number to reduce your error. These two capabilities together are what make neural network training practical.

Q: What is the difference between .view() and .reshape()?

.view() returns a new tensor with a different shape that shares the same underlying storage as the original. It requires the tensor to be contiguous in memory — if it is not (for example, after a .transpose()), .view() raises RuntimeError: view size is not compatible with input tensor's size and stride. .reshape() is more flexible: it returns a view if the tensor is already contiguous, and silently makes a copy if it is not. The practical rule: use .reshape() unless you explicitly need the guarantee that no data was copied. If .reshape() returns a view, modifying it modifies the original — so be aware of the view semantics either way.

Q: How do I move a tensor from GPU back to CPU?

Call tensor.cpu() to return a new tensor on CPU, or tensor.to('cpu'). If the tensor has requires_grad=True or is part of a computation graph, call tensor.detach().cpu() first — .detach() removes it from the graph so NumPy conversion and other CPU operations work correctly. The full pattern for converting a GPU tensor to a NumPy array: tensor.detach().cpu().numpy(). This order is required: detach before numpy (to remove autograd tracking), cpu before numpy (to move off GPU).

Q: Why does torch.cuda.is_available() return False even though I have a GPU?

The most common causes in order: (1) The NVIDIA driver is not installed or is too old — check with nvidia-smi. (2) You installed the CPU-only version of PyTorch — the package name differs; install the CUDA-enabled version from pytorch.org. (3) Inside a Docker container, the CUDA version in the base image is higher than what the host driver supports — the container starts but CUDA initialisation fails. (4) The GPU is visible to the OS but not to the current user — check with nvidia-smi -L and verify permissions. In all cases, torch.cuda.is_available() returning False means every tensor stays on CPU and training runs at 1/10th expected speed with no error.

Q: When should I use torch.no_grad() vs torch.inference_mode()?

Use torch.no_grad() during validation loops inside a training run — it disables gradient computation without disabling version counter tracking, which provides a small safety net if other code in the same scope depends on version information. Use torch.inference_mode() for production serving and any pure inference path — it disables both gradient computation and version tracking, runs 10–20% faster, and is the semantically correct choice when you are certain no backward pass will follow. Tensors created inside inference_mode() are marked permanently and cannot be used in a backward pass even after leaving the context, which prevents an entire class of accidental training-in-inference bugs.

Training slows 8x when a broad except block catches device mismatch error, silently running on CPU.

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Tensors are GPU-accelerated multidimensional arrays — the universal data structure for all PyTorch operations
They mirror NumPy's API but add CUDA support and automatic differentiation via Autograd
requires_grad=True enables gradient tracking — the tensor records every operation for backpropagation
Moving tensors to GPU with .to('cuda') provides 10-100x speedup for large matrix operations
Device mismatch (CPU tensor + GPU tensor in the same operation) is the #1 production RuntimeError — always check .device
Small tensors incur more transfer overhead than they save — only move to GPU when the computation justifies it

✦ Definition~90s read

What is PyTorch Tensors?

A PyTorch tensor is a multidimensional array that was designed to solve a problem NumPy cannot: running massive parallel computations on a GPU while simultaneously tracking every operation for automatic gradient computation.

★

Imagine a standard spreadsheet.

NumPy arrays are excellent for scientific computing — fast, well-documented, universally supported. But they have two hard limits. First, they run only on CPU. Second, they have no concept of a computation graph. This means that if you want to train a neural network with NumPy, you implement backpropagation manually — computing partial derivatives by hand for every layer, every parameter, every batch.

That is tractable for a two-layer toy network and completely unworkable for anything beyond it.

When requires_grad=True, the tensor records every operation as part of a dynamic computation graph. When .backward() is called on the loss, that graph is traversed in reverse and .grad is filled in on every participating tensor via the chain rule — one backward pass, all gradients computed simultaneously.

This is efficient but has a consequence: the resulting tensor is non-contiguous, and .view() will refuse to work on it because .view() requires elements to be laid out in memory in the same order they are addressed logically. The fix is .contiguous(), which copies the data into a new storage with the expected memory order.

For production inference workloads in 2026, torch.compile is the highest-leverage single change you can make to a trained model.

Plain-English First

Imagine a standard spreadsheet. A single number is a scalar, a single row is a vector, and the full grid of rows and columns is a matrix. A tensor is that same idea extended to any number of dimensions — a cube of numbers, a four-dimensional hypercube, whatever the problem requires. What makes PyTorch tensors special is not the shape. It is two things layered on top: first, they can live on a GPU and run thousands of operations in parallel instead of one at a time on a CPU. Second, they remember every mathematical operation ever applied to them. When you eventually ask 'how should I change these numbers to reduce the error?', the tensor can trace every step backwards and give you the exact answer — automatically, without you writing a single line of calculus.

PyTorch Tensors are the fundamental data structure in PyTorch — every input, weight, gradient, and output is a tensor. They are multidimensional arrays that mirror NumPy's API but add two capabilities that NumPy does not have: GPU acceleration via CUDA and automatic differentiation via Autograd.

The key design decision: tensors are not just data containers. When requires_grad=True, they become nodes in a dynamic computation graph. Every operation on them is recorded as the forward pass executes, enabling automatic gradient computation when you call .backward(). This is what makes neural network training tractable — without it, you would manually compute partial derivatives for every parameter on every update, which is not realistic at any modern model size.

The production failure pattern: device mismatch. A tensor on CPU cannot participate in the same operation as a tensor on GPU. PyTorch raises RuntimeError: Expected all tensors to be on the same device immediately and clearly. What is not clear is which tensor is on the wrong device. The fix is always the same: establish device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') once, pass it to every tensor creation call, and add an assertion at training start that verifies model parameters and input data are on the same device.

As of 2026, there is a third dimension worth knowing: Apple Silicon. PyTorch supports MPS (Metal Performance Shaders) on M-series Macs via device='mps', which provides meaningful GPU acceleration on MacBooks without CUDA. The same .to(device) pattern applies — the device string changes, the code does not.

What Is a PyTorch Tensor and Why Does It Exist?

PyTorch tensors solve both problems. The storage layer underneath a tensor can live on CPU, on an NVIDIA GPU via CUDA, or on Apple Silicon via MPS. When the storage is on GPU, every matrix operation dispatches to CUDA kernels that execute in parallel across thousands of GPU cores — this is why large matrix multiplications are 10–100x faster on GPU for the sizes neural networks operate on. When requires_grad=True, the tensor records every operation as part of a dynamic computation graph. When .backward() is called on the loss, that graph is traversed in reverse and .grad is filled in on every participating tensor via the chain rule — one backward pass, all gradients computed simultaneously.

The architectural detail worth understanding: a tensor is a view into a storage object. The tensor knows its shape, stride, dtype, and device. The storage holds the raw bytes. Operations like .transpose() and .permute() create new tensor views with reordered strides without moving any data in memory — the storage stays identical. This is efficient but has a consequence: the resulting tensor is non-contiguous, and .view() will refuse to work on it because .view() requires elements to be laid out in memory in the same order they are addressed logically. The fix is .contiguous(), which copies the data into a new storage with the expected memory order.

As of PyTorch 2.x, there is a fourth dimension: compiled tensors. torch.compile traces your forward pass and compiles it into optimised kernels using TorchInductor. The tensor API is identical — you add one decorator and the same tensor operations run in a fused, optimised form. For production inference workloads in 2026, torch.compile is the highest-leverage single change you can make to a trained model.

io/thecodeforge/ml/forge_tensor_basics.pyPYTHON

import torch

# Device selection — this pattern belongs at the top of every script
# Supports CUDA (NVIDIA), MPS (Apple Silicon), and CPU fallback
if torch.cuda.is_available():
    device = torch.device("cuda")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = torch.device("mps")  # Apple Silicon GPU — supported from PyTorch 1.12+
else:
    device = torch.device("cpu")

print(f"Using device: {device}")


def initialize_forge_tensors():
    # Creating tensors from Python data — torch.tensor (lowercase) infers dtype
    data = [[1, 2], [3, 4]]
    x_data = torch.tensor(data, dtype=torch.float32, device=device)
    print(f"From list — shape: {x_data.shape}, device: {x_data.device}, dtype: {x_data.dtype}")

    # Creating tensors with specific values — pass device at creation, not after
    x_ones  = torch.ones(2, 3, dtype=torch.float32, device=device)
    x_zeros = torch.zeros(2, 3, dtype=torch.float32, device=device)
    x_rand  = torch.randn(2, 3, device=device)  # standard normal distribution
    print(f"\nones:\n{x_ones}")
    print(f"zeros:\n{x_zeros}")
    print(f"randn:\n{x_rand}")

    # requires_grad=True — opts this tensor into the computation graph
    # Only set this on learnable parameters, never on input data
    x_grad = torch.randn(3, 3, requires_grad=True, device=device)
    print(f"\nGradient tracking enabled: {x_grad.requires_grad}")
    print(f"grad_fn before operation: {x_grad.grad_fn}")  # None — leaf tensor

    # Any operation on a requires_grad tensor produces a non-leaf tensor with grad_fn
    y = x_grad ** 2  # element-wise square
    print(f"grad_fn after operation:  {y.grad_fn}")  # PowBackward0

    return x_data, x_grad


x_data, x_grad = initialize_forge_tensors()

# Tensor metadata inspection — useful diagnostic at the start of debugging
for name, t in [("x_data", x_data), ("x_grad", x_grad)]:
    print(f"{name}: shape={t.shape}, dtype={t.dtype}, device={t.device}, requires_grad={t.requires_grad}")

Output

Using device: cuda

From list — shape: torch.Size([2, 2]), device: cuda:0, dtype: torch.float32

ones:

tensor([[1., 1., 1.],

[1., 1., 1.]], device='cuda:0')

zeros:

tensor([[0., 0., 0.],

[0., 0., 0.]], device='cuda:0')

randn:

tensor([[ 0.3152, -1.2089, 0.7741],

[-0.4156, 0.9823, -0.1205]], device='cuda:0')

Gradient tracking enabled: True

grad_fn before operation: None

grad_fn after operation: <PowBackward0 object at 0x7f3a2c1d4b50>

x_data: shape=torch.Size([2, 2]), dtype=torch.float32, device=cuda:0, requires_grad=False

x_grad: shape=torch.Size([3, 3]), dtype=torch.float32, device=cuda:0, requires_grad=True

Mental Model

The Tensor Mental Model

A tensor is a view into a storage buffer that lives on CPU or GPU — operations dispatch to hardware-specific kernels, and requires_grad threads a computation graph through every operation for automatic differentiation.

Tensor = multidimensional array with device awareness — the same API whether the storage is on CPU, CUDA, or MPS
NumPy-like API but with GPU dispatch and Autograd built in — torch.from_numpy() converts with zero copy
requires_grad=True builds a computation graph as operations execute — .backward() traverses it in reverse to compute all gradients
GPU tensors use CUDA kernels for massively parallel execution — the speedup is real only for large tensors; small ones have more transfer overhead than benefit
Tensors are views with shape and stride — .transpose() and .permute() reorder strides without moving data; call .contiguous() before .view() if the tensor is non-contiguous

📊 Production Insight

Tensors default to CPU at creation — always pass device=device explicitly to every creation call rather than creating on CPU and moving afterward.

requires_grad=True enables Autograd — without it the tensor is a static array with no gradient path and no learning.

As of PyTorch 2.x, wrapping your model with torch.compile compiles tensor operations into fused CUDA kernels — benchmark it on your architecture before shipping to production.

Rule: set device once at the top of the script, pass it everywhere, log it at startup, and assert it before the training loop.

🎯 Key Takeaway

Tensors are GPU-aware multidimensional arrays — the universal data structure in PyTorch. requires_grad=True opts the tensor into Autograd and every subsequent operation is recorded for the backward pass. Always pass device=device to tensor creation — tensors default to CPU and PyTorch will never move them automatically.

Tensor Creation Decision

IfCreating input data for model training

→

UseUse torch.tensor(data, dtype=torch.float32, device=device, requires_grad=False) — explicit dtype and device, no gradient tracking on inputs

IfCreating trainable model parameters

→

UseUse nn.Parameter(torch.randn(..., device=device)) — automatically sets requires_grad=True and registers the parameter with the module

IfConverting a NumPy array to a tensor

→

UseUse torch.from_numpy(arr) for zero-copy on CPU, then .to(device) to move to GPU — or torch.tensor(arr, device=device) if you need a copy

IfCreating a tensor for inference only

→

UseCreate without requires_grad and wrap the forward pass in torch.inference_mode() — faster than torch.no_grad() and prevents the tensor from being used in a backward pass

thecodeforge.io

Pytorch Tensors

Enterprise Data Pipelines: SQL to Tensor Conversion

In production ML systems, training data rarely arrives as a Python list. It lives in a relational database — normalised, versioned, and filtered by business logic before it ever reaches a tensor. The conversion path from SQL to tensor has two meaningful implementation choices that trade memory efficiency against safety, and getting this wrong at scale causes OOM crashes or silent data corruption.

The recommended pipeline: query SQL into a Pandas DataFrame or directly into a NumPy array via the database cursor's fetchall method. Then convert to a tensor using torch.from_numpy(arr) for zero-copy conversion — the tensor and the NumPy array share the same underlying memory, so no data is duplicated. Move to GPU with .to(device). This entire path keeps memory usage as low as possible for datasets that fit in RAM.

The danger with torch.from_numpy(): because the tensor and the NumPy array share memory, modifying the NumPy array after conversion will silently change the tensor's data. In a pipeline where the DataFrame is reused or mutated for other purposes, this can corrupt training data without any error. If the NumPy array may change after conversion, use torch.tensor(arr, device=device) instead — it creates an independent copy at the cost of a second allocation.

For datasets that do not fit in RAM — anything beyond a few hundred thousand rows with high-dimensional features — loading everything at once causes OOM before training starts. The correct pattern is a custom Dataset that queries or reads one batch at a time in __getitem__, combined with a DataLoader that parallelises the fetching. This keeps memory usage proportional to batch size regardless of dataset size.

io/thecodeforge/queries/fetch_features.sqlSQL

-- Extracting normalised features for tensor conversion
-- We pull only the columns needed for training — not SELECT *
-- The WHERE clause filters to verified samples only, matching the Dataset class expectation
-- LIMIT controls batch size for incremental loading patterns

SELECT
    feature_a,
    feature_b,
    target_label
FROM io.thecodeforge.analytics_table
WHERE status = 'processed'
  AND is_verified = TRUE
  AND split_tag = 'train'
ORDER BY sample_id ASC
LIMIT 10000;

-- For incremental loading in a custom Dataset.__getitem__:
-- Use OFFSET :offset LIMIT :batch_size with parameterised queries
-- Never load the full table in __init__ — keep memory proportional to batch size

Output

Returns a tabular result set ready for Pandas read or cursor fetchall, then torch.from_numpy() conversion.

🔥SQL to Tensor: The Zero-Copy Path

When moving data from SQL, pull into a NumPy array via Pandas (df.values) or SQLAlchemy, then convert with torch.from_numpy(). This creates a tensor that shares the same underlying memory as the NumPy array — no data is duplicated. Move to GPU with .to(device) afterward. The caveat: because the tensor and the array share memory, any mutation of the NumPy array after conversion will silently change the tensor's data. If the array may be modified, use torch.tensor(arr, device=device) instead to get an independent copy.

📊 Production Insight

torch.from_numpy() shares memory with the NumPy array — zero-copy and fast, but any subsequent mutation of the array silently corrupts the tensor.

torch.tensor(arr) creates an independent copy — safer for pipelines where the source array is reused or modified.

For datasets larger than available RAM, load in chunks inside a custom Dataset.__getitem__ rather than upfront in __init__ — this keeps memory proportional to batch size regardless of dataset size.

Rule: from_numpy() for read-only pipelines where you control the source array's lifetime; tensor() when in doubt.

🎯 Key Takeaway

torch.from_numpy() shares memory with the NumPy array — zero-copy but modifications to the array propagate silently to the tensor. torch.tensor() creates a copy — safer for production pipelines where the source data has a longer lifetime than the tensor. For large datasets, fetch in chunks — loading everything at once is an OOM crash waiting to happen.

SQL to Tensor Conversion Decision

IfDataset fits in RAM and the NumPy array will not be modified after conversion

→

UseUse torch.from_numpy(df.values).to(device) — zero-copy conversion, minimum memory usage

IfNumPy array may be modified or reused for other purposes after tensor creation

→

UseUse torch.tensor(df.values, dtype=torch.float32, device=device) — creates an independent copy, prevents silent data corruption

IfDataset is too large for RAM (more than ~1M rows or high-dimensional features)

→

UseImplement a custom Dataset with lazy loading — fetch rows by offset in __getitem__ and let the DataLoader handle batching and parallelism

Standardising Environments with Docker

PyTorch's GPU support depends on a precise version compatibility chain: host NVIDIA driver → CUDA runtime → cuDNN → PyTorch. A mismatch at any point produces a silent failure — torch.cuda.is_available() returns False, tensors silently stay on CPU, and training runs at a fraction of expected speed with no error message to guide you. Docker solves this by fixing the entire stack in one image tag.

The compatibility rule: the CUDA version in the Docker base image must be less than or equal to the CUDA version supported by the host machine's NVIDIA driver. The driver's maximum supported CUDA version is shown in the top-right corner of nvidia-smi output. If the image requests a higher CUDA version than the driver supports, PyTorch loads but CUDA initialisation fails silently. The fix is always to pick a base image whose CUDA version is at or below what nvidia-smi reports.

Two environment variables matter for GPU-enabled containers. NVIDIA_VISIBLE_DEVICES controls which physical GPUs the container can see — set it to all for training containers or to a specific index when you need to isolate workloads on a multi-GPU host. NVIDIA_DRIVER_CAPABILITIES tells the NVIDIA Container Toolkit which driver features to expose — compute gives you CUDA compute, utility gives you nvidia-smi inside the container. Both should be set in the Dockerfile rather than passed at runtime so they are reproducible.

The verification step that should run before every training job: add a startup script that calls torch.cuda.is_available(), prints the GPU name and total memory, and exits with a non-zero code if CUDA is not available when it is expected. This turns silent CPU fallback into an immediate and obvious failure that stops the job before it wastes hours of compute.

DockerfileDOCKERFILE

# Pin specific PyTorch and CUDA versions — never use 'latest' in production
# 'latest' changes silently and makes training runs non-reproducible
# Check pytorch.org for the full compatibility matrix before changing these
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime

WORKDIR /app

# Expose all GPUs to the container runtime
# NVIDIA_VISIBLE_DEVICES=all makes every physical GPU available
# NVIDIA_DRIVER_CAPABILITIES=compute,utility enables CUDA compute and nvidia-smi
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

# Prevent CUDA memory fragmentation on long training runs
# Without this, OOM errors can occur even when total free memory is sufficient
ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Startup verification — catches CUDA misconfiguration before wasting compute
# Remove the CUDA assertion for CPU-only deployment targets
RUN python -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'CUDA version: {torch.version.cuda}')
"

# Run with: docker run --gpus all --shm-size=2g -v /data:/data thecodeforge/torch-runtime:2.3.1
CMD ["python", "ForgeTensorBasics.py"]

Output

PyTorch: 2.3.1

CUDA available: True

GPU: NVIDIA A100-SXM4-40GB

CUDA version: 12.1

Successfully built image thecodeforge/torch-runtime:2.3.1-cuda12.1

⚠ CUDA Version Mismatch Causes Silent CPU Fallback

The CUDA version in the Docker base image must be less than or equal to the CUDA version the host NVIDIA driver supports. Run nvidia-smi on the host to see the maximum supported CUDA version in the top-right corner. A CUDA 12.1 image on a host with a driver that only supports CUDA 11.8 will start without error, load PyTorch, and silently return False from torch.cuda.is_available() — all tensors stay on CPU and training runs at 1/10th expected throughput. Add a startup verification step to your container that asserts CUDA availability and exits with a non-zero code if the assertion fails.

📊 Production Insight

CUDA version in the image must be <= the host driver's maximum supported CUDA version — the right number is in the top-right corner of nvidia-smi output.

Mismatch causes torch.cuda.is_available() to return False silently — all tensors stay on CPU with no warning.

Add a RUN python -c 'assert torch.cuda.is_available()' step to the Dockerfile build so the image fails to build on a misconfigured host rather than failing silently at training time.

Rule: pin PyTorch and CUDA versions, set NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES, verify at container startup.

🎯 Key Takeaway

Docker fixes the CUDA compatibility chain in one image tag — use it. CUDA version in the image must be at or below what the host driver supports. Always pin both PyTorch and CUDA versions explicitly. Add a startup verification that asserts CUDA availability and fails loudly if it is missing — silent CPU fallback in a GPU training container is the most expensive misconfiguration in production ML.

Docker CUDA Version Selection

IfHost driver supports CUDA 12.x (nvidia-smi shows CUDA Version: 12.x)

→

UseUse pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime — current recommended base image for GPU training as of mid-2026

IfHost driver supports CUDA 11.x only (nvidia-smi shows CUDA Version: 11.x)

→

UseUse pytorch/pytorch:2.2.0-cuda11.8-cudnn8-runtime — do NOT use a CUDA 12 image on an 11.x driver

IfNo GPU on the host — CPU-only deployment or CI environment

→

UseUse pytorch/pytorch:2.3.1 — CPU-only image, several GB smaller, faster to pull and start

thecodeforge.io

Pytorch Tensors

Common Mistakes and How to Avoid Them

Most tensor bugs in production trace back to a handful of patterns. Knowing them in advance is the difference between a 30-second fix and a four-hour debugging session.

Device mismatch is the most common runtime crash. The error message is unambiguous — RuntimeError: Expected all tensors to be on the same device — but identifying which tensor is on the wrong device requires checking .device on each one. The usual culprit is a target tensor or loss buffer created inside the training loop without .to(device), while the model output is correctly on GPU. The fix: establish device once and pass it to every tensor creation call in the loop, not just to the model.

torch.Tensor (capital T) versus torch.tensor (lowercase t) is a confusion that trips up developers coming from other frameworks. torch.Tensor is the tensor class constructor — calling torch.Tensor([1, 2, 3]) creates a float32 tensor from data, but it is the long form of torch.FloatTensor and does not perform dtype inference. torch.tensor (lowercase) is the factory function that infers dtype from the input, always creates a copy, accepts device and requires_grad arguments, and is the correct way to create a tensor from data. In production code, always use torch.tensor().

In-place operations on gradient-tracked tensors corrupt the computation graph. When you call a.add_(b), the original value of a — which autograd needs to compute a's gradient during backpropagation — is destroyed. PyTorch raises RuntimeError: a leaf Variable that requires grad is being used in an in-place operation if it catches this immediately, but in some cases the graph is silently corrupted and gradients are wrong without any error. The rule: avoid trailing underscores (add_, mul_, fill_, zero_) on any tensor with requires_grad=True.

Views versus copies is the final major source of confusion. .view(), slicing, and .transpose() all return views that share storage with the original tensor. Modifying a view modifies the original. .clone() creates an independent copy. If you need to modify a slice without affecting the source tensor — common when building augmented versions of a batch — always call .clone() first.

io/thecodeforge/ml/common_tensor_mistakes.pyPYTHON

import torch

# Device selection — establish once, pass everywhere
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

# --- Mistake 1: Device mismatch ---
# WRONG: x is on GPU, y is on CPU — crashes on the addition
# x = torch.ones(5, device='cuda')
# y = torch.zeros(5)  # CPU by default
# z = x + y  # RuntimeError: Expected all tensors to be on the same device

# CORRECT: both created on the same device
x = torch.ones(5, device=device)
y = torch.zeros(5, device=device)
z = x + y
print(f"Device mismatch fixed: z = {z}")

# --- Mistake 2: torch.Tensor vs torch.tensor ---
# WRONG: torch.Tensor (class) — confusing, no dtype inference, no device argument
# confused = torch.Tensor([1, 2, 3])  # always float32, always CPU

# CORRECT: torch.tensor (factory function) — explicit dtype, device, requires_grad
clear = torch.tensor([1, 2, 3], dtype=torch.float32, device=device)
print(f"torch.tensor result: {clear}, device: {clear.device}")

# --- Mistake 3: In-place operation on a gradient-tracked tensor ---
a = torch.randn(2, 2, requires_grad=True)

# WRONG: in-place add destroys the value autograd needs for gradient computation
# a.add_(torch.ones(2, 2))  # RuntimeError: in-place modification of leaf variable

# CORRECT: create a new tensor — preserves the computation graph
b = a + torch.ones(2, 2)  # new tensor, a unchanged, graph intact
loss = b.sum()
loss.backward()  # gradients computed correctly
print(f"Gradient after correct operation: {a.grad}")

# --- Mistake 4: View vs clone confusion ---
original = torch.tensor([1.0, 2.0, 3.0, 4.0])
view     = original[:2]   # view — shares storage
copied   = original[:2].clone()  # independent copy

view[0] = 99.0   # modifies original too
print(f"After modifying view — original: {original}")  # [99., 2., 3., 4.]

copied[0] = 77.0  # does NOT modify original
print(f"After modifying clone — original: {original}")  # unchanged

# --- Mistake 5: Contiguity and .view() ---
t = torch.randn(3, 4)
t_transposed = t.T   # transpose — non-contiguous, strides reordered
# t_transposed.view(12)  # RuntimeError: view size not compatible with non-contiguous tensor
t_contiguous = t_transposed.contiguous()  # copies data into contiguous layout
reshaped = t_contiguous.view(12)          # works correctly
print(f"Contiguous reshape: {reshaped.shape}")

Output

Device: cuda

Device mismatch fixed: z = tensor([1., 1., 1., 1., 1.], device='cuda:0')

torch.tensor result: tensor([1., 2., 3.], device='cuda:0')

Gradient after correct operation: tensor([[1., 1.],

[1., 1.]])

After modifying view — original: tensor([99., 2., 3., 4.])

After modifying clone — original: tensor([99., 2., 3., 4.])

Contiguous reshape: torch.Size([12])

⚠ When Tensors Are the Wrong Tool

The most expensive tensor mistake is not a bug — it is unnecessary complexity. If your computation does not need GPU acceleration or automatic differentiation, NumPy is simpler, has less overhead, and integrates more broadly with the scientific Python ecosystem. Only create a PyTorch tensor when you are feeding data into a model, running gradient-based optimisation, or need CUDA parallelism for a large matrix operation. For data preprocessing, statistics, and exploratory analysis, NumPy is usually the right choice and you can convert with torch.from_numpy() when the data reaches the model.

📊 Production Insight

Device mismatch is the most common runtime crash — print .device on every tensor in the failing line before changing anything else.

In-place operations with trailing underscores (add_, mul_) on requires_grad tensors corrupt the computation graph — sometimes silently. Avoid them entirely on gradient-tracked tensors.

torch.tensor (lowercase) is the correct factory function for creating tensors from data. torch.Tensor (uppercase) is the class and should not appear in data creation code.

Views share storage — use .clone() before modifying any slice you do not want to affect the source tensor.

🎯 Key Takeaway

Device mismatch is the most common runtime error — always pass device=device at tensor creation and never mix CPU and GPU tensors in the same operation. In-place operations break autograd on gradient-tracked tensors. Use torch.tensor() (lowercase) for all data creation. Views share storage with the original — use .clone() when you need an independent copy.

Debugging Tensor Mistakes

IfRuntimeError: Expected all tensors to be on the same device

→

UsePrint .device on all tensors in the failing operation — add .to(device) at the creation point of the mismatched tensor, not after the fact

IfRuntimeError: a leaf Variable that requires grad is being used in an in-place operation

→

UseReplace a.add_(b) with a = a + b — the non-in-place version creates a new tensor and preserves the computation graph

If.view() fails with RuntimeError about size not compatible with stride

→

UseCall .contiguous() before .view() — the tensor is non-contiguous after .transpose() or .permute(). Or use .reshape() which handles non-contiguous tensors automatically.

IfModifying a slice changes the original tensor unexpectedly

→

UseUse .clone() to create an independent copy before modification — slices are views that share storage with the original

Control Memory, Control Your Model: The Real Cost of Tensor Shapes

Your model doesn't crash because of a bug. It crashes because you ran out of VRAM at batch 47. Every senior engineer has been there. The fix isn't buying more GPUs; it's understanding how tensor shapes wreck your memory budget. A single (1024, 1024) float32 tensor costs 4 MB. Blow that up to (1024, 2048) and you're at 8 MB. That's fine. But chain a few of these in a transformer and suddenly you're holding 4 GB of intermediate activations. PyTorch's torch.cuda.max_memory_allocated() is your best friend. Call it after every forward pass during development. Watch for the silent killer: broadcasting. A (64, 512) matrix multiplied with (512, 1) creates an implicit (64, 512) output. Every dim mismatch multiplies memory by the batch size. Profile before you optimize. Guesswork is for people who enjoy swapping GPUs out of racks at 3 AM.

MemoryProfiler.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
import gc

def profile_tensor_memory():
    # Reset the peak memory tracker—don't let previous allocations lie to you
    torch.cuda.reset_peak_memory_stats()
    gc.collect()  # Force Python garbage collector before measurement

    # A typical hidden state in a transformer layer
    batch_size, seq_len, hidden_dim = 64, 128, 1024
    input_tensor = torch.randn(batch_size, seq_len, hidden_dim, device='cuda')

    # This line does a broadcasted addition: waste of memory waiting to happen
    # bias shape (1, 1, hidden_dim) broadcasts to full (64, 128, 1024)
    bias = torch.randn(1, 1, hidden_dim, device='cuda')
    result = input_tensor + bias

    # Peak memory in MB after the operation
    peak_mb = torch.cuda.max_memory_allocated() / (1024 * 1024)
    print(f"Peak GPU memory after broadcasted add: {peak_mb:.2f} MB")

    # Now the same operation with no broadcast—uses the same memory but cleaner
    torch.cuda.reset_peak_memory_stats()
    gc.collect()
    result_direct = input_tensor + bias.expand_as(input_tensor)  # Pre-expand is explicit
    peak_mb_explicit = torch.cuda.max_memory_allocated() / (1024 * 1024)
    print(f"Peak GPU memory with explicit expand: {peak_mb_explicit:.2f} MB")

profile_tensor_memory()

Output

Peak GPU memory after broadcasted add: 68.00 MB

Peak GPU memory with explicit expand: 68.00 MB

// Both are same here, but the explicit version avoids silent shape mismatches in larger nets

⚠ Production Trap: Silent Broadcasting Floods Memory

Don't rely on PyTorch's implicit broadcasting in production loops. It hides shape mismatches that double memory when batch size changes. Always expand_as() or unsqueeze() explicitly. Your future self will thank you when the memory profiler doesn't scream.

🎯 Key Takeaway

Use torch.cuda.max_memory_allocated() after every training step. Monitor tensor shapes like you monitor latency. Explicit expansions are your debugging armor.

Pin Memory or Pay the Price: The Hidden Cost of CPU-to-GPU Transfer

You think you wrote a fast DataLoader. It uses 8 workers, prefetches 2 batches. But your GPU idle time is 20%. That's because transfers from CPU RAM to GPU VRAM are synchronous by default. Every to(device='cuda') call stalls the GPU until the CPU finishes copying. The fix is pinning memory. When you set pin_memory=True in your DataLoader, PyTorch allocates page-locked memory on the host. That memory is directly accessible by the GPU DMA engine. No page faults, no copy-through overhead. The transfer becomes asynchronous. Your GPU can keep running while the next batch is being prepared. Benchmark this: a ResNet-50 training loop with pin_memory=False vs True. The difference on a 4-GPU node is often 15-25% throughput. Don't let your DataLoader steal cycles from your backprop.

PinMemoryDemo.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import DataLoader, TensorDataset
import time

# Simulate a dataset: 10,000 samples, each 3x224x224 image (standard ImageNet size)
dummy_data = torch.randn(10000, 3, 224, 224)
dummy_labels = torch.randint(0, 1000, (10000,))
dataset = TensorDataset(dummy_data, dummy_labels)

def time_dataloader(pin_memory, num_workers=4):
    loader = DataLoader(
        dataset,
        batch_size=64,
        num_workers=num_workers,
        pin_memory=pin_memory,
        shuffle=True
    )

    start = time.perf_counter()
    for batch_idx, (data, targets) in enumerate(loader):
        # Simulate the transfer to GPU that happens in every training loop
        data = data.to('cuda', non_blocking=True) if pin_memory else data.to('cuda')
        targets = targets.to('cuda', non_blocking=True) if pin_memory else targets.to('cuda')
        # Don't actually train—just time the transfer
        if batch_idx > 50:  # Only sample 50 batches
            break
    end = time.perf_counter()
    return end - start

print("Pinned memory disabled:")
time_no_pin = time_dataloader(pin_memory=False)
print(f"Time for 50 batches: {time_no_pin:.2f}s")

print("\nPinned memory enabled:")
time_pin = time_dataloader(pin_memory=True)
print(f"Time for 50 batches: {time_pin:.2f}s")

print(f"\nSpeedup: {(time_no_pin / time_pin):.2f}x")

Output

Pinned memory disabled:

Time for 50 batches: 3.45s

Pinned memory enabled:

Time for 50 batches: 2.12s

Speedup: 1.63x

💡Senior Shortcut: Always Pair pin_memory with non_blocking=True

Setting pin_memory=True only matters if you also use non_blocking=True in your to(device) calls. Otherwise, the transfer still blocks. Add that parameter to every batch transfer in your training loop. It's a one-line change for a 20% throughput lift.

🎯 Key Takeaway

Always set pin_memory=True in DataLoader. Pair it with non_blocking=True in your to(device) calls. That's the cheapest 15-25% performance gain you'll ever get.

Stop Guessing: Profile Tensor Memory Before You Deploy

Memory leaks in production ML don't crash your training job — they crash your inference API at 3 AM when traffic spikes. Most engineers waste days debugging OOM errors that could be caught with one line of instrumentation.

PyTorch's built-in memory profiler (torch.cuda.memory_summary()) shows you exactly where every byte goes. Run it after your model's forward pass. Look for tensors that persist when they shouldn't — those are your memory anchors. The usual suspects: gradients held for backprop when you're in eval mode, or intermediate activations cached by autograd despite torch.no_grad().

Don't trust nvidia-smi. That shows total GPU allocation, not per-tensor breakdown. Use the PyTorch profiler to see allocation by operation. Then kill the hidden tensors. Your production budget will thank you.

profile_memory.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
import gc

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Simulate leak: hold reference to intermediate tensor
model = torch.nn.Linear(1024, 1024).to(device)
inputs = torch.randn(256, 1024, device=device)

gc.collect()
torch.cuda.empty_cache()

# One forward pass with memory tracking
output = model(inputs)
hidden = output.relu()  # This tensor persists; kill it

del hidden  # Explicitly free
print(torch.cuda.memory_summary(device=device, abbreviated=True))

Output

|-----------|-------------|----------------|----------------|

| cuda:0 | 47 | 1.2 MB | 8.0 MB |

⚠ Production Trap:

Calling torch.cuda.empty_cache() mid-request is a sign you're leaking. Fix the leak, don't sweep it under the GPU.

🎯 Key Takeaway

Profile tensor memory before you ship. One memory_summary() call saves you a production incident.

The One Transform That Breaks Your Batch Norm (And How to Fix It)

Batch normalization tracks running mean and variance per channel. When you reshape a tensor from (N, C, H, W) to (N*H, W, C) for a sequence model, you corrupt those statistics. The channel axis gets scrambled. Your model trains fine but serves garbage.

The fix: never reshape across the channel dimension. If you must flatten spatial dimensions, permute first to move channels to the last axis. Then reshape — the channel content stays contiguous. Or switch to LayerNorm, which normalizes over feature dimensions and doesn't care about spatial layout.

Check your running stats before and after a reshape. If the mean vector changes shape or magnitude, you've got a silent bug. Trust your profiler, not your intuition.

batchnorm_reshape_fix.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
import torch.nn as nn

class BrokenBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.bn = nn.BatchNorm2d(channels)

    def forward(self, x):
        # BAD: reshape collapses channel dim (C) into spatial (H*W)
        b, c, h, w = x.shape
        return self.bn(x).view(b * h, w, c)  # statistic corruption

class FixedBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.bn = nn.BatchNorm2d(channels)

    def forward(self, x):
        b, c, h, w = x.shape
        # Permute channels to last, then reshape spatial without touching C
        x = self.bn(x)
        x = x.permute(0, 2, 3, 1)  # (N, H, W, C)
        return x.reshape(b * h, w, c)

x = torch.randn(2, 4, 8, 16)
broken = BrokenBlock(4)
fixed = FixedBlock(4)
print('Broken output shape:', broken(x).shape)  # (16, 16, 4)
print('Fixed output shape:', fixed(x).shape)    # (16, 16, 4)
# Running means differ — broken one is garbage

Output

Broken output shape: torch.Size([16, 16, 4])

Fixed output shape: torch.Size([16, 16, 4])

💡Senior Shortcut:

When in doubt, use torch.nn.LayerNorm in mixed spatial-sequence architectures. It normalizes per sample and ignores spatial layout — no corruption risk.

🎯 Key Takeaway

Never reshape across the channel dimension when batch norm is active. Permute first, then flatten spatial dimensions.

Installation: Why the Wrong Build Costs You 10x Latency

Installing PyTorch seems trivial — pip install torch — but the default binary wastes GPU memory and cripples inference speed. The critical decision is selecting the CUDA version that matches your driver and hardware. Use nvidia-smi to check driver-capable CUDA version, then install the corresponding PyTorch build from pytorch.org. On CPU-only systems, avoid the CUDA build entirely; it pulls unnecessary GPU libraries. For edge devices, compile from source with USE_CUDA=0 to cut binary size by 80%. Always verify with torch.cuda.is_available() and torch.backends.cudnn.version(). A mismatch here silently falls back to CPU, multiplying training time. The why: PyTorch is a C++ engine — the Python wheel is just a wrapper. Wrong ABI compatibility forces CPU emulation or crashes. Test on a small tensor: torch.randn(3,3).cuda() should return instantly. If it hangs or errors, your installation is broken.

VerifyCudaInstall.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")

# Get device count
print(f"GPU count: {torch.cuda.device_count()}")

# Quick sanity: create tensor on GPU
try:
    t = torch.randn(4, 4).cuda()
    print("GPU tensor created successfully")
    print(f"Device: {t.device}")
except RuntimeError as e:
    print(f"GPU failure: {e}")

# Display CUDA version used by PyTorch
print(f"PyTorch CUDA version: {torch.version.cuda}")

Output

CUDA available: True

GPU count: 1

GPU tensor created successfully

Device: cuda:0

PyTorch CUDA version: 12.1

⚠ Production Trap:

Installing the CUDA 11.8 wheel on a CUDA 12 driver silently falls back to CPU. Always match PyTorch's CUDA toolkit version to your driver's max supported version, not your system's installed toolkit.

🎯 Key Takeaway

Check torch.cuda.is_available() immediately after install to catch driver mismatch before training.

Enhancing Data Diversity through Augmentation: Why Random Noise Beats Fixed Pipelines

Data augmentation is not about random flips — it's about forcing your model to learn invariances that generalize. Static augmentation pipelines (e.g., always rotate 30°) create spurious correlations. Instead, use stochastic augmentation with per-sample randomness controlled by torch.Generator. The why: Deterministic transforms let the model memorize augmentations as features. Random seeds per batch break that pattern. For images, combine geometric (random affine, perspective) with photometric (color jitter, Gaussian noise) transforms. Always apply augmentations on the CPU with num_workers>0 to avoid blocking GPU compute. Critical: never augment validation or test sets — only training. Use torchvision.transforms.RandAugment for production: it wraps 14 transforms with learned magnitudes. Profile memory: in-place augmentation via torchvision.transforms.functional avoids creating intermediate tensors. The hidden cost: excessive augmentation with RandomResizedCrop can double data loading time if num_workers is under 4.

StochasticAugmentation.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torchvision import transforms

# Use a Generator for reproducible randomness per batch
g = torch.Generator().manual_seed(42)

# Stochastic augmentation pipeline (training only)
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5, generator=g),
    transforms.RandAugment(num_ops=2, magnitude=9),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Apply to a batch: shape (B, C, H, W)
# batch_tensors = torch.stack([train_transform(img) for img in batch])

Output

Augmentation applied with per-batch randomness. Generator seed 42 ensures reproducibility across runs.

⚠ Production Trap:

Setting torch.manual_seed once is not enough. Each DataLoader worker spawns its own process with a new seed, causing non-deterministic augmentations across epochs. Use worker_init_fn with a generator per worker.

🎯 Key Takeaway

Use stochastic augmentation with torch.Generator to prevent network from memorizing fixed transforms as features.

● Production incidentPOST-MORTEMseverity: high

Training silently runs on CPU — GPU utilisation stays at 0%

Symptom

Training throughput is 8x slower than expected benchmarks for the model class. nvidia-smi shows 0% GPU utilisation throughout the run. No error is raised — training completes normally, loss decreases, and checkpoints are saved. Nothing in the output indicates anything is wrong.

Assumption

The GPU is faulty or the CUDA driver is incompatible with the installed PyTorch version. The team spent several hours running nvidia-smi diagnostics, checking driver logs, and reinstalling CUDA before looking at the training code itself.

Root cause

Input tensors were created with torch.tensor(data), which defaults to CPU. The model was correctly moved to GPU with model.to('cuda'), but the input data remained on CPU. The first forward pass raised a device mismatch RuntimeError, which was caught by a broad except Exception block that was added months earlier to 'handle data loading issues.' The except block logged a generic warning and continued, falling back to CPU computation silently. Training ran on CPU for the entirety of the job.

Fix

Established a single device variable at the top of the training script and passed it to every tensor creation call: torch.tensor(data, device=device). Removed the broad except block that was suppressing the RuntimeError — device mismatch errors should crash immediately, not be swallowed. Added a pre-training assertion: assert next(model.parameters()).is_cuda, 'Model is not on GPU'. Added a startup log that prints the device of the first model parameter and the first input batch so device placement is visible from the very first line of training output.

Key lesson

PyTorch tensors default to CPU — you must explicitly pass device=device to every creation call, or call .to(device) before any operation involving the model
Broad except blocks that catch Exception are one of the most dangerous patterns in training code — they suppress device mismatch errors and allow silent CPU fallback
Always verify GPU placement with assert next(model.parameters()).is_cuda before the training loop starts — this one line catches the most expensive silent failure in production ML
Log the device of model parameters and input tensors at startup — make device placement visible from the first line of output, not something you discover after 8 hours of slow training

Production debug guideCommon symptoms when tensor operations fail5 entries

Symptom · 01

RuntimeError: Expected all tensors to be on the same device

→

Fix

Print .device on every tensor in the failing operation before trying anything else: print(x.device, y.device). The mismatch is usually between model output (GPU) and a target tensor created inside the loss function (CPU). Fix by passing device=device to every tensor creation call in the training loop, including target tensors and any intermediate buffers.

Symptom · 02

CUDA out of memory error mid-training

→

Fix

Check tensor sizes with .element_size() * .nelement() to find which allocation is largest. Delete unused tensors with del and call torch.cuda.empty_cache() between epochs. Check whether loss is being logged with loss.item() — logging loss directly holds the entire computation graph in memory. Run torch.cuda.memory_summary() to see a breakdown of current allocations.

Symptom · 03

RuntimeError: grad can be implicitly created only for scalar outputs

→

Fix

Loss must be a scalar before calling .backward(). If your loss computation returns a tensor with more than one element, reduce it with .mean() or .sum() first. This typically happens when the loss function is applied without proper aggregation — for example, calling a per-element loss without reducing across the batch dimension.

Symptom · 04

Tensor shape mismatch on matrix multiplication

→

Fix

Print .shape on both tensors before the operation: print(x.shape, y.shape). Matrix multiplication requires the inner dimensions to match — (A, B) @ (B, C) produces (A, C). Use .unsqueeze() to add missing dimensions or .permute() to reorder. If the mismatch is a batch dimension issue, check whether you need .bmm() instead of .mm() for batched matrix multiplication.

Symptom · 05

Gradients are None for some model parameters

→

Fix

Check whether those parameters are actually used in the computation that produced the loss. Parameters not in the computation graph have no gradient path and .grad remains None. Also check for accidental torch.no_grad() wrapping the forward pass, and verify that requires_grad=True is set on the parameters — freezing layers (param.requires_grad = False) is a common source of this.

★ Tensor Debug Cheat SheetQuick commands to diagnose tensor issues

Device mismatch crash−

Immediate action

Check the device of every tensor involved in the failing operation

Commands

python -c "import torch; x = torch.tensor([1.0]); print('Default device:', x.device)"

python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device count:', torch.cuda.device_count()); print('Current device:', torch.cuda.current_device() if torch.cuda.is_available() else 'N/A')"

Fix now

Set device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') at the top of the script and pass it to every tensor creation call — this eliminates the entire class of device mismatch errors

CUDA out of memory+

Tensor shape mismatch+

NumPy Arrays vs PyTorch Tensors

Aspect	NumPy Arrays	PyTorch Tensors
Hardware support	CPU only — no GPU path	CPU, NVIDIA GPU via CUDA, Apple Silicon via MPS — same API regardless of device
Automatic differentiation	Manual — you implement the derivative by hand	Autograd — .backward() computes all gradients via the chain rule in one pass
Deep learning ecosystem	Requires wrappers or conversion to use with PyTorch, JAX, or TensorFlow	Native — every PyTorch layer, loss function, and optimizer operates on tensors directly
Memory model	Contiguous C-order arrays in CPU memory — straightforward layout	Views with shape and stride — efficient for transpose and reshape, but non-contiguous tensors require .contiguous() before .view()
Interoperability	Universal — the lingua franca of the scientific Python ecosystem	torch.from_numpy() converts with zero copy on CPU — full round-trip compatibility
When to use it	Data preprocessing, statistics, visualisation, and any computation that does not need GPU or gradients	Any workload that feeds a neural network, requires gradient-based optimisation, or benefits from GPU parallelism on large matrices

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
iothecodeforgemlforge_tensor_basics.py	if torch.cuda.is_available():	What Is a PyTorch Tensor and Why Does It Exist?
iothecodeforgequeriesfetch_features.sql	SELECT	Enterprise Data Pipelines
Dockerfile	FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime	Standardising Environments with Docker
iothecodeforgemlcommon_tensor_mistakes.py	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")	Common Mistakes and How to Avoid Them
MemoryProfiler.py	def profile_tensor_memory():	Control Memory, Control Your Model
PinMemoryDemo.py	from torch.utils.data import DataLoader, TensorDataset	Pin Memory or Pay the Price
profile_memory.py	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')	Stop Guessing
batchnorm_reshape_fix.py	class BrokenBlock(nn.Module):	The One Transform That Breaks Your Batch Norm (And How to Fi
VerifyCudaInstall.py	print(f"CUDA available: {torch.cuda.is_available()}")	Installation
StochasticAugmentation.py	from torchvision import transforms	Enhancing Data Diversity through Augmentation

Key takeaways

Tensors are the universal data structure in PyTorch

every input, weight, gradient, and model output is a tensor. Understanding how they work is not optional; it is the foundation every other PyTorch concept builds on.

requires_grad=True opts a tensor into the computation graph

every subsequent operation is recorded for backpropagation. Set it only on learnable parameters, never on input data, and never on tensors used only for inference.

Always pass device=device to tensor creation calls

tensors default to CPU and PyTorch never moves them automatically. The cost of forgetting this is training silently on CPU at 1/10th expected speed.

In-place operations (add_, mul_, fill_) on gradient-tracked tensors corrupt the computation graph

sometimes raising an error immediately, sometimes silently producing wrong gradients. Avoid them on any tensor with requires_grad=True.

Use torch.tensor() (lowercase) for creating tensors from data. torch.Tensor() (uppercase) is the class constructor and does not infer dtype or accept a device argument

it should not appear in data creation code.

Views share storage with the original tensor

.transpose(), .permute(), and slicing all return views. Use .clone() when you need a modification-safe independent copy, and .contiguous() before .view() when the tensor is non-contiguous.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is the difference between torch.Tensor (class constructor) and torc...

Q02SENIOR

Explain the contiguous tensor concept. Why do we often need to call .con...

Q03SENIOR

Describe the broadcasting rules in PyTorch. How does the framework handl...

Q04SENIOR

How does the Autograd engine use the grad_fn attribute to perform backpr...

Q05SENIOR

What is the difference between torch.no_grad() and torch.inference_mode(...

Q01 of 05SENIOR

What is the difference between torch.Tensor (class constructor) and torch.tensor (factory function)?

ANSWER

torch.Tensor (capital T) is the tensor class constructor. Calling torch.Tensor([1, 2, 3]) creates a float32 tensor — it is equivalent to torch.FloatTensor([1, 2, 3]). It does not perform dtype inference from the input, does not accept a device argument directly, and is considered a low-level interface. torch.tensor (lowercase t) is the recommended factory function for creating tensors from data. It infers dtype from the Python or NumPy type of the input (int64 for Python integers, float32 for Python floats with a trailing decimal, and so on), always creates a copy of the data, and accepts dtype, device, and requires_grad as arguments. In production code, torch.tensor() should appear in all data creation — torch.Tensor() should not. The practical danger of torch.Tensor(): calling torch.Tensor(3, 4) creates a 3x4 tensor of uninitialised memory rather than a tensor from the data [3, 4], which is a silent correctness bug.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is a PyTorch Tensor in simple terms?

What is the difference between .view() and .reshape()?

How do I move a tensor from GPU back to CPU?

Why does torch.cuda.is_available() return False even though I have a GPU?

When should I use torch.no_grad() vs torch.inference_mode()?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Drawn from code that ran under real load.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's PyTorch. Mark it forged?

9 min read · try the examples if you haven't