PyTorch Tensors Explained
- Tensors are the universal data structure in PyTorch — every input, weight, gradient, and model output is a tensor. Understanding how they work is not optional; it is the foundation every other PyTorch concept builds on.
- requires_grad=True opts a tensor into the computation graph — every subsequent operation is recorded for backpropagation. Set it only on learnable parameters, never on input data, and never on tensors used only for inference.
- Always pass device=device to tensor creation calls — tensors default to CPU and PyTorch never moves them automatically. The cost of forgetting this is training silently on CPU at 1/10th expected speed.
- Tensors are GPU-accelerated multidimensional arrays — the universal data structure for all PyTorch operations
- They mirror NumPy's API but add CUDA support and automatic differentiation via Autograd
- requires_grad=True enables gradient tracking — the tensor records every operation for backpropagation
- Moving tensors to GPU with .to('cuda') provides 10-100x speedup for large matrix operations
- Device mismatch (CPU tensor + GPU tensor in the same operation) is the #1 production RuntimeError — always check .device
- Small tensors incur more transfer overhead than they save — only move to GPU when the computation justifies it
Device mismatch crash
python -c "import torch; x = torch.tensor([1.0]); print('Default device:', x.device)"python -c "import torch; print('CUDA available:', torch.cuda.is_available()); print('Device count:', torch.cuda.device_count()); print('Current device:', torch.cuda.current_device() if torch.cuda.is_available() else 'N/A')"CUDA out of memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csvpython -c "import torch; print(torch.cuda.memory_summary())"Tensor shape mismatch
python -c "import torch; x = torch.randn(3,4); y = torch.randn(4,5); print('x:', x.shape, 'y:', y.shape, 'result:', (x@y).shape)"python -c "import torch; x = torch.randn(3); print('Original:', x.shape, 'Unsqueezed row:', x.unsqueeze(0).shape, 'Unsqueezed col:', x.unsqueeze(1).shape)"Production Incident
model.parameters()).is_cuda, 'Model is not on GPU'. Added a startup log that prints the device of the first model parameter and the first input batch so device placement is visible from the very first line of training output.model.parameters()).is_cuda before the training loop starts — this one line catches the most expensive silent failure in production MLLog the device of model parameters and input tensors at startup — make device placement visible from the first line of output, not something you discover after 8 hours of slow trainingProduction Debug GuideCommon symptoms when tensor operations fail
torch.cuda.empty_cache() between epochs. Check whether loss is being logged with loss.item() — logging loss directly holds the entire computation graph in memory. Run torch.cuda.memory_summary() to see a breakdown of current allocations.torch.no_grad() wrapping the forward pass, and verify that requires_grad=True is set on the parameters — freezing layers (param.requires_grad = False) is a common source of this.PyTorch Tensors are the fundamental data structure in PyTorch — every input, weight, gradient, and output is a tensor. They are multidimensional arrays that mirror NumPy's API but add two capabilities that NumPy does not have: GPU acceleration via CUDA and automatic differentiation via Autograd.
The key design decision: tensors are not just data containers. When requires_grad=True, they become nodes in a dynamic computation graph. Every operation on them is recorded as the forward pass executes, enabling automatic gradient computation when you call .backward(). This is what makes neural network training tractable — without it, you would manually compute partial derivatives for every parameter on every update, which is not realistic at any modern model size.
The production failure pattern: device mismatch. A tensor on CPU cannot participate in the same operation as a tensor on GPU. PyTorch raises RuntimeError: Expected all tensors to be on the same device immediately and clearly. What is not clear is which tensor is on the wrong device. The fix is always the same: establish device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') once, pass it to every tensor creation call, and add an assertion at training start that verifies model parameters and input data are on the same device.
As of 2026, there is a third dimension worth knowing: Apple Silicon. PyTorch supports MPS (Metal Performance Shaders) on M-series Macs via device='mps', which provides meaningful GPU acceleration on MacBooks without CUDA. The same .to(device) pattern applies — the device string changes, the code does not.
What Is a PyTorch Tensor and Why Does It Exist?
A PyTorch tensor is a multidimensional array that was designed to solve a problem NumPy cannot: running massive parallel computations on a GPU while simultaneously tracking every operation for automatic gradient computation.
NumPy arrays are excellent for scientific computing — fast, well-documented, universally supported. But they have two hard limits. First, they run only on CPU. Second, they have no concept of a computation graph. This means that if you want to train a neural network with NumPy, you implement backpropagation manually — computing partial derivatives by hand for every layer, every parameter, every batch. That is tractable for a two-layer toy network and completely unworkable for anything beyond it.
PyTorch tensors solve both problems. The storage layer underneath a tensor can live on CPU, on an NVIDIA GPU via CUDA, or on Apple Silicon via MPS. When the storage is on GPU, every matrix operation dispatches to CUDA kernels that execute in parallel across thousands of GPU cores — this is why large matrix multiplications are 10–100x faster on GPU for the sizes neural networks operate on. When requires_grad=True, the tensor records every operation as part of a dynamic computation graph. When .backward() is called on the loss, that graph is traversed in reverse and .grad is filled in on every participating tensor via the chain rule — one backward pass, all gradients computed simultaneously.
The architectural detail worth understanding: a tensor is a view into a storage object. The tensor knows its shape, stride, dtype, and device. The storage holds the raw bytes. Operations like .transpose() and .permute() create new tensor views with reordered strides without moving any data in memory — the storage stays identical. This is efficient but has a consequence: the resulting tensor is non-contiguous, and .view() will refuse to work on it because .view() requires elements to be laid out in memory in the same order they are addressed logically. The fix is .contiguous(), which copies the data into a new storage with the expected memory order.
As of PyTorch 2.x, there is a fourth dimension: compiled tensors. torch.compile traces your forward pass and compiles it into optimised kernels using TorchInductor. The tensor API is identical — you add one decorator and the same tensor operations run in a fused, optimised form. For production inference workloads in 2026, torch.compile is the highest-leverage single change you can make to a trained model.
import torch # Device selection — this pattern belongs at the top of every script # Supports CUDA (NVIDIA), MPS (Apple Silicon), and CPU fallback if torch.cuda.is_available(): device = torch.device("cuda") elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): device = torch.device("mps") # Apple Silicon GPU — supported from PyTorch 1.12+ else: device = torch.device("cpu") print(f"Using device: {device}") def initialize_forge_tensors(): # Creating tensors from Python data — torch.tensor (lowercase) infers dtype data = [[1, 2], [3, 4]] x_data = torch.tensor(data, dtype=torch.float32, device=device) print(f"From list — shape: {x_data.shape}, device: {x_data.device}, dtype: {x_data.dtype}") # Creating tensors with specific values — pass device at creation, not after x_ones = torch.ones(2, 3, dtype=torch.float32, device=device) x_zeros = torch.zeros(2, 3, dtype=torch.float32, device=device) x_rand = torch.randn(2, 3, device=device) # standard normal distribution print(f"\nones:\n{x_ones}") print(f"zeros:\n{x_zeros}") print(f"randn:\n{x_rand}") # requires_grad=True — opts this tensor into the computation graph # Only set this on learnable parameters, never on input data x_grad = torch.randn(3, 3, requires_grad=True, device=device) print(f"\nGradient tracking enabled: {x_grad.requires_grad}") print(f"grad_fn before operation: {x_grad.grad_fn}") # None — leaf tensor # Any operation on a requires_grad tensor produces a non-leaf tensor with grad_fn y = x_grad ** 2 # element-wise square print(f"grad_fn after operation: {y.grad_fn}") # PowBackward0 return x_data, x_grad x_data, x_grad = initialize_forge_tensors() # Tensor metadata inspection — useful diagnostic at the start of debugging for name, t in [("x_data", x_data), ("x_grad", x_grad)]: print(f"{name}: shape={t.shape}, dtype={t.dtype}, device={t.device}, requires_grad={t.requires_grad}")
From list — shape: torch.Size([2, 2]), device: cuda:0, dtype: torch.float32
ones:
tensor([[1., 1., 1.],
[1., 1., 1.]], device='cuda:0')
zeros:
tensor([[0., 0., 0.],
[0., 0., 0.]], device='cuda:0')
randn:
tensor([[ 0.3152, -1.2089, 0.7741],
[-0.4156, 0.9823, -0.1205]], device='cuda:0')
Gradient tracking enabled: True
grad_fn before operation: None
grad_fn after operation: <PowBackward0 object at 0x7f3a2c1d4b50>
x_data: shape=torch.Size([2, 2]), dtype=torch.float32, device=cuda:0, requires_grad=False
x_grad: shape=torch.Size([3, 3]), dtype=torch.float32, device=cuda:0, requires_grad=True
- Tensor = multidimensional array with device awareness — the same API whether the storage is on CPU, CUDA, or MPS
- NumPy-like API but with GPU dispatch and Autograd built in —
torch.from_numpy()converts with zero copy - requires_grad=True builds a computation graph as operations execute — .backward() traverses it in reverse to compute all gradients
- GPU tensors use CUDA kernels for massively parallel execution — the speedup is real only for large tensors; small ones have more transfer overhead than benefit
- Tensors are views with shape and stride — .transpose() and .permute() reorder strides without moving data; call .contiguous() before .view() if the tensor is non-contiguous
torch.inference_mode() — faster than torch.no_grad() and prevents the tensor from being used in a backward passEnterprise Data Pipelines: SQL to Tensor Conversion
In production ML systems, training data rarely arrives as a Python list. It lives in a relational database — normalised, versioned, and filtered by business logic before it ever reaches a tensor. The conversion path from SQL to tensor has two meaningful implementation choices that trade memory efficiency against safety, and getting this wrong at scale causes OOM crashes or silent data corruption.
The recommended pipeline: query SQL into a Pandas DataFrame or directly into a NumPy array via the database cursor's fetchall method. Then convert to a tensor using torch.from_numpy(arr) for zero-copy conversion — the tensor and the NumPy array share the same underlying memory, so no data is duplicated. Move to GPU with .to(device). This entire path keeps memory usage as low as possible for datasets that fit in RAM.
The danger with torch.from_numpy(): because the tensor and the NumPy array share memory, modifying the NumPy array after conversion will silently change the tensor's data. In a pipeline where the DataFrame is reused or mutated for other purposes, this can corrupt training data without any error. If the NumPy array may change after conversion, use torch.tensor(arr, device=device) instead — it creates an independent copy at the cost of a second allocation.
For datasets that do not fit in RAM — anything beyond a few hundred thousand rows with high-dimensional features — loading everything at once causes OOM before training starts. The correct pattern is a custom Dataset that queries or reads one batch at a time in __getitem__, combined with a DataLoader that parallelises the fetching. This keeps memory usage proportional to batch size regardless of dataset size.
-- Extracting normalised features for tensor conversion -- We pull only the columns needed for training — not SELECT * -- The WHERE clause filters to verified samples only, matching the Dataset class expectation -- LIMIT controls batch size for incremental loading patterns SELECT feature_a, feature_b, target_label FROM io.thecodeforge.analytics_table WHERE status = 'processed' AND is_verified = TRUE AND split_tag = 'train' ORDER BY sample_id ASC LIMIT 10000; -- For incremental loading in a custom Dataset.__getitem__: -- Use OFFSET :offset LIMIT :batch_size with parameterised queries -- Never load the full table in __init__ — keep memory proportional to batch size
torch.from_numpy(). This creates a tensor that shares the same underlying memory as the NumPy array — no data is duplicated. Move to GPU with .to(device) afterward. The caveat: because the tensor and the array share memory, any mutation of the NumPy array after conversion will silently change the tensor's data. If the array may be modified, use torch.tensor(arr, device=device) instead to get an independent copy.from_numpy() for read-only pipelines where you control the source array's lifetime; tensor() when in doubt.torch.tensor() creates a copy — safer for production pipelines where the source data has a longer lifetime than the tensor. For large datasets, fetch in chunks — loading everything at once is an OOM crash waiting to happen.Standardising Environments with Docker
PyTorch's GPU support depends on a precise version compatibility chain: host NVIDIA driver → CUDA runtime → cuDNN → PyTorch. A mismatch at any point produces a silent failure — torch.cuda.is_available() returns False, tensors silently stay on CPU, and training runs at a fraction of expected speed with no error message to guide you. Docker solves this by fixing the entire stack in one image tag.
The compatibility rule: the CUDA version in the Docker base image must be less than or equal to the CUDA version supported by the host machine's NVIDIA driver. The driver's maximum supported CUDA version is shown in the top-right corner of nvidia-smi output. If the image requests a higher CUDA version than the driver supports, PyTorch loads but CUDA initialisation fails silently. The fix is always to pick a base image whose CUDA version is at or below what nvidia-smi reports.
Two environment variables matter for GPU-enabled containers. NVIDIA_VISIBLE_DEVICES controls which physical GPUs the container can see — set it to all for training containers or to a specific index when you need to isolate workloads on a multi-GPU host. NVIDIA_DRIVER_CAPABILITIES tells the NVIDIA Container Toolkit which driver features to expose — compute gives you CUDA compute, utility gives you nvidia-smi inside the container. Both should be set in the Dockerfile rather than passed at runtime so they are reproducible.
The verification step that should run before every training job: add a startup script that calls torch.cuda.is_available(), prints the GPU name and total memory, and exits with a non-zero code if CUDA is not available when it is expected. This turns silent CPU fallback into an immediate and obvious failure that stops the job before it wastes hours of compute.
# Pin specific PyTorch and CUDA versions — never use 'latest' in production # 'latest' changes silently and makes training runs non-reproducible # Check pytorch.org for the full compatibility matrix before changing these FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime WORKDIR /app # Expose all GPUs to the container runtime # NVIDIA_VISIBLE_DEVICES=all makes every physical GPU available # NVIDIA_DRIVER_CAPABILITIES=compute,utility enables CUDA compute and nvidia-smi ENV NVIDIA_VISIBLE_DEVICES=all ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility # Prevent CUDA memory fragmentation on long training runs # Without this, OOM errors can occur even when total free memory is sufficient ENV PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . # Startup verification — catches CUDA misconfiguration before wasting compute # Remove the CUDA assertion for CPU-only deployment targets RUN python -c " import torch print(f'PyTorch: {torch.__version__}') print(f'CUDA available: {torch.cuda.is_available()}') if torch.cuda.is_available(): print(f'GPU: {torch.cuda.get_device_name(0)}') print(f'CUDA version: {torch.version.cuda}') " # Run with: docker run --gpus all --shm-size=2g -v /data:/data thecodeforge/torch-runtime:2.3.1 CMD ["python", "ForgeTensorBasics.py"]
CUDA available: True
GPU: NVIDIA A100-SXM4-40GB
CUDA version: 12.1
Successfully built image thecodeforge/torch-runtime:2.3.1-cuda12.1
torch.cuda.is_available() — all tensors stay on CPU and training runs at 1/10th expected throughput. Add a startup verification step to your container that asserts CUDA availability and exits with a non-zero code if the assertion fails.torch.cuda.is_available() to return False silently — all tensors stay on CPU with no warning.torch.cuda.is_available()' step to the Dockerfile build so the image fails to build on a misconfigured host rather than failing silently at training time.Common Mistakes and How to Avoid Them
Most tensor bugs in production trace back to a handful of patterns. Knowing them in advance is the difference between a 30-second fix and a four-hour debugging session.
Device mismatch is the most common runtime crash. The error message is unambiguous — RuntimeError: Expected all tensors to be on the same device — but identifying which tensor is on the wrong device requires checking .device on each one. The usual culprit is a target tensor or loss buffer created inside the training loop without .to(device), while the model output is correctly on GPU. The fix: establish device once and pass it to every tensor creation call in the loop, not just to the model.
torch.Tensor (capital T) versus torch.tensor (lowercase t) is a confusion that trips up developers coming from other frameworks. torch.Tensor is the tensor class constructor — calling torch.Tensor([1, 2, 3]) creates a float32 tensor from data, but it is the long form of torch.FloatTensor and does not perform dtype inference. torch.tensor (lowercase) is the factory function that infers dtype from the input, always creates a copy, accepts device and requires_grad arguments, and is the correct way to create a tensor from data. In production code, always use torch.tensor().
In-place operations on gradient-tracked tensors corrupt the computation graph. When you call a.add_(b), the original value of a — which autograd needs to compute a's gradient during backpropagation — is destroyed. PyTorch raises RuntimeError: a leaf Variable that requires grad is being used in an in-place operation if it catches this immediately, but in some cases the graph is silently corrupted and gradients are wrong without any error. The rule: avoid trailing underscores (add_, mul_, fill_, zero_) on any tensor with requires_grad=True.
Views versus copies is the final major source of confusion. .view(), slicing, and .transpose() all return views that share storage with the original tensor. Modifying a view modifies the original. .clone() creates an independent copy. If you need to modify a slice without affecting the source tensor — common when building augmented versions of a batch — always call .clone() first.
import torch # Device selection — establish once, pass everywhere device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Device: {device}") # --- Mistake 1: Device mismatch --- # WRONG: x is on GPU, y is on CPU — crashes on the addition # x = torch.ones(5, device='cuda') # y = torch.zeros(5) # CPU by default # z = x + y # RuntimeError: Expected all tensors to be on the same device # CORRECT: both created on the same device x = torch.ones(5, device=device) y = torch.zeros(5, device=device) z = x + y print(f"Device mismatch fixed: z = {z}") # --- Mistake 2: torch.Tensor vs torch.tensor --- # WRONG: torch.Tensor (class) — confusing, no dtype inference, no device argument # confused = torch.Tensor([1, 2, 3]) # always float32, always CPU # CORRECT: torch.tensor (factory function) — explicit dtype, device, requires_grad clear = torch.tensor([1, 2, 3], dtype=torch.float32, device=device) print(f"torch.tensor result: {clear}, device: {clear.device}") # --- Mistake 3: In-place operation on a gradient-tracked tensor --- a = torch.randn(2, 2, requires_grad=True) # WRONG: in-place add destroys the value autograd needs for gradient computation # a.add_(torch.ones(2, 2)) # RuntimeError: in-place modification of leaf variable # CORRECT: create a new tensor — preserves the computation graph b = a + torch.ones(2, 2) # new tensor, a unchanged, graph intact loss = b.sum() loss.backward() # gradients computed correctly print(f"Gradient after correct operation: {a.grad}") # --- Mistake 4: View vs clone confusion --- original = torch.tensor([1.0, 2.0, 3.0, 4.0]) view = original[:2] # view — shares storage copied = original[:2].clone() # independent copy view[0] = 99.0 # modifies original too print(f"After modifying view — original: {original}") # [99., 2., 3., 4.] copied[0] = 77.0 # does NOT modify original print(f"After modifying clone — original: {original}") # unchanged # --- Mistake 5: Contiguity and .view() --- t = torch.randn(3, 4) t_transposed = t.T # transpose — non-contiguous, strides reordered # t_transposed.view(12) # RuntimeError: view size not compatible with non-contiguous tensor t_contiguous = t_transposed.contiguous() # copies data into contiguous layout reshaped = t_contiguous.view(12) # works correctly print(f"Contiguous reshape: {reshaped.shape}")
Device mismatch fixed: z = tensor([1., 1., 1., 1., 1.], device='cuda:0')
torch.tensor result: tensor([1., 2., 3.], device='cuda:0')
Gradient after correct operation: tensor([[1., 1.],
[1., 1.]])
After modifying view — original: tensor([99., 2., 3., 4.])
After modifying clone — original: tensor([99., 2., 3., 4.])
Contiguous reshape: torch.Size([12])
torch.from_numpy() when the data reaches the model.torch.tensor() (lowercase) for all data creation. Views share storage with the original — use .clone() when you need an independent copy.| Aspect | NumPy Arrays | PyTorch Tensors |
|---|---|---|
| Hardware support | CPU only — no GPU path | CPU, NVIDIA GPU via CUDA, Apple Silicon via MPS — same API regardless of device |
| Automatic differentiation | Manual — you implement the derivative by hand | Autograd — .backward() computes all gradients via the chain rule in one pass |
| Deep learning ecosystem | Requires wrappers or conversion to use with PyTorch, JAX, or TensorFlow | Native — every PyTorch layer, loss function, and optimizer operates on tensors directly |
| Memory model | Contiguous C-order arrays in CPU memory — straightforward layout | Views with shape and stride — efficient for transpose and reshape, but non-contiguous tensors require .contiguous() before .view() |
| Interoperability | Universal — the lingua franca of the scientific Python ecosystem | torch.from_numpy() converts with zero copy on CPU — full round-trip compatibility |
| When to use it | Data preprocessing, statistics, visualisation, and any computation that does not need GPU or gradients | Any workload that feeds a neural network, requires gradient-based optimisation, or benefits from GPU parallelism on large matrices |
🎯 Key Takeaways
- Tensors are the universal data structure in PyTorch — every input, weight, gradient, and model output is a tensor. Understanding how they work is not optional; it is the foundation every other PyTorch concept builds on.
- requires_grad=True opts a tensor into the computation graph — every subsequent operation is recorded for backpropagation. Set it only on learnable parameters, never on input data, and never on tensors used only for inference.
- Always pass device=device to tensor creation calls — tensors default to CPU and PyTorch never moves them automatically. The cost of forgetting this is training silently on CPU at 1/10th expected speed.
- In-place operations (add_, mul_, fill_) on gradient-tracked tensors corrupt the computation graph — sometimes raising an error immediately, sometimes silently producing wrong gradients. Avoid them on any tensor with requires_grad=True.
- Use
torch.tensor()(lowercase) for creating tensors from data. torch.Tensor()(uppercase) is the class constructor and does not infer dtype or accept a device argument — it should not appear in data creation code. - Views share storage with the original tensor — .transpose(), .permute(), and slicing all return views. Use .clone() when you need a modification-safe independent copy, and .contiguous() before .view() when the tensor is non-contiguous.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the difference between torch.Tensor (class constructor) and torch.tensor (factory function)?Mid-levelReveal
- QExplain the contiguous tensor concept. Why do we often need to call .contiguous() before a .view() operation?SeniorReveal
- QDescribe the broadcasting rules in PyTorch. How does the framework handle an operation between a (3, 1) tensor and a (1, 3) tensor?Mid-levelReveal
- QHow does the Autograd engine use the grad_fn attribute to perform backpropagation?SeniorReveal
- QWhat is the difference between
torch.no_grad()andtorch.inference_mode(), and when should you use each?SeniorReveal
Frequently Asked Questions
What is a PyTorch Tensor in simple terms?
A PyTorch tensor is a multidimensional array — like a NumPy array — that can live on a GPU and automatically track every mathematical operation performed on it. The GPU part makes large matrix computations 10–100x faster. The operation tracking is what enables automatic differentiation: when you tell PyTorch 'compute gradients', it traces every step backwards and tells you exactly how to adjust each number to reduce your error. These two capabilities together are what make neural network training practical.
What is the difference between .view() and .reshape()?
.view() returns a new tensor with a different shape that shares the same underlying storage as the original. It requires the tensor to be contiguous in memory — if it is not (for example, after a .transpose()), .view() raises RuntimeError: view size is not compatible with input tensor's size and stride. .reshape() is more flexible: it returns a view if the tensor is already contiguous, and silently makes a copy if it is not. The practical rule: use .reshape() unless you explicitly need the guarantee that no data was copied. If .reshape() returns a view, modifying it modifies the original — so be aware of the view semantics either way.
How do I move a tensor from GPU back to CPU?
Call tensor.cpu() to return a new tensor on CPU, or tensor.to('cpu'). If the tensor has requires_grad=True or is part of a computation graph, call tensor.detach().cpu() first — .detach() removes it from the graph so NumPy conversion and other CPU operations work correctly. The full pattern for converting a GPU tensor to a NumPy array: tensor.detach().cpu().numpy(). This order is required: detach before numpy (to remove autograd tracking), cpu before numpy (to move off GPU).
Why does torch.cuda.is_available() return False even though I have a GPU?
The most common causes in order: (1) The NVIDIA driver is not installed or is too old — check with nvidia-smi. (2) You installed the CPU-only version of PyTorch — the package name differs; install the CUDA-enabled version from pytorch.org. (3) Inside a Docker container, the CUDA version in the base image is higher than what the host driver supports — the container starts but CUDA initialisation fails. (4) The GPU is visible to the OS but not to the current user — check with nvidia-smi -L and verify permissions. In all cases, torch.cuda.is_available() returning False means every tensor stays on CPU and training runs at 1/10th expected speed with no error.
When should I use torch.no_grad() vs torch.inference_mode()?
Use torch.no_grad() during validation loops inside a training run — it disables gradient computation without disabling version counter tracking, which provides a small safety net if other code in the same scope depends on version information. Use torch.inference_mode() for production serving and any pure inference path — it disables both gradient computation and version tracking, runs 10–20% faster, and is the semantically correct choice when you are certain no backward pass will follow. Tensors created inside inference_mode() are marked permanently and cannot be used in a backward pass even after leaving the context, which prevents an entire class of accidental training-in-inference bugs.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.