PyTorch DataLoader and Datasets
- Dataset and DataLoader have deliberately separate responsibilities — Dataset knows how to access one sample, DataLoader knows how to batch, shuffle, and parallelise. Understanding this separation makes every configuration decision obvious.
- num_workers=0 is the default and the most common performance mistake — it serialises data loading on the main thread and leaves the GPU idle between batches. Always override it for GPU training.
- In Docker, --shm-size=2g is a required flag when num_workers > 0, not an optional optimisation. The default 64MB causes Bus error crashes and leaves no Python traceback to diagnose from.
- Dataset defines how to access a single sample — implement __len__ and __getitem__ for lazy loading
- DataLoader wraps a Dataset to provide batching, shuffling, and multi-process parallel loading
- pin_memory=True speeds up CPU-to-GPU transfers by using page-locked host memory
- num_workers > 0 parallelizes data loading on CPU — the #1 fix for GPU starvation
- The biggest production mistake is num_workers=0, which serializes loading and slows training 50%+
- In Docker, --shm-size must be increased when num_workers > 0 or you get Bus error crashes
Training is slow, GPU utilisation is low
nvidia-smi -l 1 # watch GPU utilisation in real time — below 80% means starvationpython -c "import torch; p = torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA]); print('profiler ready')"Bus error crash inside Docker container
docker exec <container> df -h /dev/shmdocker inspect <container> | grep ShmSizeDataLoader hangs with no error after a few batches
import pickle; pickle.dumps(dataset) # if this raises, you have an unpickleable objectstrace -p <worker_pid> # check for blocked syscalls in a worker processProduction Incident
Production Debug GuideCommon symptoms when the data pipeline goes wrong
PyTorch DataLoader and Datasets decouple data storage from batching logic, enabling scalable pipelines that keep GPUs fully utilised. The Dataset class abstracts how individual samples are accessed — one at a time, lazily, from disk or a database. The DataLoader handles batching, shuffling, and multi-process loading on top of whatever Dataset you hand it.
The core problem these tools solve: training on datasets that do not fit in memory while keeping the GPU fed continuously. If data loading is slower than GPU computation, the GPU sits idle between batches — this is called data starvation and it is one of the most common reasons a training run is 3x slower than it should be. The DataLoader solves this by pre-fetching batches in parallel worker processes while the GPU processes the current batch. That overlap is the entire point.
The architectural separation is deliberate and worth internalising early: Dataset knows how to access one sample. DataLoader knows how to batch, shuffle, and parallelise. This means you can swap your data source (disk, SQL, S3, Kafka) without touching the DataLoader, and you can tune the DataLoader's parallelism without touching the Dataset. Each side has one job.
The most common production failure I see in 2026 is the same one I saw in 2022: developers set num_workers=0 during prototyping because it is simpler, everything works, and then they deploy to a real dataset and discover training is 3–5x slower than it needs to be because data loading is serialised on the main thread. The fix is always num_workers >= 1 with pin_memory=True for GPU training — and documenting that requirement so it does not get reverted in a future PR.
What Is PyTorch DataLoader and Datasets and Why Does It Exist?
PyTorch DataLoader and Datasets exist to solve a single concrete problem: how do you train on data that is too large to fit in memory, while keeping a GPU that costs thousands of dollars per hour fully utilised?
The Dataset class — specifically the Map-style variant — requires implementing two methods: __len__ (how many samples exist) and __getitem__ (fetch one sample by index). That is the entire contract. The Dataset knows nothing about batching, shuffling, or parallelism. It just answers 'give me sample 4,217' as fast as it can.
The DataLoader wraps that Dataset and adds everything else: it selects a batch of indices (optionally shuffled), hands those indices to worker processes that call __getitem__ in parallel, collates the results into a batch tensor, and optionally pre-pins that tensor in page-locked memory for faster GPU transfer. The training loop then pulls pre-fetched batches from a queue without waiting.
The performance insight that changes how you think about this: with num_workers=4 and pin_memory=True, the DataLoader is pre-fetching batch N+1 and N+2 while the GPU is still processing batch N. That pipeline overlap is what keeps GPU utilisation above 90%. Without it — with num_workers=0 — every batch is loaded synchronously on the main thread after the GPU finishes the previous one. The GPU sits idle for however long loading takes. On a dataset of real images with augmentations, that idle time can represent 60–70% of wall-clock training time.
As of 2026, with models being trained on increasingly large datasets and GPUs being increasingly expensive, getting this right is not an optimisation — it is table stakes.
import torch from torch.utils.data import Dataset, DataLoader # A minimal but complete custom Dataset implementation # This is the pattern you will replicate for every new data source class ForgeProjectDataset(Dataset): def __init__(self, data_list: list, labels: list): # Store only metadata in __init__ — never load actual data here # If you load data in __init__, it all lands in RAM before training starts self.data = data_list self.labels = labels def __len__(self) -> int: # DataLoader uses this to know how many batches constitute one epoch return len(self.data) def __getitem__(self, idx: int): # This is called once per sample, in parallel across num_workers processes # Keep it fast: one file read, one transform, return one sample sample = torch.tensor(self.data[idx], dtype=torch.float32) label = torch.tensor(self.labels[idx], dtype=torch.long) return sample, label # Minimal working example — four samples, two features each raw_data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]] raw_labels = [0, 1, 0, 1] forge_ds = ForgeProjectDataset(raw_data, raw_labels) # Production-grade DataLoader configuration for GPU training forge_loader = DataLoader( dataset=forge_ds, batch_size=2, shuffle=True, # Reshuffle at every epoch for better generalisation num_workers=2, # Two CPU processes load data in parallel pin_memory=True, # Pre-pin batches in page-locked memory for faster GPU transfer persistent_workers=True # Keep workers alive between epochs — avoids 5-10s restart cost ) # Verify the DataLoader works before starting a long training run for batch_idx, (samples, labels) in enumerate(forge_loader): print(f"Batch {batch_idx}: samples shape {samples.shape}, labels {labels}") if batch_idx >= 1: break # Just checking the first two batches
Batch 1: samples shape torch.Size([2, 2]), labels tensor([0, 1])
- Dataset defines how to access ONE sample — it knows nothing about batching or parallelism and should not
- DataLoader wraps the Dataset and adds batching, shuffling, and multi-process loading on top
- Worker processes (producers) load and transform data in parallel on CPU cores while the GPU works
- The training loop (consumer) pulls pre-fetched batches from a queue — ideally it never waits
- pin_memory=True pre-pins batches to page-locked memory so DMA transfers to GPU start without an extra copy step
Enterprise Integration: SQL-Backed Datasets
In real production environments, your training data rarely lives in a flat folder of files. It lives in a database — with labels, metadata, train/val/test splits, and versioning all managed in SQL. Implementing a Dataset that queries a SQL backend is one of the more underrated patterns in production ML engineering.
The approach: in __init__, run a single SQL query to fetch metadata only — sample IDs, file paths on disk or object storage, and labels. Store that metadata in memory as a list or DataFrame. In __getitem__, use the file path from metadata to load the actual binary data — the image, audio file, or feature array — from disk or S3. This keeps memory usage proportional to the number of samples (a few bytes per row of metadata), not the size of the data (potentially gigabytes).
The production benefit that makes this pattern worth the setup: when you add new training data, you insert a row into the SQL table and drop the corresponding file on disk. The next training run picks it up automatically via the __init__ query. There is no CSV file to regenerate, no manifest to sync, and no risk of the file list drifting from the actual filesystem state. I have seen teams spend days debugging training regressions that turned out to be a stale CSV pointing to deleted files — this pattern eliminates that entire class of issue.
One thing to watch: do not query SQL inside __getitem__. SQL connections are not thread-safe and cannot be pickled for multi-process workers. Fetch all metadata once in __init__ and do all disk or object-storage I/O in __getitem__.
-- Fetch sample metadata for Dataset __init__ -- We fetch IDs, paths, and labels here — NOT the binary data -- Binary data is loaded lazily in __getitem__ using the file_path -- This query runs once at the start of training, not per batch SELECT sample_id, file_path, -- path to the file on NVMe SSD or S3 label_id, split_tag -- 'train', 'val', or 'test' FROM io.thecodeforge.training_data WHERE project_tag = 'vision_v2' AND split_tag = 'train' AND is_verified = TRUE -- exclude samples flagged as corrupted during QA ORDER BY sample_id ASC; -- Expected result: one lightweight metadata row per sample -- Actual image binary data stays on disk until __getitem__ loads it
Binary data is not transferred; only enough information to load it on demand.
Containerised Data Pipelines with Docker
Wrapping your training environment in Docker is the standard way to ensure the data pipeline behaves identically across a developer's laptop, a CI server, and a production GPU cluster. It also surfaces the most common PyTorch DataLoader configuration mistake before it costs you a four-hour training run.
The critical Docker configuration that almost everyone gets wrong the first time: when num_workers > 0, PyTorch uses shared memory at /dev/shm to transfer tensors between worker processes and the main process. Docker's default shared memory allocation is 64MB — a sensible default for containerised web services that never heard of PyTorch. For a training job with num_workers=4 and any real batch size, that 64MB fills up within a few epochs and the container dies with a Bus error and no Python traceback. The fix is one flag: --shm-size=2g.
The deployment checklist I use for every new training container: set --shm-size=2g or larger; mount the data directory as a Docker volume rather than copying it into the image (datasets are too large for image layers and change too frequently); pin the PyTorch version explicitly rather than using pytorch/pytorch:latest (latest changes under you in ways that are hard to reproduce); set num_workers based on the CPU cores allocated to the container, not the host machine's total CPU count; and add a pre-flight health check that verifies /dev/shm has enough free space before the training job starts.
The --shm-size flag also belongs in your docker-compose.yml, your Kubernetes pod spec under resources, and your CI job definition — anywhere the container is launched. If it lives in only one place, it will be dropped in a refactor and you will spend an afternoon diagnosing a Bus error that you already fixed six months ago.
# Pin a specific PyTorch release — 'latest' changes under you # and reproducing a training run six months later becomes impossible FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime WORKDIR /app # Install dependencies before copying source code # This layer is cached as long as requirements.txt does not change COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . # IMPORTANT: This container MUST be started with --shm-size=2g # PyTorch DataLoader uses /dev/shm to transfer tensors between worker processes # Docker's default 64MB causes Bus error crashes when num_workers > 0 # Example: docker run --shm-size=2g --gpus all -v /data:/data thecodeforge/training:latest # Pre-flight check: verify shared memory is sufficient before training starts # Exits with a clear error rather than a cryptic Bus error mid-epoch HEALTHCHECK --interval=10s --timeout=5s --retries=1 \ CMD python -c "import shutil; free = shutil.disk_usage('/dev/shm').free; assert free > 1e9, f'Insufficient /dev/shm: {free/1e6:.0f}MB free, need 1000MB+'" CMD ["python", "ForgeDataset.py"]
Healthcheck configured — container will refuse to start training if /dev/shm < 1GB free
Common Mistakes and How to Avoid Them
Most DataLoader bugs in production fall into a small set of patterns. Knowing them in advance means you spend time training models instead of debugging pipelines.
The performance mistakes: num_workers=0 is the biggest one — it serialises every sample load on the main thread and GPU sits idle while it happens. Loading data in __init__ instead of __getitem__ is the second — it turns a lazy-loading Dataset into a greedy RAM consumer that OOMs before training even starts.
The correctness mistakes: passing unpickleable objects (open file handles, database connections, lambda functions) to a Dataset when num_workers > 0. Python's multiprocessing pickles the Dataset to send it to each worker process. If any attribute cannot be pickled, the worker hangs silently or crashes without a useful traceback. The fix is to initialise those objects inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker process.
The subtle one that costs teams debugging time: forgetting drop_last=True on the training DataLoader. The last batch of an epoch almost always has fewer samples than the configured batch_size. For most loss functions this is harmless, but for BatchNorm it is not — BatchNorm uses batch statistics during training, and a batch of size 1 produces undefined variance. Setting drop_last=True discards the last incomplete batch and ensures consistent batch sizes throughout training. For validation DataLoaders, use drop_last=False — you want to evaluate on every sample, no exceptions.
import torch from torch.utils.data import DataLoader # Production-grade DataLoader configuration for GPU training # Each parameter here solves a specific real problem loader = DataLoader( forge_ds, batch_size=32, shuffle=True, # Reshuffle every epoch for better generalisation num_workers=4, # 4 parallel CPU processes — eliminates GPU starvation pin_memory=True, # Pre-pin batches in page-locked memory for faster DMA transfer drop_last=True, # Drop incomplete final batch — prevents BatchNorm issues persistent_workers=True, # Keep workers alive between epochs — avoids 5-10s restart cost prefetch_factor=2, # Each worker pre-fetches 2 batches ahead — reduces wait time ) # Validation DataLoader — different settings for a reason val_loader = DataLoader( val_ds, batch_size=64, # Larger batch is fine for inference — no gradient storage shuffle=False, # Do NOT shuffle validation — reproducible evaluation num_workers=4, pin_memory=True, drop_last=False, # Evaluate on EVERY sample — no exceptions persistent_workers=True, ) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") for samples, labels in loader: # non_blocking=True overlaps the CPU-to-GPU transfer with other CPU work # Only effective when pin_memory=True is also set samples = samples.to(device, non_blocking=True) labels = labels.to(device, non_blocking=True) # ... training logic ... # Random seed consistency across workers # Without this, all workers inherit the same seed and produce identical augmentations def worker_init_fn(worker_id: int) -> None: import numpy as np base_seed = torch.initial_seed() % (2 ** 32) np.random.seed(base_seed + worker_id) torch.manual_seed(base_seed + worker_id) # Apply it to any DataLoader that uses random augmentations in __getitem__ auged_loader = DataLoader( forge_ds, batch_size=32, num_workers=4, worker_init_fn=worker_init_fn, pin_memory=True, )
// GPU will receive pre-fetched, pre-pinned batches with consistent random augmentation across workers.
Custom collate_fn: Handling Variable-Length Data
The default collate_fn expects every sample in a batch to have the same shape so it can stack them into a uniform tensor. This assumption breaks the moment you work with NLP sequences of different lengths, graphs with different numbers of nodes, or images that have not been resized to a fixed resolution.
A custom collate_fn lets you define exactly how a list of heterogeneous samples becomes a batch. The most common pattern — one you will write or encounter in almost every NLP project — is pad-and-mask: pad all sequences to the length of the longest sequence in the batch, and return a binary mask tensor that tells downstream layers which positions are real data and which are padding. Attention layers, loss functions, and pooling operations all need this mask to avoid treating padding as signal.
The production subtlety that trips people up: collate_fn runs on the main thread, not inside the worker processes. This means even with num_workers=4, a slow collate_fn becomes the bottleneck for the entire pipeline. Keep it to reshaping and padding only. If you find yourself sorting sequences, computing complex statistics, or doing any non-trivial transformation in collate_fn, move that work into __getitem__ where it can run in parallel across workers.
For NLP work in 2026, most teams use Hugging Face's DataCollatorWithPadding which implements this pattern with tokeniser-aware padding. But understanding the underlying collate_fn contract means you can customise it when the standard collators do not fit your data structure.
import torch from torch.utils.data import Dataset, DataLoader from torch.nn.utils.rnn import pad_sequence from typing import List, Tuple def collate_variable_length( batch: List[Tuple[torch.Tensor, torch.Tensor]] ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: """ Pads variable-length sequences to the longest in the batch. Returns: padded — (batch_size, max_len, feature_dim) padded sequences labels — (batch_size,) classification labels mask — (batch_size, max_len) float mask: 1.0 real, 0.0 padding Runs on the main thread — keep this function fast. Heavy transforms belong in __getitem__, not here. """ sequences, labels = zip(*batch) # pad_sequence pads to the longest sequence in this batch # batch_first=True gives shape (batch, seq_len, features) padded = pad_sequence(sequences, batch_first=True, padding_value=0.0) # Build the mask: 1.0 where real data, 0.0 where padding # Downstream attention layers and loss functions use this to ignore padding lengths = torch.tensor([len(s) for s in sequences], dtype=torch.long) mask = torch.zeros(padded.shape[0], padded.shape[1], dtype=torch.float32) for i, length in enumerate(lengths): mask[i, :length] = 1.0 labels_stacked = torch.stack(labels) return padded, labels_stacked, mask # Toy variable-length dataset class VariableLengthDataset(Dataset): def __init__(self, num_samples: int = 200, feature_dim: int = 128): self.num_samples = num_samples self.feature_dim = feature_dim def __len__(self) -> int: return self.num_samples def __getitem__(self, idx: int): # Sequence length varies per sample — this is what breaks the default collate_fn seq_len = torch.randint(10, 60, (1,)).item() features = torch.randn(seq_len, self.feature_dim) label = torch.tensor(idx % 2, dtype=torch.long) # binary label return features, label variable_length_dataset = VariableLengthDataset(num_samples=200, feature_dim=128) loader = DataLoader( variable_length_dataset, batch_size=32, shuffle=True, collate_fn=collate_variable_length, # custom collation for variable-length sequences num_workers=4, pin_memory=True, ) for batch_idx, (padded_seqs, labels, mask) in enumerate(loader): # padded_seqs: (32, max_len_in_batch, 128) # mask: (32, max_len_in_batch) — 1.0 for real tokens, 0.0 for padding real_token_count = mask.sum().item() total_positions = mask.numel() padding_pct = 100 * (1 - real_token_count / total_positions) print(f"Batch {batch_idx}: shape {padded_seqs.shape} | " f"padding {padding_pct:.1f}% | labels {labels[:4]}") if batch_idx >= 2: break
Batch 1: shape torch.Size([32, 58, 128]) | padding 36.8% | labels tensor([0, 1, 0, 1])
Batch 2: shape torch.Size([32, 57, 128]) | padding 35.1% | labels tensor([1, 0, 1, 0])
- Use pad_sequence from torch.nn.utils.rnn — it handles batch-first padding in one call and is well-tested
- Always return a mask alongside padded data — every downstream layer that touches sequences needs to know where padding starts
- collate_fn runs on the main thread — if profiling shows it as the bottleneck, move the heavy work into __getitem__ where workers can parallelise it
- Consider bucket sampling (grouping sequences by similar length before batching) to reduce padding waste — padding above 40% per batch is worth addressing
- For images of different sizes, resize in __getitem__ not in collate_fn — resizing is CPU-intensive and belongs in the parallel workers
IterableDataset: Streaming Data Without Random Access
Map-style datasets require __len__ and __getitem__ — random access to any sample by index. This breaks when data arrives as a stream (Kafka, network logs, real-time sensor feeds) or when the dataset is genuinely too large to index. IterableDataset solves this by yielding samples sequentially without needing to know the total size.
The use case that justifies reaching for IterableDataset: training on a live event stream where the concept of 'total dataset size' does not exist, or on a dataset so large that generating a complete index would take longer than training itself. The DataLoader iterates through the __iter__ method, batches samples as they arrive, and provides limited shuffling within a buffer of recent samples.
The production trade-off that you need to understand before choosing IterableDataset: it cannot shuffle globally because it never knows the full dataset. It can shuffle within a configurable buffer of recent samples, but the model always sees data in approximately the order it arrives in the stream. If the stream has any temporal structure — and real data almost always does — the model will see a biased distribution. For training data that can be indexed, Map-style datasets with global shuffling are strictly better. Use IterableDataset only when indexing is genuinely impossible.
One non-obvious operational issue with num_workers > 0 and IterableDataset: each worker receives its own copy of the __iter__ method and will iterate the entire stream independently. Without sharding the stream across workers, every sample gets loaded num_workers times. You need to detect the worker ID inside __iter__ using torch.utils.data.get_worker_info() and partition the stream so each worker handles a distinct subset.
import torch from torch.utils.data import IterableDataset, DataLoader, get_worker_info from typing import Iterator, Tuple class StreamingSensorDataset(IterableDataset): """ Yields samples from a simulated sensor stream. In production, replace __iter__ with a Kafka consumer, network socket reader, or database cursor. IMPORTANT: with num_workers > 0, each worker calls __iter__ independently. Without sharding, every sample is loaded num_workers times. The __iter__ below handles worker partitioning automatically. """ def __init__(self, num_samples: int = 10000): self.num_samples = num_samples def __iter__(self) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]: worker_info = get_worker_info() if worker_info is None: # Single-process loading — yield the full stream start, end = 0, self.num_samples else: # Multi-process loading — partition stream across workers # Each worker gets a non-overlapping slice of the sample range per_worker = self.num_samples // worker_info.num_workers worker_id = worker_info.id start = worker_id * per_worker end = start + per_worker if worker_id < worker_info.num_workers - 1 else self.num_samples for i in range(start, end): # Simulate streaming sensor data — replace with real I/O in production features = torch.randn(10) label = torch.tensor(float(features[0] > 0), dtype=torch.float32) yield features, label stream_ds = StreamingSensorDataset(num_samples=10000) stream_loader = DataLoader( stream_ds, batch_size=64, num_workers=2, # Each worker now handles a non-overlapping shard of the stream ) for batch_idx, (features, labels) in enumerate(stream_loader): if batch_idx >= 5: break print(f"Batch {batch_idx}: features {features.shape}, " f"label distribution: {labels.sum().item():.0f}/{len(labels)} positive")
Batch 1: features torch.Size([64, 10]), label distribution: 31/64 positive
Batch 2: features torch.Size([64, 10]), label distribution: 33/64 positive
Batch 3: features torch.Size([64, 10]), label distribution: 30/64 positive
Batch 4: features torch.Size([64, 10]), label distribution: 32/64 positive
get_worker_info() inside __iter__ — otherwise each sample is loaded num_workers times.get_worker_info() — this doubles or quadruples I/O load silently.get_worker_info() sharding inside __iter__ or every sample gets loaded num_workers times. Use Map-style datasets whenever the data can be indexed; reach for IterableDataset only when it genuinely cannot.| Feature | Standard Python List / Loop | PyTorch DataLoader / Dataset |
|---|---|---|
| Memory Usage | High — entire dataset loaded into RAM before training starts | Low — lazy loading per sample in __getitem__, only one batch in memory at a time |
| Concurrency | Single-threaded — data loading blocks the main thread and therefore the GPU | Multi-process via num_workers — true parallelism that bypasses Python's GIL |
| Batching | Manual list slicing — you write the indexing logic and handle edge cases like the last batch | Automatic via batch_size — DataLoader handles indexing, collation, and drop_last |
| Data Shuffling | Manual random.shuffle() — must remember to call it every epoch and it operates on the full list in memory | Built-in per-epoch shuffling via shuffle=True — operates on indices, not the data itself |
| GPU Integration | Manual .to(device) on every tensor — no transfer optimisation | Optimised via pin_memory=True and non_blocking=True — DMA transfer without extra copy step |
🎯 Key Takeaways
- Dataset and DataLoader have deliberately separate responsibilities — Dataset knows how to access one sample, DataLoader knows how to batch, shuffle, and parallelise. Understanding this separation makes every configuration decision obvious.
- num_workers=0 is the default and the most common performance mistake — it serialises data loading on the main thread and leaves the GPU idle between batches. Always override it for GPU training.
- In Docker, --shm-size=2g is a required flag when num_workers > 0, not an optional optimisation. The default 64MB causes Bus error crashes and leaves no Python traceback to diagnose from.
- Load only metadata in __init__ and actual data in __getitem__ — this is what makes lazy loading work. Loading data in __init__ converts a scalable pipeline into an OOM crash before training starts.
- collate_fn runs on the main thread regardless of num_workers — keep it to padding and reshaping only, and move transforms into __getitem__ where workers can parallelise them.
- IterableDataset cannot shuffle globally and requires per-worker stream sharding via
get_worker_info()when num_workers > 0. Use Map-style datasets whenever indexing is possible.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QHow does the DataLoader utilise Python's multi-processing to bypass the Global Interpreter Lock (GIL)?Mid-levelReveal
- QExplain the Producer-Consumer pattern as it applies to the relationship between a CPU DataLoader and a GPU training loop.Mid-levelReveal
- QWhat is the difference between a Map-style and an Iterable-style Dataset? When would you strictly choose the latter?SeniorReveal
- QHow would you implement a custom collate_fn to handle a dataset where samples have variable sequence lengths?SeniorReveal
- QDescribe the purpose of pin_memory. How does it interact with pageable vs pinned host memory during DMA transfers to the GPU?SeniorReveal
Frequently Asked Questions
What is the difference between Dataset and DataLoader in PyTorch?
A Dataset defines how to access a single sample — it implements __len__ (total number of samples) and __getitem__ (fetch one sample by index). It knows nothing about batching or parallelism. A DataLoader wraps a Dataset and adds everything else: batching (batch_size), per-epoch shuffling (shuffle=True), multi-process loading (num_workers), and GPU transfer optimisation (pin_memory). The Dataset is the data source; the DataLoader is the pipeline that feeds data to the training loop at the right cadence to keep the GPU busy.
How many num_workers should I use?
A practical starting point is 4 for single-GPU training. Increase it if GPU utilisation measured by nvidia-smi is below 85% after setting num_workers=4 and pin_memory=True. Do not set num_workers higher than the number of CPU cores allocated to your process — in Docker and Kubernetes, check the container's CPU limit, not the host machine's total core count. Setting num_workers too high causes CPU contention between workers and actually slows down data loading. The right number is the smallest value that keeps GPU utilisation above 85%.
When should I use TensorDataset vs a custom Dataset?
Use TensorDataset when your data is already in memory as tensors and the full dataset fits comfortably in RAM. It provides __len__ and __getitem__ for free with zero boilerplate — no custom class needed. Use a custom Dataset when data is on disk or object storage and too large to fit in RAM, when samples need per-sample transforms or augmentation, or when data comes from a non-tensor source like a database or API. The decision is straightforward: if the data is already a tensor in memory, TensorDataset. If it needs to be loaded from somewhere, custom Dataset.
Why does my DataLoader hang with num_workers > 0?
The most common cause is unpickleable objects stored in the Dataset. Python's multiprocessing pickles the Dataset to send it to each worker process. Open file handles, database connections, and lambda functions cannot be pickled — the worker hangs silently rather than raising a clear exception. Test it first: import pickle; pickle.dumps(dataset) should complete without raising. The fix is to initialise those objects inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker. A second cause is a deadlock inside __getitem__ — for example, waiting on a threading lock that is held by the main process.
What is persistent_workers and when should I use it?
persistent_workers=True keeps worker processes alive between epochs rather than destroying and re-creating them. Worker startup involves forking processes, importing modules, and re-initialising any objects in worker_init_fn — this typically costs 5–10 seconds per epoch on a standard training machine. With persistent_workers=True, that overhead only occurs once at the start of training. Use it whenever you are training for more than a few epochs and num_workers > 0. The only trade-off is slightly higher baseline memory usage because the workers remain resident. It is almost always worth it.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.