Skip to content
Home ML / AI PyTorch DataLoader and Datasets

PyTorch DataLoader and Datasets

Where developers are forged. · Structured learning · Free forever.
📍 Part of: PyTorch → Topic 6 of 7
A comprehensive guide to PyTorch DataLoader and Datasets — learn to manage data pipelines, implement custom datasets, and optimize batch loading for ML models.
⚙️ Intermediate — basic ML / AI knowledge assumed
In this tutorial, you'll learn
A comprehensive guide to PyTorch DataLoader and Datasets — learn to manage data pipelines, implement custom datasets, and optimize batch loading for ML models.
  • Dataset and DataLoader have deliberately separate responsibilities — Dataset knows how to access one sample, DataLoader knows how to batch, shuffle, and parallelise. Understanding this separation makes every configuration decision obvious.
  • num_workers=0 is the default and the most common performance mistake — it serialises data loading on the main thread and leaves the GPU idle between batches. Always override it for GPU training.
  • In Docker, --shm-size=2g is a required flag when num_workers > 0, not an optional optimisation. The default 64MB causes Bus error crashes and leaves no Python traceback to diagnose from.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Dataset defines how to access a single sample — implement __len__ and __getitem__ for lazy loading
  • DataLoader wraps a Dataset to provide batching, shuffling, and multi-process parallel loading
  • pin_memory=True speeds up CPU-to-GPU transfers by using page-locked host memory
  • num_workers > 0 parallelizes data loading on CPU — the #1 fix for GPU starvation
  • The biggest production mistake is num_workers=0, which serializes loading and slows training 50%+
  • In Docker, --shm-size must be increased when num_workers > 0 or you get Bus error crashes
🚨 START HERE
DataLoader Debug Cheat Sheet
Quick commands to diagnose data pipeline issues
🟠Training is slow, GPU utilisation is low
Immediate ActionProfile the gap between data loading time and GPU compute time
Commands
nvidia-smi -l 1 # watch GPU utilisation in real time — below 80% means starvation
python -c "import torch; p = torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA]); print('profiler ready')"
Fix NowSet num_workers=4 and pin_memory=True — this is the fix for the vast majority of GPU starvation cases
🟡Bus error crash inside Docker container
Immediate ActionCheck shared memory allocation and current usage before touching anything else
Commands
docker exec <container> df -h /dev/shm
docker inspect <container> | grep ShmSize
Fix NowRestart the container with --shm-size=2g and add this flag to your docker run script or compose file permanently
🟡DataLoader hangs with no error after a few batches
Immediate ActionTest whether the Dataset itself can be pickled — if it cannot, workers will hang silently
Commands
import pickle; pickle.dumps(dataset) # if this raises, you have an unpickleable object
strace -p <worker_pid> # check for blocked syscalls in a worker process
Fix NowMove file handles and DB connections out of __init__ and into __getitem__, or initialise them in worker_init_fn
Production IncidentDocker container crashes with Bus error when num_workers > 0A training job worked fine locally with num_workers=4 but crashed with SIGBUS (Bus error) inside a Docker container after 2–3 epochs.
SymptomTraining starts normally and runs for 2–3 epochs before dying with a Bus error. The crash is intermittent — sometimes it survives 10 epochs before failing. No Python traceback is produced, which makes it look like a hardware or OS issue rather than a configuration problem.
AssumptionThe dataset has corrupted files or the host machine has faulty RAM. The team spent two days running memory diagnostics and re-validating the dataset before looking at the container configuration.
Root causePyTorch's multi-process DataLoader uses shared memory at /dev/shm to transfer tensors between worker processes and the main process. Docker's default shared memory allocation is 64MB — a number that made sense for containerised web services but is completely inadequate for ML training. With num_workers=4 and batch_size=64 on a dataset of any real size, the shared memory segment fills up within a few epochs. When a worker tries to write to a full shared memory segment, the OS sends SIGBUS. There is no Python-level exception because the failure happens at the OS level, below the Python runtime.
FixAdded --shm-size=2g to the docker run command. Added a pre-training health check that reads /dev/shm free space and exits with a clear error message if it is below 1GB. Documented the --shm-size requirement in the Dockerfile as a comment and in the repository's README under 'Running in Docker'. Added the flag to the CI/CD job definition so it can never be accidentally dropped.
Key Lesson
Docker's default 64MB shared memory is too small for PyTorch multi-process DataLoader — this is not a corner case, it affects every real training jobAlways run Docker containers with --shm-size=2g or larger when num_workers > 0, and encode this in your CI job definition so it cannot be silently removedBus error crashes with no Python traceback are the unmistakable signature of shared memory exhaustion — do not waste time on hardware diagnostics before checking /dev/shmTest training inside Docker before shipping to production — local and container environments differ in ways that only surface under sustained load
Production Debug GuideCommon symptoms when the data pipeline goes wrong
Training is 3–5x slower than expected, GPU utilisation is below 50%Check if num_workers=0. Increase to num_workers=4 and add pin_memory=True. Profile with torch.profiler to confirm data loading is the bottleneck and not something in __getitem__ itself — sometimes the bottleneck is a slow transform, not the I/O.
Bus error (SIGBUS) crash with no Python traceback inside DockerIncrease Docker shared memory: docker run --shm-size=2g. Verify with df -h /dev/shm inside the container before starting training. Add a pre-flight check to your training entry point that asserts free shared memory exceeds a minimum threshold.
RuntimeError: DataLoader worker is killed by signalCheck for OOM in worker processes — workers are separate processes and their memory usage is not reflected in the main process metrics. Reduce num_workers or batch_size. Check if __getitem__ loads entire files into memory rather than streaming or memory-mapping them.
DataLoader hangs indefinitely — no error, no progressCheck for unpickleable objects in the Dataset. Python's multiprocessing pickles the Dataset to send it to each worker — open file handles, database connections, and lambda functions all fail silently here. Use worker_init_fn for any per-worker initialisation that cannot be pickled.
Different workers produce identical random augmentations across the same epochSet a unique random seed per worker using worker_init_fn. Without this, all workers inherit the same base seed from the main process and generate identical augmentation sequences, which reduces effective data diversity and can silently hurt generalisation.

PyTorch DataLoader and Datasets decouple data storage from batching logic, enabling scalable pipelines that keep GPUs fully utilised. The Dataset class abstracts how individual samples are accessed — one at a time, lazily, from disk or a database. The DataLoader handles batching, shuffling, and multi-process loading on top of whatever Dataset you hand it.

The core problem these tools solve: training on datasets that do not fit in memory while keeping the GPU fed continuously. If data loading is slower than GPU computation, the GPU sits idle between batches — this is called data starvation and it is one of the most common reasons a training run is 3x slower than it should be. The DataLoader solves this by pre-fetching batches in parallel worker processes while the GPU processes the current batch. That overlap is the entire point.

The architectural separation is deliberate and worth internalising early: Dataset knows how to access one sample. DataLoader knows how to batch, shuffle, and parallelise. This means you can swap your data source (disk, SQL, S3, Kafka) without touching the DataLoader, and you can tune the DataLoader's parallelism without touching the Dataset. Each side has one job.

The most common production failure I see in 2026 is the same one I saw in 2022: developers set num_workers=0 during prototyping because it is simpler, everything works, and then they deploy to a real dataset and discover training is 3–5x slower than it needs to be because data loading is serialised on the main thread. The fix is always num_workers >= 1 with pin_memory=True for GPU training — and documenting that requirement so it does not get reverted in a future PR.

What Is PyTorch DataLoader and Datasets and Why Does It Exist?

PyTorch DataLoader and Datasets exist to solve a single concrete problem: how do you train on data that is too large to fit in memory, while keeping a GPU that costs thousands of dollars per hour fully utilised?

The Dataset class — specifically the Map-style variant — requires implementing two methods: __len__ (how many samples exist) and __getitem__ (fetch one sample by index). That is the entire contract. The Dataset knows nothing about batching, shuffling, or parallelism. It just answers 'give me sample 4,217' as fast as it can.

The DataLoader wraps that Dataset and adds everything else: it selects a batch of indices (optionally shuffled), hands those indices to worker processes that call __getitem__ in parallel, collates the results into a batch tensor, and optionally pre-pins that tensor in page-locked memory for faster GPU transfer. The training loop then pulls pre-fetched batches from a queue without waiting.

The performance insight that changes how you think about this: with num_workers=4 and pin_memory=True, the DataLoader is pre-fetching batch N+1 and N+2 while the GPU is still processing batch N. That pipeline overlap is what keeps GPU utilisation above 90%. Without it — with num_workers=0 — every batch is loaded synchronously on the main thread after the GPU finishes the previous one. The GPU sits idle for however long loading takes. On a dataset of real images with augmentations, that idle time can represent 60–70% of wall-clock training time.

As of 2026, with models being trained on increasingly large datasets and GPUs being increasingly expensive, getting this right is not an optimisation — it is table stakes.

io/thecodeforge/ml/forge_dataset.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import torch
from torch.utils.data import Dataset, DataLoader

# A minimal but complete custom Dataset implementation
# This is the pattern you will replicate for every new data source
class ForgeProjectDataset(Dataset):
    def __init__(self, data_list: list, labels: list):
        # Store only metadata in __init__ — never load actual data here
        # If you load data in __init__, it all lands in RAM before training starts
        self.data = data_list
        self.labels = labels

    def __len__(self) -> int:
        # DataLoader uses this to know how many batches constitute one epoch
        return len(self.data)

    def __getitem__(self, idx: int):
        # This is called once per sample, in parallel across num_workers processes
        # Keep it fast: one file read, one transform, return one sample
        sample = torch.tensor(self.data[idx], dtype=torch.float32)
        label  = torch.tensor(self.labels[idx], dtype=torch.long)
        return sample, label


# Minimal working example — four samples, two features each
raw_data   = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]]
raw_labels = [0, 1, 0, 1]

forge_ds = ForgeProjectDataset(raw_data, raw_labels)

# Production-grade DataLoader configuration for GPU training
forge_loader = DataLoader(
    dataset=forge_ds,
    batch_size=2,
    shuffle=True,           # Reshuffle at every epoch for better generalisation
    num_workers=2,          # Two CPU processes load data in parallel
    pin_memory=True,        # Pre-pin batches in page-locked memory for faster GPU transfer
    persistent_workers=True # Keep workers alive between epochs — avoids 5-10s restart cost
)

# Verify the DataLoader works before starting a long training run
for batch_idx, (samples, labels) in enumerate(forge_loader):
    print(f"Batch {batch_idx}: samples shape {samples.shape}, labels {labels}")
    if batch_idx >= 1:
        break  # Just checking the first two batches
▶ Output
Batch 0: samples shape torch.Size([2, 2]), labels tensor([1, 0])
Batch 1: samples shape torch.Size([2, 2]), labels tensor([0, 1])
Mental Model
The Producer-Consumer Pattern
The DataLoader is a producer that prepares batches on CPU while the training loop is a consumer that processes them on GPU — the overlap between these two is where all the performance comes from.
  • Dataset defines how to access ONE sample — it knows nothing about batching or parallelism and should not
  • DataLoader wraps the Dataset and adds batching, shuffling, and multi-process loading on top
  • Worker processes (producers) load and transform data in parallel on CPU cores while the GPU works
  • The training loop (consumer) pulls pre-fetched batches from a queue — ideally it never waits
  • pin_memory=True pre-pins batches to page-locked memory so DMA transfers to GPU start without an extra copy step
📊 Production Insight
num_workers=0 is the default but it serialises data loading on the main thread — GPU sits idle between every batch.
With num_workers=4 and pin_memory=True on a typical image dataset, GPU utilisation moves from 40–50% to above 90%.
In 2026 with A100 and H100 pricing, that difference in utilisation is real money on every training run.
Rule: always set num_workers >= 1 for GPU training, pin_memory=True for CUDA, and persistent_workers=True for multi-epoch runs.
🎯 Key Takeaway
Dataset defines how to access one sample; DataLoader adds batching, shuffling, and parallelism on top. The separation is deliberate — it lets you swap data sources without touching the pipeline, and tune parallelism without touching the data logic. Always set num_workers >= 1 and pin_memory=True for GPU training; leaving both at their defaults is the most common reason training is slower than it should be.
DataLoader Configuration Decision
IfSmall dataset fits entirely in RAM and is already a tensor
UseUse torch.utils.data.TensorDataset — no custom class needed, zero boilerplate, and it is just as fast
IfData is on disk (images, audio files, parquet shards) and needs lazy loading
UseImplement a custom Dataset with __getitem__ loading one sample at a time from disk — never pre-load in __init__
IfData is in a database or streaming source with no natural index
UseUse an IterableDataset — it yields samples sequentially without needing __len__ or random access
IfTraining on GPU with any custom Dataset
UseSet num_workers=4, pin_memory=True, persistent_workers=True, and use non_blocking=True in .to(device) inside the training loop

Enterprise Integration: SQL-Backed Datasets

In real production environments, your training data rarely lives in a flat folder of files. It lives in a database — with labels, metadata, train/val/test splits, and versioning all managed in SQL. Implementing a Dataset that queries a SQL backend is one of the more underrated patterns in production ML engineering.

The approach: in __init__, run a single SQL query to fetch metadata only — sample IDs, file paths on disk or object storage, and labels. Store that metadata in memory as a list or DataFrame. In __getitem__, use the file path from metadata to load the actual binary data — the image, audio file, or feature array — from disk or S3. This keeps memory usage proportional to the number of samples (a few bytes per row of metadata), not the size of the data (potentially gigabytes).

The production benefit that makes this pattern worth the setup: when you add new training data, you insert a row into the SQL table and drop the corresponding file on disk. The next training run picks it up automatically via the __init__ query. There is no CSV file to regenerate, no manifest to sync, and no risk of the file list drifting from the actual filesystem state. I have seen teams spend days debugging training regressions that turned out to be a stale CSV pointing to deleted files — this pattern eliminates that entire class of issue.

One thing to watch: do not query SQL inside __getitem__. SQL connections are not thread-safe and cannot be pickled for multi-process workers. Fetch all metadata once in __init__ and do all disk or object-storage I/O in __getitem__.

io/thecodeforge/db/fetch_samples.sql · SQL
123456789101112131415161718
-- Fetch sample metadata for Dataset __init__
-- We fetch IDs, paths, and labels here — NOT the binary data
-- Binary data is loaded lazily in __getitem__ using the file_path
-- This query runs once at the start of training, not per batch

SELECT
    sample_id,
    file_path,      -- path to the file on NVMe SSD or S3
    label_id,
    split_tag       -- 'train', 'val', or 'test'
FROM io.thecodeforge.training_data
WHERE project_tag = 'vision_v2'
  AND split_tag   = 'train'
  AND is_verified = TRUE   -- exclude samples flagged as corrupted during QA
ORDER BY sample_id ASC;

-- Expected result: one lightweight metadata row per sample
-- Actual image binary data stays on disk until __getitem__ loads it
▶ Output
Returns metadata rows — sample_id, file_path, label_id for each training sample.
Binary data is not transferred; only enough information to load it on demand.
🔥Store Paths in SQL, Not Blobs
Only store file paths or object-storage keys in your SQL table. Loading actual binary blobs from SQL during __getitem__ creates a per-sample database round-trip under multi-process load — it will saturate your database connection pool and become the bottleneck faster than you expect. Keep binary data on a fast NVMe SSD or a distributed object store like S3 or GCS, and use SQL only for the lightweight metadata that tells your Dataset where to find it.
📊 Production Insight
Store file paths in SQL, not binary blobs — per-sample SQL reads under multi-process load will saturate your DB connection pool.
Load binary data from disk or object storage in __getitem__, not __init__ — lazy loading keeps memory proportional to batch size, not dataset size.
SQL connections cannot be pickled for worker processes — open them inside __getitem__ or use worker_init_fn, never store them as instance variables.
Rule: SQL for metadata and versioning, disk or S3 for binary data, Dataset for lazy access, DataLoader for batching and parallelism.
🎯 Key Takeaway
SQL stores metadata (paths, labels, splits) — disk or object storage holds binary data — Dataset loads lazily one sample at a time. This pattern scales to tens of millions of samples without loading anything into RAM upfront. New data is picked up automatically on the next training run by rerunning the __init__ query — no CSV drift, no stale manifests.

Containerised Data Pipelines with Docker

Wrapping your training environment in Docker is the standard way to ensure the data pipeline behaves identically across a developer's laptop, a CI server, and a production GPU cluster. It also surfaces the most common PyTorch DataLoader configuration mistake before it costs you a four-hour training run.

The critical Docker configuration that almost everyone gets wrong the first time: when num_workers > 0, PyTorch uses shared memory at /dev/shm to transfer tensors between worker processes and the main process. Docker's default shared memory allocation is 64MB — a sensible default for containerised web services that never heard of PyTorch. For a training job with num_workers=4 and any real batch size, that 64MB fills up within a few epochs and the container dies with a Bus error and no Python traceback. The fix is one flag: --shm-size=2g.

The deployment checklist I use for every new training container: set --shm-size=2g or larger; mount the data directory as a Docker volume rather than copying it into the image (datasets are too large for image layers and change too frequently); pin the PyTorch version explicitly rather than using pytorch/pytorch:latest (latest changes under you in ways that are hard to reproduce); set num_workers based on the CPU cores allocated to the container, not the host machine's total CPU count; and add a pre-flight health check that verifies /dev/shm has enough free space before the training job starts.

The --shm-size flag also belongs in your docker-compose.yml, your Kubernetes pod spec under resources, and your CI job definition — anywhere the container is launched. If it lives in only one place, it will be dropped in a refactor and you will spend an afternoon diagnosing a Bus error that you already fixed six months ago.

Dockerfile · DOCKERFILE
123456789101112131415161718192021222324
# Pin a specific PyTorch release — 'latest' changes under you
# and reproducing a training run six months later becomes impossible
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WORKDIR /app

# Install dependencies before copying source code
# This layer is cached as long as requirements.txt does not change
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# IMPORTANT: This container MUST be started with --shm-size=2g
# PyTorch DataLoader uses /dev/shm to transfer tensors between worker processes
# Docker's default 64MB causes Bus error crashes when num_workers > 0
# Example: docker run --shm-size=2g --gpus all -v /data:/data thecodeforge/training:latest

# Pre-flight check: verify shared memory is sufficient before training starts
# Exits with a clear error rather than a cryptic Bus error mid-epoch
HEALTHCHECK --interval=10s --timeout=5s --retries=1 \
  CMD python -c "import shutil; free = shutil.disk_usage('/dev/shm').free; assert free > 1e9, f'Insufficient /dev/shm: {free/1e6:.0f}MB free, need 1000MB+'"

CMD ["python", "ForgeDataset.py"]
▶ Output
Successfully built image thecodeforge/data-pipeline:latest
Healthcheck configured — container will refuse to start training if /dev/shm < 1GB free
⚠ The One Docker Flag That Prevents Most PyTorch Crashes
When using num_workers > 0 inside Docker, you must set --shm-size=2g on the docker run command. Docker's default 64MB shared memory is designed for web containers, not ML training. Without this flag, your training job will crash with a Bus error and no Python traceback after a few epochs — and it will look like a hardware or dataset problem rather than a configuration one. Add this flag to your docker run script, your docker-compose.yml, and your CI job definition. Treat it as a required argument, not an optional one.
📊 Production Insight
Docker's default 64MB shared memory causes Bus error crashes under multi-process DataLoader — this is not an edge case, it happens on every real training job.
The flag --shm-size=2g must live in every place the container is launched: docker run, compose file, Kubernetes pod spec, and CI job definition.
In Kubernetes, set the equivalent via a shm volume mount: emptyDir with medium: Memory.
Rule: add a pre-training /dev/shm health check so the failure is a clear error message at startup, not a cryptic crash after two hours of training.
🎯 Key Takeaway
Docker's default 64MB shared memory causes Bus errors with multi-process DataLoader — --shm-size=2g is not optional, it is a required flag for any real training job. Mount data as a volume rather than baking it into the image. In Kubernetes, use an emptyDir shm volume mount with medium: Memory as the equivalent of --shm-size.
Docker Configuration for Data Loading
Ifnum_workers=0 (single-process loading)
UseDefault 64MB shared memory is sufficient — no --shm-size flag needed, but you are leaving GPU utilisation on the table
Ifnum_workers > 0 (multi-process loading)
UseSet --shm-size=2g minimum — increase to 4g+ with many workers or large batches, and verify with df -h /dev/shm
IfLarge training dataset (>100GB)
UseMount data as a Docker volume, never COPY into the image — images have practical size limits and your dataset changes more often than your code
IfMultiple containers sharing a GPU or running on Kubernetes
UseSet CUDA_VISIBLE_DEVICES explicitly and limit num_workers per container based on the container's CPU allocation, not the host's total core count

Common Mistakes and How to Avoid Them

Most DataLoader bugs in production fall into a small set of patterns. Knowing them in advance means you spend time training models instead of debugging pipelines.

The performance mistakes: num_workers=0 is the biggest one — it serialises every sample load on the main thread and GPU sits idle while it happens. Loading data in __init__ instead of __getitem__ is the second — it turns a lazy-loading Dataset into a greedy RAM consumer that OOMs before training even starts.

The correctness mistakes: passing unpickleable objects (open file handles, database connections, lambda functions) to a Dataset when num_workers > 0. Python's multiprocessing pickles the Dataset to send it to each worker process. If any attribute cannot be pickled, the worker hangs silently or crashes without a useful traceback. The fix is to initialise those objects inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker process.

The subtle one that costs teams debugging time: forgetting drop_last=True on the training DataLoader. The last batch of an epoch almost always has fewer samples than the configured batch_size. For most loss functions this is harmless, but for BatchNorm it is not — BatchNorm uses batch statistics during training, and a batch of size 1 produces undefined variance. Setting drop_last=True discards the last incomplete batch and ensures consistent batch sizes throughout training. For validation DataLoaders, use drop_last=False — you want to evaluate on every sample, no exceptions.

io/thecodeforge/ml/efficient_loading.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import torch
from torch.utils.data import DataLoader

# Production-grade DataLoader configuration for GPU training
# Each parameter here solves a specific real problem
loader = DataLoader(
    forge_ds,
    batch_size=32,
    shuffle=True,              # Reshuffle every epoch for better generalisation
    num_workers=4,             # 4 parallel CPU processes — eliminates GPU starvation
    pin_memory=True,           # Pre-pin batches in page-locked memory for faster DMA transfer
    drop_last=True,            # Drop incomplete final batch — prevents BatchNorm issues
    persistent_workers=True,   # Keep workers alive between epochs — avoids 5-10s restart cost
    prefetch_factor=2,         # Each worker pre-fetches 2 batches ahead — reduces wait time
)

# Validation DataLoader — different settings for a reason
val_loader = DataLoader(
    val_ds,
    batch_size=64,             # Larger batch is fine for inference — no gradient storage
    shuffle=False,             # Do NOT shuffle validation — reproducible evaluation
    num_workers=4,
    pin_memory=True,
    drop_last=False,           # Evaluate on EVERY sample — no exceptions
    persistent_workers=True,
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for samples, labels in loader:
    # non_blocking=True overlaps the CPU-to-GPU transfer with other CPU work
    # Only effective when pin_memory=True is also set
    samples = samples.to(device, non_blocking=True)
    labels  = labels.to(device, non_blocking=True)
    # ... training logic ...

# Random seed consistency across workers
# Without this, all workers inherit the same seed and produce identical augmentations
def worker_init_fn(worker_id: int) -> None:
    import numpy as np
    base_seed = torch.initial_seed() % (2 ** 32)
    np.random.seed(base_seed + worker_id)
    torch.manual_seed(base_seed + worker_id)

# Apply it to any DataLoader that uses random augmentations in __getitem__
auged_loader = DataLoader(
    forge_ds,
    batch_size=32,
    num_workers=4,
    worker_init_fn=worker_init_fn,
    pin_memory=True,
)
▶ Output
// High-throughput data pipeline established.
// GPU will receive pre-fetched, pre-pinned batches with consistent random augmentation across workers.
⚠ When Simple Is Better
The most over-engineered DataLoader mistake is building a custom Dataset for data that already lives in memory as tensors. If your entire dataset fits in 10% of available RAM and is already in tensor format, torch.utils.data.TensorDataset(features, labels) gives you __len__ and __getitem__ for free with zero boilerplate. Only write a custom Dataset when you genuinely need lazy loading from disk, a database, or object storage. Start simple and add complexity only when you have a measurable reason to.
📊 Production Insight
num_workers=0 is the default and the most common performance mistake — always override it for GPU training.
Loading data in __init__ instead of __getitem__ converts lazy loading into a greedy RAM consumer — OOMs before the first epoch.
drop_last=True on training DataLoaders prevents BatchNorm from receiving a batch of size 1 at the end of an epoch — this is not optional when BatchNorm is in your model.
Rule: load one sample in __getitem__, set num_workers=4, pin_memory=True, drop_last=True for training, and worker_init_fn for any job that uses random augmentations.
🎯 Key Takeaway
num_workers=0 serialises data loading and is the single most common reason GPU utilisation is below 50%. Load data in __getitem__ not __init__. Set drop_last=True on training DataLoaders when using BatchNorm. Use worker_init_fn to ensure random augmentations differ across workers — without it, you are training on fewer unique augmentations than you think.
Debugging Slow or Broken Data Loading
IfGPU utilisation below 50% during training
UseIncrease num_workers to 4 and add pin_memory=True — data loading is almost certainly the bottleneck
IfRAM usage grows steadily during training
UseCheck if __init__ pre-loads data — move all data loading into __getitem__ so only one batch lives in memory at a time
IfTraining crashes with unpickleable object error or silent worker hang
UseMove file handles and DB connections out of __init__ and into __getitem__ or worker_init_fn
IfBatchNorm behaves erratically on the last batch of each epoch
UseSet drop_last=True on the training DataLoader to ensure every batch has the same size

Custom collate_fn: Handling Variable-Length Data

The default collate_fn expects every sample in a batch to have the same shape so it can stack them into a uniform tensor. This assumption breaks the moment you work with NLP sequences of different lengths, graphs with different numbers of nodes, or images that have not been resized to a fixed resolution.

A custom collate_fn lets you define exactly how a list of heterogeneous samples becomes a batch. The most common pattern — one you will write or encounter in almost every NLP project — is pad-and-mask: pad all sequences to the length of the longest sequence in the batch, and return a binary mask tensor that tells downstream layers which positions are real data and which are padding. Attention layers, loss functions, and pooling operations all need this mask to avoid treating padding as signal.

The production subtlety that trips people up: collate_fn runs on the main thread, not inside the worker processes. This means even with num_workers=4, a slow collate_fn becomes the bottleneck for the entire pipeline. Keep it to reshaping and padding only. If you find yourself sorting sequences, computing complex statistics, or doing any non-trivial transformation in collate_fn, move that work into __getitem__ where it can run in parallel across workers.

For NLP work in 2026, most teams use Hugging Face's DataCollatorWithPadding which implements this pattern with tokeniser-aware padding. But understanding the underlying collate_fn contract means you can customise it when the standard collators do not fit your data structure.

io/thecodeforge/ml/custom_collate.py · PYTHON
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from typing import List, Tuple


def collate_variable_length(
    batch: List[Tuple[torch.Tensor, torch.Tensor]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Pads variable-length sequences to the longest in the batch.
    Returns:
        padded   — (batch_size, max_len, feature_dim) padded sequences
        labels   — (batch_size,) classification labels
        mask     — (batch_size, max_len) float mask: 1.0 real, 0.0 padding

    Runs on the main thread — keep this function fast.
    Heavy transforms belong in __getitem__, not here.
    """
    sequences, labels = zip(*batch)

    # pad_sequence pads to the longest sequence in this batch
    # batch_first=True gives shape (batch, seq_len, features)
    padded = pad_sequence(sequences, batch_first=True, padding_value=0.0)

    # Build the mask: 1.0 where real data, 0.0 where padding
    # Downstream attention layers and loss functions use this to ignore padding
    lengths = torch.tensor([len(s) for s in sequences], dtype=torch.long)
    mask    = torch.zeros(padded.shape[0], padded.shape[1], dtype=torch.float32)
    for i, length in enumerate(lengths):
        mask[i, :length] = 1.0

    labels_stacked = torch.stack(labels)
    return padded, labels_stacked, mask


# Toy variable-length dataset
class VariableLengthDataset(Dataset):
    def __init__(self, num_samples: int = 200, feature_dim: int = 128):
        self.num_samples = num_samples
        self.feature_dim = feature_dim

    def __len__(self) -> int:
        return self.num_samples

    def __getitem__(self, idx: int):
        # Sequence length varies per sample — this is what breaks the default collate_fn
        seq_len  = torch.randint(10, 60, (1,)).item()
        features = torch.randn(seq_len, self.feature_dim)
        label    = torch.tensor(idx % 2, dtype=torch.long)  # binary label
        return features, label


variable_length_dataset = VariableLengthDataset(num_samples=200, feature_dim=128)

loader = DataLoader(
    variable_length_dataset,
    batch_size=32,
    shuffle=True,
    collate_fn=collate_variable_length,  # custom collation for variable-length sequences
    num_workers=4,
    pin_memory=True,
)

for batch_idx, (padded_seqs, labels, mask) in enumerate(loader):
    # padded_seqs: (32, max_len_in_batch, 128)
    # mask:        (32, max_len_in_batch) — 1.0 for real tokens, 0.0 for padding
    real_token_count = mask.sum().item()
    total_positions  = mask.numel()
    padding_pct      = 100 * (1 - real_token_count / total_positions)
    print(f"Batch {batch_idx}: shape {padded_seqs.shape} | "
          f"padding {padding_pct:.1f}% | labels {labels[:4]}")
    if batch_idx >= 2:
        break
▶ Output
Batch 0: shape torch.Size([32, 59, 128]) | padding 37.2% | labels tensor([1, 0, 1, 0])
Batch 1: shape torch.Size([32, 58, 128]) | padding 36.8% | labels tensor([0, 1, 0, 1])
Batch 2: shape torch.Size([32, 57, 128]) | padding 35.1% | labels tensor([1, 0, 1, 0])
💡collate_fn Rules of Thumb
  • Use pad_sequence from torch.nn.utils.rnn — it handles batch-first padding in one call and is well-tested
  • Always return a mask alongside padded data — every downstream layer that touches sequences needs to know where padding starts
  • collate_fn runs on the main thread — if profiling shows it as the bottleneck, move the heavy work into __getitem__ where workers can parallelise it
  • Consider bucket sampling (grouping sequences by similar length before batching) to reduce padding waste — padding above 40% per batch is worth addressing
  • For images of different sizes, resize in __getitem__ not in collate_fn — resizing is CPU-intensive and belongs in the parallel workers
📊 Production Insight
collate_fn runs on the main thread regardless of num_workers — complex logic here blocks the entire pipeline even with parallel workers running.
Padding above 40% per batch is wasted compute at training time and is worth fixing with bucket sampling or sorting by length before batching.
In 2026 most NLP teams use Hugging Face DataCollatorWithPadding, but understanding the raw collate_fn contract lets you customise it when standard collators do not fit your data structure.
Rule: keep collate_fn to reshaping and padding only — move transforms to __getitem__.
🎯 Key Takeaway
Custom collate_fn handles variable-length samples — pad to the longest sequence in the batch and always return a mask so downstream layers can ignore padding positions. collate_fn runs on the main thread, not in workers — keep it fast or it becomes the bottleneck that num_workers cannot fix. Track padding percentage per batch and consider bucket sampling if it consistently exceeds 40%.

IterableDataset: Streaming Data Without Random Access

Map-style datasets require __len__ and __getitem__ — random access to any sample by index. This breaks when data arrives as a stream (Kafka, network logs, real-time sensor feeds) or when the dataset is genuinely too large to index. IterableDataset solves this by yielding samples sequentially without needing to know the total size.

The use case that justifies reaching for IterableDataset: training on a live event stream where the concept of 'total dataset size' does not exist, or on a dataset so large that generating a complete index would take longer than training itself. The DataLoader iterates through the __iter__ method, batches samples as they arrive, and provides limited shuffling within a buffer of recent samples.

The production trade-off that you need to understand before choosing IterableDataset: it cannot shuffle globally because it never knows the full dataset. It can shuffle within a configurable buffer of recent samples, but the model always sees data in approximately the order it arrives in the stream. If the stream has any temporal structure — and real data almost always does — the model will see a biased distribution. For training data that can be indexed, Map-style datasets with global shuffling are strictly better. Use IterableDataset only when indexing is genuinely impossible.

One non-obvious operational issue with num_workers > 0 and IterableDataset: each worker receives its own copy of the __iter__ method and will iterate the entire stream independently. Without sharding the stream across workers, every sample gets loaded num_workers times. You need to detect the worker ID inside __iter__ using torch.utils.data.get_worker_info() and partition the stream so each worker handles a distinct subset.

io/thecodeforge/ml/iterable_dataset.py · PYTHON
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import torch
from torch.utils.data import IterableDataset, DataLoader, get_worker_info
from typing import Iterator, Tuple


class StreamingSensorDataset(IterableDataset):
    """
    Yields samples from a simulated sensor stream.
    In production, replace __iter__ with a Kafka consumer,
    network socket reader, or database cursor.

    IMPORTANT: with num_workers > 0, each worker calls __iter__ independently.
    Without sharding, every sample is loaded num_workers times.
    The __iter__ below handles worker partitioning automatically.
    """

    def __init__(self, num_samples: int = 10000):
        self.num_samples = num_samples

    def __iter__(self) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]:
        worker_info = get_worker_info()

        if worker_info is None:
            # Single-process loading — yield the full stream
            start, end = 0, self.num_samples
        else:
            # Multi-process loading — partition stream across workers
            # Each worker gets a non-overlapping slice of the sample range
            per_worker = self.num_samples // worker_info.num_workers
            worker_id  = worker_info.id
            start      = worker_id * per_worker
            end        = start + per_worker if worker_id < worker_info.num_workers - 1 else self.num_samples

        for i in range(start, end):
            # Simulate streaming sensor data — replace with real I/O in production
            features = torch.randn(10)
            label    = torch.tensor(float(features[0] > 0), dtype=torch.float32)
            yield features, label


stream_ds     = StreamingSensorDataset(num_samples=10000)
stream_loader = DataLoader(
    stream_ds,
    batch_size=64,
    num_workers=2,  # Each worker now handles a non-overlapping shard of the stream
)

for batch_idx, (features, labels) in enumerate(stream_loader):
    if batch_idx >= 5:
        break
    print(f"Batch {batch_idx}: features {features.shape}, "
          f"label distribution: {labels.sum().item():.0f}/{len(labels)} positive")
▶ Output
Batch 0: features torch.Size([64, 10]), label distribution: 34/64 positive
Batch 1: features torch.Size([64, 10]), label distribution: 31/64 positive
Batch 2: features torch.Size([64, 10]), label distribution: 33/64 positive
Batch 3: features torch.Size([64, 10]), label distribution: 30/64 positive
Batch 4: features torch.Size([64, 10]), label distribution: 32/64 positive
🔥Map-style vs Iterable Dataset — When to Use Each
Use Map-style Dataset when you have random access to all samples and can determine the dataset size — this covers the vast majority of production ML use cases. Use IterableDataset when data is a live stream (Kafka, sockets, real-time sensors) where indexing is genuinely impossible or when the dataset is so large that building a complete index is impractical. If you choose IterableDataset with num_workers > 0, you must implement worker sharding using get_worker_info() inside __iter__ — otherwise each sample is loaded num_workers times.
📊 Production Insight
IterableDataset with num_workers > 0 loads every sample num_workers times unless you shard the stream per worker using get_worker_info() — this doubles or quadruples I/O load silently.
Global shuffling is impossible with IterableDataset — only buffer-based local shuffling is available, which leaves the model exposed to distribution bias in ordered streams.
In 2026, streaming ML pipelines more often use specialised libraries (Mosaic Streaming, WebDataset) rather than raw IterableDataset for very large scale — but understanding IterableDataset is the prerequisite.
Rule: use Map-style datasets whenever data can be indexed. Reserve IterableDataset for genuinely streaming or unindexable data sources.
🎯 Key Takeaway
IterableDataset yields samples sequentially — no __len__, no __getitem__, no random access. Shuffling is limited to a buffer of recent samples, which means the model sees an approximately ordered distribution rather than a globally shuffled one. With num_workers > 0, implement get_worker_info() sharding inside __iter__ or every sample gets loaded num_workers times. Use Map-style datasets whenever the data can be indexed; reach for IterableDataset only when it genuinely cannot.
🗂 Data Loading Approaches Compared
Choosing the right data loading strategy for your use case
FeatureStandard Python List / LoopPyTorch DataLoader / Dataset
Memory UsageHigh — entire dataset loaded into RAM before training startsLow — lazy loading per sample in __getitem__, only one batch in memory at a time
ConcurrencySingle-threaded — data loading blocks the main thread and therefore the GPUMulti-process via num_workers — true parallelism that bypasses Python's GIL
BatchingManual list slicing — you write the indexing logic and handle edge cases like the last batchAutomatic via batch_size — DataLoader handles indexing, collation, and drop_last
Data ShufflingManual random.shuffle() — must remember to call it every epoch and it operates on the full list in memoryBuilt-in per-epoch shuffling via shuffle=True — operates on indices, not the data itself
GPU IntegrationManual .to(device) on every tensor — no transfer optimisationOptimised via pin_memory=True and non_blocking=True — DMA transfer without extra copy step

🎯 Key Takeaways

  • Dataset and DataLoader have deliberately separate responsibilities — Dataset knows how to access one sample, DataLoader knows how to batch, shuffle, and parallelise. Understanding this separation makes every configuration decision obvious.
  • num_workers=0 is the default and the most common performance mistake — it serialises data loading on the main thread and leaves the GPU idle between batches. Always override it for GPU training.
  • In Docker, --shm-size=2g is a required flag when num_workers > 0, not an optional optimisation. The default 64MB causes Bus error crashes and leaves no Python traceback to diagnose from.
  • Load only metadata in __init__ and actual data in __getitem__ — this is what makes lazy loading work. Loading data in __init__ converts a scalable pipeline into an OOM crash before training starts.
  • collate_fn runs on the main thread regardless of num_workers — keep it to padding and reshaping only, and move transforms into __getitem__ where workers can parallelise them.
  • IterableDataset cannot shuffle globally and requires per-worker stream sharding via get_worker_info() when num_workers > 0. Use Map-style datasets whenever indexing is possible.

⚠ Common Mistakes to Avoid

    Using a custom Dataset when TensorDataset suffices
    Symptom

    Unnecessary boilerplate that adds maintenance burden. The custom class has __len__ and __getitem__ that do nothing more than index into an in-memory tensor — exactly what TensorDataset already does.

    Fix

    If your data fits in RAM and is already a tensor, use torch.utils.data.TensorDataset(features, labels) directly. It provides __len__ and __getitem__ for free. Only write a custom Dataset when you need lazy loading from disk, a database, or object storage — not for data that already lives in memory.

    Passing unpickleable objects to Dataset when num_workers > 0
    Symptom

    DataLoader hangs silently for 30–60 seconds and then crashes, or the worker process exits with no useful traceback. Open file handles, database connections, and lambda functions are the most common culprits.

    Fix

    Initialise file handles and DB connections inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker process. Never store open handles as instance variables when num_workers > 0. Test pickleability before a long training run: import pickle; pickle.dumps(dataset) should not raise.

    Ignoring error handling in __getitem__ for corrupted files
    Symptom

    Training crashes mid-epoch with FileNotFoundError, PIL.UnidentifiedImageError, or a silent tensor of zeros where a valid sample should be. The entire epoch is lost and the crash is non-deterministic — it only happens when that specific corrupted sample is drawn.

    Fix

    Add try/except in __getitem__. On exception, log the bad file path and return a placeholder sample or the nearest valid neighbour. Never silently return zeros without logging — you will not know how much of your dataset is corrupted.

    Setting num_workers=0 for GPU training
    Symptom

    GPU utilisation stays below 50% throughout training. The GPU is idle for roughly as long as it is computing — data loading on the main thread blocks the entire pipeline. Training takes 3–5x longer than it should.

    Fix

    Set num_workers=4 as a starting point for single-GPU training. Add pin_memory=True for CUDA. Add persistent_workers=True to avoid the 5–10 second worker restart overhead at the start of every epoch. Profile with nvidia-smi -l 1 to confirm GPU utilisation exceeds 85% after the change.

    Loading data in __init__ instead of __getitem__
    Symptom

    Dataset initialisation takes minutes rather than milliseconds. RAM usage climbs to system limits before the first training batch is processed. OOM crashes occur before training even starts.

    Fix

    Store only metadata (file paths, labels, IDs) in __init__ — this should take milliseconds regardless of dataset size. Load actual data lazily in __getitem__, one sample at a time. If the dataset truly fits in RAM and pre-loading is intentional, use TensorDataset instead of a custom class.

Interview Questions on This Topic

  • QHow does the DataLoader utilise Python's multi-processing to bypass the Global Interpreter Lock (GIL)?Mid-levelReveal
    Python's GIL prevents true parallel execution of Python bytecode within a single process. The DataLoader bypasses this by spawning multiple separate processes — not threads — via Python's multiprocessing module. Each worker process loads and transforms data independently in its own Python interpreter with its own GIL, so they run truly in parallel without contending with each other or the main process. Workers write tensors to shared memory at /dev/shm, and the main process reads pre-fetched batches from that shared memory. This achieves genuine parallelism for both I/O-bound operations (reading files from disk) and CPU-bound operations (image decoding, augmentation). The shared memory approach is also why Docker's /dev/shm size matters — it is the actual transfer medium, not a socket or pipe.
  • QExplain the Producer-Consumer pattern as it applies to the relationship between a CPU DataLoader and a GPU training loop.Mid-levelReveal
    The DataLoader workers are producers — they load, transform, and batch data on CPU in parallel, writing completed batches to a shared memory queue. The training loop is a consumer — it pulls the next batch from the queue and sends it to GPU for forward and backward passes. The performance insight is pipeline parallelism: while the GPU is processing batch N, workers are already preparing batch N+1 and N+2. This overlap hides data loading latency behind GPU compute time. The prefetch_factor parameter controls how many batches ahead workers prepare. If the queue is empty when the training loop requests a batch, the GPU stalls — this is data starvation, and it shows up as GPU utilisation below 80% in nvidia-smi. Increasing num_workers or reducing the cost of __getitem__ resolves it.
  • QWhat is the difference between a Map-style and an Iterable-style Dataset? When would you strictly choose the latter?SeniorReveal
    A Map-style dataset implements __len__ and __getitem__, providing O(1) random access to any sample by index. The DataLoader can globally shuffle all indices before each epoch, ensuring the model sees a fully randomised data distribution. An Iterable-style dataset implements only __iter__, yielding samples sequentially. It cannot provide __len__ or random access, and shuffling is limited to a buffer of recent samples rather than the full dataset. Choose Iterable-style only when: data arrives as a live stream (Kafka, network socket) where total size is unknown; the dataset is so large that building a complete index is impractical; or data genuinely cannot be accessed by index. The trade-off is real — limited shuffling means the model sees a biased distribution relative to a globally shuffled Map-style dataset. An additional operational issue: with num_workers > 0, each worker calls __iter__ independently and will iterate the full stream unless you implement per-worker sharding using get_worker_info().
  • QHow would you implement a custom collate_fn to handle a dataset where samples have variable sequence lengths?SeniorReveal
    The default collate_fn tries to stack samples into a uniform tensor, which raises an error when sequences have different lengths. A custom collate_fn receives a list of (sequence, label) tuples and must produce a batch tensor. The standard implementation pads all sequences to the length of the longest in the batch using pad_sequence from torch.nn.utils.rnn, which handles the padding efficiently in one call. It also generates a binary mask tensor — shape (batch_size, max_len), with 1.0 for real tokens and 0.0 for padding positions — that downstream attention layers, loss functions, and pooling operations use to ignore padding. The function signature is collate_fn(batch) returning (padded_tensor, labels, mask). One production consideration: collate_fn runs on the main thread, not inside worker processes, so any computation added here is serial and can become the pipeline bottleneck. Keep it to reshaping and padding; move heavier transforms into __getitem__ where they run in parallel.
  • QDescribe the purpose of pin_memory. How does it interact with pageable vs pinned host memory during DMA transfers to the GPU?SeniorReveal
    pin_memory=True allocates DataLoader output tensors in page-locked (pinned) host memory rather than the default pageable memory. The difference matters at transfer time: pageable memory can be swapped to disk by the OS, so before the GPU's DMA engine can transfer it, the driver must first copy it to a temporary pinned buffer — adding a full CPU-to-CPU copy before the CPU-to-GPU transfer begins. With pre-pinned memory, the DMA transfer starts immediately from the original buffer, eliminating that intermediate copy. In practice this reduces CPU-to-GPU transfer latency by 2–5x on large tensors. The non_blocking=True argument in .to(device) extends this further — it tells the CUDA runtime to initiate the DMA transfer asynchronously and return control to the CPU immediately, allowing the training loop to continue CPU work while the transfer completes in the background. The trade-off: pinned memory cannot be swapped by the OS, reducing memory management flexibility under system-wide memory pressure. Use pin_memory=True for GPU training and monitor total pinned memory with torch.cuda.memory_summary() to ensure you are not exhausting it.

Frequently Asked Questions

What is the difference between Dataset and DataLoader in PyTorch?

A Dataset defines how to access a single sample — it implements __len__ (total number of samples) and __getitem__ (fetch one sample by index). It knows nothing about batching or parallelism. A DataLoader wraps a Dataset and adds everything else: batching (batch_size), per-epoch shuffling (shuffle=True), multi-process loading (num_workers), and GPU transfer optimisation (pin_memory). The Dataset is the data source; the DataLoader is the pipeline that feeds data to the training loop at the right cadence to keep the GPU busy.

How many num_workers should I use?

A practical starting point is 4 for single-GPU training. Increase it if GPU utilisation measured by nvidia-smi is below 85% after setting num_workers=4 and pin_memory=True. Do not set num_workers higher than the number of CPU cores allocated to your process — in Docker and Kubernetes, check the container's CPU limit, not the host machine's total core count. Setting num_workers too high causes CPU contention between workers and actually slows down data loading. The right number is the smallest value that keeps GPU utilisation above 85%.

When should I use TensorDataset vs a custom Dataset?

Use TensorDataset when your data is already in memory as tensors and the full dataset fits comfortably in RAM. It provides __len__ and __getitem__ for free with zero boilerplate — no custom class needed. Use a custom Dataset when data is on disk or object storage and too large to fit in RAM, when samples need per-sample transforms or augmentation, or when data comes from a non-tensor source like a database or API. The decision is straightforward: if the data is already a tensor in memory, TensorDataset. If it needs to be loaded from somewhere, custom Dataset.

Why does my DataLoader hang with num_workers > 0?

The most common cause is unpickleable objects stored in the Dataset. Python's multiprocessing pickles the Dataset to send it to each worker process. Open file handles, database connections, and lambda functions cannot be pickled — the worker hangs silently rather than raising a clear exception. Test it first: import pickle; pickle.dumps(dataset) should complete without raising. The fix is to initialise those objects inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker. A second cause is a deadlock inside __getitem__ — for example, waiting on a threading lock that is held by the main process.

What is persistent_workers and when should I use it?

persistent_workers=True keeps worker processes alive between epochs rather than destroying and re-creating them. Worker startup involves forking processes, importing modules, and re-initialising any objects in worker_init_fn — this typically costs 5–10 seconds per epoch on a standard training machine. With persistent_workers=True, that overhead only occurs once at the start of training. Use it whenever you are training for more than a few epochs and num_workers > 0. The only trade-off is slightly higher baseline memory usage because the workers remain resident. It is almost always worth it.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousTraining Loop in PyTorch ExplainedNext →CNN Image Classification with PyTorch
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged