Senior 9 min · March 09, 2026

DataLoader num_workers Bus Error — Docker shm Fix

Docker's 64MB shm triggers Bus error in PyTorch DataLoader (num_workers>0).

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Dataset defines how to access a single sample — implement __len__ and __getitem__ for lazy loading
  • DataLoader wraps a Dataset to provide batching, shuffling, and multi-process parallel loading
  • pin_memory=True speeds up CPU-to-GPU transfers by using page-locked host memory
  • num_workers > 0 parallelizes data loading on CPU — the #1 fix for GPU starvation
  • The biggest production mistake is num_workers=0, which serializes loading and slows training 50%+
  • In Docker, --shm-size must be increased when num_workers > 0 or you get Bus error crashes
Plain-English First

Think of PyTorch DataLoader and Datasets as the logistics layer of a large industrial kitchen. The Dataset is your pantry — it holds all the raw ingredients and knows exactly where each one lives. The DataLoader is your sous-chef — it pulls those ingredients, organises them into manageable trays (batches), shuffles the order so the kitchen never gets stuck cooking the same meal twice in a row, and hands each tray to the head chef (the GPU) at exactly the right moment so the stove never sits idle waiting. Without a well-organised sous-chef, the most powerful stove in the world spends most of its time waiting for ingredients that are not ready yet.

PyTorch DataLoader and Datasets decouple data storage from batching logic, enabling scalable pipelines that keep GPUs fully utilised. The Dataset class abstracts how individual samples are accessed — one at a time, lazily, from disk or a database. The DataLoader handles batching, shuffling, and multi-process loading on top of whatever Dataset you hand it.

The core problem these tools solve: training on datasets that do not fit in memory while keeping the GPU fed continuously. If data loading is slower than GPU computation, the GPU sits idle between batches — this is called data starvation and it is one of the most common reasons a training run is 3x slower than it should be. The DataLoader solves this by pre-fetching batches in parallel worker processes while the GPU processes the current batch. That overlap is the entire point.

The architectural separation is deliberate and worth internalising early: Dataset knows how to access one sample. DataLoader knows how to batch, shuffle, and parallelise. This means you can swap your data source (disk, SQL, S3, Kafka) without touching the DataLoader, and you can tune the DataLoader's parallelism without touching the Dataset. Each side has one job.

The most common production failure I see in 2026 is the same one I saw in 2022: developers set num_workers=0 during prototyping because it is simpler, everything works, and then they deploy to a real dataset and discover training is 3–5x slower than it needs to be because data loading is serialised on the main thread. The fix is always num_workers >= 1 with pin_memory=True for GPU training — and documenting that requirement so it does not get reverted in a future PR.

What Is PyTorch DataLoader and Datasets and Why Does It Exist?

PyTorch DataLoader and Datasets exist to solve a single concrete problem: how do you train on data that is too large to fit in memory, while keeping a GPU that costs thousands of dollars per hour fully utilised?

The Dataset class — specifically the Map-style variant — requires implementing two methods: __len__ (how many samples exist) and __getitem__ (fetch one sample by index). That is the entire contract. The Dataset knows nothing about batching, shuffling, or parallelism. It just answers 'give me sample 4,217' as fast as it can.

The DataLoader wraps that Dataset and adds everything else: it selects a batch of indices (optionally shuffled), hands those indices to worker processes that call __getitem__ in parallel, collates the results into a batch tensor, and optionally pre-pins that tensor in page-locked memory for faster GPU transfer. The training loop then pulls pre-fetched batches from a queue without waiting.

The performance insight that changes how you think about this: with num_workers=4 and pin_memory=True, the DataLoader is pre-fetching batch N+1 and N+2 while the GPU is still processing batch N. That pipeline overlap is what keeps GPU utilisation above 90%. Without it — with num_workers=0 — every batch is loaded synchronously on the main thread after the GPU finishes the previous one. The GPU sits idle for however long loading takes. On a dataset of real images with augmentations, that idle time can represent 60–70% of wall-clock training time.

As of 2026, with models being trained on increasingly large datasets and GPUs being increasingly expensive, getting this right is not an optimisation — it is table stakes.

io/thecodeforge/ml/forge_dataset.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import torch
from torch.utils.data import Dataset, DataLoader

# A minimal but complete custom Dataset implementation
# This is the pattern you will replicate for every new data source
class ForgeProjectDataset(Dataset):
    def __init__(self, data_list: list, labels: list):
        # Store only metadata in __init__ — never load actual data here
        # If you load data in __init__, it all lands in RAM before training starts
        self.data = data_list
        self.labels = labels

    def __len__(self) -> int:
        # DataLoader uses this to know how many batches constitute one epoch
        return len(self.data)

    def __getitem__(self, idx: int):
        # This is called once per sample, in parallel across num_workers processes
        # Keep it fast: one file read, one transform, return one sample
        sample = torch.tensor(self.data[idx], dtype=torch.float32)
        label  = torch.tensor(self.labels[idx], dtype=torch.long)
        return sample, label


# Minimal working example — four samples, two features each
raw_data   = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]]
raw_labels = [0, 1, 0, 1]

forge_ds = ForgeProjectDataset(raw_data, raw_labels)

# Production-grade DataLoader configuration for GPU training
forge_loader = DataLoader(
    dataset=forge_ds,
    batch_size=2,
    shuffle=True,           # Reshuffle at every epoch for better generalisation
    num_workers=2,          # Two CPU processes load data in parallel
    pin_memory=True,        # Pre-pin batches in page-locked memory for faster GPU transfer
    persistent_workers=True # Keep workers alive between epochs — avoids 5-10s restart cost
)

# Verify the DataLoader works before starting a long training run
for batch_idx, (samples, labels) in enumerate(forge_loader):
    print(f"Batch {batch_idx}: samples shape {samples.shape}, labels {labels}")
    if batch_idx >= 1:
        break  # Just checking the first two batches
Output
Batch 0: samples shape torch.Size([2, 2]), labels tensor([1, 0])
Batch 1: samples shape torch.Size([2, 2]), labels tensor([0, 1])
The Producer-Consumer Pattern
  • Dataset defines how to access ONE sample — it knows nothing about batching or parallelism and should not
  • DataLoader wraps the Dataset and adds batching, shuffling, and multi-process loading on top
  • Worker processes (producers) load and transform data in parallel on CPU cores while the GPU works
  • The training loop (consumer) pulls pre-fetched batches from a queue — ideally it never waits
  • pin_memory=True pre-pins batches to page-locked memory so DMA transfers to GPU start without an extra copy step
Production Insight
num_workers=0 is the default but it serialises data loading on the main thread — GPU sits idle between every batch.
With num_workers=4 and pin_memory=True on a typical image dataset, GPU utilisation moves from 40–50% to above 90%.
In 2026 with A100 and H100 pricing, that difference in utilisation is real money on every training run.
Rule: always set num_workers >= 1 for GPU training, pin_memory=True for CUDA, and persistent_workers=True for multi-epoch runs.
Key Takeaway
Dataset defines how to access one sample; DataLoader adds batching, shuffling, and parallelism on top. The separation is deliberate — it lets you swap data sources without touching the pipeline, and tune parallelism without touching the data logic. Always set num_workers >= 1 and pin_memory=True for GPU training; leaving both at their defaults is the most common reason training is slower than it should be.
DataLoader Configuration Decision
IfSmall dataset fits entirely in RAM and is already a tensor
UseUse torch.utils.data.TensorDataset — no custom class needed, zero boilerplate, and it is just as fast
IfData is on disk (images, audio files, parquet shards) and needs lazy loading
UseImplement a custom Dataset with __getitem__ loading one sample at a time from disk — never pre-load in __init__
IfData is in a database or streaming source with no natural index
UseUse an IterableDataset — it yields samples sequentially without needing __len__ or random access
IfTraining on GPU with any custom Dataset
UseSet num_workers=4, pin_memory=True, persistent_workers=True, and use non_blocking=True in .to(device) inside the training loop

Enterprise Integration: SQL-Backed Datasets

In real production environments, your training data rarely lives in a flat folder of files. It lives in a database — with labels, metadata, train/val/test splits, and versioning all managed in SQL. Implementing a Dataset that queries a SQL backend is one of the more underrated patterns in production ML engineering.

The approach: in __init__, run a single SQL query to fetch metadata only — sample IDs, file paths on disk or object storage, and labels. Store that metadata in memory as a list or DataFrame. In __getitem__, use the file path from metadata to load the actual binary data — the image, audio file, or feature array — from disk or S3. This keeps memory usage proportional to the number of samples (a few bytes per row of metadata), not the size of the data (potentially gigabytes).

The production benefit that makes this pattern worth the setup: when you add new training data, you insert a row into the SQL table and drop the corresponding file on disk. The next training run picks it up automatically via the __init__ query. There is no CSV file to regenerate, no manifest to sync, and no risk of the file list drifting from the actual filesystem state. I have seen teams spend days debugging training regressions that turned out to be a stale CSV pointing to deleted files — this pattern eliminates that entire class of issue.

One thing to watch: do not query SQL inside __getitem__. SQL connections are not thread-safe and cannot be pickled for multi-process workers. Fetch all metadata once in __init__ and do all disk or object-storage I/O in __getitem__.

io/thecodeforge/db/fetch_samples.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Fetch sample metadata for Dataset __init__
-- We fetch IDs, paths, and labels here — NOT the binary data
-- Binary data is loaded lazily in __getitem__ using the file_path
-- This query runs once at the start of training, not per batch

SELECT
    sample_id,
    file_path,      -- path to the file on NVMe SSD or S3
    label_id,
    split_tag       -- 'train', 'val', or 'test'
FROM io.thecodeforge.training_data
WHERE project_tag = 'vision_v2'
  AND split_tag   = 'train'
  AND is_verified = TRUE   -- exclude samples flagged as corrupted during QA
ORDER BY sample_id ASC;

-- Expected result: one lightweight metadata row per sample
-- Actual image binary data stays on disk until __getitem__ loads it
Output
Returns metadata rows — sample_id, file_path, label_id for each training sample.
Binary data is not transferred; only enough information to load it on demand.
Store Paths in SQL, Not Blobs
Only store file paths or object-storage keys in your SQL table. Loading actual binary blobs from SQL during __getitem__ creates a per-sample database round-trip under multi-process load — it will saturate your database connection pool and become the bottleneck faster than you expect. Keep binary data on a fast NVMe SSD or a distributed object store like S3 or GCS, and use SQL only for the lightweight metadata that tells your Dataset where to find it.
Production Insight
Store file paths in SQL, not binary blobs — per-sample SQL reads under multi-process load will saturate your DB connection pool.
Load binary data from disk or object storage in __getitem__, not __init__ — lazy loading keeps memory proportional to batch size, not dataset size.
SQL connections cannot be pickled for worker processes — open them inside __getitem__ or use worker_init_fn, never store them as instance variables.
Rule: SQL for metadata and versioning, disk or S3 for binary data, Dataset for lazy access, DataLoader for batching and parallelism.
Key Takeaway
SQL stores metadata (paths, labels, splits) — disk or object storage holds binary data — Dataset loads lazily one sample at a time. This pattern scales to tens of millions of samples without loading anything into RAM upfront. New data is picked up automatically on the next training run by rerunning the __init__ query — no CSV drift, no stale manifests.

Containerised Data Pipelines with Docker

Wrapping your training environment in Docker is the standard way to ensure the data pipeline behaves identically across a developer's laptop, a CI server, and a production GPU cluster. It also surfaces the most common PyTorch DataLoader configuration mistake before it costs you a four-hour training run.

The critical Docker configuration that almost everyone gets wrong the first time: when num_workers > 0, PyTorch uses shared memory at /dev/shm to transfer tensors between worker processes and the main process. Docker's default shared memory allocation is 64MB — a sensible default for containerised web services that never heard of PyTorch. For a training job with num_workers=4 and any real batch size, that 64MB fills up within a few epochs and the container dies with a Bus error and no Python traceback. The fix is one flag: --shm-size=2g.

The deployment checklist I use for every new training container: set --shm-size=2g or larger; mount the data directory as a Docker volume rather than copying it into the image (datasets are too large for image layers and change too frequently); pin the PyTorch version explicitly rather than using pytorch/pytorch:latest (latest changes under you in ways that are hard to reproduce); set num_workers based on the CPU cores allocated to the container, not the host machine's total CPU count; and add a pre-flight health check that verifies /dev/shm has enough free space before the training job starts.

The --shm-size flag also belongs in your docker-compose.yml, your Kubernetes pod spec under resources, and your CI job definition — anywhere the container is launched. If it lives in only one place, it will be dropped in a refactor and you will spend an afternoon diagnosing a Bus error that you already fixed six months ago.

DockerfileDOCKERFILE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Pin a specific PyTorch release — 'latest' changes under you
# and reproducing a training run six months later becomes impossible
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WORKDIR /app

# Install dependencies before copying source code
# This layer is cached as long as requirements.txt does not change
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# IMPORTANT: This container MUST be started with --shm-size=2g
# PyTorch DataLoader uses /dev/shm to transfer tensors between worker processes
# Docker's default 64MB causes Bus error crashes when num_workers > 0
# Example: docker run --shm-size=2g --gpus all -v /data:/data thecodeforge/training:latest

# Pre-flight check: verify shared memory is sufficient before training starts
# Exits with a clear error rather than a cryptic Bus error mid-epoch
HEALTHCHECK --interval=10s --timeout=5s --retries=1 \
  CMD python -c "import shutil; free = shutil.disk_usage('/dev/shm').free; assert free > 1e9, f'Insufficient /dev/shm: {free/1e6:.0f}MB free, need 1000MB+'"

CMD ["python", "ForgeDataset.py"]
Output
Successfully built image thecodeforge/data-pipeline:latest
Healthcheck configured — container will refuse to start training if /dev/shm < 1GB free
The One Docker Flag That Prevents Most PyTorch Crashes
When using num_workers > 0 inside Docker, you must set --shm-size=2g on the docker run command. Docker's default 64MB shared memory is designed for web containers, not ML training. Without this flag, your training job will crash with a Bus error and no Python traceback after a few epochs — and it will look like a hardware or dataset problem rather than a configuration one. Add this flag to your docker run script, your docker-compose.yml, and your CI job definition. Treat it as a required argument, not an optional one.
Production Insight
Docker's default 64MB shared memory causes Bus error crashes under multi-process DataLoader — this is not an edge case, it happens on every real training job.
The flag --shm-size=2g must live in every place the container is launched: docker run, compose file, Kubernetes pod spec, and CI job definition.
In Kubernetes, set the equivalent via a shm volume mount: emptyDir with medium: Memory.
Rule: add a pre-training /dev/shm health check so the failure is a clear error message at startup, not a cryptic crash after two hours of training.
Key Takeaway
Docker's default 64MB shared memory causes Bus errors with multi-process DataLoader — --shm-size=2g is not optional, it is a required flag for any real training job. Mount data as a volume rather than baking it into the image. In Kubernetes, use an emptyDir shm volume mount with medium: Memory as the equivalent of --shm-size.
Docker Configuration for Data Loading
Ifnum_workers=0 (single-process loading)
UseDefault 64MB shared memory is sufficient — no --shm-size flag needed, but you are leaving GPU utilisation on the table
Ifnum_workers > 0 (multi-process loading)
UseSet --shm-size=2g minimum — increase to 4g+ with many workers or large batches, and verify with df -h /dev/shm
IfLarge training dataset (>100GB)
UseMount data as a Docker volume, never COPY into the image — images have practical size limits and your dataset changes more often than your code
IfMultiple containers sharing a GPU or running on Kubernetes
UseSet CUDA_VISIBLE_DEVICES explicitly and limit num_workers per container based on the container's CPU allocation, not the host's total core count

Common Mistakes and How to Avoid Them

Most DataLoader bugs in production fall into a small set of patterns. Knowing them in advance means you spend time training models instead of debugging pipelines.

The performance mistakes: num_workers=0 is the biggest one — it serialises every sample load on the main thread and GPU sits idle while it happens. Loading data in __init__ instead of __getitem__ is the second — it turns a lazy-loading Dataset into a greedy RAM consumer that OOMs before training even starts.

The correctness mistakes: passing unpickleable objects (open file handles, database connections, lambda functions) to a Dataset when num_workers > 0. Python's multiprocessing pickles the Dataset to send it to each worker process. If any attribute cannot be pickled, the worker hangs silently or crashes without a useful traceback. The fix is to initialise those objects inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker process.

The subtle one that costs teams debugging time: forgetting drop_last=True on the training DataLoader. The last batch of an epoch almost always has fewer samples than the configured batch_size. For most loss functions this is harmless, but for BatchNorm it is not — BatchNorm uses batch statistics during training, and a batch of size 1 produces undefined variance. Setting drop_last=True discards the last incomplete batch and ensures consistent batch sizes throughout training. For validation DataLoaders, use drop_last=False — you want to evaluate on every sample, no exceptions.

io/thecodeforge/ml/efficient_loading.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import torch
from torch.utils.data import DataLoader

# Production-grade DataLoader configuration for GPU training
# Each parameter here solves a specific real problem
loader = DataLoader(
    forge_ds,
    batch_size=32,
    shuffle=True,              # Reshuffle every epoch for better generalisation
    num_workers=4,             # 4 parallel CPU processes — eliminates GPU starvation
    pin_memory=True,           # Pre-pin batches in page-locked memory for faster DMA transfer
    drop_last=True,            # Drop incomplete final batch — prevents BatchNorm issues
    persistent_workers=True,   # Keep workers alive between epochs — avoids 5-10s restart cost
    prefetch_factor=2,         # Each worker pre-fetches 2 batches ahead — reduces wait time
)

# Validation DataLoader — different settings for a reason
val_loader = DataLoader(
    val_ds,
    batch_size=64,             # Larger batch is fine for inference — no gradient storage
    shuffle=False,             # Do NOT shuffle validation — reproducible evaluation
    num_workers=4,
    pin_memory=True,
    drop_last=False,           # Evaluate on EVERY sample — no exceptions
    persistent_workers=True,
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for samples, labels in loader:
    # non_blocking=True overlaps the CPU-to-GPU transfer with other CPU work
    # Only effective when pin_memory=True is also set
    samples = samples.to(device, non_blocking=True)
    labels  = labels.to(device, non_blocking=True)
    # ... training logic ...

# Random seed consistency across workers
# Without this, all workers inherit the same seed and produce identical augmentations
def worker_init_fn(worker_id: int) -> None:
    import numpy as np
    base_seed = torch.initial_seed() % (2 ** 32)
    np.random.seed(base_seed + worker_id)
    torch.manual_seed(base_seed + worker_id)

# Apply it to any DataLoader that uses random augmentations in __getitem__
auged_loader = DataLoader(
    forge_ds,
    batch_size=32,
    num_workers=4,
    worker_init_fn=worker_init_fn,
    pin_memory=True,
)
Output
// High-throughput data pipeline established.
// GPU will receive pre-fetched, pre-pinned batches with consistent random augmentation across workers.
When Simple Is Better
The most over-engineered DataLoader mistake is building a custom Dataset for data that already lives in memory as tensors. If your entire dataset fits in 10% of available RAM and is already in tensor format, torch.utils.data.TensorDataset(features, labels) gives you __len__ and __getitem__ for free with zero boilerplate. Only write a custom Dataset when you genuinely need lazy loading from disk, a database, or object storage. Start simple and add complexity only when you have a measurable reason to.
Production Insight
num_workers=0 is the default and the most common performance mistake — always override it for GPU training.
Loading data in __init__ instead of __getitem__ converts lazy loading into a greedy RAM consumer — OOMs before the first epoch.
drop_last=True on training DataLoaders prevents BatchNorm from receiving a batch of size 1 at the end of an epoch — this is not optional when BatchNorm is in your model.
Rule: load one sample in __getitem__, set num_workers=4, pin_memory=True, drop_last=True for training, and worker_init_fn for any job that uses random augmentations.
Key Takeaway
num_workers=0 serialises data loading and is the single most common reason GPU utilisation is below 50%. Load data in __getitem__ not __init__. Set drop_last=True on training DataLoaders when using BatchNorm. Use worker_init_fn to ensure random augmentations differ across workers — without it, you are training on fewer unique augmentations than you think.
Debugging Slow or Broken Data Loading
IfGPU utilisation below 50% during training
UseIncrease num_workers to 4 and add pin_memory=True — data loading is almost certainly the bottleneck
IfRAM usage grows steadily during training
UseCheck if __init__ pre-loads data — move all data loading into __getitem__ so only one batch lives in memory at a time
IfTraining crashes with unpickleable object error or silent worker hang
UseMove file handles and DB connections out of __init__ and into __getitem__ or worker_init_fn
IfBatchNorm behaves erratically on the last batch of each epoch
UseSet drop_last=True on the training DataLoader to ensure every batch has the same size

Custom collate_fn: Handling Variable-Length Data

The default collate_fn expects every sample in a batch to have the same shape so it can stack them into a uniform tensor. This assumption breaks the moment you work with NLP sequences of different lengths, graphs with different numbers of nodes, or images that have not been resized to a fixed resolution.

A custom collate_fn lets you define exactly how a list of heterogeneous samples becomes a batch. The most common pattern — one you will write or encounter in almost every NLP project — is pad-and-mask: pad all sequences to the length of the longest sequence in the batch, and return a binary mask tensor that tells downstream layers which positions are real data and which are padding. Attention layers, loss functions, and pooling operations all need this mask to avoid treating padding as signal.

The production subtlety that trips people up: collate_fn runs on the main thread, not inside the worker processes. This means even with num_workers=4, a slow collate_fn becomes the bottleneck for the entire pipeline. Keep it to reshaping and padding only. If you find yourself sorting sequences, computing complex statistics, or doing any non-trivial transformation in collate_fn, move that work into __getitem__ where it can run in parallel across workers.

For NLP work in 2026, most teams use Hugging Face's DataCollatorWithPadding which implements this pattern with tokeniser-aware padding. But understanding the underlying collate_fn contract means you can customise it when the standard collators do not fit your data structure.

io/thecodeforge/ml/custom_collate.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from typing import List, Tuple


def collate_variable_length(
    batch: List[Tuple[torch.Tensor, torch.Tensor]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Pads variable-length sequences to the longest in the batch.
    Returns:
        padded   — (batch_size, max_len, feature_dim) padded sequences
        labels   — (batch_size,) classification labels
        mask     — (batch_size, max_len) float mask: 1.0 real, 0.0 padding

    Runs on the main thread — keep this function fast.
    Heavy transforms belong in __getitem__, not here.
    """
    sequences, labels = zip(*batch)

    # pad_sequence pads to the longest sequence in this batch
    # batch_first=True gives shape (batch, seq_len, features)
    padded = pad_sequence(sequences, batch_first=True, padding_value=0.0)

    # Build the mask: 1.0 where real data, 0.0 where padding
    # Downstream attention layers and loss functions use this to ignore padding
    lengths = torch.tensor([len(s) for s in sequences], dtype=torch.long)
    mask    = torch.zeros(padded.shape[0], padded.shape[1], dtype=torch.float32)
    for i, length in enumerate(lengths):
        mask[i, :length] = 1.0

    labels_stacked = torch.stack(labels)
    return padded, labels_stacked, mask


# Toy variable-length dataset
class VariableLengthDataset(Dataset):
    def __init__(self, num_samples: int = 200, feature_dim: int = 128):
        self.num_samples = num_samples
        self.feature_dim = feature_dim

    def __len__(self) -> int:
        return self.num_samples

    def __getitem__(self, idx: int):
        # Sequence length varies per sample — this is what breaks the default collate_fn
        seq_len  = torch.randint(10, 60, (1,)).item()
        features = torch.randn(seq_len, self.feature_dim)
        label    = torch.tensor(idx % 2, dtype=torch.long)  # binary label
        return features, label


variable_length_dataset = VariableLengthDataset(num_samples=200, feature_dim=128)

loader = DataLoader(
    variable_length_dataset,
    batch_size=32,
    shuffle=True,
    collate_fn=collate_variable_length,  # custom collation for variable-length sequences
    num_workers=4,
    pin_memory=True,
)

for batch_idx, (padded_seqs, labels, mask) in enumerate(loader):
    # padded_seqs: (32, max_len_in_batch, 128)
    # mask:        (32, max_len_in_batch) — 1.0 for real tokens, 0.0 for padding
    real_token_count = mask.sum().item()
    total_positions  = mask.numel()
    padding_pct      = 100 * (1 - real_token_count / total_positions)
    print(f"Batch {batch_idx}: shape {padded_seqs.shape} | "
          f"padding {padding_pct:.1f}% | labels {labels[:4]}")
    if batch_idx >= 2:
        break
Output
Batch 0: shape torch.Size([32, 59, 128]) | padding 37.2% | labels tensor([1, 0, 1, 0])
Batch 1: shape torch.Size([32, 58, 128]) | padding 36.8% | labels tensor([0, 1, 0, 1])
Batch 2: shape torch.Size([32, 57, 128]) | padding 35.1% | labels tensor([1, 0, 1, 0])
collate_fn Rules of Thumb
  • Use pad_sequence from torch.nn.utils.rnn — it handles batch-first padding in one call and is well-tested
  • Always return a mask alongside padded data — every downstream layer that touches sequences needs to know where padding starts
  • collate_fn runs on the main thread — if profiling shows it as the bottleneck, move the heavy work into __getitem__ where workers can parallelise it
  • Consider bucket sampling (grouping sequences by similar length before batching) to reduce padding waste — padding above 40% per batch is worth addressing
  • For images of different sizes, resize in __getitem__ not in collate_fn — resizing is CPU-intensive and belongs in the parallel workers
Production Insight
collate_fn runs on the main thread regardless of num_workers — complex logic here blocks the entire pipeline even with parallel workers running.
Padding above 40% per batch is wasted compute at training time and is worth fixing with bucket sampling or sorting by length before batching.
In 2026 most NLP teams use Hugging Face DataCollatorWithPadding, but understanding the raw collate_fn contract lets you customise it when standard collators do not fit your data structure.
Rule: keep collate_fn to reshaping and padding only — move transforms to __getitem__.
Key Takeaway
Custom collate_fn handles variable-length samples — pad to the longest sequence in the batch and always return a mask so downstream layers can ignore padding positions. collate_fn runs on the main thread, not in workers — keep it fast or it becomes the bottleneck that num_workers cannot fix. Track padding percentage per batch and consider bucket sampling if it consistently exceeds 40%.

IterableDataset: Streaming Data Without Random Access

Map-style datasets require __len__ and __getitem__ — random access to any sample by index. This breaks when data arrives as a stream (Kafka, network logs, real-time sensor feeds) or when the dataset is genuinely too large to index. IterableDataset solves this by yielding samples sequentially without needing to know the total size.

The use case that justifies reaching for IterableDataset: training on a live event stream where the concept of 'total dataset size' does not exist, or on a dataset so large that generating a complete index would take longer than training itself. The DataLoader iterates through the __iter__ method, batches samples as they arrive, and provides limited shuffling within a buffer of recent samples.

The production trade-off that you need to understand before choosing IterableDataset: it cannot shuffle globally because it never knows the full dataset. It can shuffle within a configurable buffer of recent samples, but the model always sees data in approximately the order it arrives in the stream. If the stream has any temporal structure — and real data almost always does — the model will see a biased distribution. For training data that can be indexed, Map-style datasets with global shuffling are strictly better. Use IterableDataset only when indexing is genuinely impossible.

One non-obvious operational issue with num_workers > 0 and IterableDataset: each worker receives its own copy of the __iter__ method and will iterate the entire stream independently. Without sharding the stream across workers, every sample gets loaded num_workers times. You need to detect the worker ID inside __iter__ using torch.utils.data.get_worker_info() and partition the stream so each worker handles a distinct subset.

io/thecodeforge/ml/iterable_dataset.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import torch
from torch.utils.data import IterableDataset, DataLoader, get_worker_info
from typing import Iterator, Tuple


class StreamingSensorDataset(IterableDataset):
    """
    Yields samples from a simulated sensor stream.
    In production, replace __iter__ with a Kafka consumer,
    network socket reader, or database cursor.

    IMPORTANT: with num_workers > 0, each worker calls __iter__ independently.
    Without sharding, every sample is loaded num_workers times.
    The __iter__ below handles worker partitioning automatically.
    """

    def __init__(self, num_samples: int = 10000):
        self.num_samples = num_samples

    def __iter__(self) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]:
        worker_info = get_worker_info()

        if worker_info is None:
            # Single-process loading — yield the full stream
            start, end = 0, self.num_samples
        else:
            # Multi-process loading — partition stream across workers
            # Each worker gets a non-overlapping slice of the sample range
            per_worker = self.num_samples // worker_info.num_workers
            worker_id  = worker_info.id
            start      = worker_id * per_worker
            end        = start + per_worker if worker_id < worker_info.num_workers - 1 else self.num_samples

        for i in range(start, end):
            # Simulate streaming sensor data — replace with real I/O in production
            features = torch.randn(10)
            label    = torch.tensor(float(features[0] > 0), dtype=torch.float32)
            yield features, label


stream_ds     = StreamingSensorDataset(num_samples=10000)
stream_loader = DataLoader(
    stream_ds,
    batch_size=64,
    num_workers=2,  # Each worker now handles a non-overlapping shard of the stream
)

for batch_idx, (features, labels) in enumerate(stream_loader):
    if batch_idx >= 5:
        break
    print(f"Batch {batch_idx}: features {features.shape}, "
          f"label distribution: {labels.sum().item():.0f}/{len(labels)} positive")
Output
Batch 0: features torch.Size([64, 10]), label distribution: 34/64 positive
Batch 1: features torch.Size([64, 10]), label distribution: 31/64 positive
Batch 2: features torch.Size([64, 10]), label distribution: 33/64 positive
Batch 3: features torch.Size([64, 10]), label distribution: 30/64 positive
Batch 4: features torch.Size([64, 10]), label distribution: 32/64 positive
Map-style vs Iterable Dataset — When to Use Each
Use Map-style Dataset when you have random access to all samples and can determine the dataset size — this covers the vast majority of production ML use cases. Use IterableDataset when data is a live stream (Kafka, sockets, real-time sensors) where indexing is genuinely impossible or when the dataset is so large that building a complete index is impractical. If you choose IterableDataset with num_workers > 0, you must implement worker sharding using get_worker_info() inside __iter__ — otherwise each sample is loaded num_workers times.
Production Insight
IterableDataset with num_workers > 0 loads every sample num_workers times unless you shard the stream per worker using get_worker_info() — this doubles or quadruples I/O load silently.
Global shuffling is impossible with IterableDataset — only buffer-based local shuffling is available, which leaves the model exposed to distribution bias in ordered streams.
In 2026, streaming ML pipelines more often use specialised libraries (Mosaic Streaming, WebDataset) rather than raw IterableDataset for very large scale — but understanding IterableDataset is the prerequisite.
Rule: use Map-style datasets whenever data can be indexed. Reserve IterableDataset for genuinely streaming or unindexable data sources.
Key Takeaway
IterableDataset yields samples sequentially — no __len__, no __getitem__, no random access. Shuffling is limited to a buffer of recent samples, which means the model sees an approximately ordered distribution rather than a globally shuffled one. With num_workers > 0, implement get_worker_info() sharding inside __iter__ or every sample gets loaded num_workers times. Use Map-style datasets whenever the data can be indexed; reach for IterableDataset only when it genuinely cannot.
● Production incidentPOST-MORTEMseverity: high

Docker container crashes with Bus error when num_workers > 0

Symptom
Training starts normally and runs for 2–3 epochs before dying with a Bus error. The crash is intermittent — sometimes it survives 10 epochs before failing. No Python traceback is produced, which makes it look like a hardware or OS issue rather than a configuration problem.
Assumption
The dataset has corrupted files or the host machine has faulty RAM. The team spent two days running memory diagnostics and re-validating the dataset before looking at the container configuration.
Root cause
PyTorch's multi-process DataLoader uses shared memory at /dev/shm to transfer tensors between worker processes and the main process. Docker's default shared memory allocation is 64MB — a number that made sense for containerised web services but is completely inadequate for ML training. With num_workers=4 and batch_size=64 on a dataset of any real size, the shared memory segment fills up within a few epochs. When a worker tries to write to a full shared memory segment, the OS sends SIGBUS. There is no Python-level exception because the failure happens at the OS level, below the Python runtime.
Fix
Added --shm-size=2g to the docker run command. Added a pre-training health check that reads /dev/shm free space and exits with a clear error message if it is below 1GB. Documented the --shm-size requirement in the Dockerfile as a comment and in the repository's README under 'Running in Docker'. Added the flag to the CI/CD job definition so it can never be accidentally dropped.
Key lesson
  • Docker's default 64MB shared memory is too small for PyTorch multi-process DataLoader — this is not a corner case, it affects every real training job
  • Always run Docker containers with --shm-size=2g or larger when num_workers > 0, and encode this in your CI job definition so it cannot be silently removed
  • Bus error crashes with no Python traceback are the unmistakable signature of shared memory exhaustion — do not waste time on hardware diagnostics before checking /dev/shm
  • Test training inside Docker before shipping to production — local and container environments differ in ways that only surface under sustained load
Production debug guideCommon symptoms when the data pipeline goes wrong5 entries
Symptom · 01
Training is 3–5x slower than expected, GPU utilisation is below 50%
Fix
Check if num_workers=0. Increase to num_workers=4 and add pin_memory=True. Profile with torch.profiler to confirm data loading is the bottleneck and not something in __getitem__ itself — sometimes the bottleneck is a slow transform, not the I/O.
Symptom · 02
Bus error (SIGBUS) crash with no Python traceback inside Docker
Fix
Increase Docker shared memory: docker run --shm-size=2g. Verify with df -h /dev/shm inside the container before starting training. Add a pre-flight check to your training entry point that asserts free shared memory exceeds a minimum threshold.
Symptom · 03
RuntimeError: DataLoader worker is killed by signal
Fix
Check for OOM in worker processes — workers are separate processes and their memory usage is not reflected in the main process metrics. Reduce num_workers or batch_size. Check if __getitem__ loads entire files into memory rather than streaming or memory-mapping them.
Symptom · 04
DataLoader hangs indefinitely — no error, no progress
Fix
Check for unpickleable objects in the Dataset. Python's multiprocessing pickles the Dataset to send it to each worker — open file handles, database connections, and lambda functions all fail silently here. Use worker_init_fn for any per-worker initialisation that cannot be pickled.
Symptom · 05
Different workers produce identical random augmentations across the same epoch
Fix
Set a unique random seed per worker using worker_init_fn. Without this, all workers inherit the same base seed from the main process and generate identical augmentation sequences, which reduces effective data diversity and can silently hurt generalisation.
★ DataLoader Debug Cheat SheetQuick commands to diagnose data pipeline issues
Training is slow, GPU utilisation is low
Immediate action
Profile the gap between data loading time and GPU compute time
Commands
nvidia-smi -l 1 # watch GPU utilisation in real time — below 80% means starvation
python -c "import torch; p = torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA]); print('profiler ready')"
Fix now
Set num_workers=4 and pin_memory=True — this is the fix for the vast majority of GPU starvation cases
Bus error crash inside Docker container+
Immediate action
Check shared memory allocation and current usage before touching anything else
Commands
docker exec <container> df -h /dev/shm
docker inspect <container> | grep ShmSize
Fix now
Restart the container with --shm-size=2g and add this flag to your docker run script or compose file permanently
DataLoader hangs with no error after a few batches+
Immediate action
Test whether the Dataset itself can be pickled — if it cannot, workers will hang silently
Commands
import pickle; pickle.dumps(dataset) # if this raises, you have an unpickleable object
strace -p <worker_pid> # check for blocked syscalls in a worker process
Fix now
Move file handles and DB connections out of __init__ and into __getitem__, or initialise them in worker_init_fn
Data Loading Approaches Compared
FeatureStandard Python List / LoopPyTorch DataLoader / Dataset
Memory UsageHigh — entire dataset loaded into RAM before training startsLow — lazy loading per sample in __getitem__, only one batch in memory at a time
ConcurrencySingle-threaded — data loading blocks the main thread and therefore the GPUMulti-process via num_workers — true parallelism that bypasses Python's GIL
BatchingManual list slicing — you write the indexing logic and handle edge cases like the last batchAutomatic via batch_size — DataLoader handles indexing, collation, and drop_last
Data ShufflingManual random.shuffle() — must remember to call it every epoch and it operates on the full list in memoryBuilt-in per-epoch shuffling via shuffle=True — operates on indices, not the data itself
GPU IntegrationManual .to(device) on every tensor — no transfer optimisationOptimised via pin_memory=True and non_blocking=True — DMA transfer without extra copy step

Key takeaways

1
Dataset and DataLoader have deliberately separate responsibilities
Dataset knows how to access one sample, DataLoader knows how to batch, shuffle, and parallelise. Understanding this separation makes every configuration decision obvious.
2
num_workers=0 is the default and the most common performance mistake
it serialises data loading on the main thread and leaves the GPU idle between batches. Always override it for GPU training.
3
In Docker, --shm-size=2g is a required flag when num_workers > 0, not an optional optimisation. The default 64MB causes Bus error crashes and leaves no Python traceback to diagnose from.
4
Load only metadata in __init__ and actual data in __getitem__
this is what makes lazy loading work. Loading data in __init__ converts a scalable pipeline into an OOM crash before training starts.
5
collate_fn runs on the main thread regardless of num_workers
keep it to padding and reshaping only, and move transforms into __getitem__ where workers can parallelise them.
6
IterableDataset cannot shuffle globally and requires per-worker stream sharding via get_worker_info() when num_workers > 0. Use Map-style datasets whenever indexing is possible.

Common mistakes to avoid

5 patterns
×

Using a custom Dataset when TensorDataset suffices

Symptom
Unnecessary boilerplate that adds maintenance burden. The custom class has __len__ and __getitem__ that do nothing more than index into an in-memory tensor — exactly what TensorDataset already does.
Fix
If your data fits in RAM and is already a tensor, use torch.utils.data.TensorDataset(features, labels) directly. It provides __len__ and __getitem__ for free. Only write a custom Dataset when you need lazy loading from disk, a database, or object storage — not for data that already lives in memory.
×

Passing unpickleable objects to Dataset when num_workers > 0

Symptom
DataLoader hangs silently for 30–60 seconds and then crashes, or the worker process exits with no useful traceback. Open file handles, database connections, and lambda functions are the most common culprits.
Fix
Initialise file handles and DB connections inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker process. Never store open handles as instance variables when num_workers > 0. Test pickleability before a long training run: import pickle; pickle.dumps(dataset) should not raise.
×

Ignoring error handling in __getitem__ for corrupted files

Symptom
Training crashes mid-epoch with FileNotFoundError, PIL.UnidentifiedImageError, or a silent tensor of zeros where a valid sample should be. The entire epoch is lost and the crash is non-deterministic — it only happens when that specific corrupted sample is drawn.
Fix
Add try/except in __getitem__. On exception, log the bad file path and return a placeholder sample or the nearest valid neighbour. Never silently return zeros without logging — you will not know how much of your dataset is corrupted.
×

Setting num_workers=0 for GPU training

Symptom
GPU utilisation stays below 50% throughout training. The GPU is idle for roughly as long as it is computing — data loading on the main thread blocks the entire pipeline. Training takes 3–5x longer than it should.
Fix
Set num_workers=4 as a starting point for single-GPU training. Add pin_memory=True for CUDA. Add persistent_workers=True to avoid the 5–10 second worker restart overhead at the start of every epoch. Profile with nvidia-smi -l 1 to confirm GPU utilisation exceeds 85% after the change.
×

Loading data in __init__ instead of __getitem__

Symptom
Dataset initialisation takes minutes rather than milliseconds. RAM usage climbs to system limits before the first training batch is processed. OOM crashes occur before training even starts.
Fix
Store only metadata (file paths, labels, IDs) in __init__ — this should take milliseconds regardless of dataset size. Load actual data lazily in __getitem__, one sample at a time. If the dataset truly fits in RAM and pre-loading is intentional, use TensorDataset instead of a custom class.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does the DataLoader utilise Python's multi-processing to bypass the ...
Q02SENIOR
Explain the Producer-Consumer pattern as it applies to the relationship ...
Q03SENIOR
What is the difference between a Map-style and an Iterable-style Dataset...
Q04SENIOR
How would you implement a custom collate_fn to handle a dataset where sa...
Q05SENIOR
Describe the purpose of pin_memory. How does it interact with pageable v...
Q01 of 05SENIOR

How does the DataLoader utilise Python's multi-processing to bypass the Global Interpreter Lock (GIL)?

ANSWER
Python's GIL prevents true parallel execution of Python bytecode within a single process. The DataLoader bypasses this by spawning multiple separate processes — not threads — via Python's multiprocessing module. Each worker process loads and transforms data independently in its own Python interpreter with its own GIL, so they run truly in parallel without contending with each other or the main process. Workers write tensors to shared memory at /dev/shm, and the main process reads pre-fetched batches from that shared memory. This achieves genuine parallelism for both I/O-bound operations (reading files from disk) and CPU-bound operations (image decoding, augmentation). The shared memory approach is also why Docker's /dev/shm size matters — it is the actual transfer medium, not a socket or pipe.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between Dataset and DataLoader in PyTorch?
02
How many num_workers should I use?
03
When should I use TensorDataset vs a custom Dataset?
04
Why does my DataLoader hang with num_workers > 0?
05
What is persistent_workers and when should I use it?
🔥

That's PyTorch. Mark it forged?

9 min read · try the examples if you haven't

Previous
Training Loop in PyTorch Explained
6 / 7 · PyTorch
Next
CNN Image Classification with PyTorch