Intermediate 12 min · March 09, 2026

PyTorch DataLoader and Datasets

DataLoader num_workers Bus Error — Docker shm Fix

Q: What is the difference between Dataset and DataLoader in PyTorch?

A Dataset defines how to access a single sample — it implements __len__ (total number of samples) and __getitem__ (fetch one sample by index). It knows nothing about batching or parallelism. A DataLoader wraps a Dataset and adds everything else: batching (batch_size), per-epoch shuffling (shuffle=True), multi-process loading (num_workers), and GPU transfer optimisation (pin_memory). The Dataset is the data source; the DataLoader is the pipeline that feeds data to the training loop at the right cadence to keep the GPU busy.

Q: How many num_workers should I use?

A practical starting point is 4 for single-GPU training. Increase it if GPU utilisation measured by nvidia-smi is below 85% after setting num_workers=4 and pin_memory=True. Do not set num_workers higher than the number of CPU cores allocated to your process — in Docker and Kubernetes, check the container's CPU limit, not the host machine's total core count. Setting num_workers too high causes CPU contention between workers and actually slows down data loading. The right number is the smallest value that keeps GPU utilisation above 85%.

Q: When should I use TensorDataset vs a custom Dataset?

Use TensorDataset when your data is already in memory as tensors and the full dataset fits comfortably in RAM. It provides __len__ and __getitem__ for free with zero boilerplate — no custom class needed. Use a custom Dataset when data is on disk or object storage and too large to fit in RAM, when samples need per-sample transforms or augmentation, or when data comes from a non-tensor source like a database or API. The decision is straightforward: if the data is already a tensor in memory, TensorDataset. If it needs to be loaded from somewhere, custom Dataset.

Q: Why does my DataLoader hang with num_workers > 0?

The most common cause is unpickleable objects stored in the Dataset. Python's multiprocessing pickles the Dataset to send it to each worker process. Open file handles, database connections, and lambda functions cannot be pickled — the worker hangs silently rather than raising a clear exception. Test it first: import pickle; pickle.dumps(dataset) should complete without raising. The fix is to initialise those objects inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker. A second cause is a deadlock inside __getitem__ — for example, waiting on a threading lock that is held by the main process.

Q: What is persistent_workers and when should I use it?

persistent_workers=True keeps worker processes alive between epochs rather than destroying and re-creating them. Worker startup involves forking processes, importing modules, and re-initialising any objects in worker_init_fn — this typically costs 5–10 seconds per epoch on a standard training machine. With persistent_workers=True, that overhead only occurs once at the start of training. Use it whenever you are training for more than a few epochs and num_workers > 0. The only trade-off is slightly higher baseline memory usage because the workers remain resident. It is almost always worth it.

Docker's 64MB shm triggers Bus error in PyTorch DataLoader (num_workers>0).

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Dataset defines how to access a single sample — implement __len__ and __getitem__ for lazy loading
DataLoader wraps a Dataset to provide batching, shuffling, and multi-process parallel loading
pin_memory=True speeds up CPU-to-GPU transfers by using page-locked host memory
num_workers > 0 parallelizes data loading on CPU — the #1 fix for GPU starvation
The biggest production mistake is num_workers=0, which serializes loading and slows training 50%+
In Docker, --shm-size must be increased when num_workers > 0 or you get Bus error crashes

✦ Definition~90s read

What is PyTorch DataLoader and Datasets?

PyTorch DataLoader and Datasets exist to solve a single concrete problem: how do you train on data that is too large to fit in memory, while keeping a GPU that costs thousands of dollars per hour fully utilised?

★

Think of PyTorch DataLoader and Datasets as the logistics layer of a large industrial kitchen.

The Dataset class — specifically the Map-style variant — requires implementing two methods: __len__ (how many samples exist) and __getitem__ (fetch one sample by index). That is the entire contract. The Dataset knows nothing about batching, shuffling, or parallelism. It just answers 'give me sample 4,217' as fast as it can.

The DataLoader wraps that Dataset and adds everything else: it selects a batch of indices (optionally shuffled), hands those indices to worker processes that call __getitem__ in parallel, collates the results into a batch tensor, and optionally pre-pins that tensor in page-locked memory for faster GPU transfer. The training loop then pulls pre-fetched batches from a queue without waiting.

Without it — with num_workers=0 — every batch is loaded synchronously on the main thread after the GPU finishes the previous one. The GPU sits idle for however long loading takes. On a dataset of real images with augmentations, that idle time can represent 60–70% of wall-clock training time.

As of 2026, with models being trained on increasingly large datasets and GPUs being increasingly expensive, getting this right is not an optimisation — it is table stakes.

Plain-English First

Think of PyTorch DataLoader and Datasets as the logistics layer of a large industrial kitchen. The Dataset is your pantry — it holds all the raw ingredients and knows exactly where each one lives. The DataLoader is your sous-chef — it pulls those ingredients, organises them into manageable trays (batches), shuffles the order so the kitchen never gets stuck cooking the same meal twice in a row, and hands each tray to the head chef (the GPU) at exactly the right moment so the stove never sits idle waiting. Without a well-organised sous-chef, the most powerful stove in the world spends most of its time waiting for ingredients that are not ready yet.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

PyTorch DataLoader and Datasets decouple data storage from batching logic, enabling scalable pipelines that keep GPUs fully utilised. The Dataset class abstracts how individual samples are accessed — one at a time, lazily, from disk or a database. The DataLoader handles batching, shuffling, and multi-process loading on top of whatever Dataset you hand it.

The core problem these tools solve: training on datasets that do not fit in memory while keeping the GPU fed continuously. If data loading is slower than GPU computation, the GPU sits idle between batches — this is called data starvation and it is one of the most common reasons a training run is 3x slower than it should be. The DataLoader solves this by pre-fetching batches in parallel worker processes while the GPU processes the current batch. That overlap is the entire point.

The architectural separation is deliberate and worth internalising early: Dataset knows how to access one sample. DataLoader knows how to batch, shuffle, and parallelise. This means you can swap your data source (disk, SQL, S3, Kafka) without touching the DataLoader, and you can tune the DataLoader's parallelism without touching the Dataset. Each side has one job.

The most common production failure I see in 2026 is the same one I saw in 2022: developers set num_workers=0 during prototyping because it is simpler, everything works, and then they deploy to a real dataset and discover training is 3–5x slower than it needs to be because data loading is serialised on the main thread. The fix is always num_workers >= 1 with pin_memory=True for GPU training — and documenting that requirement so it does not get reverted in a future PR.

What Is PyTorch DataLoader and Datasets and Why Does It Exist?

The performance insight that changes how you think about this: with num_workers=4 and pin_memory=True, the DataLoader is pre-fetching batch N+1 and N+2 while the GPU is still processing batch N. That pipeline overlap is what keeps GPU utilisation above 90%. Without it — with num_workers=0 — every batch is loaded synchronously on the main thread after the GPU finishes the previous one. The GPU sits idle for however long loading takes. On a dataset of real images with augmentations, that idle time can represent 60–70% of wall-clock training time.

As of 2026, with models being trained on increasingly large datasets and GPUs being increasingly expensive, getting this right is not an optimisation — it is table stakes.

io/thecodeforge/ml/forge_dataset.pyPYTHON

import torch
from torch.utils.data import Dataset, DataLoader

# A minimal but complete custom Dataset implementation
# This is the pattern you will replicate for every new data source
class ForgeProjectDataset(Dataset):
    def __init__(self, data_list: list, labels: list):
        # Store only metadata in __init__ — never load actual data here
        # If you load data in __init__, it all lands in RAM before training starts
        self.data = data_list
        self.labels = labels

    def __len__(self) -> int:
        # DataLoader uses this to know how many batches constitute one epoch
        return len(self.data)

    def __getitem__(self, idx: int):
        # This is called once per sample, in parallel across num_workers processes
        # Keep it fast: one file read, one transform, return one sample
        sample = torch.tensor(self.data[idx], dtype=torch.float32)
        label  = torch.tensor(self.labels[idx], dtype=torch.long)
        return sample, label


# Minimal working example — four samples, two features each
raw_data   = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]]
raw_labels = [0, 1, 0, 1]

forge_ds = ForgeProjectDataset(raw_data, raw_labels)

# Production-grade DataLoader configuration for GPU training
forge_loader = DataLoader(
    dataset=forge_ds,
    batch_size=2,
    shuffle=True,           # Reshuffle at every epoch for better generalisation
    num_workers=2,          # Two CPU processes load data in parallel
    pin_memory=True,        # Pre-pin batches in page-locked memory for faster GPU transfer
    persistent_workers=True # Keep workers alive between epochs — avoids 5-10s restart cost
)

# Verify the DataLoader works before starting a long training run
for batch_idx, (samples, labels) in enumerate(forge_loader):
    print(f"Batch {batch_idx}: samples shape {samples.shape}, labels {labels}")
    if batch_idx >= 1:
        break  # Just checking the first two batches

Output

Batch 0: samples shape torch.Size([2, 2]), labels tensor([1, 0])

Batch 1: samples shape torch.Size([2, 2]), labels tensor([0, 1])

Mental Model

The Producer-Consumer Pattern

The DataLoader is a producer that prepares batches on CPU while the training loop is a consumer that processes them on GPU — the overlap between these two is where all the performance comes from.

Dataset defines how to access ONE sample — it knows nothing about batching or parallelism and should not
DataLoader wraps the Dataset and adds batching, shuffling, and multi-process loading on top
Worker processes (producers) load and transform data in parallel on CPU cores while the GPU works
The training loop (consumer) pulls pre-fetched batches from a queue — ideally it never waits
pin_memory=True pre-pins batches to page-locked memory so DMA transfers to GPU start without an extra copy step

📊 Production Insight

num_workers=0 is the default but it serialises data loading on the main thread — GPU sits idle between every batch.

With num_workers=4 and pin_memory=True on a typical image dataset, GPU utilisation moves from 40–50% to above 90%.

In 2026 with A100 and H100 pricing, that difference in utilisation is real money on every training run.

Rule: always set num_workers >= 1 for GPU training, pin_memory=True for CUDA, and persistent_workers=True for multi-epoch runs.

🎯 Key Takeaway

Dataset defines how to access one sample; DataLoader adds batching, shuffling, and parallelism on top. The separation is deliberate — it lets you swap data sources without touching the pipeline, and tune parallelism without touching the data logic. Always set num_workers >= 1 and pin_memory=True for GPU training; leaving both at their defaults is the most common reason training is slower than it should be.

DataLoader Configuration Decision

IfSmall dataset fits entirely in RAM and is already a tensor

→

UseUse torch.utils.data.TensorDataset — no custom class needed, zero boilerplate, and it is just as fast

IfData is on disk (images, audio files, parquet shards) and needs lazy loading

→

UseImplement a custom Dataset with __getitem__ loading one sample at a time from disk — never pre-load in __init__

IfData is in a database or streaming source with no natural index

→

UseUse an IterableDataset — it yields samples sequentially without needing __len__ or random access

IfTraining on GPU with any custom Dataset

→

UseSet num_workers=4, pin_memory=True, persistent_workers=True, and use non_blocking=True in .to(device) inside the training loop

thecodeforge.io

Pytorch Dataloader Datasets

Enterprise Integration: SQL-Backed Datasets

In real production environments, your training data rarely lives in a flat folder of files. It lives in a database — with labels, metadata, train/val/test splits, and versioning all managed in SQL. Implementing a Dataset that queries a SQL backend is one of the more underrated patterns in production ML engineering.

The approach: in __init__, run a single SQL query to fetch metadata only — sample IDs, file paths on disk or object storage, and labels. Store that metadata in memory as a list or DataFrame. In __getitem__, use the file path from metadata to load the actual binary data — the image, audio file, or feature array — from disk or S3. This keeps memory usage proportional to the number of samples (a few bytes per row of metadata), not the size of the data (potentially gigabytes).

The production benefit that makes this pattern worth the setup: when you add new training data, you insert a row into the SQL table and drop the corresponding file on disk. The next training run picks it up automatically via the __init__ query. There is no CSV file to regenerate, no manifest to sync, and no risk of the file list drifting from the actual filesystem state. I have seen teams spend days debugging training regressions that turned out to be a stale CSV pointing to deleted files — this pattern eliminates that entire class of issue.

One thing to watch: do not query SQL inside __getitem__. SQL connections are not thread-safe and cannot be pickled for multi-process workers. Fetch all metadata once in __init__ and do all disk or object-storage I/O in __getitem__.

io/thecodeforge/db/fetch_samples.sqlSQL

-- Fetch sample metadata for Dataset __init__
-- We fetch IDs, paths, and labels here — NOT the binary data
-- Binary data is loaded lazily in __getitem__ using the file_path
-- This query runs once at the start of training, not per batch

SELECT
    sample_id,
    file_path,      -- path to the file on NVMe SSD or S3
    label_id,
    split_tag       -- 'train', 'val', or 'test'
FROM io.thecodeforge.training_data
WHERE project_tag = 'vision_v2'
  AND split_tag   = 'train'
  AND is_verified = TRUE   -- exclude samples flagged as corrupted during QA
ORDER BY sample_id ASC;

-- Expected result: one lightweight metadata row per sample
-- Actual image binary data stays on disk until __getitem__ loads it

Output

Returns metadata rows — sample_id, file_path, label_id for each training sample.

Binary data is not transferred; only enough information to load it on demand.

🔥Store Paths in SQL, Not Blobs

Only store file paths or object-storage keys in your SQL table. Loading actual binary blobs from SQL during __getitem__ creates a per-sample database round-trip under multi-process load — it will saturate your database connection pool and become the bottleneck faster than you expect. Keep binary data on a fast NVMe SSD or a distributed object store like S3 or GCS, and use SQL only for the lightweight metadata that tells your Dataset where to find it.

📊 Production Insight

Store file paths in SQL, not binary blobs — per-sample SQL reads under multi-process load will saturate your DB connection pool.

Load binary data from disk or object storage in __getitem__, not __init__ — lazy loading keeps memory proportional to batch size, not dataset size.

SQL connections cannot be pickled for worker processes — open them inside __getitem__ or use worker_init_fn, never store them as instance variables.

Rule: SQL for metadata and versioning, disk or S3 for binary data, Dataset for lazy access, DataLoader for batching and parallelism.

🎯 Key Takeaway

SQL stores metadata (paths, labels, splits) — disk or object storage holds binary data — Dataset loads lazily one sample at a time. This pattern scales to tens of millions of samples without loading anything into RAM upfront. New data is picked up automatically on the next training run by rerunning the __init__ query — no CSV drift, no stale manifests.

Containerised Data Pipelines with Docker

Wrapping your training environment in Docker is the standard way to ensure the data pipeline behaves identically across a developer's laptop, a CI server, and a production GPU cluster. It also surfaces the most common PyTorch DataLoader configuration mistake before it costs you a four-hour training run.

The critical Docker configuration that almost everyone gets wrong the first time: when num_workers > 0, PyTorch uses shared memory at /dev/shm to transfer tensors between worker processes and the main process. Docker's default shared memory allocation is 64MB — a sensible default for containerised web services that never heard of PyTorch. For a training job with num_workers=4 and any real batch size, that 64MB fills up within a few epochs and the container dies with a Bus error and no Python traceback. The fix is one flag: --shm-size=2g.

The deployment checklist I use for every new training container: set --shm-size=2g or larger; mount the data directory as a Docker volume rather than copying it into the image (datasets are too large for image layers and change too frequently); pin the PyTorch version explicitly rather than using pytorch/pytorch:latest (latest changes under you in ways that are hard to reproduce); set num_workers based on the CPU cores allocated to the container, not the host machine's total CPU count; and add a pre-flight health check that verifies /dev/shm has enough free space before the training job starts.

The --shm-size flag also belongs in your docker-compose.yml, your Kubernetes pod spec under resources, and your CI job definition — anywhere the container is launched. If it lives in only one place, it will be dropped in a refactor and you will spend an afternoon diagnosing a Bus error that you already fixed six months ago.

DockerfileDOCKERFILE

# Pin a specific PyTorch release — 'latest' changes under you
# and reproducing a training run six months later becomes impossible
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WORKDIR /app

# Install dependencies before copying source code
# This layer is cached as long as requirements.txt does not change
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# IMPORTANT: This container MUST be started with --shm-size=2g
# PyTorch DataLoader uses /dev/shm to transfer tensors between worker processes
# Docker's default 64MB causes Bus error crashes when num_workers > 0
# Example: docker run --shm-size=2g --gpus all -v /data:/data thecodeforge/training:latest

# Pre-flight check: verify shared memory is sufficient before training starts
# Exits with a clear error rather than a cryptic Bus error mid-epoch
HEALTHCHECK --interval=10s --timeout=5s --retries=1 \
  CMD python -c "import shutil; free = shutil.disk_usage('/dev/shm').free; assert free > 1e9, f'Insufficient /dev/shm: {free/1e6:.0f}MB free, need 1000MB+'"

CMD ["python", "ForgeDataset.py"]

Output

Successfully built image thecodeforge/data-pipeline:latest

Healthcheck configured — container will refuse to start training if /dev/shm < 1GB free

⚠ The One Docker Flag That Prevents Most PyTorch Crashes

When using num_workers > 0 inside Docker, you must set --shm-size=2g on the docker run command. Docker's default 64MB shared memory is designed for web containers, not ML training. Without this flag, your training job will crash with a Bus error and no Python traceback after a few epochs — and it will look like a hardware or dataset problem rather than a configuration one. Add this flag to your docker run script, your docker-compose.yml, and your CI job definition. Treat it as a required argument, not an optional one.

📊 Production Insight

Docker's default 64MB shared memory causes Bus error crashes under multi-process DataLoader — this is not an edge case, it happens on every real training job.

The flag --shm-size=2g must live in every place the container is launched: docker run, compose file, Kubernetes pod spec, and CI job definition.

In Kubernetes, set the equivalent via a shm volume mount: emptyDir with medium: Memory.

Rule: add a pre-training /dev/shm health check so the failure is a clear error message at startup, not a cryptic crash after two hours of training.

🎯 Key Takeaway

Docker's default 64MB shared memory causes Bus errors with multi-process DataLoader — --shm-size=2g is not optional, it is a required flag for any real training job. Mount data as a volume rather than baking it into the image. In Kubernetes, use an emptyDir shm volume mount with medium: Memory as the equivalent of --shm-size.

Docker Configuration for Data Loading

Ifnum_workers=0 (single-process loading)

→

UseDefault 64MB shared memory is sufficient — no --shm-size flag needed, but you are leaving GPU utilisation on the table

Ifnum_workers > 0 (multi-process loading)

→

UseSet --shm-size=2g minimum — increase to 4g+ with many workers or large batches, and verify with df -h /dev/shm

IfLarge training dataset (>100GB)

→

UseMount data as a Docker volume, never COPY into the image — images have practical size limits and your dataset changes more often than your code

IfMultiple containers sharing a GPU or running on Kubernetes

→

UseSet CUDA_VISIBLE_DEVICES explicitly and limit num_workers per container based on the container's CPU allocation, not the host's total core count

thecodeforge.io

Pytorch Dataloader Datasets

Common Mistakes and How to Avoid Them

Most DataLoader bugs in production fall into a small set of patterns. Knowing them in advance means you spend time training models instead of debugging pipelines.

The performance mistakes: num_workers=0 is the biggest one — it serialises every sample load on the main thread and GPU sits idle while it happens. Loading data in __init__ instead of __getitem__ is the second — it turns a lazy-loading Dataset into a greedy RAM consumer that OOMs before training even starts.

The correctness mistakes: passing unpickleable objects (open file handles, database connections, lambda functions) to a Dataset when num_workers > 0. Python's multiprocessing pickles the Dataset to send it to each worker process. If any attribute cannot be pickled, the worker hangs silently or crashes without a useful traceback. The fix is to initialise those objects inside __getitem__ (called per sample in each worker) or use worker_init_fn to set them up once per worker process.

The subtle one that costs teams debugging time: forgetting drop_last=True on the training DataLoader. The last batch of an epoch almost always has fewer samples than the configured batch_size. For most loss functions this is harmless, but for BatchNorm it is not — BatchNorm uses batch statistics during training, and a batch of size 1 produces undefined variance. Setting drop_last=True discards the last incomplete batch and ensures consistent batch sizes throughout training. For validation DataLoaders, use drop_last=False — you want to evaluate on every sample, no exceptions.

io/thecodeforge/ml/efficient_loading.pyPYTHON

import torch
from torch.utils.data import DataLoader

# Production-grade DataLoader configuration for GPU training
# Each parameter here solves a specific real problem
loader = DataLoader(
    forge_ds,
    batch_size=32,
    shuffle=True,              # Reshuffle every epoch for better generalisation
    num_workers=4,             # 4 parallel CPU processes — eliminates GPU starvation
    pin_memory=True,           # Pre-pin batches in page-locked memory for faster DMA transfer
    drop_last=True,            # Drop incomplete final batch — prevents BatchNorm issues
    persistent_workers=True,   # Keep workers alive between epochs — avoids 5-10s restart cost
    prefetch_factor=2,         # Each worker pre-fetches 2 batches ahead — reduces wait time
)

# Validation DataLoader — different settings for a reason
val_loader = DataLoader(
    val_ds,
    batch_size=64,             # Larger batch is fine for inference — no gradient storage
    shuffle=False,             # Do NOT shuffle validation — reproducible evaluation
    num_workers=4,
    pin_memory=True,
    drop_last=False,           # Evaluate on EVERY sample — no exceptions
    persistent_workers=True,
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for samples, labels in loader:
    # non_blocking=True overlaps the CPU-to-GPU transfer with other CPU work
    # Only effective when pin_memory=True is also set
    samples = samples.to(device, non_blocking=True)
    labels  = labels.to(device, non_blocking=True)
    # ... training logic ...

# Random seed consistency across workers
# Without this, all workers inherit the same seed and produce identical augmentations
def worker_init_fn(worker_id: int) -> None:
    import numpy as np
    base_seed = torch.initial_seed() % (2 ** 32)
    np.random.seed(base_seed + worker_id)
    torch.manual_seed(base_seed + worker_id)

# Apply it to any DataLoader that uses random augmentations in __getitem__
auged_loader = DataLoader(
    forge_ds,
    batch_size=32,
    num_workers=4,
    worker_init_fn=worker_init_fn,
    pin_memory=True,
)

Output

// High-throughput data pipeline established.

// GPU will receive pre-fetched, pre-pinned batches with consistent random augmentation across workers.

⚠ When Simple Is Better

The most over-engineered DataLoader mistake is building a custom Dataset for data that already lives in memory as tensors. If your entire dataset fits in 10% of available RAM and is already in tensor format, torch.utils.data.TensorDataset(features, labels) gives you __len__ and __getitem__ for free with zero boilerplate. Only write a custom Dataset when you genuinely need lazy loading from disk, a database, or object storage. Start simple and add complexity only when you have a measurable reason to.

📊 Production Insight

num_workers=0 is the default and the most common performance mistake — always override it for GPU training.

Loading data in __init__ instead of __getitem__ converts lazy loading into a greedy RAM consumer — OOMs before the first epoch.

drop_last=True on training DataLoaders prevents BatchNorm from receiving a batch of size 1 at the end of an epoch — this is not optional when BatchNorm is in your model.

Rule: load one sample in __getitem__, set num_workers=4, pin_memory=True, drop_last=True for training, and worker_init_fn for any job that uses random augmentations.

🎯 Key Takeaway

num_workers=0 serialises data loading and is the single most common reason GPU utilisation is below 50%. Load data in __getitem__ not __init__. Set drop_last=True on training DataLoaders when using BatchNorm. Use worker_init_fn to ensure random augmentations differ across workers — without it, you are training on fewer unique augmentations than you think.

Debugging Slow or Broken Data Loading

IfGPU utilisation below 50% during training

→

UseIncrease num_workers to 4 and add pin_memory=True — data loading is almost certainly the bottleneck

IfRAM usage grows steadily during training

→

UseCheck if __init__ pre-loads data — move all data loading into __getitem__ so only one batch lives in memory at a time

IfTraining crashes with unpickleable object error or silent worker hang

→

UseMove file handles and DB connections out of __init__ and into __getitem__ or worker_init_fn

IfBatchNorm behaves erratically on the last batch of each epoch

→

UseSet drop_last=True on the training DataLoader to ensure every batch has the same size

Custom collate_fn: Handling Variable-Length Data

The default collate_fn expects every sample in a batch to have the same shape so it can stack them into a uniform tensor. This assumption breaks the moment you work with NLP sequences of different lengths, graphs with different numbers of nodes, or images that have not been resized to a fixed resolution.

A custom collate_fn lets you define exactly how a list of heterogeneous samples becomes a batch. The most common pattern — one you will write or encounter in almost every NLP project — is pad-and-mask: pad all sequences to the length of the longest sequence in the batch, and return a binary mask tensor that tells downstream layers which positions are real data and which are padding. Attention layers, loss functions, and pooling operations all need this mask to avoid treating padding as signal.

The production subtlety that trips people up: collate_fn runs on the main thread, not inside the worker processes. This means even with num_workers=4, a slow collate_fn becomes the bottleneck for the entire pipeline. Keep it to reshaping and padding only. If you find yourself sorting sequences, computing complex statistics, or doing any non-trivial transformation in collate_fn, move that work into __getitem__ where it can run in parallel across workers.

For NLP work in 2026, most teams use Hugging Face's DataCollatorWithPadding which implements this pattern with tokeniser-aware padding. But understanding the underlying collate_fn contract means you can customise it when the standard collators do not fit your data structure.

io/thecodeforge/ml/custom_collate.pyPYTHON

import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from typing import List, Tuple


def collate_variable_length(
    batch: List[Tuple[torch.Tensor, torch.Tensor]]
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Pads variable-length sequences to the longest in the batch.
    Returns:
        padded   — (batch_size, max_len, feature_dim) padded sequences
        labels   — (batch_size,) classification labels
        mask     — (batch_size, max_len) float mask: 1.0 real, 0.0 padding

    Runs on the main thread — keep this function fast.
    Heavy transforms belong in __getitem__, not here.
    """
    sequences, labels = zip(*batch)

    # pad_sequence pads to the longest sequence in this batch
    # batch_first=True gives shape (batch, seq_len, features)
    padded = pad_sequence(sequences, batch_first=True, padding_value=0.0)

    # Build the mask: 1.0 where real data, 0.0 where padding
    # Downstream attention layers and loss functions use this to ignore padding
    lengths = torch.tensor([len(s) for s in sequences], dtype=torch.long)
    mask    = torch.zeros(padded.shape[0], padded.shape[1], dtype=torch.float32)
    for i, length in enumerate(lengths):
        mask[i, :length] = 1.0

    labels_stacked = torch.stack(labels)
    return padded, labels_stacked, mask


# Toy variable-length dataset
class VariableLengthDataset(Dataset):
    def __init__(self, num_samples: int = 200, feature_dim: int = 128):
        self.num_samples = num_samples
        self.feature_dim = feature_dim

    def __len__(self) -> int:
        return self.num_samples

    def __getitem__(self, idx: int):
        # Sequence length varies per sample — this is what breaks the default collate_fn
        seq_len  = torch.randint(10, 60, (1,)).item()
        features = torch.randn(seq_len, self.feature_dim)
        label    = torch.tensor(idx % 2, dtype=torch.long)  # binary label
        return features, label


variable_length_dataset = VariableLengthDataset(num_samples=200, feature_dim=128)

loader = DataLoader(
    variable_length_dataset,
    batch_size=32,
    shuffle=True,
    collate_fn=collate_variable_length,  # custom collation for variable-length sequences
    num_workers=4,
    pin_memory=True,
)

for batch_idx, (padded_seqs, labels, mask) in enumerate(loader):
    # padded_seqs: (32, max_len_in_batch, 128)
    # mask:        (32, max_len_in_batch) — 1.0 for real tokens, 0.0 for padding
    real_token_count = mask.sum().item()
    total_positions  = mask.numel()
    padding_pct      = 100 * (1 - real_token_count / total_positions)
    print(f"Batch {batch_idx}: shape {padded_seqs.shape} | "
          f"padding {padding_pct:.1f}% | labels {labels[:4]}")
    if batch_idx >= 2:
        break

Output

Batch 0: shape torch.Size([32, 59, 128]) | padding 37.2% | labels tensor([1, 0, 1, 0])

Batch 1: shape torch.Size([32, 58, 128]) | padding 36.8% | labels tensor([0, 1, 0, 1])

Batch 2: shape torch.Size([32, 57, 128]) | padding 35.1% | labels tensor([1, 0, 1, 0])

💡collate_fn Rules of Thumb

Use pad_sequence from torch.nn.utils.rnn — it handles batch-first padding in one call and is well-tested
Always return a mask alongside padded data — every downstream layer that touches sequences needs to know where padding starts
collate_fn runs on the main thread — if profiling shows it as the bottleneck, move the heavy work into __getitem__ where workers can parallelise it
Consider bucket sampling (grouping sequences by similar length before batching) to reduce padding waste — padding above 40% per batch is worth addressing
For images of different sizes, resize in __getitem__ not in collate_fn — resizing is CPU-intensive and belongs in the parallel workers

📊 Production Insight

collate_fn runs on the main thread regardless of num_workers — complex logic here blocks the entire pipeline even with parallel workers running.

Padding above 40% per batch is wasted compute at training time and is worth fixing with bucket sampling or sorting by length before batching.

In 2026 most NLP teams use Hugging Face DataCollatorWithPadding, but understanding the raw collate_fn contract lets you customise it when standard collators do not fit your data structure.

Rule: keep collate_fn to reshaping and padding only — move transforms to __getitem__.

🎯 Key Takeaway

Custom collate_fn handles variable-length samples — pad to the longest sequence in the batch and always return a mask so downstream layers can ignore padding positions. collate_fn runs on the main thread, not in workers — keep it fast or it becomes the bottleneck that num_workers cannot fix. Track padding percentage per batch and consider bucket sampling if it consistently exceeds 40%.

IterableDataset: Streaming Data Without Random Access

Map-style datasets require __len__ and __getitem__ — random access to any sample by index. This breaks when data arrives as a stream (Kafka, network logs, real-time sensor feeds) or when the dataset is genuinely too large to index. IterableDataset solves this by yielding samples sequentially without needing to know the total size.

The use case that justifies reaching for IterableDataset: training on a live event stream where the concept of 'total dataset size' does not exist, or on a dataset so large that generating a complete index would take longer than training itself. The DataLoader iterates through the __iter__ method, batches samples as they arrive, and provides limited shuffling within a buffer of recent samples.

The production trade-off that you need to understand before choosing IterableDataset: it cannot shuffle globally because it never knows the full dataset. It can shuffle within a configurable buffer of recent samples, but the model always sees data in approximately the order it arrives in the stream. If the stream has any temporal structure — and real data almost always does — the model will see a biased distribution. For training data that can be indexed, Map-style datasets with global shuffling are strictly better. Use IterableDataset only when indexing is genuinely impossible.

One non-obvious operational issue with num_workers > 0 and IterableDataset: each worker receives its own copy of the __iter__ method and will iterate the entire stream independently. Without sharding the stream across workers, every sample gets loaded num_workers times. You need to detect the worker ID inside __iter__ using torch.utils.data.get_worker_info() and partition the stream so each worker handles a distinct subset.

io/thecodeforge/ml/iterable_dataset.pyPYTHON

import torch
from torch.utils.data import IterableDataset, DataLoader, get_worker_info
from typing import Iterator, Tuple


class StreamingSensorDataset(IterableDataset):
    """
    Yields samples from a simulated sensor stream.
    In production, replace __iter__ with a Kafka consumer,
    network socket reader, or database cursor.

    IMPORTANT: with num_workers > 0, each worker calls __iter__ independently.
    Without sharding, every sample is loaded num_workers times.
    The __iter__ below handles worker partitioning automatically.
    """

    def __init__(self, num_samples: int = 10000):
        self.num_samples = num_samples

    def __iter__(self) -> Iterator[Tuple[torch.Tensor, torch.Tensor]]:
        worker_info = get_worker_info()

        if worker_info is None:
            # Single-process loading — yield the full stream
            start, end = 0, self.num_samples
        else:
            # Multi-process loading — partition stream across workers
            # Each worker gets a non-overlapping slice of the sample range
            per_worker = self.num_samples // worker_info.num_workers
            worker_id  = worker_info.id
            start      = worker_id * per_worker
            end        = start + per_worker if worker_id < worker_info.num_workers - 1 else self.num_samples

        for i in range(start, end):
            # Simulate streaming sensor data — replace with real I/O in production
            features = torch.randn(10)
            label    = torch.tensor(float(features[0] > 0), dtype=torch.float32)
            yield features, label


stream_ds     = StreamingSensorDataset(num_samples=10000)
stream_loader = DataLoader(
    stream_ds,
    batch_size=64,
    num_workers=2,  # Each worker now handles a non-overlapping shard of the stream
)

for batch_idx, (features, labels) in enumerate(stream_loader):
    if batch_idx >= 5:
        break
    print(f"Batch {batch_idx}: features {features.shape}, "
          f"label distribution: {labels.sum().item():.0f}/{len(labels)} positive")

Output

Batch 0: features torch.Size([64, 10]), label distribution: 34/64 positive

Batch 1: features torch.Size([64, 10]), label distribution: 31/64 positive

Batch 2: features torch.Size([64, 10]), label distribution: 33/64 positive

Batch 3: features torch.Size([64, 10]), label distribution: 30/64 positive

Batch 4: features torch.Size([64, 10]), label distribution: 32/64 positive

🔥Map-style vs Iterable Dataset — When to Use Each

Use Map-style Dataset when you have random access to all samples and can determine the dataset size — this covers the vast majority of production ML use cases. Use IterableDataset when data is a live stream (Kafka, sockets, real-time sensors) where indexing is genuinely impossible or when the dataset is so large that building a complete index is impractical. If you choose IterableDataset with num_workers > 0, you must implement worker sharding using get_worker_info() inside __iter__ — otherwise each sample is loaded num_workers times.

📊 Production Insight

IterableDataset with num_workers > 0 loads every sample num_workers times unless you shard the stream per worker using get_worker_info() — this doubles or quadruples I/O load silently.

Global shuffling is impossible with IterableDataset — only buffer-based local shuffling is available, which leaves the model exposed to distribution bias in ordered streams.

In 2026, streaming ML pipelines more often use specialised libraries (Mosaic Streaming, WebDataset) rather than raw IterableDataset for very large scale — but understanding IterableDataset is the prerequisite.

Rule: use Map-style datasets whenever data can be indexed. Reserve IterableDataset for genuinely streaming or unindexable data sources.

🎯 Key Takeaway

IterableDataset yields samples sequentially — no __len__, no __getitem__, no random access. Shuffling is limited to a buffer of recent samples, which means the model sees an approximately ordered distribution rather than a globally shuffled one. With num_workers > 0, implement get_worker_info() sharding inside __iter__ or every sample gets loaded num_workers times. Use Map-style datasets whenever the data can be indexed; reach for IterableDataset only when it genuinely cannot.

Why Batch Size Breaks Your Model (And How to Pick It)

Most tutorials tell you batch size is a hyperparameter you tune for memory. That's half the story. The real reason batch size matters is gradient variance.

Small batches (32-64) give noisy gradients. That noise acts as regularization — it helps escape sharp minima and generalizes better. Large batches (512+) average out the noise, giving you cleaner gradients that converge faster but often to flatter minima with worse generalization. This isn't theory. It's why you see models train slower per epoch with small batches but reach better test accuracy.

Here's the trap: doubling batch size to speed up training doesn't halve wall time. DataLoader overhead, GPU memory bandwidth, and the cost of synchronizing gradients across devices grow non-linearly. A batch of 1024 isn't 32x faster than a batch of 32. It's maybe 2-3x faster per epoch, but you'll need more epochs to converge.

Rule of thumb: start at 32 for vision tasks, 16 for NLP. Double only if gradients look stable and validation loss isn't plateauing early. Monitor GPU utilization with nvidia-smi — if you're below 80%, increase batch size until you hit 90%.

BatchSizeProfiler.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import DataLoader, TensorDataset
import time

def profile_batch_sizes(model, inputs, labels, batch_sizes=[32, 64, 128, 256]):
    dataset = TensorDataset(inputs, labels)
    results = {}
    for batch_size in batch_sizes:
        loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=4)
        start = time.perf_counter()
        for x, y in loader:
            out = model(x)
            loss = out.sum()
            loss.backward()
        elapsed = time.perf_counter() - start
        results[batch_size] = round(elapsed, 2)
    return results

# dummy data: 10000 samples, 256 features
dummy_inputs = torch.randn(10000, 256)
dummy_labels = torch.randint(0, 10, (10000,))
model = torch.nn.Linear(256, 10)

profiled = profile_batch_sizes(model, dummy_inputs, dummy_labels)
for bs, t in profiled.items():
    print(f"batch_size={bs}: {t}s per epoch")

Output

batch_size=32: 2.34s per epoch

batch_size=64: 1.51s per epoch

batch_size=128: 1.12s per epoch

batch_size=256: 0.94s per epoch

⚠ Production Trap:

Never set batch size to 'max that fits in GPU memory'. That gives you 2x memory utilization but often 10x slower convergence per batch. Leave 20% memory headroom for activation checkpoints and intermediate tensors.

🎯 Key Takeaway

Batch size controls gradient variance. Small batches = regularization. Large batches = speed but worse generalization. Start small, scale only when gradients stabilize.

Shuffling Isn't Optional — Here's Why Your Model Memorizes Without It

You've seen the shuffle=True argument in DataLoader. Ever wondered what happens when you set it to False? Your model learns the order of your data, not the patterns. This is called dataset memorization, and it's insidious.

Consider a dataset sorted by class: first 1000 cats, then 1000 dogs. Without shuffling, the first epoch shows the model only cats. The model adapts to predict 'cat' for everything. Then epoch 2 shows only dogs. The model flips to predict 'dog'. This creates oscillations in loss that never converge. Even if your data isn't explicitly sorted, inherent ordering from data collection (e.g., timestamps, sensor IDs) leaks bias.

Shuffling breaks temporal correlations between samples. It ensures each mini-batch is an i.i.d. sample from the data distribution. This stabilizes gradient updates and prevents the model from exploiting spurious order patterns.

But there's a catch: shuffling entire datasets is expensive. For datasets larger than RAM (think 100GB+), full shuffling is impossible. That's when you use Sampler objects — specifically RandomSampler with replacement or distributed-aware samplers for multi-GPU training. Never implement your own shuffle logic. PyTorch's DataLoader does it correctly with Fisher-Yates in C++. Your Python loop will be 100x slower.

ShuffleOrNot.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import DataLoader, TensorDataset

# Simulate sorted data: class 0 then class 0 then class 1
data = torch.cat([torch.zeros(1000, 10), torch.ones(1000, 10)], dim=0)
labels = torch.cat([torch.zeros(1000), torch.ones(1000)], dim=0).long()
dataset = TensorDataset(data, labels)

loader_no_shuffle = DataLoader(dataset, batch_size=64, shuffle=False)
loader_shuffle = DataLoader(dataset, batch_size=64, shuffle=True)

for name, loader in [("No shuffle", loader_no_shuffle), ("Shuffle", loader_shuffle)]:
    class_counts = torch.zeros(2)
    for _, y in loader:
        for c in range(2):
            class_counts[c] += (y == c).sum().item()
    print(f"{name}: batch class distribution: {class_counts.tolist()}")

Output

No shuffle: batch class distribution: [64.0, 0.0]

Shuffle: batch class distribution: [32.0, 32.0]

🔥Senior Shortcut:

Set shuffle=True AND drop_last=True for training. The last incomplete batch has biased class distribution. Dropping it ensures every batch is uniform size, keeping gradient variance consistent.

🎯 Key Takeaway

Shuffling breaks temporal biases in data. Set shuffle=True always for training. For huge datasets, use PyTorch Samplers — never write your own shuffle.

When DataLoader Bottlenecks Your GPU: Num Workers and Prefetch

Your GPU sits at 30% utilization while training. You blame the model. I blame your DataLoader. The culprit is almost always insufficient num_workers or missing prefetch_factor. Here's the math.

By default, DataLoader uses num_workers=0 — meaning the main process loads data serially. That's a death sentence for GPU-bound workloads. Each time the GPU finishes a batch, it waits for the CPU to load the next one. This creates a bubble of idle GPU time.

Set num_workers to the number of CPU cores (not threads). For a 16-core machine, use 12-14 workers. Each worker prefetches batches independently. The prefetch_factor (default 2) controls how many batches each worker queues ahead. Increase it to 4 or 8 if your data transforms are heavy (e.g., image augmentations).

But there's a ceiling. Too many workers cause contention on disk I/O and memory bandwidth. Monitor with htop and nvidia-smi. If CPU usage hits 100% or disk reads saturate, back off workers. Also, Windows users: num_workers=0 is your only safe option due to multiprocessing quirks — use Linux for serious training.

One more thing: when using custom transforms, move heavy operations (resizing, normalization) into the Dataset's __getitem__ and let workers parallelize them. Don't apply them after batching — that's serial and slow.

WorkerBenchmark.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import DataLoader, TensorDataset
import time

# Simulate heavy data transform: 100ms per sample
def heavy_load(x):
    time.sleep(0.1)
    return x * 2

data = torch.randn(500, 256)
labels = torch.randint(0, 10, (500,))
dataset = TensorDataset(data, labels)

for workers in [0, 2, 4, 8]:
    loader = DataLoader(
        dataset, batch_size=32, shuffle=True,
        num_workers=workers, prefetch_factor=2
    )
    start = time.perf_counter()
    for _ in range(10):  # 10 epochs
        for x, y in loader:
            x = heavy_load(x)
    elapsed = time.perf_counter() - start
    print(f"num_workers={workers}: {elapsed:.2f}s")

Output

num_workers=0: 15.67s

num_workers=2: 8.34s

num_workers=4: 5.12s

num_workers=8: 4.89s

💡Production Trap:

Don't set num_workers > CPU cores. Watch for 'Too many open files' errors (ulimit -n). If data is on network storage (NFS/S3), workers fight for I/O — use 1-2 workers and cache locally.

🎯 Key Takeaway

GPU utilization drops because DataLoader starves it. Set num_workers to (CPU cores - 2) and prefetch_factor to 4. Profile with nvidia-smi. If GPU < 90%, increase workers.

Stop Guessing Transforms: Compose Like a Pro

You don't hand-roll random crops and flips. torchvision.transforms.Compose chains them in one pipeline. The WHY is simple: your model sees invariant features faster when it can't rely on pixel-location memorization. Compose lets you stack RandomHorizontalFlip, ColorJitter, and RandomResizedCrop into a single callable that fits into Dataset.__getitem__.

Here's how it works: you pass a list of transforms to Compose, it applies each sequentially at index time. No manual loop. No duplicated logic. For production, pin the transforms order: normalization must be last, after all random augmentations. If you flip after normalize, your color distribution breaks.

TransformPipeline.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torchvision import transforms
from torchvision.datasets import ImageFolder

# Compose defines the pipeline order
pipeline = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

dataset = ImageFolder(root='./data/train', transform=pipeline)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

for images, labels in dataloader:
    print(images.shape)  # torch.Size([32, 3, 224, 224])
    print(labels.shape)  # torch.Size([32])
    break

Output

torch.Size([32, 3, 224, 224])

torch.Size([32])

⚠ Order Matters:

Put ToTensor() before Normalize. If you normalize raw PIL images, you'll crash on type mismatch.

🎯 Key Takeaway

Compose chains transforms in order; normalize last, random augmentations first.

Waste of time: asking "what's the difference between DataLoader and ImageFolder." They're not even the same thing. ImageFolder is a dataset class that assumes your images sit in subfolders named by class. DataLoader is a load balancer — it wraps any dataset, batches samples, and spins up workers. ImageFolder gives you (PIL.Image, label) tuples. DataLoader gives you tensors.

Here's the real difference: use ImageFolder when your data is already sorted into class folders on disk. That's it. If your data is in a CSV, a SQL table, or a blob store, skip ImageFolder. Write a custom Dataset. ImageFolder is a convenience wrapper over datasets.DatasetFolder. It reads file paths, applies your transform, and returns samples. DataLoader then handles concurrency, batching, and shuffling. Know the boundary.

ImageFolderBare.pyPYTHON

// io.thecodeforge — ml-ai tutorial

from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

# ImageFolder: reads class subdirectories
dataset = ImageFolder(
    root='./data/train',
    transform=transforms.ToTensor()
)

# DataLoader: handles batch delivery, not data structure
dataloader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4
)

sample_img, label = dataset[0]
print(f"Class index: {label}, Class name: {dataset.classes[label]}")
print(f"Number of classes: {len(dataset.classes)}")

Output

Class index: 2, Class name: cat

Number of classes: 5

🔥Senior Shortcut:

Peek at dataset.classes and dataset.class_to_idx before training. They map folder names → integer labels, saved in sorted order.

🎯 Key Takeaway

ImageFolder structures data by folders; DataLoader batches and parallelizes. They are complementary, not interchangeable.

Prefetching Is Free Speed — Use It or Waste GPU Cycles

Your GPU sits idle while DataLoader fetches the next batch. That's the problem. prefetch_factor in DataLoader tells how many batches to load ahead while the current one trains. Default is 2. Bump it to 4 or 8 on high-throughput pipelines. The WHY is simple: overlap I/O with compute. While your model runs forward and backward, workers load the next prefetch_factor * batch_size samples into a queue.

Combine this with num_workers. Rule of thumb: set workers to 4–8 per GPU, then prefetch to 4. Watch your RAM. If you hit swap, back off workers or lower prefetch. Production setup: log time_to_first_batch and iter_time — if iter_time < load_time, increase prefetch. Your GPU will thank you.

PrefetchConfig.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import DataLoader, TensorDataset

# Synthetic data — real pipeline mirrors this
data = torch.randn(10000, 3, 224, 224)
labels = torch.randint(0, 10, (10000,))
dataset = TensorDataset(data, labels)

# Prefetch 8 batches ahead on each worker
dataloader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    num_workers=4,
    prefetch_factor=8,
    pin_memory=True  # moves tensors to pinned memory for async GPU transfer
)

start_time = time.time()
for epoch in range(1):
    for batch_idx, (inputs, targets) in enumerate(dataloader):
        if batch_idx == 0:
            print(f"First batch delivered in {time.time()-start_time:.2f}s")
        # Simulate training
        _ = inputs.mean()

Output

First batch delivered in 0.04s

💡Production Trap:

Too high prefetch_factor with large images = OOM. Profile with nvidia-smi before settling on 8. Start at 4.

🎯 Key Takeaway

Prefetch overlaps I/O with compute; set prefetch_factor to 4–8 and watch VRAM.

Creating an Instance of the Dataset

Most engineers copy-paste dataset instantiation without understanding the memory cost. Every time you create a Dataset object, you are building a mapping from indices to data points. For map-style datasets, this means the entire list of file paths or SQL row pointers is loaded into memory. A common mistake is creating a separate Dataset instance per epoch instead of reusing one — this doubles memory overhead for zero benefit. Instead, instantiate the Dataset once and feed it into DataLoader repeatedly. When using transforms, the Dataset instance stores them as attributes; passing transforms during creation lets you swap augmentations without recreating the entire mapping. If your data is on remote storage, lazy-loading strategies inside the Dataset constructor can prevent OOM errors. Always profile memory usage with torch.cuda.memory_summary() after instantiation to catch leaks early.

dataset_instance.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import os

class ImageFileDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.files = sorted(os.listdir(root_dir))  # single init
        self.transform = transform

    def __len__(self):
        return len(self.files)

    def __getitem__(self, idx):
        img = torch.randn(3, 224, 224)  # placeholder load
        if self.transform:
            img = self.transform(img)
        return img

# ONE instance per run — reuse across epochs
dataset = ImageFileDataset('/data/images', transform=transforms.ToTensor())
dloader = DataLoader(dataset, batch_size=32, shuffle=True)

for epoch in range(10):
    for batch in dloader:  # same dataset, no re-init
        pass

Output

(No output — runs silently)

⚠ Production Trap:

Never re-instantiate a Dataset per epoch — each call recreates the file list in memory, doubling memory. Single constructor call for training loop.

🎯 Key Takeaway

Instantiate your Dataset once and reuse it across all epochs to avoid memory duplication.

Benefits of Using Mini-Batches

Training on mini-batches isn't just about fitting data into GPU memory — it fundamentally changes optimization dynamics. Single-sample (stochastic) updates have high variance because gradients from one sample don't represent the full data distribution. Full-batch gradients are stable but computationally prohibitive and generalize poorly due to sharp minima. Mini-batches hit the sweet spot: they average gradients over a small, representative subset, reducing variance while maintaining per-step efficiency. PyTorch's DataLoader handles batching automatically via the batch_size parameter, but the real win is parallelization — mini-batches let you exploit matrix operations on GPU hardware, achieving 10–100x throughput over sequential single-sample updates. The optimal batch size depends on your model's memory footprint and the dataset's intrinsic variance; start with powers of 2 between 32 and 512. Large batches require higher learning rates to maintain convergence speed.

mini_batch_benefits.pyPYTHON

// io.thecodeforge — ml-ai tutorial

import torch
from torch.utils.data import DataLoader, TensorDataset

# Simulate dataset
X = torch.randn(10000, 784)
y = torch.randint(0, 10, (10000,))
dataset = TensorDataset(X, y)

# Mini-batch vs full-batch comparison
mini_loader = DataLoader(dataset, batch_size=64, shuffle=True)
full_loader  = DataLoader(dataset, batch_size=10000, shuffle=False)

# Mini-batch: 157 gradient steps per epoch
for batch_x, batch_y in mini_loader:
    loss = (batch_x.sum() - batch_y.sum()).pow(2)
    loss.backward()  # efficient matrix ops
    break

# Full-batch: 1 step per epoch — memory spike
for batch_x, batch_y in full_loader:
    loss = (batch_x.sum() - batch_y.sum()).pow(2)
    loss.backward()  # OOM risk on large GPU
    break

Output

(No output — conceptual comparison)

⚠ Production Trap:

Batch size affects generalization — large batches (>512) often degrade test accuracy unless you scale learning rate. Use learning rate warmup.

🎯 Key Takeaway

Mini-batches reduce gradient variance and enable GPU parallelism, trading memory for stability and speed.

● Production incidentPOST-MORTEMseverity: high

Docker container crashes with Bus error when num_workers > 0

Symptom

Training starts normally and runs for 2–3 epochs before dying with a Bus error. The crash is intermittent — sometimes it survives 10 epochs before failing. No Python traceback is produced, which makes it look like a hardware or OS issue rather than a configuration problem.

Assumption

The dataset has corrupted files or the host machine has faulty RAM. The team spent two days running memory diagnostics and re-validating the dataset before looking at the container configuration.

Root cause

PyTorch's multi-process DataLoader uses shared memory at /dev/shm to transfer tensors between worker processes and the main process. Docker's default shared memory allocation is 64MB — a number that made sense for containerised web services but is completely inadequate for ML training. With num_workers=4 and batch_size=64 on a dataset of any real size, the shared memory segment fills up within a few epochs. When a worker tries to write to a full shared memory segment, the OS sends SIGBUS. There is no Python-level exception because the failure happens at the OS level, below the Python runtime.

Fix

Added --shm-size=2g to the docker run command. Added a pre-training health check that reads /dev/shm free space and exits with a clear error message if it is below 1GB. Documented the --shm-size requirement in the Dockerfile as a comment and in the repository's README under 'Running in Docker'. Added the flag to the CI/CD job definition so it can never be accidentally dropped.

Key lesson

Docker's default 64MB shared memory is too small for PyTorch multi-process DataLoader — this is not a corner case, it affects every real training job
Always run Docker containers with --shm-size=2g or larger when num_workers > 0, and encode this in your CI job definition so it cannot be silently removed
Bus error crashes with no Python traceback are the unmistakable signature of shared memory exhaustion — do not waste time on hardware diagnostics before checking /dev/shm
Test training inside Docker before shipping to production — local and container environments differ in ways that only surface under sustained load

Production debug guideCommon symptoms when the data pipeline goes wrong5 entries

Symptom · 01

Training is 3–5x slower than expected, GPU utilisation is below 50%

→

Fix

Check if num_workers=0. Increase to num_workers=4 and add pin_memory=True. Profile with torch.profiler to confirm data loading is the bottleneck and not something in __getitem__ itself — sometimes the bottleneck is a slow transform, not the I/O.

Symptom · 02

Bus error (SIGBUS) crash with no Python traceback inside Docker

→

Fix

Increase Docker shared memory: docker run --shm-size=2g. Verify with df -h /dev/shm inside the container before starting training. Add a pre-flight check to your training entry point that asserts free shared memory exceeds a minimum threshold.

Symptom · 03

RuntimeError: DataLoader worker is killed by signal

→

Fix

Check for OOM in worker processes — workers are separate processes and their memory usage is not reflected in the main process metrics. Reduce num_workers or batch_size. Check if __getitem__ loads entire files into memory rather than streaming or memory-mapping them.

Symptom · 04

DataLoader hangs indefinitely — no error, no progress

→

Fix

Check for unpickleable objects in the Dataset. Python's multiprocessing pickles the Dataset to send it to each worker — open file handles, database connections, and lambda functions all fail silently here. Use worker_init_fn for any per-worker initialisation that cannot be pickled.

Symptom · 05

Different workers produce identical random augmentations across the same epoch

→

Fix

Set a unique random seed per worker using worker_init_fn. Without this, all workers inherit the same base seed from the main process and generate identical augmentation sequences, which reduces effective data diversity and can silently hurt generalisation.

★ DataLoader Debug Cheat SheetQuick commands to diagnose data pipeline issues

Training is slow, GPU utilisation is low−

Immediate action

Profile the gap between data loading time and GPU compute time

Commands

nvidia-smi -l 1 # watch GPU utilisation in real time — below 80% means starvation

python -c "import torch; p = torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA]); print('profiler ready')"

Fix now

Set num_workers=4 and pin_memory=True — this is the fix for the vast majority of GPU starvation cases

Bus error crash inside Docker container+

DataLoader hangs with no error after a few batches+

Data Loading Approaches Compared

Feature	Standard Python List / Loop	PyTorch DataLoader / Dataset
Memory Usage	High — entire dataset loaded into RAM before training starts	Low — lazy loading per sample in __getitem__, only one batch in memory at a time
Concurrency	Single-threaded — data loading blocks the main thread and therefore the GPU	Multi-process via num_workers — true parallelism that bypasses Python's GIL
Batching	Manual list slicing — you write the indexing logic and handle edge cases like the last batch	Automatic via batch_size — DataLoader handles indexing, collation, and drop_last
Data Shuffling	Manual `random.shuffle()` — must remember to call it every epoch and it operates on the full list in memory	Built-in per-epoch shuffling via shuffle=True — operates on indices, not the data itself
GPU Integration	Manual .to(device) on every tensor — no transfer optimisation	Optimised via pin_memory=True and non_blocking=True — DMA transfer without extra copy step

⚙ Quick Reference

14 commands from this guide

File	Command / Code	Purpose
iothecodeforgemlforge_dataset.py	from torch.utils.data import Dataset, DataLoader	What Is PyTorch DataLoader and Datasets and Why Does It Exis
iothecodeforgedbfetch_samples.sql	SELECT	Enterprise Integration
Dockerfile	FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime	Containerised Data Pipelines with Docker
iothecodeforgemlefficient_loading.py	from torch.utils.data import DataLoader	Common Mistakes and How to Avoid Them
iothecodeforgemlcustom_collate.py	from torch.utils.data import Dataset, DataLoader	Custom collate_fn
iothecodeforgemliterable_dataset.py	from torch.utils.data import IterableDataset, DataLoader, get_worker_info	IterableDataset
BatchSizeProfiler.py	from torch.utils.data import DataLoader, TensorDataset	Why Batch Size Breaks Your Model (And How to Pick It)
ShuffleOrNot.py	from torch.utils.data import DataLoader, TensorDataset	Shuffling Isn't Optional
WorkerBenchmark.py	from torch.utils.data import DataLoader, TensorDataset	When DataLoader Bottlenecks Your GPU
TransformPipeline.py	from torchvision import transforms	Stop Guessing Transforms
ImageFolderBare.py	from torchvision.datasets import ImageFolder	Don't Import ImageFolder Blind
PrefetchConfig.py	from torch.utils.data import DataLoader, TensorDataset	Prefetching Is Free Speed
dataset_instance.py	from torch.utils.data import Dataset, DataLoader	Creating an Instance of the Dataset
mini_batch_benefits.py	from torch.utils.data import DataLoader, TensorDataset	Benefits of Using Mini-Batches

Key takeaways

Dataset and DataLoader have deliberately separate responsibilities

Dataset knows how to access one sample, DataLoader knows how to batch, shuffle, and parallelise. Understanding this separation makes every configuration decision obvious.

num_workers=0 is the default and the most common performance mistake

it serialises data loading on the main thread and leaves the GPU idle between batches. Always override it for GPU training.

In Docker, --shm-size=2g is a required flag when num_workers > 0, not an optional optimisation. The default 64MB causes Bus error crashes and leaves no Python traceback to diagnose from.

Load only metadata in __init__ and actual data in __getitem__

this is what makes lazy loading work. Loading data in __init__ converts a scalable pipeline into an OOM crash before training starts.

collate_fn runs on the main thread regardless of num_workers

keep it to padding and reshaping only, and move transforms into __getitem__ where workers can parallelise them.

IterableDataset cannot shuffle globally and requires per-worker stream sharding via get_worker_info() when num_workers > 0. Use Map-style datasets whenever indexing is possible.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does the DataLoader utilise Python's multi-processing to bypass the ...

Q02SENIOR

Explain the Producer-Consumer pattern as it applies to the relationship ...

Q03SENIOR

What is the difference between a Map-style and an Iterable-style Dataset...

Q04SENIOR

How would you implement a custom collate_fn to handle a dataset where sa...

Q05SENIOR

Describe the purpose of pin_memory. How does it interact with pageable v...

Q01 of 05SENIOR

How does the DataLoader utilise Python's multi-processing to bypass the Global Interpreter Lock (GIL)?

ANSWER

Python's GIL prevents true parallel execution of Python bytecode within a single process. The DataLoader bypasses this by spawning multiple separate processes — not threads — via Python's multiprocessing module. Each worker process loads and transforms data independently in its own Python interpreter with its own GIL, so they run truly in parallel without contending with each other or the main process. Workers write tensors to shared memory at /dev/shm, and the main process reads pre-fetched batches from that shared memory. This achieves genuine parallelism for both I/O-bound operations (reading files from disk) and CPU-bound operations (image decoding, augmentation). The shared memory approach is also why Docker's /dev/shm size matters — it is the actual transfer medium, not a socket or pipe.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is the difference between Dataset and DataLoader in PyTorch?

How many num_workers should I use?

When should I use TensorDataset vs a custom Dataset?

Why does my DataLoader hang with num_workers > 0?

What is persistent_workers and when should I use it?

Naren Founder & Principal Engineer

20+ years shipping production ML systems and the infrastructure behind them. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's PyTorch. Mark it forged?

12 min read · try the examples if you haven't

DataLoader num_workers Bus Error — Docker shm Fix

What Is PyTorch DataLoader and Datasets and Why Does It Exist?

Enterprise Integration: SQL-Backed Datasets

Containerised Data Pipelines with Docker

Common Mistakes and How to Avoid Them

Custom collate_fn: Handling Variable-Length Data

IterableDataset: Streaming Data Without Random Access

Why Batch Size Breaks Your Model (And How to Pick It)

Shuffling Isn't Optional — Here's Why Your Model Memorizes Without It

When DataLoader Bottlenecks Your GPU: Num Workers and Prefetch

Stop Guessing Transforms: Compose Like a Pro

Don't Import ImageFolder Blind — Know When to Dump It

Prefetching Is Free Speed — Use It or Waste GPU Cycles

Creating an Instance of the Dataset

Benefits of Using Mini-Batches

Docker container crashes with Bus error when num_workers > 0

Key takeaways

Interview Questions on This Topic

Frequently Asked Questions

That's PyTorch. Mark it forged?