Docker Daemon Bottleneck — 50 Concurrent Builds Crash
50 concurrent docker builds consumed 12GB RAM and froze the daemon for 20 minutes.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
- ✓Solid grasp of DevOps fundamentals
- ✓Comfortable with command-line tools
- ✓Basic Linux administration knowledge
- docker build: CLI sends context to dockerd -> dockerd executes Dockerfile instructions -> each instruction creates a cached layer -> layers stored under /var/lib/docker/overlay2/
- docker push: dockerd uploads layers to a registry -> registry stores layers by digest -> tags point to manifests
- docker run: CLI sends API request to dockerd -> dockerd delegates to containerd -> containerd invokes runc -> runc configures namespaces/cgroups/filesystem -> exec starts the application
- Docker CLI: HTTP client that talks to the daemon via Unix socket
- dockerd (daemon): manages images, networks, volumes, and the REST API
- containerd: manages container lifecycle, image pulling, and snapshot management
- runc: creates containers by calling kernel syscalls (clone, pivot_root, exec)
Docker's architecture is a layered client-server model where the Docker CLI communicates with a central daemon (dockerd) that manages containers, images, and builds. The daemon delegates low-level container operations to containerd (a container runtime supervisor) and runc (the OCI-compliant container spawner), while the daemon itself handles the Docker API, image management, and storage orchestration.
This design centralizes control but creates a single point of failure: all CLI commands, image pulls, and concurrent builds funnel through one daemon process, which can saturate CPU, memory, or I/O under load—like 50 parallel builds exhausting layer cache locks and filesystem operations. The OCI Spec standardizes the image format and runtime behavior, allowing containerd and runc to be swapped with alternatives like CRI-O or Kata Containers, but the daemon remains the bottleneck in Docker's default stack.
When you run docker build, the daemon parses the Dockerfile, executes each instruction in a temporary container via containerd, commits layers as overlay2 diffs, and caches them in /var/lib/docker. The storage stack uses overlay2 for copy-on-write union mounts, where each layer is a read-only filesystem diff, and a container's writable layer sits on top.
Volumes bypass this stack by mounting host directories directly, avoiding the performance overhead of the layered filesystem. For distribution, the daemon interacts with a registry (e.g., Docker Hub) using the OCI Distribution Spec: it fetches a manifest listing layer digests (SHA256 hashes), deduplicates layers already cached locally, and streams missing layers as compressed tarballs.
This architecture works well for single-host workflows but breaks under concurrent load because the daemon serializes layer operations and registry interactions, making it unsuitable for high-scale CI/CD without external orchestration or daemon-per-build patterns.
Think of Docker architecture as a shipping company. The Docker CLI is the customer placing an order. The daemon (dockerd) is the dispatch center that receives orders and coordinates everything. containerd is the warehouse manager that tracks inventory (images) and assembles shipments (containers). runc is the forklift operator who physically moves items into the shipping container. The kernel is the warehouse floor — the physical space where everything is built. Each layer has a specific job, and the handoff between layers is standardized (the OCI spec) so you can swap one forklift brand (runc) for another (crun, gVisor) without redesigning the warehouse.
Most Docker documentation treats the architecture as a black box — run a command, get a container. This abstraction breaks down when containers fail to start, images pull slowly, or the daemon crashes under load. Understanding the component stack and the data flow between components is essential for production debugging.
Docker is not one program. It is a chain of specialized components: the CLI sends API requests to the daemon, the daemon delegates to containerd, containerd invokes runc, and runc configures the Linux kernel to create an isolated process. Each handoff is a potential failure point. The OCI (Open Container Initiative) spec standardizes the interface between containerd and runc, enabling runtime replaceability.
This article traces the complete end-to-end flow: what happens when you run docker build, how images are stored and distributed, what happens when you run docker run, how networking and storage are wired, and where each component lives on the filesystem. Every section includes production failure scenarios and debugging commands.
Why Docker Daemon Architecture Becomes a Single Point of Failure
Docker uses a client-server architecture where the Docker daemon (dockerd) is the central orchestrator. It manages images, containers, networks, and volumes via a REST API. The daemon runs as a single process, handling all requests serially through a shared state store. This means every docker build, run, or pull command goes through the same bottleneck. Under load, especially with concurrent builds, the daemon's internal job queue and filesystem operations (layer management, image extraction) become the limiting factor. The daemon processes each build step sequentially for a given image, but across builds, it must synchronize access to shared resources like the image cache and storage driver. With 50 concurrent builds, the daemon's single-threaded event loop and lock contention on the overlay filesystem cause exponential latency growth. The result: builds time out, the daemon runs out of file descriptors, or the entire host becomes unresponsive. This architecture works for low concurrency but fails when you treat Docker as a CI build orchestrator without understanding its single-process limits.
Component Stack: CLI, Daemon, containerd, runc, and the OCI Spec
Docker is a chain of five components, each with a specific responsibility. Understanding this chain is the foundation for debugging any Docker issue.
Docker CLI (docker): A Go binary that sends HTTP requests to the Docker daemon via a Unix socket (/var/run/docker.sock) or TCP. The CLI does not create containers, build images, or manage networks — it is a thin client. You can replace it with curl: curl --unix-socket /var/run/docker.sock http://localhost/containers/json.
Docker daemon (dockerd): A long-running Go process that manages the Docker API, image storage, network configuration, volume management, and build orchestration. The daemon listens on the Unix socket and processes all API requests. It delegates container lifecycle operations to containerd. The daemon runs as root and has full access to the host.
containerd: A container runtime daemon that manages the complete container lifecycle — pulling images, managing snapshots, creating containers, and handling execution. containerd was originally part of Docker but was extracted as a CNCF project in 2017. It is now used independently by Docker, Kubernetes (via CRI), AWS ECS, GKE, and other platforms. containerd invokes runc to actually create containers.
runc: A lightweight CLI tool that creates a single container from an OCI runtime specification (config.json). runc calls clone() to create a new process with namespaces, configures cgroups, mounts the filesystem via overlay2, drops privileges, and exec's the application process. runc exits after creating the container — it does not manage the lifecycle.
OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged as layers + manifest) and the runtime spec (how containers are created as config.json). This standardization enables runtime replaceability — you can swap runc for crun, kata-runtime, or runsc without changing Docker or containerd.
The handoff chain: docker run -> dockerd (API) -> containerd (lifecycle) -> runc (creation) -> kernel (namespaces, cgroups, overlay2). Each handoff is a potential failure point. If dockerd crashes, all operations fail. If containerd crashes, new containers cannot be created but existing ones keep running. If runc fails, the specific container creation fails but the stack above is unaffected.
#!/bin/bash # Trace the complete Docker architecture flow # ── 1. CLI -> Daemon communication ─────────────────────────────────────────── # The CLI sends HTTP requests to the daemon socket curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool # Shows: Version, ApiVersion, GoVersion, Os, Arch, KernelVersion # List containers via the API (same as docker ps) curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool # ── 2. Daemon -> containerd communication ──────────────────────────────────── # containerd runs as a separate process, communicating via gRPC ps aux | grep containerd # root 1234 0.3 0.5 ... /usr/bin/containerd # Check the containerd socket ls -la /run/containerd/containerd.sock # srw-rw---- 1 root containerd /run/containerd/containerd.sock # List containers managed by containerd (via ctr) sudo ctr -n moby containers ls # Shows containers that containerd is managing on behalf of Docker # ── 3. containerd -> runc communication ────────────────────────────────────── # runc is invoked by containerd to create each container which runc # /usr/bin/runc runc --version # runc version 1.1.9 # List containers managed by runc sudo runc list # Shows: container ID, PID, status, bundle path # ── 4. Inspect the OCI runtime spec for a running container ────────────────── CONTAINER_ID=$(docker ps -q | head -1) # Find the container's bundle directory sudo find /run/containerd -name config.json 2>/dev/null | head -3 # /run/containerd/io.containerd.runtime.v2.task/default/<id>/config.json # ── 5. Check the daemon process tree ───────────────────────────────────────── pstree -p $(pidof dockerd) # dockerd(1234)───containerd(1235)───containerd-shim(5678)───node(5679) # └─pause(5677) # ── 6. Check all components are running ────────────────────────────────────── echo "dockerd: $(systemctl is-active docker)" echo "containerd: $(systemctl is-active containerd)" echo "runc: $(which runc && echo 'installed' || echo 'missing')" # ── 7. Check the daemon's storage driver and root directory ────────────────── docker info --format '{{.Driver}} {{.DockerRootDir}}' # overlay2 /var/lib/docker # ── 8. Check the daemon's configured runtimes ──────────────────────────────── docker info --format '{{json .Runtimes}}' | python3 -m json.tool # Shows: runc (default), and any custom runtimes (runsc, kata)
- containerd was extracted from Docker in 2017 to become a standalone CNCF project.
- Kubernetes can use containerd directly (via CRI) without dockerd — reducing overhead and complexity.
- Separation allows independent scaling — containerd can be updated without restarting dockerd.
- containerd manages the lifecycle; dockerd manages the user-facing API and image building.
Image Build Flow: From Dockerfile to Cached Layers
When you run docker build, a precise sequence of operations transforms a Dockerfile into a cached, layered image. Understanding this flow explains why builds are slow, why layers are cached, and why image size matters.
Step 1: Send build context. The CLI tar's the current directory (or the path specified in docker build -f) and sends it to the daemon via the Unix socket. This is the 'Sending build context to Docker daemon' message. The .dockerignore file filters out excluded files before sending. Without .dockerignore, the entire directory (including .git, node_modules) is sent.
Step 2: Parse the Dockerfile. The daemon parses the Dockerfile and executes each instruction sequentially. Each instruction is evaluated against the layer cache.
Step 3: Cache lookup. For each instruction, the daemon checks if a cached layer exists with the same instruction text and the same parent layer. If the cache hit, the layer is reused (no execution). If the cache miss, the instruction is executed and a new layer is created. The cache is sequential — a miss invalidates all subsequent layers.
Step 4: Execute the instruction. For RUN, the daemon creates a temporary container from the previous layer, executes the command, and captures the filesystem diff as a new layer. For COPY/ADD, the daemon copies files from the build context into a new layer. For ENV/EXPOSE/LABEL, the daemon creates a metadata-only layer (no filesystem change).
Step 5: Commit the layer. The filesystem diff is committed as a new layer under /var/lib/docker/overlay2/. Each layer is a directory containing only the files that changed from the previous layer. The layer is identified by a SHA256 digest.
Step 6: Tag the image. After all instructions are executed, the final layer is tagged with the image name and tag (e.g., my-app:1.0.0). The tag points to a manifest — a JSON file that lists all layers in order.
BuildKit vs legacy builder: The legacy builder executes instructions sequentially. BuildKit (DOCKER_BUILDKIT=1) builds a dependency graph and executes independent instructions in parallel. BuildKit also supports --mount=type=secret for build-time secrets without baking them into layers. BuildKit is the default in Docker Desktop and is recommended for all builds.
#!/bin/bash # Trace the complete image build flow # ── 1. Build context size (before and after .dockerignore) ─────────────────── # Without .dockerignore: tar -cf - . | wc -c # May be 500MB+ if node_modules and .git are included # With .dockerignore: cat .dockerignore # node_modules/ # .git/ # *.log tar -cf - --exclude-from=.dockerignore . | wc -c # Should be <10MB for a typical project # ── 2. Build with cache inspection ─────────────────────────────────────────── # Build with BuildKit and progress=plain to see every step DOCKER_BUILDKIT=1 docker build --progress=plain -t io.thecodeforge/api:1.0 . 2>&1 | tee /tmp/build.log # Count cached vs executed steps: grep -c 'CACHED' /tmp/build.log grep -c 'RUN\|COPY' /tmp/build.log # ── 3. Inspect the image layers ────────────────────────────────────────────── # List layers in the image docker inspect io.thecodeforge/api:1.0 --format '{{json .RootFS.Layers}}' | python3 -m json.tool # Each entry is a SHA256 digest of a layer # Show layer sizes docker history io.thecodeforge/api:1.0 --format '{{.Size}}\t{{.CreatedBy}}' | head -10 # Shows the size contribution of each instruction # ── 4. Find layers on disk ─────────────────────────────────────────────────── ls /var/lib/docker/overlay2/ | head -10 # Each directory is a layer. Shared layers are hard-linked. # Check disk usage per layer: du -sh /var/lib/docker/overlay2/* | sort -hr | head -10 # ── 5. Inspect the image manifest ──────────────────────────────────────────── # Save the image and inspect its manifest docker save io.thecodeforge/api:1.0 | tar -xO manifest.json | python3 -m json.tool # Shows: Config (image config), RepoTags, Layers (ordered list of layer tar files) # ── 6. Compare BuildKit vs legacy builder performance ──────────────────────── # Legacy builder: time DOCKER_BUILDKIT=0 docker build -t test:legacy . # Sequential execution — slower for multi-step builds # BuildKit: time DOCKER_BUILDKIT=1 docker build -t test:buildkit . # Parallel execution — faster for independent steps # ── 7. Check the build cache ───────────────────────────────────────────────── docker builder du # Shows disk usage of the build cache docker builder prune # Removes unused build cache entries
- The daemon needs access to files referenced by COPY and ADD instructions.
- The daemon runs on the host (or a remote machine) — it cannot access the CLI's local filesystem directly.
- The CLI tar's the context and sends it over the Unix socket. This is why .dockerignore is critical for build speed.
- BuildKit optimizes this by only sending files referenced by COPY/ADD, not the entire context.
Image Distribution: Registry, Manifest, and Layer Deduplication
Once an image is built, it needs to be distributed to other machines — CI servers, staging environments, production clusters. This is the registry's job.
Image format: An image is not a single file. It is a collection of: - Layers: compressed tar archives, each identified by a SHA256 digest - Manifest: a JSON file that lists the layers in order and points to the image config - Image config: a JSON file that defines the runtime configuration (env vars, entrypoint, exposed ports, user)
The registry protocol: Docker registries implement the OCI Distribution Spec — an HTTP API for pushing and pulling images. The flow: 1. Client sends the manifest to the registry 2. Registry checks which layers it already has (by digest) 3. Client uploads only the missing layers 4. Registry stores layers by digest and links them to the manifest
Layer deduplication: This is the key efficiency mechanism. If two images share the same base layer (e.g., both use node:20-alpine), the layer is stored once on the registry and once on the local machine. When you pull a second image that shares layers with an existing image, only the unique layers are downloaded. This is why pulling a new version of your app is fast — only the top layers (containing your code) change.
Docker Hub pull-rate limits: Docker Hub limits pulls per IP: 100 per 6 hours for anonymous users, 200 for authenticated free users. This limit is per IP, not per user — a NAT gateway makes multiple machines appear as one IP. For CI/CD pipelines, this limit is hit quickly. The fix: authenticate with docker login, use a pull-through cache, or mirror images to a private registry.
Content trust (DCT): Docker Content Trust uses digital signatures to verify image integrity. When DOCKER_CONTENT_TRUST=1, Docker only pulls signed images. This prevents supply chain attacks where a malicious image is pushed with the same tag as a legitimate image.
#!/bin/bash # Trace the image distribution flow # ── 1. Inspect the local image manifest ────────────────────────────────────── # Save the image and extract the manifest docker save io.thecodeforge/api:1.0 -o /tmp/api-image.tar cd /tmp && tar xf api-image.tar # The manifest.json lists all components: cat manifest.json | python3 -m json.tool # [ # { # "Config": "sha256:abc123...json", <- image config # "RepoTags": ["io.thecodeforge/api:1.0"], # "Layers": [ <- ordered layer list # "sha256:def456.../layer.tar", # "sha256:ghi789.../layer.tar" # ] # } # ] # ── 2. Inspect the image config ────────────────────────────────────────────── cat sha256:abc123*.json | python3 -m json.tool | head -30 # Shows: architecture, os, config (env, cmd, entrypoint), rootfs (diff_ids) # ── 3. Push to a registry ──────────────────────────────────────────────────── # Login to Docker Hub docker login # Tag the image for the registry docker tag io.thecodeforge/api:1.0 youruser/io-thecodeforge-api:1.0 # Push — watch which layers are pushed vs already exist docker push youruser/io-thecodeforge-api:1.0 # Output shows: # Layer already exists (shared with base image) # Pushing layer (unique to this image) # ── 4. Pull from a registry ────────────────────────────────────────────────── # Pull on a different machine docker pull youruser/io-thecodeforge-api:1.0 # Output shows: # Already exists (layers shared with local images) # Downloading (unique layers) # ── 5. Check pull-rate limit status ────────────────────────────────────────── curl -s -I \ -H "Authorization: Bearer $(curl -s 'https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull' | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')" \ https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest \ | grep -i ratelimit # ratelimit-limit: 100;w=21600 # ratelimit-remaining: 76;w=21600 # ── 6. Enable Docker Content Trust ─────────────────────────────────────────── export DOCKER_CONTENT_TRUST=1 # Now docker pull only fetches signed images docker pull youruser/io-thecodeforge-api:1.0 # If the image is not signed, the pull fails with a trust error # ── 7. Check layer deduplication ───────────────────────────────────────────── # Compare layers between two images docker inspect node:20-alpine --format '{{.RootFS.Layers}}' | tr ' ' '\n' | wc -l docker inspect io.thecodeforge/api:1.0 --format '{{.RootFS.Layers}}' | tr ' ' '\n' | wc -l # The API image shares base layers with node:20-alpine
- Without deduplication, every image would store a full copy of its base OS — wasting disk and bandwidth.
- With deduplication, shared layers (like node:20-alpine) are stored once and referenced by multiple images.
- Pulling a new app version only downloads the changed layers (typically your code — a few MB), not the entire image.
- This is why Docker images are practical at scale — the overhead per image is only the unique layers.
Container Creation Flow: From Image to Running Process
When you run docker run, a precise sequence of operations creates an isolated process from an image. This is the most critical flow to understand for production debugging.
Step 1: API request. The CLI sends a POST /containers/create request to dockerd. The request includes the image name, command, environment variables, port mappings, volume mounts, and resource limits.
Step 2: Image resolution. dockerd checks if the image exists locally. If not, it pulls the image from the registry. The image's layers are unpacked into /var/lib/docker/overlay2/.
Step 3: Create container metadata. dockerd creates a container configuration (container JSON) that includes the merged overlay2 directory, network settings, volume mounts, and resource limits. This metadata is stored in /var/lib/docker/containers/<container-id>/.
Step 4: Delegate to containerd. dockerd sends a gRPC request to containerd to create the container. containerd generates the OCI runtime spec (config.json) — a JSON file that defines namespaces, cgroups, mounts, and the process to execute.
Step 5: Invoke runc. containerd invokes runc create with the OCI spec. runc reads config.json and executes kernel syscalls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) — creates a new process with namespaces - mount() — mounts /proc, /sys, /dev inside the container - pivot_root() — changes the root to the overlay2 merge directory - setuid()/setgid() — drops privileges (if non-root) - execve() — starts the application process
Step 6: Network setup. dockerd (via libnetwork) creates a veth pair — one end in the container's network namespace, one end on the Docker bridge (docker0). The container gets an IP address from the bridge's subnet. iptables rules are added for port publishing (-p) and inter-container communication.
Step 7: Monitor the process. containerd-shim monitors the container process, captures stdout/stderr, and handles signals. The pause process holds the namespaces open. When the application process exits, containerd-shim reports the exit code to containerd, which reports to dockerd.
#!/bin/bash # Trace the complete container creation flow # ── 1. Create a container (without starting it) ───────────────────────────── docker create --name flow-demo \ --cpus=1.0 \ --memory=256m \ -p 8080:3000 \ -v demo-data:/app/data \ alpine:3.19 sleep 3600 # ── 2. Inspect the container metadata ──────────────────────────────────────── # Container config stored by the daemon: ls /var/lib/docker/containers/$(docker inspect flow-demo --format '{{.Id}}')/ # config.v2.json hostconfig.json hostname hosts resolv.conf ... # ── 3. Start the container and trace the flow ──────────────────────────────── docker start flow-demo # Get the container's host PID: CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' flow-demo) echo "Container process PID on host: $CONTAINER_PID" # ── 4. Inspect the OCI runtime spec ────────────────────────────────────────── # Find the container's bundle in containerd's state: sudo find /run/containerd -path '*flow-demo*' -name config.json 2>/dev/null # Inspect the OCI spec (namespaces, mounts, process): sudo cat /run/containerd/io.containerd.runtime.v2.task/default/*/config.json 2>/dev/null | python3 -m json.tool | head -60 # ── 5. Inspect the overlay2 filesystem ─────────────────────────────────────── docker inspect flow-demo --format '{{json .GraphDriver.Data}}' | python3 -m json.tool # MergedDir: what the container sees as / # UpperDir: writable layer (container-specific changes) # LowerDir: read-only image layers # ── 6. Inspect the network setup ───────────────────────────────────────────── # Container's network config: docker inspect flow-demo --format '{{json .NetworkSettings}}' | python3 -m json.tool | head -20 # Shows: IPAddress, Gateway, Ports, Networks # veth pair on the host: ip link show | grep veth # vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0 # iptables rules for port publishing: sudo iptables -t nat -L -n | grep 8080 # DNAT rule forwarding host:8080 to container:3000 # ── 7. Find the pause process ──────────────────────────────────────────────── ps aux | grep pause | grep -v grep # root 5677 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause # The pause process and container process share namespaces: ls -la /proc/$CONTAINER_PID/ns/net ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net # Both point to the same namespace inode # ── Cleanup ────────────────────────────────────────────────────────────────── docker rm -f flow-demo
- runc's job is to create the container, not to manage it. After
execve(), runc is replaced by the application process. - containerd-shim monitors the application process, captures output, and handles signals.
- The pause process holds the namespaces open so they survive application restarts.
- This separation allows containerd to manage the lifecycle without being PID 1 in the container.
clone() for namespaces, pivot_root() for filesystem, execve() for the application. Network setup creates veth pairs and iptables rules. The pause process holds namespaces open. Every container is a real Linux process on the host.Storage Architecture: overlay2, Volumes, and the Filesystem Stack
Docker's storage architecture has three layers: the image layers (read-only, cached), the container layer (writable, per-container), and volumes (persistent, managed separately). Understanding this stack explains why containers start fast, why data disappears, and why database performance differs between containers and bare metal.
overlay2 driver: The default storage driver. It stacks directories (layers) and presents a merged view. Each image layer is a directory under /var/lib/docker/overlay2/. The container's writable layer is a separate directory. The merged view is what the container sees as its root filesystem.
Layer sharing: Multiple containers from the same image share the same read-only layers. Each container has its own writable layer. This is why starting a second container from the same image is nearly instant — no data is copied, only a new writable directory is created.
Volumes: Named volumes are directories under /var/lib/docker/volumes/<volume-name>/_data. They are mounted into the container at the specified path. Volumes bypass overlay2 entirely — reads and writes go directly to the host filesystem. This is why databases should use volumes: no copy-up overhead, no overlay2 performance penalty, and data survives container deletion.
Bind mounts: Bind mounts map a specific host directory into the container. They also bypass overlay2. Bind mounts are ideal for development (live code reload) but risky in production (the container can modify host files).
tmpfs mounts: tmpfs mounts store data in memory only. They never touch the disk. Useful for sensitive data (secrets, session tokens) that should not persist.
Storage driver alternatives: overlay2 is the default on all modern Linux distributions. Other drivers include fuse-overlayfs (rootless containers), devicemapper (legacy, deprecated), btrfs (Btrfs filesystem), and zfs (ZFS filesystem). overlay2 is recommended for all use cases unless you have a specific reason to use another driver.
#!/bin/bash # Inspect the complete Docker storage architecture # ── 1. Check the storage driver ────────────────────────────────────────────── docker info --format '{{.Driver}}' # overlay2 (default on modern Linux) # ── 2. Inspect the overlay2 directory structure ────────────────────────────── ls /var/lib/docker/overlay2/ | head -10 # Each directory is a layer (image or container writable layer) # Each layer directory contains: ls /var/lib/docker/overlay2/<layer-hash>/ # diff/ — the actual filesystem content (only files that changed) # link — short name for the layer (used for path length limits) # lower — references to parent layers # merged/ — the combined view (only for container layers) # work/ — overlay2 internal working directory # ── 3. Compare container vs image layers ───────────────────────────────────── # Image layers are read-only and shared: IMAGE_LAYERS=$(docker inspect alpine:3.19 --format '{{.RootFS.Layers}}') echo "Image has $(echo $IMAGE_LAYERS | tr ' ' '\n' | wc -l) layers" # Container adds one writable layer: docker create --name storage-test alpine:3.19 sleep 3600 CONTAINER_UPPER=$(docker inspect storage-test --format '{{.GraphDriver.Data.UpperDir}}') echo "Container writable layer: $CONTAINER_UPPER" # ── 4. Demonstrate layer sharing between containers ────────────────────────── # Create two containers from the same image docker create --name storage-a alpine:3.19 sleep 3600 docker create --name storage-b alpine:3.19 sleep 3600 # Compare their lower layers (should be identical): LOWER_A=$(docker inspect storage-a --format '{{.GraphDriver.Data.LowerDir}}') LOWER_B=$(docker inspect storage-b --format '{{.GraphDriver.Data.LowerDir}}') echo "Container A lower: $LOWER_A" echo "Container B lower: $LOWER_B" # Same layers — shared, not duplicated # Compare their upper layers (should be different): UPPER_A=$(docker inspect storage-a --format '{{.GraphDriver.Data.UpperDir}}') UPPER_B=$(docker inspect storage-b --format '{{.GraphDriver.Data.UpperDir}}') echo "Container A upper: $UPPER_A" echo "Container B upper: $UPPER_B" # Different directories — each container has its own writable layer # ── 5. Inspect volumes ─────────────────────────────────────────────────────── docker volume create demo-volume # Volume location on host: docker volume inspect demo-volume --format '{{.Mountpoint}}' # /var/lib/docker/volumes/demo-volume/_data # Volumes bypass overlay2 — direct host filesystem access: docker run --rm -v demo-volume:/data alpine:3.19 sh -c 'echo hello > /data/test' cat /var/lib/docker/volumes/demo-volume/_data/test # hello — directly accessible on the host # ── 6. Compare performance: overlay2 vs volume vs bind mount ───────────────── # overlay2 write (container writable layer): time docker run --rm alpine:3.19 sh -c 'dd if=/dev/zero of=/tmp/test bs=1M count=100' # ~0.3s # Volume write: time docker run --rm -v demo-volume:/data alpine:3.19 sh -c 'dd if=/dev/zero of=/data/test bs=1M count=100' # ~0.2s (slightly faster — no overlay2 overhead) # Bind mount write: time docker run --rm -v $(pwd):/data alpine:3.19 sh -c 'dd if=/dev/zero of=/data/test bs=1M count=100' # ~0.2s (direct host filesystem) # ── Cleanup ────────────────────────────────────────────────────────────────── docker rm -f storage-test storage-a storage-b docker volume rm demo-volume
- overlay2 has a copy-up penalty: modifying a file from a lower layer requires copying it to the upper layer first.
- For multi-GB database files, copy-up causes seconds of latency on first write.
- Volumes bypass overlay2 entirely — reads and writes go directly to the host filesystem.
- Volumes survive container deletion. The overlay2 writable layer is deleted when the container is removed.
Network Architecture: Bridge, veth, iptables, and DNS
Docker networking is built on Linux networking primitives — virtual bridges, veth pairs, iptables rules, and an embedded DNS server. Understanding these primitives explains why containers can communicate, why ports are published, and why the default bridge network lacks DNS.
The Docker bridge (docker0): When Docker is installed, it creates a Linux bridge called docker0 on the host. This bridge acts as a virtual switch. Each container connects to this bridge via a veth pair.
veth pairs: A veth (virtual Ethernet) pair is a pair of connected network interfaces — packets sent to one end appear on the other. Docker creates a veth pair for each container: one end (eth0) is inside the container's network namespace, the other end (vethXXXX) is attached to the docker0 bridge. This is how containers communicate with each other and the outside world.
iptables rules: Docker adds iptables rules for: - Port publishing (-p): DNAT rules forward traffic from the host port to the container's IP and port - Inter-container communication: the default bridge allows all containers to communicate. User-defined networks can be configured with --icc=false to block inter-container communication. - Outbound NAT: MASQUERADE rules allow containers to reach the internet via the host's network interface.
DNS resolution: The default bridge network has no DNS resolution — containers can only reach each other by IP. User-defined bridge networks have an embedded DNS server (127.0.0.11) that resolves container names to IP addresses. This is why docker-compose.yml services can reference each other by service name.
Network drivers: Docker supports multiple network drivers: - bridge: default for single-host setups. Creates a virtual bridge. - host: container shares the host's network stack. No isolation, best performance. - none: no network. Completely air-gapped. - overlay: VXLAN tunnel for multi-host communication (Docker Swarm, Kubernetes). - macvlan: assigns a MAC address to the container, making it appear as a physical device on the network.
#!/bin/bash # Inspect the complete Docker network architecture # ── 1. Check the Docker bridge ─────────────────────────────────────────────── ip addr show docker0 # docker0: <BROADCAST,MULTICAST,UP> mtu 1500 # inet 172.17.0.1/16 # The bridge has the gateway IP for the container subnet # ── 2. Create a container and inspect its veth pair ────────────────────────── docker run -d --name net-demo alpine:3.19 sleep 3600 # Get the container's host PID: CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' net-demo) # Inside the container, see eth0 (one end of the veth pair): docker exec net-demo ip addr show eth0 # eth0@if7: <BROADCAST,MULTICAST,UP> mtu 1500 # inet 172.17.0.2/16 # On the host, see the other end of the veth pair: ip link show | grep veth # vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0 # The host end is attached to the docker0 bridge # ── 3. Inspect iptables rules ──────────────────────────────────────────────── # NAT rules for port publishing: sudo iptables -t nat -L DOCKER -n -v # DNAT tcp -- anywhere anywhere tcp dpt:8080 to:172.17.0.2:3000 # Forward rules: sudo iptables -L DOCKER -n -v # ACCEPT tcp -- anywhere 172.17.0.2 tcp dpt:3000 # MASQUERADE for outbound traffic: sudo iptables -t nat -L POSTROUTING -n -v | grep 172.17 # MASQUERADE all -- 172.17.0.0/16 !172.17.0.0/16 # ── 4. Check DNS resolution (default vs user-defined network) ──────────────── # Default bridge — no DNS: docker exec net-demo cat /etc/resolv.conf # nameserver 8.8.8.8 (host's DNS, not container-specific) docker exec net-demo nslookup other-container # Fails — no embedded DNS on default bridge # User-defined network — embedded DNS: docker network create app-net docker run -d --name api --network app-net alpine:3.19 sleep 3600 docker run -d --name db --network app-net alpine:3.19 sleep 3600 docker exec api cat /etc/resolv.conf # nameserver 127.0.0.11 (embedded DNS server) docker exec api nslookup db # Name: db Address: 172.18.0.3 # ── 5. Inspect network configuration ───────────────────────────────────────── docker network inspect bridge --format '{{json .IPAM.Config}}' | python3 -m json.tool # [{"Subnet": "172.17.0.0/16", "Gateway": "172.17.0.1"}] docker network inspect app-net --format '{{json .IPAM.Config}}' | python3 -m json.tool # [{"Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1"}] # ── 6. Trace network traffic ───────────────────────────────────────────────── # Capture packets on the docker0 bridge: sudo tcpdump -i docker0 -n -c 10 # Shows ARP requests, TCP SYN packets between containers # Capture packets inside a container: sudo nsenter --net --target $CONTAINER_PID tcpdump -i eth0 -n -c 10 # ── Cleanup ────────────────────────────────────────────────────────────────── docker rm -f net-demo api db docker network rm app-net
- The default bridge is a legacy design from before Docker had user-defined networks.
- Docker chose not to add DNS to the default bridge to avoid breaking backward compatibility.
- User-defined networks were introduced later with DNS as a built-in feature.
- The default bridge is effectively deprecated for production use — always create a user-defined network.
The Core Architectural Model — Why Client-Server Matters in Production
Docker uses a client-server architecture. The Docker client talks to the Docker Daemon, which builds, runs, and manages containers. They communicate through a REST API via UNIX sockets or a network interface. This is the fundamental model underlying everything else.
You don't install a monolithic "Docker." You install a client and a daemon. The daemon runs as root. The client runs as you. Same host, different security contexts.
Your docker run command is a REST call. Nothing more. If the daemon crashes, containers stop. No graceful degradation. No graceful anything.
Remote daemon support exists but introduces latency. A request to a daemon on another continent adds 300ms per command. Your CI pipeline hates this.
Most production setups pin the daemon to local sockets. UNIX sockets are faster than TCP and don't expose an attack surface. Your security team appreciates this.
When you debug "Docker not responding," check the socket, not the client. The client is almost never the problem.
// io.thecodeforge — devops tutorial # Check which socket the daemon is listening on sudo dockerd --config-file /etc/docker/daemon.json # Typical daemon.json for production { "hosts": ["unix:///var/run/docker.sock"], "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" }, "storage-driver": "overlay2" } # Verify the socket exists ls -la /var/run/docker.sock # Output: # srw-rw---- 1 root docker 0 Jan 15 14:23 /var/run/docker.sock
Images, Containers, Networks, Volumes — They're All Objects with Metadata
Docker stores everything as objects in a graph database on the daemon host. Images, containers, networks, volumes, secrets, configs. Every object has an ID, metadata, and a lifecycle.
Images are read-only templates. Think of them as frozen filesystem snapshots with metadata about ports, environment variables, and entrypoints. They're stored in layers. Each layer is a diff. Pulling an image means fetching layers. You can inspect an image's layers with docker history.
Containers are running instances of images. A container is a process with namespaces and cgroups applied, plus a writable layer on top of the image layers. When you stop a container, the writable layer persists unless you use --rm.
Networks are virtual Layer 2 segments. Bridge networks connect containers on the same host. Overlay networks span hosts in Swarm mode. Each network object has an IPAM config, subnet, and gateway.
Volumes are persistent data stores managed by Docker. They exist outside the container's union filesystem. Bind mounts are not volumes — they're host directory references. Don't confuse them.
Secrets and configs are encrypted objects available to Swarm services. They're mounted as files inside containers. Never store secrets in environment variables.
// io.thecodeforge — devops tutorial # Inspect an image's layers $ docker history nginx:alpine IMAGE CREATED CREATED BY SIZE 605c77e624dd 2 weeks ago /bin/sh -c #(nop) CMD ["nginx" "-g" "daemon… 0B cfbafb0ab33a 2 weeks ago /bin/sh -c #(nop) STOPSIGNAL SIGQUIT 0B ... # Show all objects on the system $ docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 12 4 2.345GB 1.2GB (51%) Containers 8 3 1.2GB 800MB (67%) Local Volumes 5 2 500MB 300MB (60%) Build Cache 23 0 0B 0B
Docker Daemon Crashes Under Load — All Container Operations Fail for 20 Minutes
- The Docker daemon is a single process that handles all operations — builds, runs, networking, volumes. Under high concurrency, it becomes a bottleneck.
- Limit concurrent docker build operations per host. 50 concurrent builds can exhaust the daemon's memory and CPU.
- Use BuildKit (DOCKER_BUILDKIT=1) for builds — it parallelizes steps and reduces daemon load compared to the legacy builder.
- Separate build hosts from runtime hosts. Build operations are more resource-intensive than container lifecycle operations.
- Monitor dockerd resource usage (CPU, memory, open file descriptors). A daemon consuming >2GB RSS is a sign of overload.
systemctl status docker && ps aux | grep dockerdjournalctl -u docker --since '10 minutes ago' --no-pager | tail -30du -sh . (in build directory)cat .dockerignore 2>/dev/null || echo 'NO .dockerignore'curl -s -I https://registry-1.docker.io/v2/library/alpine/manifests/latest | grep -i ratelimitdocker info | grep -i 'registry\|mirror\|proxy'docker inspect <container> --format '{{.State.ExitCode}} {{.State.Error}}'docker logs <container>docker system df -vdu -sh /var/lib/docker/* | sort -hrsystemctl status containerdjournalctl -u containerd --since '10 minutes ago' --no-pager | tail -20| Component | Role | Failure Impact | Runs As |
|---|---|---|---|
| Docker CLI | Sends API requests to the daemon | CLI commands fail — containers unaffected | User process |
| dockerd (daemon) | Manages API, images, networks, volumes | All CLI operations fail — existing containers keep running | Root process |
| containerd | Manages container lifecycle, image pulling | New containers cannot be created — existing ones keep running | Root process |
| runc | Creates a single container from OCI spec | The specific container creation fails — others unaffected | Short-lived (exits after creation) |
| containerd-shim | Monitors container process, captures output | Container loses stdout/stderr capture — process still runs | Per-container process |
| pause | Holds namespaces open for restart | Container cannot restart — namespaces destroyed on exit | Per-container process |
| File | Command / Code | Purpose |
|---|---|---|
| io | curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m js... | Component Stack |
| io | tar -cf - . | wc -c | Image Build Flow |
| io | docker save io.thecodeforge/api:1.0 -o /tmp/api-image.tar | Image Distribution |
| io | docker create --name flow-demo \ | Container Creation Flow |
| io | docker info --format '{{.Driver}}' | Storage Architecture |
| io | ip addr show docker0 | Network Architecture |
| DaemonConfigCheck.yml | sudo dockerd --config-file /etc/docker/daemon.json | The Core Architectural Model |
| ObjectInspection.yml | $ docker history nginx:alpine | Images, Containers, Networks, Volumes |
Key takeaways
Common mistakes to avoid
7 patternsNot understanding that the daemon is a single point of failure
Sending a 500MB build context without .dockerignore
Using the default bridge network and expecting DNS resolution
Writing database data to the overlay2 filesystem
Not setting resource limits on containers
Flushing iptables while Docker is running
Not authenticating CI runners for Docker Hub pulls
Interview Questions on This Topic
Walk me through the complete flow from 'docker build' to a cached image on disk. What happens at each step, and how does layer caching work?
Explain the Docker component stack: CLI, daemon, containerd, runc. What does each component do, and what happens when each one fails?
How does the OCI spec enable runtime replaceability? Why can you swap runc for gVisor or Kata without changing Docker?
Trace the container creation flow from 'docker run' to a running process. What kernel syscalls does runc make?
How does Docker networking work at the Linux level? Explain veth pairs, the docker0 bridge, and iptables rules for port publishing.
Why should databases use named volumes instead of the overlay2 filesystem? What is the copy-up problem?
Your CI pipeline runs 50 concurrent docker build operations and the daemon becomes unresponsive. What is happening and how do you fix it?
Frequently Asked Questions
dockerd (the Docker daemon) is the user-facing server that manages the Docker API, image building, networking, and volumes. containerd is the container runtime that manages the container lifecycle — pulling images, creating containers, and handling execution. dockerd delegates to containerd for container operations. containerd was extracted from Docker in 2017 and is now used independently by Kubernetes and other platforms.
The OCI (Open Container Initiative) spec defines two standards: the image spec (how images are packaged as layers + manifest) and the runtime spec (how containers are created as config.json). This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run any OCI-compliant image. It enables runtime replaceability — you can swap runc for gVisor without changing Docker.
Yes. containerd provides its own CLI (ctr) and API (gRPC). Kubernetes uses containerd directly via the CRI plugin, bypassing dockerd entirely. You can use ctr to pull images, create containers, and manage snapshots. This reduces overhead and removes the daemon as a single point of failure.
The build context is the entire current directory (or the path specified in docker build -f). Without a .dockerignore file, this includes node_modules (500MB+), .git history (100MB+), and other large files. The CLI tar's this directory and sends it to the daemon over the Unix socket. Create a .dockerignore file to exclude unnecessary files. This alone can reduce build time from minutes to seconds.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
That's Docker. Mark it forged?
13 min read · try the examples if you haven't