Docker Architecture Explained: Complete End-to-End Flow (Images, Containers, Engine)
- Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each component has a specific role. The OCI spec standardizes the interface between containerd and runc.
- Image build flow: CLI sends context -> daemon parses Dockerfile -> cache lookup per instruction -> execute miss -> commit layer -> tag image. .dockerignore is mandatory.
- Container creation flow: CLI -> dockerd API -> containerd -> runc -> kernel syscalls (clone, pivot_root, execve). Every container is a real Linux process on the host.
- docker build: CLI sends context to dockerd -> dockerd executes Dockerfile instructions -> each instruction creates a cached layer -> layers stored under /var/lib/docker/overlay2/
- docker push: dockerd uploads layers to a registry -> registry stores layers by digest -> tags point to manifests
- docker run: CLI sends API request to dockerd -> dockerd delegates to containerd -> containerd invokes runc -> runc configures namespaces/cgroups/filesystem -> exec starts the application
- Docker CLI: HTTP client that talks to the daemon via Unix socket
- dockerd (daemon): manages images, networks, volumes, and the REST API
- containerd: manages container lifecycle, image pulling, and snapshot management
- runc: creates containers by calling kernel syscalls (clone, pivot_root, exec)
Docker daemon is unresponsive or crashed.
systemctl status docker && ps aux | grep dockerdjournalctl -u docker --since '10 minutes ago' --no-pager | tail -30docker build hangs at 'Sending build context'.
du -sh . (in build directory)cat .dockerignore 2>/dev/null || echo 'NO .dockerignore'docker pull fails or is extremely slow.
curl -s -I https://registry-1.docker.io/v2/library/alpine/manifests/latest | grep -i ratelimitdocker info | grep -i 'registry\|mirror\|proxy'Container exits immediately after start.
docker inspect <container> --format '{{.State.ExitCode}} {{.State.Error}}'docker logs <container>Docker disk usage is growing rapidly.
docker system df -vdu -sh /var/lib/docker/* | sort -hrcontainerd is not running or crashing.
systemctl status containerdjournalctl -u containerd --since '10 minutes ago' --no-pager | tail -20Production Incident
Production Debug GuideFrom daemon crashes to slow builds β systematic debugging paths through the component stack.
Most Docker documentation treats the architecture as a black box β run a command, get a container. This abstraction breaks down when containers fail to start, images pull slowly, or the daemon crashes under load. Understanding the component stack and the data flow between components is essential for production debugging.
Docker is not one program. It is a chain of specialized components: the CLI sends API requests to the daemon, the daemon delegates to containerd, containerd invokes runc, and runc configures the Linux kernel to create an isolated process. Each handoff is a potential failure point. The OCI (Open Container Initiative) spec standardizes the interface between containerd and runc, enabling runtime replaceability.
This article traces the complete end-to-end flow: what happens when you run docker build, how images are stored and distributed, what happens when you run docker run, how networking and storage are wired, and where each component lives on the filesystem. Every section includes production failure scenarios and debugging commands.
Component Stack: CLI, Daemon, containerd, runc, and the OCI Spec
Docker is a chain of five components, each with a specific responsibility. Understanding this chain is the foundation for debugging any Docker issue.
Docker CLI (docker): A Go binary that sends HTTP requests to the Docker daemon via a Unix socket (/var/run/docker.sock) or TCP. The CLI does not create containers, build images, or manage networks β it is a thin client. You can replace it with curl: curl --unix-socket /var/run/docker.sock http://localhost/containers/json.
Docker daemon (dockerd): A long-running Go process that manages the Docker API, image storage, network configuration, volume management, and build orchestration. The daemon listens on the Unix socket and processes all API requests. It delegates container lifecycle operations to containerd. The daemon runs as root and has full access to the host.
containerd: A container runtime daemon that manages the complete container lifecycle β pulling images, managing snapshots, creating containers, and handling execution. containerd was originally part of Docker but was extracted as a CNCF project in 2017. It is now used independently by Docker, Kubernetes (via CRI), AWS ECS, GKE, and other platforms. containerd invokes runc to actually create containers.
runc: A lightweight CLI tool that creates a single container from an OCI runtime specification (config.json). runc calls clone() to create a new process with namespaces, configures cgroups, mounts the filesystem via overlay2, drops privileges, and exec's the application process. runc exits after creating the container β it does not manage the lifecycle.
OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged as layers + manifest) and the runtime spec (how containers are created as config.json). This standardization enables runtime replaceability β you can swap runc for crun, kata-runtime, or runsc without changing Docker or containerd.
The handoff chain: docker run -> dockerd (API) -> containerd (lifecycle) -> runc (creation) -> kernel (namespaces, cgroups, overlay2). Each handoff is a potential failure point. If dockerd crashes, all operations fail. If containerd crashes, new containers cannot be created but existing ones keep running. If runc fails, the specific container creation fails but the stack above is unaffected.
#!/bin/bash # Trace the complete Docker architecture flow # ββ 1. CLI -> Daemon communication βββββββββββββββββββββββββββββββββββββββββββ # The CLI sends HTTP requests to the daemon socket curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool # Shows: Version, ApiVersion, GoVersion, Os, Arch, KernelVersion # List containers via the API (same as docker ps) curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool # ββ 2. Daemon -> containerd communication ββββββββββββββββββββββββββββββββββββ # containerd runs as a separate process, communicating via gRPC ps aux | grep containerd # root 1234 0.3 0.5 ... /usr/bin/containerd # Check the containerd socket ls -la /run/containerd/containerd.sock # srw-rw---- 1 root containerd /run/containerd/containerd.sock # List containers managed by containerd (via ctr) sudo ctr -n moby containers ls # Shows containers that containerd is managing on behalf of Docker # ββ 3. containerd -> runc communication ββββββββββββββββββββββββββββββββββββββ # runc is invoked by containerd to create each container which runc # /usr/bin/runc runc --version # runc version 1.1.9 # List containers managed by runc sudo runc list # Shows: container ID, PID, status, bundle path # ββ 4. Inspect the OCI runtime spec for a running container ββββββββββββββββββ CONTAINER_ID=$(docker ps -q | head -1) # Find the container's bundle directory sudo find /run/containerd -name config.json 2>/dev/null | head -3 # /run/containerd/io.containerd.runtime.v2.task/default/<id>/config.json # ββ 5. Check the daemon process tree βββββββββββββββββββββββββββββββββββββββββ pstree -p $(pidof dockerd) # dockerd(1234)βββcontainerd(1235)βββcontainerd-shim(5678)βββnode(5679) # ββpause(5677) # ββ 6. Check all components are running ββββββββββββββββββββββββββββββββββββββ echo "dockerd: $(systemctl is-active docker)" echo "containerd: $(systemctl is-active containerd)" echo "runc: $(which runc && echo 'installed' || echo 'missing')" # ββ 7. Check the daemon's storage driver and root directory ββββββββββββββββββ docker info --format '{{.Driver}} {{.DockerRootDir}}' # overlay2 /var/lib/docker # ββ 8. Check the daemon's configured runtimes ββββββββββββββββββββββββββββββββ docker info --format '{{json .Runtimes}}' | python3 -m json.tool # Shows: runc (default), and any custom runtimes (runsc, kata)
{
"Version": "24.0.7",
"ApiVersion": "1.43",
"GoVersion": "go1.20.10",
"Os": "linux",
"Arch": "amd64",
"KernelVersion": "6.1.0-18-amd64"
}
# Component status:
dockerd: active
containerd: active
runc: installed
# Storage driver:
overlay2 /var/lib/docker
# Process tree:
dockerd(1234)βββcontainerd(1235)βββcontainerd-shim(5678)βββnode(5679)
ββpause(5677)
- containerd was extracted from Docker in 2017 to become a standalone CNCF project.
- Kubernetes can use containerd directly (via CRI) without dockerd β reducing overhead and complexity.
- Separation allows independent scaling β containerd can be updated without restarting dockerd.
- containerd manages the lifecycle; dockerd manages the user-facing API and image building.
Image Build Flow: From Dockerfile to Cached Layers
When you run docker build, a precise sequence of operations transforms a Dockerfile into a cached, layered image. Understanding this flow explains why builds are slow, why layers are cached, and why image size matters.
Step 1: Send build context. The CLI tar's the current directory (or the path specified in docker build -f) and sends it to the daemon via the Unix socket. This is the 'Sending build context to Docker daemon' message. The .dockerignore file filters out excluded files before sending. Without .dockerignore, the entire directory (including .git, node_modules) is sent.
Step 2: Parse the Dockerfile. The daemon parses the Dockerfile and executes each instruction sequentially. Each instruction is evaluated against the layer cache.
Step 3: Cache lookup. For each instruction, the daemon checks if a cached layer exists with the same instruction text and the same parent layer. If the cache hit, the layer is reused (no execution). If the cache miss, the instruction is executed and a new layer is created. The cache is sequential β a miss invalidates all subsequent layers.
Step 4: Execute the instruction. For RUN, the daemon creates a temporary container from the previous layer, executes the command, and captures the filesystem diff as a new layer. For COPY/ADD, the daemon copies files from the build context into a new layer. For ENV/EXPOSE/LABEL, the daemon creates a metadata-only layer (no filesystem change).
Step 5: Commit the layer. The filesystem diff is committed as a new layer under /var/lib/docker/overlay2/. Each layer is a directory containing only the files that changed from the previous layer. The layer is identified by a SHA256 digest.
Step 6: Tag the image. After all instructions are executed, the final layer is tagged with the image name and tag (e.g., my-app:1.0.0). The tag points to a manifest β a JSON file that lists all layers in order.
BuildKit vs legacy builder: The legacy builder executes instructions sequentially. BuildKit (DOCKER_BUILDKIT=1) builds a dependency graph and executes independent instructions in parallel. BuildKit also supports --mount=type=secret for build-time secrets without baking them into layers. BuildKit is the default in Docker Desktop and is recommended for all builds.
#!/bin/bash # Trace the complete image build flow # ββ 1. Build context size (before and after .dockerignore) βββββββββββββββββββ # Without .dockerignore: tar -cf - . | wc -c # May be 500MB+ if node_modules and .git are included # With .dockerignore: cat .dockerignore # node_modules/ # .git/ # *.log tar -cf - --exclude-from=.dockerignore . | wc -c # Should be <10MB for a typical project # ββ 2. Build with cache inspection βββββββββββββββββββββββββββββββββββββββββββ # Build with BuildKit and progress=plain to see every step DOCKER_BUILDKIT=1 docker build --progress=plain -t io.thecodeforge/api:1.0 . 2>&1 | tee /tmp/build.log # Count cached vs executed steps: grep -c 'CACHED' /tmp/build.log grep -c 'RUN\|COPY' /tmp/build.log # ββ 3. Inspect the image layers ββββββββββββββββββββββββββββββββββββββββββββββ # List layers in the image docker inspect io.thecodeforge/api:1.0 --format '{{json .RootFS.Layers}}' | python3 -m json.tool # Each entry is a SHA256 digest of a layer # Show layer sizes docker history io.thecodeforge/api:1.0 --format '{{.Size}}\t{{.CreatedBy}}' | head -10 # Shows the size contribution of each instruction # ββ 4. Find layers on disk βββββββββββββββββββββββββββββββββββββββββββββββββββ ls /var/lib/docker/overlay2/ | head -10 # Each directory is a layer. Shared layers are hard-linked. # Check disk usage per layer: du -sh /var/lib/docker/overlay2/* | sort -hr | head -10 # ββ 5. Inspect the image manifest ββββββββββββββββββββββββββββββββββββββββββββ # Save the image and inspect its manifest docker save io.thecodeforge/api:1.0 | tar -xO manifest.json | python3 -m json.tool # Shows: Config (image config), RepoTags, Layers (ordered list of layer tar files) # ββ 6. Compare BuildKit vs legacy builder performance ββββββββββββββββββββββββ # Legacy builder: time DOCKER_BUILDKIT=0 docker build -t test:legacy . # Sequential execution β slower for multi-step builds # BuildKit: time DOCKER_BUILDKIT=1 docker build -t test:buildkit . # Parallel execution β faster for independent steps # ββ 7. Check the build cache βββββββββββββββββββββββββββββββββββββββββββββββββ docker builder du # Shows disk usage of the build cache docker builder prune # Removes unused build cache entries
8543232 (8.1MB with .dockerignore)
524288000 (500MB without .dockerignore)
# Build with cache:
#5 [ package.json ./ 2/6] COPY CACHED
#6 [3/6] RUN npm ci CACHED
#7 [4/6] COPY src/ ./src/ 0.3s
# Only the COPY src/ step was rebuilt
# Image layers:
[
"sha256:abc123...",
"sha256:def456...",
"sha256:ghi789..."
]
# Layer sizes:
142MB COPY --from=builder /app/node_modules ./node_modules
12MB COPY --from=builder /app/dist ./dist
7MB FROM node:20-alpine
# BuildKit vs legacy:
Legacy: 42.3s
BuildKit: 28.1s (33% faster)
- The daemon needs access to files referenced by COPY and ADD instructions.
- The daemon runs on the host (or a remote machine) β it cannot access the CLI's local filesystem directly.
- The CLI tar's the context and sends it over the Unix socket. This is why .dockerignore is critical for build speed.
- BuildKit optimizes this by only sending files referenced by COPY/ADD, not the entire context.
Image Distribution: Registry, Manifest, and Layer Deduplication
Once an image is built, it needs to be distributed to other machines β CI servers, staging environments, production clusters. This is the registry's job.
Image format: An image is not a single file. It is a collection of: - Layers: compressed tar archives, each identified by a SHA256 digest - Manifest: a JSON file that lists the layers in order and points to the image config - Image config: a JSON file that defines the runtime configuration (env vars, entrypoint, exposed ports, user)
The registry protocol: Docker registries implement the OCI Distribution Spec β an HTTP API for pushing and pulling images. The flow: 1. Client sends the manifest to the registry 2. Registry checks which layers it already has (by digest) 3. Client uploads only the missing layers 4. Registry stores layers by digest and links them to the manifest
Layer deduplication: This is the key efficiency mechanism. If two images share the same base layer (e.g., both use node:20-alpine), the layer is stored once on the registry and once on the local machine. When you pull a second image that shares layers with an existing image, only the unique layers are downloaded. This is why pulling a new version of your app is fast β only the top layers (containing your code) change.
Docker Hub pull-rate limits: Docker Hub limits pulls per IP: 100 per 6 hours for anonymous users, 200 for authenticated free users. This limit is per IP, not per user β a NAT gateway makes multiple machines appear as one IP. For CI/CD pipelines, this limit is hit quickly. The fix: authenticate with docker login, use a pull-through cache, or mirror images to a private registry.
Content trust (DCT): Docker Content Trust uses digital signatures to verify image integrity. When DOCKER_CONTENT_TRUST=1, Docker only pulls signed images. This prevents supply chain attacks where a malicious image is pushed with the same tag as a legitimate image.
#!/bin/bash # Trace the image distribution flow # ββ 1. Inspect the local image manifest ββββββββββββββββββββββββββββββββββββββ # Save the image and extract the manifest docker save io.thecodeforge/api:1.0 -o /tmp/api-image.tar cd /tmp && tar xf api-image.tar # The manifest.json lists all components: cat manifest.json | python3 -m json.tool # [ # { # "Config": "sha256:abc123...json", <- image config # "RepoTags": ["io.thecodeforge/api:1.0"], # "Layers": [ <- ordered layer list # "sha256:def456.../layer.tar", # "sha256:ghi789.../layer.tar" # ] # } # ] # ββ 2. Inspect the image config ββββββββββββββββββββββββββββββββββββββββββββββ cat sha256:abc123*.json | python3 -m json.tool | head -30 # Shows: architecture, os, config (env, cmd, entrypoint), rootfs (diff_ids) # ββ 3. Push to a registry ββββββββββββββββββββββββββββββββββββββββββββββββββββ # Login to Docker Hub docker login # Tag the image for the registry docker tag io.thecodeforge/api:1.0 youruser/io-thecodeforge-api:1.0 # Push β watch which layers are pushed vs already exist docker push youruser/io-thecodeforge-api:1.0 # Output shows: # Layer already exists (shared with base image) # Pushing layer (unique to this image) # ββ 4. Pull from a registry ββββββββββββββββββββββββββββββββββββββββββββββββββ # Pull on a different machine docker pull youruser/io-thecodeforge-api:1.0 # Output shows: # Already exists (layers shared with local images) # Downloading (unique layers) # ββ 5. Check pull-rate limit status ββββββββββββββββββββββββββββββββββββββββββ curl -s -I \ -H "Authorization: Bearer $(curl -s 'https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull' | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')" \ https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest \ | grep -i ratelimit # ratelimit-limit: 100;w=21600 # ratelimit-remaining: 76;w=21600 # ββ 6. Enable Docker Content Trust βββββββββββββββββββββββββββββββββββββββββββ export DOCKER_CONTENT_TRUST=1 # Now docker pull only fetches signed images docker pull youruser/io-thecodeforge-api:1.0 # If the image is not signed, the pull fails with a trust error # ββ 7. Check layer deduplication βββββββββββββββββββββββββββββββββββββββββββββ # Compare layers between two images docker inspect node:20-alpine --format '{{.RootFS.Layers}}' | tr ' ' '\n' | wc -l docker inspect io.thecodeforge/api:1.0 --format '{{.RootFS.Layers}}' | tr ' ' '\n' | wc -l # The API image shares base layers with node:20-alpine
[
{
"Config": "sha256:a1b2c3d4e5f6...json",
"RepoTags": ["io.thecodeforge/api:1.0"],
"Layers": [
"sha256:f1e2d3c4b5a6.../layer.tar",
"sha256:a7b8c9d0e1f2.../layer.tar",
"sha256:d3e4f5a6b7c8.../layer.tar"
]
}
]
# Push output:
The push refers to repository [docker.io/youruser/io-thecodeforge-api]
f1e2d3c4b5a6: Mounted from library/node (shared layer)
a7b8c9d0e1f2: Pushed (unique layer)
d3e4f5a6b7c8: Pushed (unique layer)
1.0: digest: sha256:abc123... size: 1570
# Rate limit:
ratelimit-limit: 100;w=21600
ratelimit-remaining: 76;w=21600
- Without deduplication, every image would store a full copy of its base OS β wasting disk and bandwidth.
- With deduplication, shared layers (like node:20-alpine) are stored once and referenced by multiple images.
- Pulling a new app version only downloads the changed layers (typically your code β a few MB), not the entire image.
- This is why Docker images are practical at scale β the overhead per image is only the unique layers.
Container Creation Flow: From Image to Running Process
When you run docker run, a precise sequence of operations creates an isolated process from an image. This is the most critical flow to understand for production debugging.
Step 1: API request. The CLI sends a POST /containers/create request to dockerd. The request includes the image name, command, environment variables, port mappings, volume mounts, and resource limits.
Step 2: Image resolution. dockerd checks if the image exists locally. If not, it pulls the image from the registry. The image's layers are unpacked into /var/lib/docker/overlay2/.
Step 3: Create container metadata. dockerd creates a container configuration (container JSON) that includes the merged overlay2 directory, network settings, volume mounts, and resource limits. This metadata is stored in /var/lib/docker/containers/<container-id>/.
Step 4: Delegate to containerd. dockerd sends a gRPC request to containerd to create the container. containerd generates the OCI runtime spec (config.json) β a JSON file that defines namespaces, cgroups, mounts, and the process to execute.
Step 5: Invoke runc. containerd invokes runc create with the OCI spec. runc reads config.json and executes kernel syscalls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) β creates a new process with namespaces - mount() β mounts /proc, /sys, /dev inside the container - pivot_root() β changes the root to the overlay2 merge directory - setuid()/setgid() β drops privileges (if non-root) - execve() β starts the application process
Step 6: Network setup. dockerd (via libnetwork) creates a veth pair β one end in the container's network namespace, one end on the Docker bridge (docker0). The container gets an IP address from the bridge's subnet. iptables rules are added for port publishing (-p) and inter-container communication.
Step 7: Monitor the process. containerd-shim monitors the container process, captures stdout/stderr, and handles signals. The pause process holds the namespaces open. When the application process exits, containerd-shim reports the exit code to containerd, which reports to dockerd.
#!/bin/bash # Trace the complete container creation flow # ββ 1. Create a container (without starting it) βββββββββββββββββββββββββββββ docker create --name flow-demo \ --cpus=1.0 \ --memory=256m \ -p 8080:3000 \ -v demo-data:/app/data \ alpine:3.19 sleep 3600 # ββ 2. Inspect the container metadata ββββββββββββββββββββββββββββββββββββββββ # Container config stored by the daemon: ls /var/lib/docker/containers/$(docker inspect flow-demo --format '{{.Id}}')/ # config.v2.json hostconfig.json hostname hosts resolv.conf ... # ββ 3. Start the container and trace the flow ββββββββββββββββββββββββββββββββ docker start flow-demo # Get the container's host PID: CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' flow-demo) echo "Container process PID on host: $CONTAINER_PID" # ββ 4. Inspect the OCI runtime spec ββββββββββββββββββββββββββββββββββββββββββ # Find the container's bundle in containerd's state: sudo find /run/containerd -path '*flow-demo*' -name config.json 2>/dev/null # Inspect the OCI spec (namespaces, mounts, process): sudo cat /run/containerd/io.containerd.runtime.v2.task/default/*/config.json 2>/dev/null | python3 -m json.tool | head -60 # ββ 5. Inspect the overlay2 filesystem βββββββββββββββββββββββββββββββββββββββ docker inspect flow-demo --format '{{json .GraphDriver.Data}}' | python3 -m json.tool # MergedDir: what the container sees as / # UpperDir: writable layer (container-specific changes) # LowerDir: read-only image layers # ββ 6. Inspect the network setup βββββββββββββββββββββββββββββββββββββββββββββ # Container's network config: docker inspect flow-demo --format '{{json .NetworkSettings}}' | python3 -m json.tool | head -20 # Shows: IPAddress, Gateway, Ports, Networks # veth pair on the host: ip link show | grep veth # vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0 # iptables rules for port publishing: sudo iptables -t nat -L -n | grep 8080 # DNAT rule forwarding host:8080 to container:3000 # ββ 7. Find the pause process ββββββββββββββββββββββββββββββββββββββββββββββββ ps aux | grep pause | grep -v grep # root 5677 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause # The pause process and container process share namespaces: ls -la /proc/$CONTAINER_PID/ns/net ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net # Both point to the same namespace inode # ββ Cleanup ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ docker rm -f flow-demo
flow-demo
# Host PID:
Container process PID on host: 5679
# Overlay2:
{
"LowerDir": "/var/lib/docker/overlay2/.../layers",
"MergedDir": "/var/lib/docker/overlay2/.../merged",
"UpperDir": "/var/lib/docker/overlay2/.../diff",
"WorkDir": "/var/lib/docker/overlay2/.../work"
}
# Network:
{
"IPAddress": "172.17.0.2",
"Gateway": "172.17.0.1",
"Ports": {"3000/tcp": [{"HostIp": "0.0.0.0", "HostPort": "8080"}]}
}
# iptables:
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:3000
- runc's job is to create the container, not to manage it. After execve(), runc is replaced by the application process.
- containerd-shim monitors the application process, captures output, and handles signals.
- The pause process holds the namespaces open so they survive application restarts.
- This separation allows containerd to manage the lifecycle without being PID 1 in the container.
Storage Architecture: overlay2, Volumes, and the Filesystem Stack
Docker's storage architecture has three layers: the image layers (read-only, cached), the container layer (writable, per-container), and volumes (persistent, managed separately). Understanding this stack explains why containers start fast, why data disappears, and why database performance differs between containers and bare metal.
overlay2 driver: The default storage driver. It stacks directories (layers) and presents a merged view. Each image layer is a directory under /var/lib/docker/overlay2/. The container's writable layer is a separate directory. The merged view is what the container sees as its root filesystem.
Layer sharing: Multiple containers from the same image share the same read-only layers. Each container has its own writable layer. This is why starting a second container from the same image is nearly instant β no data is copied, only a new writable directory is created.
Volumes: Named volumes are directories under /var/lib/docker/volumes/<volume-name>/_data. They are mounted into the container at the specified path. Volumes bypass overlay2 entirely β reads and writes go directly to the host filesystem. This is why databases should use volumes: no copy-up overhead, no overlay2 performance penalty, and data survives container deletion.
Bind mounts: Bind mounts map a specific host directory into the container. They also bypass overlay2. Bind mounts are ideal for development (live code reload) but risky in production (the container can modify host files).
tmpfs mounts: tmpfs mounts store data in memory only. They never touch the disk. Useful for sensitive data (secrets, session tokens) that should not persist.
Storage driver alternatives: overlay2 is the default on all modern Linux distributions. Other drivers include fuse-overlayfs (rootless containers), devicemapper (legacy, deprecated), btrfs (Btrfs filesystem), and zfs (ZFS filesystem). overlay2 is recommended for all use cases unless you have a specific reason to use another driver.
#!/bin/bash # Inspect the complete Docker storage architecture # ββ 1. Check the storage driver ββββββββββββββββββββββββββββββββββββββββββββββ docker info --format '{{.Driver}}' # overlay2 (default on modern Linux) # ββ 2. Inspect the overlay2 directory structure ββββββββββββββββββββββββββββββ ls /var/lib/docker/overlay2/ | head -10 # Each directory is a layer (image or container writable layer) # Each layer directory contains: ls /var/lib/docker/overlay2/<layer-hash>/ # diff/ β the actual filesystem content (only files that changed) # link β short name for the layer (used for path length limits) # lower β references to parent layers # merged/ β the combined view (only for container layers) # work/ β overlay2 internal working directory # ββ 3. Compare container vs image layers βββββββββββββββββββββββββββββββββββββ # Image layers are read-only and shared: IMAGE_LAYERS=$(docker inspect alpine:3.19 --format '{{.RootFS.Layers}}') echo "Image has $(echo $IMAGE_LAYERS | tr ' ' '\n' | wc -l) layers" # Container adds one writable layer: docker create --name storage-test alpine:3.19 sleep 3600 CONTAINER_UPPER=$(docker inspect storage-test --format '{{.GraphDriver.Data.UpperDir}}') echo "Container writable layer: $CONTAINER_UPPER" # ββ 4. Demonstrate layer sharing between containers ββββββββββββββββββββββββββ # Create two containers from the same image docker create --name storage-a alpine:3.19 sleep 3600 docker create --name storage-b alpine:3.19 sleep 3600 # Compare their lower layers (should be identical): LOWER_A=$(docker inspect storage-a --format '{{.GraphDriver.Data.LowerDir}}') LOWER_B=$(docker inspect storage-b --format '{{.GraphDriver.Data.LowerDir}}') echo "Container A lower: $LOWER_A" echo "Container B lower: $LOWER_B" # Same layers β shared, not duplicated # Compare their upper layers (should be different): UPPER_A=$(docker inspect storage-a --format '{{.GraphDriver.Data.UpperDir}}') UPPER_B=$(docker inspect storage-b --format '{{.GraphDriver.Data.UpperDir}}') echo "Container A upper: $UPPER_A" echo "Container B upper: $UPPER_B" # Different directories β each container has its own writable layer # ββ 5. Inspect volumes βββββββββββββββββββββββββββββββββββββββββββββββββββββββ docker volume create demo-volume # Volume location on host: docker volume inspect demo-volume --format '{{.Mountpoint}}' # /var/lib/docker/volumes/demo-volume/_data # Volumes bypass overlay2 β direct host filesystem access: docker run --rm -v demo-volume:/data alpine:3.19 sh -c 'echo hello > /data/test' cat /var/lib/docker/volumes/demo-volume/_data/test # hello β directly accessible on the host # ββ 6. Compare performance: overlay2 vs volume vs bind mount βββββββββββββββββ # overlay2 write (container writable layer): time docker run --rm alpine:3.19 sh -c 'dd if=/dev/zero of=/tmp/test bs=1M count=100' # ~0.3s # Volume write: time docker run --rm -v demo-volume:/data alpine:3.19 sh -c 'dd if=/dev/zero of=/data/test bs=1M count=100' # ~0.2s (slightly faster β no overlay2 overhead) # Bind mount write: time docker run --rm -v $(pwd):/data alpine:3.19 sh -c 'dd if=/dev/zero of=/data/test bs=1M count=100' # ~0.2s (direct host filesystem) # ββ Cleanup ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ docker rm -f storage-test storage-a storage-b docker volume rm demo-volume
overlay2
# Image layers:
Image has 1 layers
# Layer sharing:
Container A lower: /var/lib/docker/overlay2/abc123/layers
Container B lower: /var/lib/docker/overlay2/abc123/layers
# Same layers β shared
Container A upper: /var/lib/docker/overlay2/def456/diff
Container B upper: /var/lib/docker/overlay2/ghi789/diff
# Different writable layers
# Volume:
/var/lib/docker/volumes/demo-volume/_data
# Performance:
overlay2: 0.31s
volume: 0.22s
bind: 0.21s
- overlay2 has a copy-up penalty: modifying a file from a lower layer requires copying it to the upper layer first.
- For multi-GB database files, copy-up causes seconds of latency on first write.
- Volumes bypass overlay2 entirely β reads and writes go directly to the host filesystem.
- Volumes survive container deletion. The overlay2 writable layer is deleted when the container is removed.
Network Architecture: Bridge, veth, iptables, and DNS
Docker networking is built on Linux networking primitives β virtual bridges, veth pairs, iptables rules, and an embedded DNS server. Understanding these primitives explains why containers can communicate, why ports are published, and why the default bridge network lacks DNS.
The Docker bridge (docker0): When Docker is installed, it creates a Linux bridge called docker0 on the host. This bridge acts as a virtual switch. Each container connects to this bridge via a veth pair.
veth pairs: A veth (virtual Ethernet) pair is a pair of connected network interfaces β packets sent to one end appear on the other. Docker creates a veth pair for each container: one end (eth0) is inside the container's network namespace, the other end (vethXXXX) is attached to the docker0 bridge. This is how containers communicate with each other and the outside world.
iptables rules: Docker adds iptables rules for: - Port publishing (-p): DNAT rules forward traffic from the host port to the container's IP and port - Inter-container communication: the default bridge allows all containers to communicate. User-defined networks can be configured with --icc=false to block inter-container communication. - Outbound NAT: MASQUERADE rules allow containers to reach the internet via the host's network interface.
DNS resolution: The default bridge network has no DNS resolution β containers can only reach each other by IP. User-defined bridge networks have an embedded DNS server (127.0.0.11) that resolves container names to IP addresses. This is why docker-compose.yml services can reference each other by service name.
Network drivers: Docker supports multiple network drivers: - bridge: default for single-host setups. Creates a virtual bridge. - host: container shares the host's network stack. No isolation, best performance. - none: no network. Completely air-gapped. - overlay: VXLAN tunnel for multi-host communication (Docker Swarm, Kubernetes). - macvlan: assigns a MAC address to the container, making it appear as a physical device on the network.
#!/bin/bash # Inspect the complete Docker network architecture # ββ 1. Check the Docker bridge βββββββββββββββββββββββββββββββββββββββββββββββ ip addr show docker0 # docker0: <BROADCAST,MULTICAST,UP> mtu 1500 # inet 172.17.0.1/16 # The bridge has the gateway IP for the container subnet # ββ 2. Create a container and inspect its veth pair ββββββββββββββββββββββββββ docker run -d --name net-demo alpine:3.19 sleep 3600 # Get the container's host PID: CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' net-demo) # Inside the container, see eth0 (one end of the veth pair): docker exec net-demo ip addr show eth0 # eth0@if7: <BROADCAST,MULTICAST,UP> mtu 1500 # inet 172.17.0.2/16 # On the host, see the other end of the veth pair: ip link show | grep veth # vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0 # The host end is attached to the docker0 bridge # ββ 3. Inspect iptables rules ββββββββββββββββββββββββββββββββββββββββββββββββ # NAT rules for port publishing: sudo iptables -t nat -L DOCKER -n -v # DNAT tcp -- anywhere anywhere tcp dpt:8080 to:172.17.0.2:3000 # Forward rules: sudo iptables -L DOCKER -n -v # ACCEPT tcp -- anywhere 172.17.0.2 tcp dpt:3000 # MASQUERADE for outbound traffic: sudo iptables -t nat -L POSTROUTING -n -v | grep 172.17 # MASQUERADE all -- 172.17.0.0/16 !172.17.0.0/16 # ββ 4. Check DNS resolution (default vs user-defined network) ββββββββββββββββ # Default bridge β no DNS: docker exec net-demo cat /etc/resolv.conf # nameserver 8.8.8.8 (host's DNS, not container-specific) docker exec net-demo nslookup other-container # Fails β no embedded DNS on default bridge # User-defined network β embedded DNS: docker network create app-net docker run -d --name api --network app-net alpine:3.19 sleep 3600 docker run -d --name db --network app-net alpine:3.19 sleep 3600 docker exec api cat /etc/resolv.conf # nameserver 127.0.0.11 (embedded DNS server) docker exec api nslookup db # Name: db Address: 172.18.0.3 # ββ 5. Inspect network configuration βββββββββββββββββββββββββββββββββββββββββ docker network inspect bridge --format '{{json .IPAM.Config}}' | python3 -m json.tool # [{"Subnet": "172.17.0.0/16", "Gateway": "172.17.0.1"}] docker network inspect app-net --format '{{json .IPAM.Config}}' | python3 -m json.tool # [{"Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1"}] # ββ 6. Trace network traffic βββββββββββββββββββββββββββββββββββββββββββββββββ # Capture packets on the docker0 bridge: sudo tcpdump -i docker0 -n -c 10 # Shows ARP requests, TCP SYN packets between containers # Capture packets inside a container: sudo nsenter --net --target $CONTAINER_PID tcpdump -i eth0 -n -c 10 # ββ Cleanup ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ docker rm -f net-demo api db docker network rm app-net
docker0: inet 172.17.0.1/16
# Container eth0:
eth0@if7: inet 172.17.0.2/16
# veth pair on host:
vethXXXX@if4: master docker0
# iptables NAT:
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:3000
# Default bridge DNS:
nameserver 8.8.8.8
# User-defined network DNS:
nameserver 127.0.0.11
Name: db Address: 172.18.0.3
- The default bridge is a legacy design from before Docker had user-defined networks.
- Docker chose not to add DNS to the default bridge to avoid breaking backward compatibility.
- User-defined networks were introduced later with DNS as a built-in feature.
- The default bridge is effectively deprecated for production use β always create a user-defined network.
| Component | Role | Failure Impact | Runs As |
|---|---|---|---|
| Docker CLI | Sends API requests to the daemon | CLI commands fail β containers unaffected | User process |
| dockerd (daemon) | Manages API, images, networks, volumes | All CLI operations fail β existing containers keep running | Root process |
| containerd | Manages container lifecycle, image pulling | New containers cannot be created β existing ones keep running | Root process |
| runc | Creates a single container from OCI spec | The specific container creation fails β others unaffected | Short-lived (exits after creation) |
| containerd-shim | Monitors container process, captures output | Container loses stdout/stderr capture β process still runs | Per-container process |
| pause | Holds namespaces open for restart | Container cannot restart β namespaces destroyed on exit | Per-container process |
π― Key Takeaways
- Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each component has a specific role. The OCI spec standardizes the interface between containerd and runc.
- Image build flow: CLI sends context -> daemon parses Dockerfile -> cache lookup per instruction -> execute miss -> commit layer -> tag image. .dockerignore is mandatory.
- Container creation flow: CLI -> dockerd API -> containerd -> runc -> kernel syscalls (clone, pivot_root, execve). Every container is a real Linux process on the host.
- overlay2 stacks read-only image layers with a writable container layer. Multiple containers share read-only layers β this is the key to Docker's density advantage.
- Docker networking uses a Linux bridge, veth pairs, and iptables. The default bridge has no DNS β always use user-defined networks.
- The daemon is a single point of failure. Limit concurrent operations, use BuildKit, and monitor daemon resource usage in production.
β Common Mistakes to Avoid
- βMistake 1: Not understanding that the daemon is a single point of failure β Symptom: all Docker operations hang when the daemon is overloaded β Fix: limit concurrent builds, use BuildKit, separate build hosts from runtime hosts, monitor daemon resource usage.
- βMistake 2: Sending a 500MB build context without .dockerignore β Symptom: docker build hangs at 'Sending build context' for minutes β Fix: create .dockerignore with node_modules/, .git/, *.log, coverage/. This alone can reduce build time from 5 minutes to 10 seconds.
- βMistake 3: Using the default bridge network and expecting DNS resolution β Symptom: containers cannot reach each other by hostname β Fix: create a user-defined bridge network. The embedded DNS server only works on user-defined networks.
- βMistake 4: Writing database data to the overlay2 filesystem β Symptom: slow writes due to copy-up, data loss on container removal β Fix: use named volumes for databases. Volumes bypass overlay2 and survive container deletion.
- βMistake 5: Not setting resource limits on containers β Symptom: one container consumes all host RAM, OOM-killing unrelated containers β Fix: set --cpus and --memory on every production container. Monitor with docker stats.
- βMistake 6: Flushing iptables while Docker is running β Symptom: container port forwarding breaks, containers become unreachable from the host β Fix: restart dockerd to recreate iptables rules. Configure firewalld to not manage the docker0 bridge.
- βMistake 7: Not authenticating CI runners for Docker Hub pulls β Symptom: CI builds fail with 'toomanyrequests' after hitting the 100-pull-per-6-hours limit β Fix: run docker login on all CI agents. Consider a pull-through cache registry.
Interview Questions on This Topic
- QWalk me through the complete flow from 'docker build' to a cached image on disk. What happens at each step, and how does layer caching work?
- QExplain the Docker component stack: CLI, daemon, containerd, runc. What does each component do, and what happens when each one fails?
- QHow does the OCI spec enable runtime replaceability? Why can you swap runc for gVisor or Kata without changing Docker?
- QTrace the container creation flow from 'docker run' to a running process. What kernel syscalls does runc make?
- QHow does Docker networking work at the Linux level? Explain veth pairs, the docker0 bridge, and iptables rules for port publishing.
- QWhy should databases use named volumes instead of the overlay2 filesystem? What is the copy-up problem?
- QYour CI pipeline runs 50 concurrent docker build operations and the daemon becomes unresponsive. What is happening and how do you fix it?
Frequently Asked Questions
What is the difference between containerd and dockerd?
dockerd (the Docker daemon) is the user-facing server that manages the Docker API, image building, networking, and volumes. containerd is the container runtime that manages the container lifecycle β pulling images, creating containers, and handling execution. dockerd delegates to containerd for container operations. containerd was extracted from Docker in 2017 and is now used independently by Kubernetes and other platforms.
What is the OCI spec and why does it matter?
The OCI (Open Container Initiative) spec defines two standards: the image spec (how images are packaged as layers + manifest) and the runtime spec (how containers are created as config.json). This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run any OCI-compliant image. It enables runtime replaceability β you can swap runc for gVisor without changing Docker.
Why is my docker build so slow at path specified). Without a .dockerignore file, this includes node_modules (500MB+), .git history (100MB+), and other large files. The CLI tar's this directory and sends it to the daemon over the Unix socket. Create a .dockerignore file to exclude unnecessary files. This alone can reduce build time from minutes to seconds.
Can I use containerd directly without dockerd?
Yes. containerd provides its own CLI (ctr) and API (gRPC). Kubernetes uses containerd directly via the CRI plugin, bypassing dockerd entirely. You can use ctr to pull images, create containers, and manage snapshots. This reduces overhead and removes the daemon as a single point of failure.
Why is my docker build so slow at 'Sending build context'?
The build context is the entire current directory (or the path specified in docker build -f). Without a .dockerignore file, this includes node_modules (500MB+), .git history (100MB+), and other large files. The CLI tar's this directory and sends it to the daemon over the Unix socket. Create a .dockerignore file to exclude unnecessary files. This alone can reduce build time from minutes to seconds.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.