Homeβ€Ί DevOpsβ€Ί Docker Architecture Explained: Complete End-to-End Flow (Images, Containers, Engine)

Docker Architecture Explained: Complete End-to-End Flow (Images, Containers, Engine)

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Docker β†’ Topic 5 of 17
Docker architecture deep-dive: trace the complete flow from docker build to docker run β€” CLI, daemon, containerd, runc, image layers, networking, and storage with real commands.
βš™οΈ Intermediate β€” basic DevOps knowledge assumed
In this tutorial, you'll learn
Docker architecture deep-dive: trace the complete flow from docker build to docker run β€” CLI, daemon, containerd, runc, image layers, networking, and storage with real commands.
  • Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each component has a specific role. The OCI spec standardizes the interface between containerd and runc.
  • Image build flow: CLI sends context -> daemon parses Dockerfile -> cache lookup per instruction -> execute miss -> commit layer -> tag image. .dockerignore is mandatory.
  • Container creation flow: CLI -> dockerd API -> containerd -> runc -> kernel syscalls (clone, pivot_root, execve). Every container is a real Linux process on the host.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑Quick Answer
  • docker build: CLI sends context to dockerd -> dockerd executes Dockerfile instructions -> each instruction creates a cached layer -> layers stored under /var/lib/docker/overlay2/
  • docker push: dockerd uploads layers to a registry -> registry stores layers by digest -> tags point to manifests
  • docker run: CLI sends API request to dockerd -> dockerd delegates to containerd -> containerd invokes runc -> runc configures namespaces/cgroups/filesystem -> exec starts the application
  • Docker CLI: HTTP client that talks to the daemon via Unix socket
  • dockerd (daemon): manages images, networks, volumes, and the REST API
  • containerd: manages container lifecycle, image pulling, and snapshot management
  • runc: creates containers by calling kernel syscalls (clone, pivot_root, exec)
🚨 START HERE
Docker Architecture Triage Cheat Sheet
First-response commands when the daemon, builds, pulls, or container creation fail.
🟑Docker daemon is unresponsive or crashed.
Immediate ActionCheck daemon process and resource usage.
Commands
systemctl status docker && ps aux | grep dockerd
journalctl -u docker --since '10 minutes ago' --no-pager | tail -30
Fix NowIf daemon is running but unresponsive, restart: systemctl restart docker. If OOM-killed, increase memory limits or reduce concurrent operations.
🟑docker build hangs at 'Sending build context'.
Immediate ActionCheck build context size and .dockerignore.
Commands
du -sh . (in build directory)
cat .dockerignore 2>/dev/null || echo 'NO .dockerignore'
Fix NowIf context is >100MB, add node_modules/, .git/, *.log to .dockerignore. Enable BuildKit: DOCKER_BUILDKIT=1 docker build .
🟠docker pull fails or is extremely slow.
Immediate ActionCheck registry connectivity and rate limits.
Commands
curl -s -I https://registry-1.docker.io/v2/library/alpine/manifests/latest | grep -i ratelimit
docker info | grep -i 'registry\|mirror\|proxy'
Fix NowIf rate-limited, authenticate: docker login. If network issue, configure a registry mirror in daemon.json.
🟑Container exits immediately after start.
Immediate ActionCheck exit code and container logs.
Commands
docker inspect <container> --format '{{.State.ExitCode}} {{.State.Error}}'
docker logs <container>
Fix NowExit code 0 = process completed normally (wrong CMD). Exit code 1 = app error. Exit code 137 = OOM-killed. Exit code 143 = SIGTERM.
🟑Docker disk usage is growing rapidly.
Immediate ActionCheck what is consuming space.
Commands
docker system df -v
du -sh /var/lib/docker/* | sort -hr
Fix NowPrune unused resources: docker system prune -a. Check for dangling volumes: docker volume prune. Check for build cache: docker builder prune.
🟑containerd is not running or crashing.
Immediate ActionCheck containerd status and logs.
Commands
systemctl status containerd
journalctl -u containerd --since '10 minutes ago' --no-pager | tail -20
Fix NowRestart containerd: systemctl restart containerd. If it keeps crashing, check for corrupted snapshots in /var/lib/containerd/.
Production IncidentDocker Daemon Crashes Under Load β€” All Container Operations Fail for 20 MinutesA CI/CD pipeline running 50 concurrent docker build and docker run operations per minute caused the Docker daemon to become unresponsive. All container operations (build, run, stop, ps) hung indefinitely. The daemon consumed 100% CPU and 12GB RAM. The pipeline queued 200 jobs before the team noticed.
SymptomCI builds started hanging at the 'docker build' step. The build command did not return an error β€” it just hung indefinitely. docker ps from the host also hung. systemctl status docker showed the daemon as 'active (running)' but docker info returned 'Cannot connect to the Docker daemon'. The daemon process (dockerd) was consuming 100% of one CPU core and 12GB of RAM (normally 200MB).
AssumptionThe team assumed a network issue β€” perhaps the Docker registry was unreachable and the pull was hanging. They checked network connectivity to Docker Hub β€” it was fine. They assumed a disk space issue β€” df -h showed 40% disk usage, well within limits. They assumed a corrupted image cache β€” they tried docker system prune but the command also hung.
Root causeThe CI pipeline was running 50 concurrent docker build operations, each sending a 500MB build context to the daemon via the Unix socket. The daemon serialized all API requests through a single goroutine pool. With 50 concurrent builds, each requiring image layer extraction, filesystem operations, and metadata updates, the daemon's internal queue grew unbounded. The daemon's memory usage grew from 200MB to 12GB as it buffered build contexts and layer data. The Go runtime's garbage collector could not keep up, and the daemon became CPU-bound on GC cycles. The root cause was the daemon's single-process architecture β€” all operations (build, run, network, volume) share the same process and resource pool.
Fix1. Limited concurrent builds to 10 per host using a semaphore in the CI pipeline. 2. Moved image builds to dedicated build servers separate from container runtime hosts. 3. Enabled BuildKit (DOCKER_BUILDKIT=1) which parallelizes build steps and reduces daemon load. 4. Added daemon resource monitoring: alert when dockerd RSS exceeds 2GB. 5. Configured the daemon with max-concurrent-downloads and max-concurrent-uploads to limit registry operations. 6. Considered migrating to containerd directly (bypassing dockerd) for high-throughput CI environments.
Key Lesson
The Docker daemon is a single process that handles all operations β€” builds, runs, networking, volumes. Under high concurrency, it becomes a bottleneck.Limit concurrent docker build operations per host. 50 concurrent builds can exhaust the daemon's memory and CPU.Use BuildKit (DOCKER_BUILDKIT=1) for builds β€” it parallelizes steps and reduces daemon load compared to the legacy builder.Separate build hosts from runtime hosts. Build operations are more resource-intensive than container lifecycle operations.Monitor dockerd resource usage (CPU, memory, open file descriptors). A daemon consuming >2GB RSS is a sign of overload.
Production Debug GuideFrom daemon crashes to slow builds β€” systematic debugging paths through the component stack.
Docker daemon is unresponsive β€” all commands hang.β†’Check daemon process status and resource usage. Run ps aux | grep dockerd to find the PID. Check memory: cat /proc/<pid>/status | grep VmRSS. Check open file descriptors: ls /proc/<pid>/fd | wc -l. Check daemon logs: journalctl -u docker --since '10 minutes ago'. If the daemon is OOM-killed, check dmesg | grep -i oom. Restart: systemctl restart docker.
docker build is slow β€” hangs at 'Sending build context'.β†’Check build context size: du -sh . in the build directory. Check if .dockerignore exists and excludes large directories (node_modules, .git). Check if BuildKit is enabled: echo $DOCKER_BUILDKIT. Enable it: DOCKER_BUILDKIT=1 docker build . Check daemon logs for layer extraction errors: journalctl -u docker | grep -i 'error\|failed'.
docker pull is slow or times out.β†’Check network connectivity to the registry: curl -v https://registry-1.docker.io/v2/. Check if pull-rate limits are hit: check RateLimit-Remaining header. Check daemon concurrent download limit: docker info | grep 'Max Concurrent Downloads'. Increase if needed in /etc/docker/daemon.json. Check if a proxy or mirror is configured.
Container starts but immediately exits with code 0.β†’Check if the entrypoint/command is correct: docker inspect <image> --format '{{.Config.Cmd}} {{.Config.Entrypoint}}'. Check if the process completes immediately (e.g., echo instead of a long-running server). Check container logs: docker logs <container>. If using a shell-form CMD, the process may be PID 2 and not receive signals correctly.
Docker daemon disk usage is growing unboundedly.β†’Check disk usage: docker system df. Check detailed breakdown: docker system df -v. Identify unused images: docker images --filter dangling=true. Check for orphaned volumes: docker volume ls --filter dangling=true. Prune: docker system prune -a --volumes (WARNING: removes all unused images, containers, networks, and volumes).
Containerd is crashing or not responding.β†’Check containerd status: systemctl status containerd. Check containerd logs: journalctl -u containerd --since '10 minutes ago'. Check if the containerd socket exists: ls -la /run/containerd/containerd.sock. If containerd crashes, dockerd cannot create or manage containers. Restart: systemctl restart containerd. If it keeps crashing, check for corrupted snapshots: ls /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/.

Most Docker documentation treats the architecture as a black box β€” run a command, get a container. This abstraction breaks down when containers fail to start, images pull slowly, or the daemon crashes under load. Understanding the component stack and the data flow between components is essential for production debugging.

Docker is not one program. It is a chain of specialized components: the CLI sends API requests to the daemon, the daemon delegates to containerd, containerd invokes runc, and runc configures the Linux kernel to create an isolated process. Each handoff is a potential failure point. The OCI (Open Container Initiative) spec standardizes the interface between containerd and runc, enabling runtime replaceability.

This article traces the complete end-to-end flow: what happens when you run docker build, how images are stored and distributed, what happens when you run docker run, how networking and storage are wired, and where each component lives on the filesystem. Every section includes production failure scenarios and debugging commands.

Component Stack: CLI, Daemon, containerd, runc, and the OCI Spec

Docker is a chain of five components, each with a specific responsibility. Understanding this chain is the foundation for debugging any Docker issue.

Docker CLI (docker): A Go binary that sends HTTP requests to the Docker daemon via a Unix socket (/var/run/docker.sock) or TCP. The CLI does not create containers, build images, or manage networks β€” it is a thin client. You can replace it with curl: curl --unix-socket /var/run/docker.sock http://localhost/containers/json.

Docker daemon (dockerd): A long-running Go process that manages the Docker API, image storage, network configuration, volume management, and build orchestration. The daemon listens on the Unix socket and processes all API requests. It delegates container lifecycle operations to containerd. The daemon runs as root and has full access to the host.

containerd: A container runtime daemon that manages the complete container lifecycle β€” pulling images, managing snapshots, creating containers, and handling execution. containerd was originally part of Docker but was extracted as a CNCF project in 2017. It is now used independently by Docker, Kubernetes (via CRI), AWS ECS, GKE, and other platforms. containerd invokes runc to actually create containers.

runc: A lightweight CLI tool that creates a single container from an OCI runtime specification (config.json). runc calls clone() to create a new process with namespaces, configures cgroups, mounts the filesystem via overlay2, drops privileges, and exec's the application process. runc exits after creating the container β€” it does not manage the lifecycle.

OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged as layers + manifest) and the runtime spec (how containers are created as config.json). This standardization enables runtime replaceability β€” you can swap runc for crun, kata-runtime, or runsc without changing Docker or containerd.

The handoff chain: docker run -> dockerd (API) -> containerd (lifecycle) -> runc (creation) -> kernel (namespaces, cgroups, overlay2). Each handoff is a potential failure point. If dockerd crashes, all operations fail. If containerd crashes, new containers cannot be created but existing ones keep running. If runc fails, the specific container creation fails but the stack above is unaffected.

io/thecodeforge/architecture_flow.sh Β· BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
#!/bin/bash
# Trace the complete Docker architecture flow

# ── 1. CLI -> Daemon communication ───────────────────────────────────────────
# The CLI sends HTTP requests to the daemon socket
curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool
# Shows: Version, ApiVersion, GoVersion, Os, Arch, KernelVersion

# List containers via the API (same as docker ps)
curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool

# ── 2. Daemon -> containerd communication ────────────────────────────────────
# containerd runs as a separate process, communicating via gRPC
ps aux | grep containerd
# root  1234  0.3  0.5  ... /usr/bin/containerd

# Check the containerd socket
ls -la /run/containerd/containerd.sock
# srw-rw---- 1 root containerd /run/containerd/containerd.sock

# List containers managed by containerd (via ctr)
sudo ctr -n moby containers ls
# Shows containers that containerd is managing on behalf of Docker

# ── 3. containerd -> runc communication ──────────────────────────────────────
# runc is invoked by containerd to create each container
which runc
# /usr/bin/runc

runc --version
# runc version 1.1.9

# List containers managed by runc
sudo runc list
# Shows: container ID, PID, status, bundle path

# ── 4. Inspect the OCI runtime spec for a running container ──────────────────
CONTAINER_ID=$(docker ps -q | head -1)

# Find the container's bundle directory
sudo find /run/containerd -name config.json 2>/dev/null | head -3
# /run/containerd/io.containerd.runtime.v2.task/default/<id>/config.json

# ── 5. Check the daemon process tree ─────────────────────────────────────────
pstree -p $(pidof dockerd)
# dockerd(1234)───containerd(1235)───containerd-shim(5678)───node(5679)
#                                                           └─pause(5677)

# ── 6. Check all components are running ──────────────────────────────────────
echo "dockerd: $(systemctl is-active docker)"
echo "containerd: $(systemctl is-active containerd)"
echo "runc: $(which runc && echo 'installed' || echo 'missing')"

# ── 7. Check the daemon's storage driver and root directory ──────────────────
docker info --format '{{.Driver}} {{.DockerRootDir}}'
# overlay2 /var/lib/docker

# ── 8. Check the daemon's configured runtimes ────────────────────────────────
docker info --format '{{json .Runtimes}}' | python3 -m json.tool
# Shows: runc (default), and any custom runtimes (runsc, kata)
β–Ά Output
# Daemon version:
{
"Version": "24.0.7",
"ApiVersion": "1.43",
"GoVersion": "go1.20.10",
"Os": "linux",
"Arch": "amd64",
"KernelVersion": "6.1.0-18-amd64"
}

# Component status:
dockerd: active
containerd: active
runc: installed

# Storage driver:
overlay2 /var/lib/docker

# Process tree:
dockerd(1234)───containerd(1235)───containerd-shim(5678)───node(5679)
└─pause(5677)
Mental Model
The Component Chain as an Assembly Line
Why does containerd exist separately from dockerd?
  • containerd was extracted from Docker in 2017 to become a standalone CNCF project.
  • Kubernetes can use containerd directly (via CRI) without dockerd β€” reducing overhead and complexity.
  • Separation allows independent scaling β€” containerd can be updated without restarting dockerd.
  • containerd manages the lifecycle; dockerd manages the user-facing API and image building.
πŸ“Š Production Insight
The daemon is a single point of failure. If dockerd crashes, all Docker operations (build, run, stop, ps, logs) fail. Existing containers keep running (they are managed by containerd, not dockerd), but you cannot interact with them via the Docker CLI. For high-availability environments, consider using containerd directly (via ctr or crictl) to bypass the daemon for container lifecycle operations.
🎯 Key Takeaway
Docker is a chain: CLI -> dockerd -> containerd -> runc -> kernel. Each component has a specific role. The OCI spec standardizes the interface between containerd and runc. If dockerd crashes, CLI operations fail but existing containers keep running because containerd manages them independently.
Component Failure Impact
Ifdockerd crashes
β†’
UseAll CLI operations fail. Existing containers keep running (managed by containerd). Restart dockerd to recover.
Ifcontainerd crashes
β†’
UseNew containers cannot be created. Existing containers keep running (processes are still alive). Restart containerd to recover.
Ifrunc fails to create a container
β†’
UseThe specific container creation fails. Other containers are unaffected. Check OCI spec and kernel logs.
IfDocker socket (/var/run/docker.sock) is deleted
β†’
UseAll CLI operations fail with 'Cannot connect to the Docker daemon'. Restart dockerd to recreate the socket.

Image Build Flow: From Dockerfile to Cached Layers

When you run docker build, a precise sequence of operations transforms a Dockerfile into a cached, layered image. Understanding this flow explains why builds are slow, why layers are cached, and why image size matters.

Step 1: Send build context. The CLI tar's the current directory (or the path specified in docker build -f) and sends it to the daemon via the Unix socket. This is the 'Sending build context to Docker daemon' message. The .dockerignore file filters out excluded files before sending. Without .dockerignore, the entire directory (including .git, node_modules) is sent.

Step 2: Parse the Dockerfile. The daemon parses the Dockerfile and executes each instruction sequentially. Each instruction is evaluated against the layer cache.

Step 3: Cache lookup. For each instruction, the daemon checks if a cached layer exists with the same instruction text and the same parent layer. If the cache hit, the layer is reused (no execution). If the cache miss, the instruction is executed and a new layer is created. The cache is sequential β€” a miss invalidates all subsequent layers.

Step 4: Execute the instruction. For RUN, the daemon creates a temporary container from the previous layer, executes the command, and captures the filesystem diff as a new layer. For COPY/ADD, the daemon copies files from the build context into a new layer. For ENV/EXPOSE/LABEL, the daemon creates a metadata-only layer (no filesystem change).

Step 5: Commit the layer. The filesystem diff is committed as a new layer under /var/lib/docker/overlay2/. Each layer is a directory containing only the files that changed from the previous layer. The layer is identified by a SHA256 digest.

Step 6: Tag the image. After all instructions are executed, the final layer is tagged with the image name and tag (e.g., my-app:1.0.0). The tag points to a manifest β€” a JSON file that lists all layers in order.

BuildKit vs legacy builder: The legacy builder executes instructions sequentially. BuildKit (DOCKER_BUILDKIT=1) builds a dependency graph and executes independent instructions in parallel. BuildKit also supports --mount=type=secret for build-time secrets without baking them into layers. BuildKit is the default in Docker Desktop and is recommended for all builds.

io/thecodeforge/build_flow.sh Β· BASH
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
#!/bin/bash
# Trace the complete image build flow

# ── 1. Build context size (before and after .dockerignore) ───────────────────
# Without .dockerignore:
tar -cf - . | wc -c
# May be 500MB+ if node_modules and .git are included

# With .dockerignore:
cat .dockerignore
# node_modules/
# .git/
# *.log

tar -cf - --exclude-from=.dockerignore . | wc -c
# Should be <10MB for a typical project

# ── 2. Build with cache inspection ───────────────────────────────────────────
# Build with BuildKit and progress=plain to see every step
DOCKER_BUILDKIT=1 docker build --progress=plain -t io.thecodeforge/api:1.0 . 2>&1 | tee /tmp/build.log

# Count cached vs executed steps:
grep -c 'CACHED' /tmp/build.log
grep -c 'RUN\|COPY' /tmp/build.log

# ── 3. Inspect the image layers ──────────────────────────────────────────────
# List layers in the image
docker inspect io.thecodeforge/api:1.0 --format '{{json .RootFS.Layers}}' | python3 -m json.tool
# Each entry is a SHA256 digest of a layer

# Show layer sizes
docker history io.thecodeforge/api:1.0 --format '{{.Size}}\t{{.CreatedBy}}' | head -10
# Shows the size contribution of each instruction

# ── 4. Find layers on disk ───────────────────────────────────────────────────
ls /var/lib/docker/overlay2/ | head -10
# Each directory is a layer. Shared layers are hard-linked.

# Check disk usage per layer:
du -sh /var/lib/docker/overlay2/* | sort -hr | head -10

# ── 5. Inspect the image manifest ────────────────────────────────────────────
# Save the image and inspect its manifest
docker save io.thecodeforge/api:1.0 | tar -xO manifest.json | python3 -m json.tool
# Shows: Config (image config), RepoTags, Layers (ordered list of layer tar files)

# ── 6. Compare BuildKit vs legacy builder performance ────────────────────────
# Legacy builder:
time DOCKER_BUILDKIT=0 docker build -t test:legacy .
# Sequential execution β€” slower for multi-step builds

# BuildKit:
time DOCKER_BUILDKIT=1 docker build -t test:buildkit .
# Parallel execution β€” faster for independent steps

# ── 7. Check the build cache ─────────────────────────────────────────────────
docker builder du
# Shows disk usage of the build cache

docker builder prune
# Removes unused build cache entries
β–Ά Output
# Build context size:
8543232 (8.1MB with .dockerignore)
524288000 (500MB without .dockerignore)

# Build with cache:
#5 [ package.json ./ 2/6] COPY CACHED
#6 [3/6] RUN npm ci CACHED
#7 [4/6] COPY src/ ./src/ 0.3s
# Only the COPY src/ step was rebuilt

# Image layers:
[
"sha256:abc123...",
"sha256:def456...",
"sha256:ghi789..."
]

# Layer sizes:
142MB COPY --from=builder /app/node_modules ./node_modules
12MB COPY --from=builder /app/dist ./dist
7MB FROM node:20-alpine

# BuildKit vs legacy:
Legacy: 42.3s
BuildKit: 28.1s (33% faster)
Mental Model
Build Context as a Delivery Truck
Why is the build context sent to the daemon before any instruction executes?
  • The daemon needs access to files referenced by COPY and ADD instructions.
  • The daemon runs on the host (or a remote machine) β€” it cannot access the CLI's local filesystem directly.
  • The CLI tar's the context and sends it over the Unix socket. This is why .dockerignore is critical for build speed.
  • BuildKit optimizes this by only sending files referenced by COPY/ADD, not the entire context.
πŸ“Š Production Insight
The build context is the most common cause of slow builds. A 500MB build context (including node_modules and .git) takes minutes to transfer over the Unix socket before any instruction executes. The fix: add .dockerignore with at minimum: node_modules/, .git/, .log, coverage/, .env. This alone can reduce build time from 5 minutes to 10 seconds.
🎯 Key Takeaway
The build flow is: CLI sends context -> daemon parses Dockerfile -> cache lookup per instruction -> execute miss -> commit layer -> tag image. The build context is the most common bottleneck. .dockerignore is mandatory. BuildKit parallelizes independent steps and is 30-50% faster than the legacy builder.

Image Distribution: Registry, Manifest, and Layer Deduplication

Once an image is built, it needs to be distributed to other machines β€” CI servers, staging environments, production clusters. This is the registry's job.

Image format: An image is not a single file. It is a collection of: - Layers: compressed tar archives, each identified by a SHA256 digest - Manifest: a JSON file that lists the layers in order and points to the image config - Image config: a JSON file that defines the runtime configuration (env vars, entrypoint, exposed ports, user)

The registry protocol: Docker registries implement the OCI Distribution Spec β€” an HTTP API for pushing and pulling images. The flow: 1. Client sends the manifest to the registry 2. Registry checks which layers it already has (by digest) 3. Client uploads only the missing layers 4. Registry stores layers by digest and links them to the manifest

Layer deduplication: This is the key efficiency mechanism. If two images share the same base layer (e.g., both use node:20-alpine), the layer is stored once on the registry and once on the local machine. When you pull a second image that shares layers with an existing image, only the unique layers are downloaded. This is why pulling a new version of your app is fast β€” only the top layers (containing your code) change.

Docker Hub pull-rate limits: Docker Hub limits pulls per IP: 100 per 6 hours for anonymous users, 200 for authenticated free users. This limit is per IP, not per user β€” a NAT gateway makes multiple machines appear as one IP. For CI/CD pipelines, this limit is hit quickly. The fix: authenticate with docker login, use a pull-through cache, or mirror images to a private registry.

Content trust (DCT): Docker Content Trust uses digital signatures to verify image integrity. When DOCKER_CONTENT_TRUST=1, Docker only pulls signed images. This prevents supply chain attacks where a malicious image is pushed with the same tag as a legitimate image.

io/thecodeforge/registry_flow.sh Β· BASH
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
#!/bin/bash
# Trace the image distribution flow

# ── 1. Inspect the local image manifest ──────────────────────────────────────
# Save the image and extract the manifest
docker save io.thecodeforge/api:1.0 -o /tmp/api-image.tar
cd /tmp && tar xf api-image.tar

# The manifest.json lists all components:
cat manifest.json | python3 -m json.tool
# [
#   {
#     "Config": "sha256:abc123...json",      <- image config
#     "RepoTags": ["io.thecodeforge/api:1.0"],
#     "Layers": [                              <- ordered layer list
#       "sha256:def456.../layer.tar",
#       "sha256:ghi789.../layer.tar"
#     ]
#   }
# ]

# ── 2. Inspect the image config ──────────────────────────────────────────────
cat sha256:abc123*.json | python3 -m json.tool | head -30
# Shows: architecture, os, config (env, cmd, entrypoint), rootfs (diff_ids)

# ── 3. Push to a registry ────────────────────────────────────────────────────
# Login to Docker Hub
docker login

# Tag the image for the registry
docker tag io.thecodeforge/api:1.0 youruser/io-thecodeforge-api:1.0

# Push β€” watch which layers are pushed vs already exist
docker push youruser/io-thecodeforge-api:1.0
# Output shows:
# Layer already exists (shared with base image)
# Pushing layer (unique to this image)

# ── 4. Pull from a registry ──────────────────────────────────────────────────
# Pull on a different machine
docker pull youruser/io-thecodeforge-api:1.0
# Output shows:
# Already exists (layers shared with local images)
# Downloading (unique layers)

# ── 5. Check pull-rate limit status ──────────────────────────────────────────
curl -s -I \
  -H "Authorization: Bearer $(curl -s 'https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull' | python3 -c 'import sys,json; print(json.load(sys.stdin)["token"])')" \
  https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest \
  | grep -i ratelimit
# ratelimit-limit: 100;w=21600
# ratelimit-remaining: 76;w=21600

# ── 6. Enable Docker Content Trust ───────────────────────────────────────────
export DOCKER_CONTENT_TRUST=1

# Now docker pull only fetches signed images
docker pull youruser/io-thecodeforge-api:1.0
# If the image is not signed, the pull fails with a trust error

# ── 7. Check layer deduplication ─────────────────────────────────────────────
# Compare layers between two images
docker inspect node:20-alpine --format '{{.RootFS.Layers}}' | tr ' ' '\n' | wc -l
docker inspect io.thecodeforge/api:1.0 --format '{{.RootFS.Layers}}' | tr ' ' '\n' | wc -l
# The API image shares base layers with node:20-alpine
β–Ά Output
# Manifest:
[
{
"Config": "sha256:a1b2c3d4e5f6...json",
"RepoTags": ["io.thecodeforge/api:1.0"],
"Layers": [
"sha256:f1e2d3c4b5a6.../layer.tar",
"sha256:a7b8c9d0e1f2.../layer.tar",
"sha256:d3e4f5a6b7c8.../layer.tar"
]
}
]

# Push output:
The push refers to repository [docker.io/youruser/io-thecodeforge-api]
f1e2d3c4b5a6: Mounted from library/node (shared layer)
a7b8c9d0e1f2: Pushed (unique layer)
d3e4f5a6b7c8: Pushed (unique layer)
1.0: digest: sha256:abc123... size: 1570

# Rate limit:
ratelimit-limit: 100;w=21600
ratelimit-remaining: 76;w=21600
Mental Model
Registry as a Library with ISBN Numbers
Why is layer deduplication the most important efficiency mechanism in Docker?
  • Without deduplication, every image would store a full copy of its base OS β€” wasting disk and bandwidth.
  • With deduplication, shared layers (like node:20-alpine) are stored once and referenced by multiple images.
  • Pulling a new app version only downloads the changed layers (typically your code β€” a few MB), not the entire image.
  • This is why Docker images are practical at scale β€” the overhead per image is only the unique layers.
πŸ“Š Production Insight
The pull-rate limit is per IP, not per user. A CI server behind a NAT gateway with 50 engineers hits the 100-pull limit in minutes. The fix: always authenticate CI runners with docker login (doubles the limit to 200), deploy a pull-through cache registry, or mirror critical base images to a private registry (AWS ECR, GCR).
🎯 Key Takeaway
An image is layers + manifest + config. The registry protocol deduplicates layers by SHA256 digest. Pulling a new version only downloads changed layers. Docker Hub rate limits are per IP β€” authenticate CI runners and consider a pull-through cache. Docker Content Trust verifies image signatures to prevent supply chain attacks.

Container Creation Flow: From Image to Running Process

When you run docker run, a precise sequence of operations creates an isolated process from an image. This is the most critical flow to understand for production debugging.

Step 1: API request. The CLI sends a POST /containers/create request to dockerd. The request includes the image name, command, environment variables, port mappings, volume mounts, and resource limits.

Step 2: Image resolution. dockerd checks if the image exists locally. If not, it pulls the image from the registry. The image's layers are unpacked into /var/lib/docker/overlay2/.

Step 3: Create container metadata. dockerd creates a container configuration (container JSON) that includes the merged overlay2 directory, network settings, volume mounts, and resource limits. This metadata is stored in /var/lib/docker/containers/<container-id>/.

Step 4: Delegate to containerd. dockerd sends a gRPC request to containerd to create the container. containerd generates the OCI runtime spec (config.json) β€” a JSON file that defines namespaces, cgroups, mounts, and the process to execute.

Step 5: Invoke runc. containerd invokes runc create with the OCI spec. runc reads config.json and executes kernel syscalls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) β€” creates a new process with namespaces - mount() β€” mounts /proc, /sys, /dev inside the container - pivot_root() β€” changes the root to the overlay2 merge directory - setuid()/setgid() β€” drops privileges (if non-root) - execve() β€” starts the application process

Step 6: Network setup. dockerd (via libnetwork) creates a veth pair β€” one end in the container's network namespace, one end on the Docker bridge (docker0). The container gets an IP address from the bridge's subnet. iptables rules are added for port publishing (-p) and inter-container communication.

Step 7: Monitor the process. containerd-shim monitors the container process, captures stdout/stderr, and handles signals. The pause process holds the namespaces open. When the application process exits, containerd-shim reports the exit code to containerd, which reports to dockerd.

io/thecodeforge/container_creation_flow.sh Β· BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
#!/bin/bash
# Trace the complete container creation flow

# ── 1. Create a container (without starting it) ─────────────────────────────
docker create --name flow-demo \
  --cpus=1.0 \
  --memory=256m \
  -p 8080:3000 \
  -v demo-data:/app/data \
  alpine:3.19 sleep 3600

# ── 2. Inspect the container metadata ────────────────────────────────────────
# Container config stored by the daemon:
ls /var/lib/docker/containers/$(docker inspect flow-demo --format '{{.Id}}')/
# config.v2.json  hostconfig.json  hostname  hosts  resolv.conf  ...

# ── 3. Start the container and trace the flow ────────────────────────────────
docker start flow-demo

# Get the container's host PID:
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' flow-demo)
echo "Container process PID on host: $CONTAINER_PID"

# ── 4. Inspect the OCI runtime spec ──────────────────────────────────────────
# Find the container's bundle in containerd's state:
sudo find /run/containerd -path '*flow-demo*' -name config.json 2>/dev/null

# Inspect the OCI spec (namespaces, mounts, process):
sudo cat /run/containerd/io.containerd.runtime.v2.task/default/*/config.json 2>/dev/null | python3 -m json.tool | head -60

# ── 5. Inspect the overlay2 filesystem ───────────────────────────────────────
docker inspect flow-demo --format '{{json .GraphDriver.Data}}' | python3 -m json.tool
# MergedDir: what the container sees as /
# UpperDir: writable layer (container-specific changes)
# LowerDir: read-only image layers

# ── 6. Inspect the network setup ─────────────────────────────────────────────
# Container's network config:
docker inspect flow-demo --format '{{json .NetworkSettings}}' | python3 -m json.tool | head -20
# Shows: IPAddress, Gateway, Ports, Networks

# veth pair on the host:
ip link show | grep veth
# vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0

# iptables rules for port publishing:
sudo iptables -t nat -L -n | grep 8080
# DNAT rule forwarding host:8080 to container:3000

# ── 7. Find the pause process ────────────────────────────────────────────────
ps aux | grep pause | grep -v grep
# root  5677  0.0  0.0  1024  4  ?  Ss  10:23  0:00 /pause

# The pause process and container process share namespaces:
ls -la /proc/$CONTAINER_PID/ns/net
ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net
# Both point to the same namespace inode

# ── Cleanup ──────────────────────────────────────────────────────────────────
docker rm -f flow-demo
β–Ά Output
# Container created:
flow-demo

# Host PID:
Container process PID on host: 5679

# Overlay2:
{
"LowerDir": "/var/lib/docker/overlay2/.../layers",
"MergedDir": "/var/lib/docker/overlay2/.../merged",
"UpperDir": "/var/lib/docker/overlay2/.../diff",
"WorkDir": "/var/lib/docker/overlay2/.../work"
}

# Network:
{
"IPAddress": "172.17.0.2",
"Gateway": "172.17.0.1",
"Ports": {"3000/tcp": [{"HostIp": "0.0.0.0", "HostPort": "8080"}]}
}

# iptables:
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:3000
Mental Model
Container Creation as Furnishing a Room
Why does runc exit after creating the container?
  • runc's job is to create the container, not to manage it. After execve(), runc is replaced by the application process.
  • containerd-shim monitors the application process, captures output, and handles signals.
  • The pause process holds the namespaces open so they survive application restarts.
  • This separation allows containerd to manage the lifecycle without being PID 1 in the container.
πŸ“Š Production Insight
The iptables rules for port publishing (-p) are created by dockerd when the container starts. If dockerd crashes and restarts, it recreates the rules for running containers. But if iptables is flushed (iptables -F) while dockerd is running, the port forwarding breaks and containers become unreachable from the host. The fix: restart dockerd to recreate the rules, or use docker network connect to re-attach containers to networks.
🎯 Key Takeaway
Container creation flow: CLI -> dockerd API -> containerd -> runc -> kernel syscalls. runc calls clone() for namespaces, pivot_root() for filesystem, execve() for the application. Network setup creates veth pairs and iptables rules. The pause process holds namespaces open. Every container is a real Linux process on the host.
Container Creation Failure Points
IfImage not found locally or in registry
β†’
Usedocker pull fails. Check image name, tag, and registry credentials. Check network connectivity.
IfPort already allocated
β†’
UseContainer creation fails. Check docker ps for conflicting port mappings. Change the host port.
IfVolume mount path does not exist
β†’
UseDocker creates the path as a directory (for bind mounts) or fails (for named volumes). Check volume existence.
Ifrunc fails to create namespaces
β†’
UseKernel error β€” check dmesg. May indicate cgroup or namespace limits. Check /proc/sys/user/max_user_namespaces.

Storage Architecture: overlay2, Volumes, and the Filesystem Stack

Docker's storage architecture has three layers: the image layers (read-only, cached), the container layer (writable, per-container), and volumes (persistent, managed separately). Understanding this stack explains why containers start fast, why data disappears, and why database performance differs between containers and bare metal.

overlay2 driver: The default storage driver. It stacks directories (layers) and presents a merged view. Each image layer is a directory under /var/lib/docker/overlay2/. The container's writable layer is a separate directory. The merged view is what the container sees as its root filesystem.

Layer sharing: Multiple containers from the same image share the same read-only layers. Each container has its own writable layer. This is why starting a second container from the same image is nearly instant β€” no data is copied, only a new writable directory is created.

Volumes: Named volumes are directories under /var/lib/docker/volumes/<volume-name>/_data. They are mounted into the container at the specified path. Volumes bypass overlay2 entirely β€” reads and writes go directly to the host filesystem. This is why databases should use volumes: no copy-up overhead, no overlay2 performance penalty, and data survives container deletion.

Bind mounts: Bind mounts map a specific host directory into the container. They also bypass overlay2. Bind mounts are ideal for development (live code reload) but risky in production (the container can modify host files).

tmpfs mounts: tmpfs mounts store data in memory only. They never touch the disk. Useful for sensitive data (secrets, session tokens) that should not persist.

Storage driver alternatives: overlay2 is the default on all modern Linux distributions. Other drivers include fuse-overlayfs (rootless containers), devicemapper (legacy, deprecated), btrfs (Btrfs filesystem), and zfs (ZFS filesystem). overlay2 is recommended for all use cases unless you have a specific reason to use another driver.

io/thecodeforge/storage_architecture.sh Β· BASH
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
#!/bin/bash
# Inspect the complete Docker storage architecture

# ── 1. Check the storage driver ──────────────────────────────────────────────
docker info --format '{{.Driver}}'
# overlay2 (default on modern Linux)

# ── 2. Inspect the overlay2 directory structure ──────────────────────────────
ls /var/lib/docker/overlay2/ | head -10
# Each directory is a layer (image or container writable layer)

# Each layer directory contains:
ls /var/lib/docker/overlay2/<layer-hash>/
# diff/   β€” the actual filesystem content (only files that changed)
# link    β€” short name for the layer (used for path length limits)
# lower   β€” references to parent layers
# merged/ β€” the combined view (only for container layers)
# work/   β€” overlay2 internal working directory

# ── 3. Compare container vs image layers ─────────────────────────────────────
# Image layers are read-only and shared:
IMAGE_LAYERS=$(docker inspect alpine:3.19 --format '{{.RootFS.Layers}}')
echo "Image has $(echo $IMAGE_LAYERS | tr ' ' '\n' | wc -l) layers"

# Container adds one writable layer:
docker create --name storage-test alpine:3.19 sleep 3600
CONTAINER_UPPER=$(docker inspect storage-test --format '{{.GraphDriver.Data.UpperDir}}')
echo "Container writable layer: $CONTAINER_UPPER"

# ── 4. Demonstrate layer sharing between containers ──────────────────────────
# Create two containers from the same image
docker create --name storage-a alpine:3.19 sleep 3600
docker create --name storage-b alpine:3.19 sleep 3600

# Compare their lower layers (should be identical):
LOWER_A=$(docker inspect storage-a --format '{{.GraphDriver.Data.LowerDir}}')
LOWER_B=$(docker inspect storage-b --format '{{.GraphDriver.Data.LowerDir}}')
echo "Container A lower: $LOWER_A"
echo "Container B lower: $LOWER_B"
# Same layers β€” shared, not duplicated

# Compare their upper layers (should be different):
UPPER_A=$(docker inspect storage-a --format '{{.GraphDriver.Data.UpperDir}}')
UPPER_B=$(docker inspect storage-b --format '{{.GraphDriver.Data.UpperDir}}')
echo "Container A upper: $UPPER_A"
echo "Container B upper: $UPPER_B"
# Different directories β€” each container has its own writable layer

# ── 5. Inspect volumes ───────────────────────────────────────────────────────
docker volume create demo-volume

# Volume location on host:
docker volume inspect demo-volume --format '{{.Mountpoint}}'
# /var/lib/docker/volumes/demo-volume/_data

# Volumes bypass overlay2 β€” direct host filesystem access:
docker run --rm -v demo-volume:/data alpine:3.19 sh -c 'echo hello > /data/test'
cat /var/lib/docker/volumes/demo-volume/_data/test
# hello β€” directly accessible on the host

# ── 6. Compare performance: overlay2 vs volume vs bind mount ─────────────────
# overlay2 write (container writable layer):
time docker run --rm alpine:3.19 sh -c 'dd if=/dev/zero of=/tmp/test bs=1M count=100'
# ~0.3s

# Volume write:
time docker run --rm -v demo-volume:/data alpine:3.19 sh -c 'dd if=/dev/zero of=/data/test bs=1M count=100'
# ~0.2s (slightly faster β€” no overlay2 overhead)

# Bind mount write:
time docker run --rm -v $(pwd):/data alpine:3.19 sh -c 'dd if=/dev/zero of=/data/test bs=1M count=100'
# ~0.2s (direct host filesystem)

# ── Cleanup ──────────────────────────────────────────────────────────────────
docker rm -f storage-test storage-a storage-b
docker volume rm demo-volume
β–Ά Output
# Storage driver:
overlay2

# Image layers:
Image has 1 layers

# Layer sharing:
Container A lower: /var/lib/docker/overlay2/abc123/layers
Container B lower: /var/lib/docker/overlay2/abc123/layers
# Same layers β€” shared

Container A upper: /var/lib/docker/overlay2/def456/diff
Container B upper: /var/lib/docker/overlay2/ghi789/diff
# Different writable layers

# Volume:
/var/lib/docker/volumes/demo-volume/_data

# Performance:
overlay2: 0.31s
volume: 0.22s
bind: 0.21s
Mental Model
Storage Architecture as a Building
Why should databases use volumes instead of the overlay2 filesystem?
  • overlay2 has a copy-up penalty: modifying a file from a lower layer requires copying it to the upper layer first.
  • For multi-GB database files, copy-up causes seconds of latency on first write.
  • Volumes bypass overlay2 entirely β€” reads and writes go directly to the host filesystem.
  • Volumes survive container deletion. The overlay2 writable layer is deleted when the container is removed.
πŸ“Š Production Insight
The layer sharing mechanism means that running 10 containers from the same image uses only one copy of the read-only layers plus 10 small writable layers. This is why Docker achieves 10-50x better density than VMs β€” shared layers are not duplicated. Monitor /var/lib/docker/overlay2/ disk usage to ensure shared layers are not consuming excessive space.
🎯 Key Takeaway
overlay2 stacks read-only image layers with a writable container layer. Multiple containers share read-only layers β€” this is the key to Docker's density advantage. Volumes bypass overlay2 and go directly to the host filesystem. Databases should always use volumes to avoid the copy-up overhead and ensure data persistence.
Storage Strategy by Use Case
IfStateless application (API, web server)
β†’
UseDefault overlay2 writable layer. No volumes needed.
IfDatabase or persistent data
β†’
UseNamed volume. Bypasses overlay2. Survives container deletion.
IfDevelopment β€” live code reload
β†’
UseBind mount (-v ./src:/app/src). Direct host access for fast iteration.
IfSensitive data (secrets, tokens)
β†’
Usetmpfs mount (--tmpfs /secrets:size=1m). In-memory only, never on disk.

Network Architecture: Bridge, veth, iptables, and DNS

Docker networking is built on Linux networking primitives β€” virtual bridges, veth pairs, iptables rules, and an embedded DNS server. Understanding these primitives explains why containers can communicate, why ports are published, and why the default bridge network lacks DNS.

The Docker bridge (docker0): When Docker is installed, it creates a Linux bridge called docker0 on the host. This bridge acts as a virtual switch. Each container connects to this bridge via a veth pair.

veth pairs: A veth (virtual Ethernet) pair is a pair of connected network interfaces β€” packets sent to one end appear on the other. Docker creates a veth pair for each container: one end (eth0) is inside the container's network namespace, the other end (vethXXXX) is attached to the docker0 bridge. This is how containers communicate with each other and the outside world.

iptables rules: Docker adds iptables rules for: - Port publishing (-p): DNAT rules forward traffic from the host port to the container's IP and port - Inter-container communication: the default bridge allows all containers to communicate. User-defined networks can be configured with --icc=false to block inter-container communication. - Outbound NAT: MASQUERADE rules allow containers to reach the internet via the host's network interface.

DNS resolution: The default bridge network has no DNS resolution β€” containers can only reach each other by IP. User-defined bridge networks have an embedded DNS server (127.0.0.11) that resolves container names to IP addresses. This is why docker-compose.yml services can reference each other by service name.

Network drivers: Docker supports multiple network drivers: - bridge: default for single-host setups. Creates a virtual bridge. - host: container shares the host's network stack. No isolation, best performance. - none: no network. Completely air-gapped. - overlay: VXLAN tunnel for multi-host communication (Docker Swarm, Kubernetes). - macvlan: assigns a MAC address to the container, making it appear as a physical device on the network.

io/thecodeforge/network_architecture.sh Β· BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
#!/bin/bash
# Inspect the complete Docker network architecture

# ── 1. Check the Docker bridge ───────────────────────────────────────────────
ip addr show docker0
# docker0: <BROADCAST,MULTICAST,UP> mtu 1500
#     inet 172.17.0.1/16
# The bridge has the gateway IP for the container subnet

# ── 2. Create a container and inspect its veth pair ──────────────────────────
docker run -d --name net-demo alpine:3.19 sleep 3600

# Get the container's host PID:
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' net-demo)

# Inside the container, see eth0 (one end of the veth pair):
docker exec net-demo ip addr show eth0
# eth0@if7: <BROADCAST,MULTICAST,UP> mtu 1500
#     inet 172.17.0.2/16

# On the host, see the other end of the veth pair:
ip link show | grep veth
# vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0
# The host end is attached to the docker0 bridge

# ── 3. Inspect iptables rules ────────────────────────────────────────────────

# NAT rules for port publishing:
sudo iptables -t nat -L DOCKER -n -v
# DNAT  tcp  --  anywhere  anywhere  tcp dpt:8080 to:172.17.0.2:3000

# Forward rules:
sudo iptables -L DOCKER -n -v
# ACCEPT  tcp  --  anywhere  172.17.0.2  tcp dpt:3000

# MASQUERADE for outbound traffic:
sudo iptables -t nat -L POSTROUTING -n -v | grep 172.17
# MASQUERADE  all  --  172.17.0.0/16  !172.17.0.0/16

# ── 4. Check DNS resolution (default vs user-defined network) ────────────────

# Default bridge β€” no DNS:
docker exec net-demo cat /etc/resolv.conf
# nameserver 8.8.8.8 (host's DNS, not container-specific)
docker exec net-demo nslookup other-container
# Fails β€” no embedded DNS on default bridge

# User-defined network β€” embedded DNS:
docker network create app-net
docker run -d --name api --network app-net alpine:3.19 sleep 3600
docker run -d --name db --network app-net alpine:3.19 sleep 3600

docker exec api cat /etc/resolv.conf
# nameserver 127.0.0.11 (embedded DNS server)
docker exec api nslookup db
# Name: db  Address: 172.18.0.3

# ── 5. Inspect network configuration ─────────────────────────────────────────
docker network inspect bridge --format '{{json .IPAM.Config}}' | python3 -m json.tool
# [{"Subnet": "172.17.0.0/16", "Gateway": "172.17.0.1"}]

docker network inspect app-net --format '{{json .IPAM.Config}}' | python3 -m json.tool
# [{"Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1"}]

# ── 6. Trace network traffic ─────────────────────────────────────────────────
# Capture packets on the docker0 bridge:
sudo tcpdump -i docker0 -n -c 10
# Shows ARP requests, TCP SYN packets between containers

# Capture packets inside a container:
sudo nsenter --net --target $CONTAINER_PID tcpdump -i eth0 -n -c 10

# ── Cleanup ──────────────────────────────────────────────────────────────────
docker rm -f net-demo api db
docker network rm app-net
β–Ά Output
# Docker bridge:
docker0: inet 172.17.0.1/16

# Container eth0:
eth0@if7: inet 172.17.0.2/16

# veth pair on host:
vethXXXX@if4: master docker0

# iptables NAT:
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:3000

# Default bridge DNS:
nameserver 8.8.8.8

# User-defined network DNS:
nameserver 127.0.0.11
Name: db Address: 172.18.0.3
Mental Model
Networking as a Building's Phone System
Why does the default bridge network lack DNS resolution?
  • The default bridge is a legacy design from before Docker had user-defined networks.
  • Docker chose not to add DNS to the default bridge to avoid breaking backward compatibility.
  • User-defined networks were introduced later with DNS as a built-in feature.
  • The default bridge is effectively deprecated for production use β€” always create a user-defined network.
πŸ“Š Production Insight
The iptables rules for port publishing are managed by dockerd. If iptables is flushed (iptables -F) while containers are running, port forwarding breaks. If dockerd restarts, it recreates the rules. But if a third-party firewall tool (ufw, firewalld) modifies iptables, Docker's rules may be overwritten. The fix: configure firewalld to not manage the docker0 bridge, or use Docker's --iptables=false flag and manage rules manually.
🎯 Key Takeaway
Docker networking uses a Linux bridge (docker0), veth pairs, and iptables rules. The default bridge has no DNS β€” always use user-defined networks. iptables rules for port publishing are managed by dockerd β€” third-party firewalls can interfere. The embedded DNS server (127.0.0.11) resolves container names on user-defined networks.
πŸ—‚ Docker Component Responsibilities
What each component in the Docker stack does and what happens when it fails.
ComponentRoleFailure ImpactRuns As
Docker CLISends API requests to the daemonCLI commands fail β€” containers unaffectedUser process
dockerd (daemon)Manages API, images, networks, volumesAll CLI operations fail β€” existing containers keep runningRoot process
containerdManages container lifecycle, image pullingNew containers cannot be created β€” existing ones keep runningRoot process
runcCreates a single container from OCI specThe specific container creation fails β€” others unaffectedShort-lived (exits after creation)
containerd-shimMonitors container process, captures outputContainer loses stdout/stderr capture β€” process still runsPer-container process
pauseHolds namespaces open for restartContainer cannot restart β€” namespaces destroyed on exitPer-container process

🎯 Key Takeaways

  • Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each component has a specific role. The OCI spec standardizes the interface between containerd and runc.
  • Image build flow: CLI sends context -> daemon parses Dockerfile -> cache lookup per instruction -> execute miss -> commit layer -> tag image. .dockerignore is mandatory.
  • Container creation flow: CLI -> dockerd API -> containerd -> runc -> kernel syscalls (clone, pivot_root, execve). Every container is a real Linux process on the host.
  • overlay2 stacks read-only image layers with a writable container layer. Multiple containers share read-only layers β€” this is the key to Docker's density advantage.
  • Docker networking uses a Linux bridge, veth pairs, and iptables. The default bridge has no DNS β€” always use user-defined networks.
  • The daemon is a single point of failure. Limit concurrent operations, use BuildKit, and monitor daemon resource usage in production.

⚠ Common Mistakes to Avoid

  • βœ•Mistake 1: Not understanding that the daemon is a single point of failure β€” Symptom: all Docker operations hang when the daemon is overloaded β€” Fix: limit concurrent builds, use BuildKit, separate build hosts from runtime hosts, monitor daemon resource usage.
  • βœ•Mistake 2: Sending a 500MB build context without .dockerignore β€” Symptom: docker build hangs at 'Sending build context' for minutes β€” Fix: create .dockerignore with node_modules/, .git/, *.log, coverage/. This alone can reduce build time from 5 minutes to 10 seconds.
  • βœ•Mistake 3: Using the default bridge network and expecting DNS resolution β€” Symptom: containers cannot reach each other by hostname β€” Fix: create a user-defined bridge network. The embedded DNS server only works on user-defined networks.
  • βœ•Mistake 4: Writing database data to the overlay2 filesystem β€” Symptom: slow writes due to copy-up, data loss on container removal β€” Fix: use named volumes for databases. Volumes bypass overlay2 and survive container deletion.
  • βœ•Mistake 5: Not setting resource limits on containers β€” Symptom: one container consumes all host RAM, OOM-killing unrelated containers β€” Fix: set --cpus and --memory on every production container. Monitor with docker stats.
  • βœ•Mistake 6: Flushing iptables while Docker is running β€” Symptom: container port forwarding breaks, containers become unreachable from the host β€” Fix: restart dockerd to recreate iptables rules. Configure firewalld to not manage the docker0 bridge.
  • βœ•Mistake 7: Not authenticating CI runners for Docker Hub pulls β€” Symptom: CI builds fail with 'toomanyrequests' after hitting the 100-pull-per-6-hours limit β€” Fix: run docker login on all CI agents. Consider a pull-through cache registry.

Interview Questions on This Topic

  • QWalk me through the complete flow from 'docker build' to a cached image on disk. What happens at each step, and how does layer caching work?
  • QExplain the Docker component stack: CLI, daemon, containerd, runc. What does each component do, and what happens when each one fails?
  • QHow does the OCI spec enable runtime replaceability? Why can you swap runc for gVisor or Kata without changing Docker?
  • QTrace the container creation flow from 'docker run' to a running process. What kernel syscalls does runc make?
  • QHow does Docker networking work at the Linux level? Explain veth pairs, the docker0 bridge, and iptables rules for port publishing.
  • QWhy should databases use named volumes instead of the overlay2 filesystem? What is the copy-up problem?
  • QYour CI pipeline runs 50 concurrent docker build operations and the daemon becomes unresponsive. What is happening and how do you fix it?

Frequently Asked Questions

What is the difference between containerd and dockerd?

dockerd (the Docker daemon) is the user-facing server that manages the Docker API, image building, networking, and volumes. containerd is the container runtime that manages the container lifecycle β€” pulling images, creating containers, and handling execution. dockerd delegates to containerd for container operations. containerd was extracted from Docker in 2017 and is now used independently by Kubernetes and other platforms.

What is the OCI spec and why does it matter?

The OCI (Open Container Initiative) spec defines two standards: the image spec (how images are packaged as layers + manifest) and the runtime spec (how containers are created as config.json). This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run any OCI-compliant image. It enables runtime replaceability β€” you can swap runc for gVisor without changing Docker.

Why is my docker build so slow at path specified). Without a .dockerignore file, this includes node_modules (500MB+), .git history (100MB+), and other large files. The CLI tar's this directory and sends it to the daemon over the Unix socket. Create a .dockerignore file to exclude unnecessary files. This alone can reduce build time from minutes to seconds.
Can I use containerd directly without dockerd?

Yes. containerd provides its own CLI (ctr) and API (gRPC). Kubernetes uses containerd directly via the CRI plugin, bypassing dockerd entirely. You can use ctr to pull images, create containers, and manage snapshots. This reduces overhead and removes the daemon as a single point of failure.

Why is my docker build so slow at 'Sending build context'?

The build context is the entire current directory (or the path specified in docker build -f). Without a .dockerignore file, this includes node_modules (500MB+), .git history (100MB+), and other large files. The CLI tar's this directory and sends it to the daemon over the Unix socket. Create a .dockerignore file to exclude unnecessary files. This alone can reduce build time from minutes to seconds.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousHow Docker Works InternallyNext β†’Docker Images and Containers
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged