How Docker Works Internally: Architecture, Namespaces, and Containers Explained
- Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls.
- Namespaces isolate what a container can see (PID tree, network, filesystem, hostname, IPC). cgroups limit what a container can consume (CPU, memory, I/O, processes).
- Every container is a real Linux process visible in ps aux on the host. The kernel does not know what a 'container' is β it only knows processes, namespaces, and cgroups.
- Docker CLI sends API requests to the Docker daemon (dockerd)
- dockerd delegates container lifecycle to containerd
- runc reads the OCI runtime spec and configures namespaces, cgroups, and filesystem
- Namespaces: isolate PID, network, mount, user, UTS, and IPC views
- cgroups: limit CPU, memory, I/O, and process count per container
- Union filesystem (overlay2): stack read-only image layers with a writable top layer
- seccomp: filter syscalls at the kernel level
Container is consuming too much CPU or memory.
docker inspect --format '{{.State.Pid}}' <container>cat /proc/<pid>/cgroup && cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytesContainer cannot reach the network.
docker inspect --format '{{.State.Pid}}' <container>nsenter --net --target <pid> ip addr show && nsenter --net --target <pid> ip route showContainer cannot write to volume β permission denied.
docker exec <container> idls -la /var/lib/docker/volumes/<volume>/_dataContainer exits immediately with code 137.
dmesg | grep -i 'oom\|killed process'cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytesContainer filesystem shows unexpected or missing files.
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container>ls -la $(docker inspect --format '{{.GraphDriver.Data.UpperDir}}' <container>)Docker daemon is unresponsive.
systemctl status docker && systemctl status containerdjournalctl -u docker --since '10 minutes ago' --no-pager | tail -50Production Incident
Production Debug GuideSystematic debugging paths using /proc, /sys, and namespace inspection.
Most Docker tutorials stop at 'docker run' and never explain what happens inside the kernel. This creates a dangerous gap β when containers misbehave, engineers without kernel-level understanding cannot diagnose the root cause. They restart containers, rebuild images, and escalate to platform teams for problems that a single /proc inspection would have solved.
Docker is a stack of components: the CLI, the daemon (dockerd), containerd, runc, and the Linux kernel. Each layer has a specific responsibility. The daemon manages images and the API. containerd manages container lifecycle. runc creates containers by configuring kernel primitives β namespaces for isolation, cgroups for resource limits, and overlay2 for the filesystem. The kernel does the actual work.
Understanding this stack is essential for production debugging. When a container cannot resolve DNS, the answer is in the network namespace. When a container is OOM-killed, the answer is in the cgroup memory controller. When a container starts slowly, the answer is in the overlay2 filesystem or image pull. Every container problem has a kernel-level root cause.
The Docker Stack: From CLI to Kernel β Every Component Explained
Docker is not a single program. It is a stack of components, each with a specific responsibility. Understanding this stack is the foundation for debugging any container issue.
Docker CLI (docker): The command-line interface. It sends HTTP API requests to the Docker daemon. The CLI does not create containers β it is a client that talks to the server. You can replace it with curl, Postman, or any HTTP client.
Docker daemon (dockerd): The server that manages images, networks, volumes, and the container API. It listens on a Unix socket (/var/run/docker.sock) or a TCP port. The daemon does not create containers directly β it delegates to containerd.
containerd: A container runtime that manages the complete container lifecycle β pulling images, creating containers, managing snapshots, and handling container execution. containerd was originally part of Docker but was extracted as a standalone project. It is now used by Docker, Kubernetes (via CRI), and other orchestration platforms.
runc: A lightweight container runtime that creates containers using Linux kernel primitives. runc reads an OCI (Open Container Initiative) runtime specification β a JSON file that describes the container's namespaces, cgroups, mounts, and environment. runc calls clone() to create a new process, configures namespaces and cgroups, pivot_root to change the filesystem, and exec to start the application. runc exits after creating the container β it does not manage the container's lifecycle.
The OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged) and the runtime spec (how containers are created). runc implements the runtime spec. This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run OCI-compliant images.
The flow: docker run -> dockerd API -> containerd creates container spec -> runc reads OCI spec -> runc calls clone() with namespaces -> runc configures cgroups -> runc pivot_root to overlay2 filesystem -> runc exec the application process -> runc exits -> containerd monitors the container process.
#!/bin/bash # Inspect every layer of the Docker stack # ββ Docker CLI -> Daemon communication βββββββββββββββββββββββββββββββββββββββ # The CLI sends HTTP requests to the daemon. You can do this manually: curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool # Shows: Docker version, API version, Go version, OS, architecture curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool # Shows: all running containers (same as docker ps) # ββ Check if containerd is running βββββββββββββββββββββββββββββββββββββββββββ systemctl status containerd # containerd is the container runtime daemon # It manages container lifecycle independently of dockerd # ββ Find the runc binary βββββββββββββββββββββββββββββββββββββββββββββββββββββ which runc # Typically: /usr/bin/runc or /usr/local/bin/runc runc --version # Shows: runc version, commit, spec version (OCI 1.0.2) # ββ Inspect the OCI runtime spec for a running container βββββββββββββββββββββ # containerd stores the OCI spec for each container CONTAINER_ID=$(docker ps -q | head -1) # Find the container's bundle directory (contains config.json) find /run/containerd/io.containerd.runtime.v2.task/default/ -name config.json 2>/dev/null | head -1 # This file is the OCI runtime spec β it defines namespaces, cgroups, mounts # ββ Trace the container creation flow ββββββββββββββββββββββββββββββββββββββββ # Start a container and watch the kernel calls strace -f -e trace=clone,unshare,pivot_root,chroot,execve \ -o /tmp/container-trace.log \ runc run test-container & # The trace shows: # clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|...) = <child-pid> # pivot_root(".", "/old-root") = 0 # execve("/app/server", ["server"], ...) = 0 # ββ Check the daemon socket βββββββββββββββββββββββββββββββββββββββββββββββββ ls -la /var/run/docker.sock # srw-rw---- 1 root docker /var/run/docker.sock # The socket is owned by root:docker group # Any process in the docker group can control ALL containers # ββ Check the daemon process tree ββββββββββββββββββββββββββββββββββββββββββββ pstree -p $(pidof dockerd) # dockerd ββ¬β containerd ββ¬β containerd-shim-runc-v2 ββ¬β <app-pid> # β β ββ pause # β ββ containerd-shim-runc-v2 ββ¬β <app-pid> # β ββ pause # ββ docker-proxy (for published ports)
{
"Version": "24.0.7",
"ApiVersion": "1.43",
"MinAPIVersion": "1.12",
"GitCommit": "afdd53b",
"GoVersion": "go1.20.10",
"Os": "linux",
"Arch": "amd64"
}
# runc version:
runc version 1.1.9
commit: v1.1.9-0-gccaecfc
spec: 1.0.2-dev
# Process tree:
dockerd(1234)βββcontainerd(1235)βββcontainerd-shim(5678)βββnode(5679)
# The container process (node, PID 5679) is a real Linux process on the host
- Separation of concerns: the daemon manages the API and images, containerd manages lifecycle, runc creates containers.
- Replaceability: you can swap runc for crun (faster), kata-runtime (VM isolation), or runsc (gVisor) without changing Docker.
- Standardization: the OCI spec ensures any compliant runtime can run any compliant image.
- Kubernetes reuses containerd directly β it does not need dockerd. This is why containerd was extracted.
Linux Namespaces: The Isolation Mechanism Behind Every Container
Namespaces are the Linux kernel feature that provides process isolation. Each namespace gives a process its own view of a system resource. A container is a regular Linux process that runs inside a set of namespaces β it sees its own PID tree, its own network stack, its own filesystem mount points, and its own hostname, even though it shares the host kernel.
There are seven namespace types in Linux. Docker uses six of them by default:
PID namespace (CLONE_NEWPID): Each container has its own PID tree. The first process inside the container is PID 1. Processes inside the container cannot see processes outside the container. On the host, the container process has a real PID β you can see it with ps aux. The PID namespace is hierarchical β a child namespace can see parent PIDs if configured, but not sibling PIDs.
Network namespace (CLONE_NEWNET): Each container gets its own network stack β its own interfaces, routing table, firewall rules, and /proc/net. When Docker creates a container, it creates a veth (virtual Ethernet) pair β one end inside the container's network namespace, one end connected to the Docker bridge. This is how containers communicate with each other and the outside world.
Mount namespace (CLONE_NEWNS): Each container has its own mount table. The container's root filesystem is a union mount (overlay2) that layers the image's read-only layers with a writable top layer. The container cannot see the host's filesystem unless explicitly mounted. pivot_root changes the container's root directory to the overlay2 merge directory.
User namespace (CLONE_NEWUSER): Maps container UIDs to different host UIDs. Container UID 0 (root) can be mapped to host UID 100000 (unprivileged). This means even a container escape results in an unprivileged host user. User namespace remapping is not enabled by default in Docker because it breaks some workflows (volume permissions, Docker-in-Docker).
UTS namespace (CLONE_NEWUTS): Each container has its own hostname. The hostname is set during container creation and can be changed inside the container without affecting the host or other containers.
IPC namespace (CLONE_NEWIPC): Each container has its own System V IPC and POSIX message queues. Processes in different containers cannot share shared memory segments or message queues.
Cgroup namespace (CLONE_NEWCGROUP): Virtualizes the /proc/self/cgroup view. The container sees its own cgroup path as '/' instead of the real path (/docker/<container-id>). This prevents the container from seeing or manipulating other containers' cgroups.
#!/bin/bash # Inspect and compare namespaces for containers and the host # ββ Get a container's host PID βββββββββββββββββββββββββββββββββββββββββββββββ CONTAINER_NAME="my-api" CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' $CONTAINER_NAME) echo "Container $CONTAINER_NAME is PID $CONTAINER_PID on the host" # ββ List all namespaces for the container process ββββββββββββββββββββββββββββ ls -la /proc/$CONTAINER_PID/ns/ # Output: # lrwxrwxrwx ... ipc -> 'ipc:[4026532XXX]' # lrwxrwxrwx ... mnt -> 'mnt:[4026532XXX]' # lrwxrwxrwx ... net -> 'net:[4026532XXX]' # lrwxrwxrwx ... pid -> 'pid:[4026532XXX]' # lrwxrwxrwx ... user -> 'user:[4026531XXX]' # lrwxrwxrwx ... uts -> 'uts:[4026532XXX]' # ββ Compare with host namespaces βββββββββββββββββββββββββββββββββββββββββββββ ls -la /proc/1/ns/ # The host PID 1 (systemd) has different namespace IDs than the container # If namespace IDs match, the container shares that namespace with the host # ββ PID namespace: container sees its own PID tree ββββββββββββββββββββββββββββ docker exec $CONTAINER_NAME ps aux # PID 1 is the container's entrypoint process # The container cannot see host processes # On the host, the same process has a different PID: ps aux | grep $(docker exec $CONTAINER_NAME cat /proc/1/cmdline | tr '\0' ' ') # The host sees the real PID, the container sees PID 1 # ββ Network namespace: container has its own network stack βββββββββββββββββββ docker exec $CONTAINER_NAME ip addr show # Shows: lo (loopback) and eth0 (veth pair inside the container) # On the host, inspect the veth pair: ip link show | grep veth # vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0 # One end is in the container's net namespace, one end is on the docker0 bridge # ββ Enter a container's network namespace from the host ββββββββββββββββββββββ sudo nsenter --net --target $CONTAINER_PID ip addr show # Shows the same network config as docker exec, but from the host # Useful for debugging without a shell inside the container # ββ Mount namespace: inspect the overlay2 filesystem βββββββββββββββββββββββββ docker inspect --format '{{.GraphDriver.Data}}' $CONTAINER_NAME # Shows: MergedDir, UpperDir, LowerDir, WorkDir # MergedDir is what the container sees as / # UpperDir is the writable layer (container-specific changes) # LowerDir is the read-only image layers (colon-separated) # ββ User namespace: check if remapping is enabled ββββββββββββββββββββββββββββ cat /etc/subuid # If userns-remap is enabled: dockremap:100000:65536 # This maps container UID 0 to host UID 100000 # ββ UTS namespace: container has its own hostname βββββββββββββββββββββββββββββ docker exec $CONTAINER_NAME hostname # Shows the container's hostname (usually the container ID) hostname # Shows the host's hostname β different from the container # ββ IPC namespace: container has its own IPC resources βββββββββββββββββββββββ docker exec $CONTAINER_NAME ipcs # Shows only IPC resources created inside the container ipcs # Shows host IPC resources β not visible inside the container
Container my-api is PID 5679 on the host
# Container namespaces:
lrwxrwxrwx 1 root root 0 Jan 15 10:23 ipc -> 'ipc:[4026532847]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 mnt -> 'mnt:[4026532849]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 net -> 'net:[4026532851]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 pid -> 'pid:[4026532852]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 uts -> 'uts:[4026532853]'
# Container process list:
PID USER COMMAND
1 node node dist/index.js
# Container network:
1: lo: <LOOPBACK,UP> mtu 65536
4: eth0@if7: <BROADCAST,MULTICAST,UP> mtu 1500
inet 172.17.0.2/16
# Overlay2 filesystem:
MergedDir: /var/lib/docker/overlay2/abc123/merged
UpperDir: /var/lib/docker/overlay2/abc123/diff
LowerDir: /var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers
- --pid=host: the container sees ALL host processes. Its PID 1 is the host's PID 1 (systemd).
- --net=host: the container shares the host's network stack. It can bind to any host port.
- --ipc=host: the container can access host shared memory segments. Potential data leak.
- Each flag removes one isolation layer. --privileged removes ALL of them.
cgroups: Resource Limits That Prevent Noisy Neighbors
While namespaces provide isolation (what a container can see), cgroups provide resource limits (how much a container can consume). Without cgroups, a container with a memory leak can consume all host RAM and trigger the OOM killer on unrelated containers.
cgroup v1 vs v2: Linux has two cgroup versions. cgroup v1 has separate hierarchies for each resource controller (cpu, memory, blkio, pids). cgroup v2 has a unified hierarchy. Docker supports both, but cgroup v2 is the default on newer Linux distributions (Ubuntu 22.04+, Fedora 31+, RHEL 9+).
CPU controller: Limits CPU usage in two ways: - cpu.shares: relative weight. Default is 1024. A container with 2048 gets twice the CPU of a container with 1024 when there is contention. Does not limit absolute CPU usage. - cpu.cfs_quota_us / cpu.cfs_period_us: absolute limit. --cpus=1.0 sets a quota of 100ms per 100ms period, limiting the container to one CPU core.
Memory controller: Limits memory usage: - memory.limit_in_bytes: hard limit. If the container exceeds this, the kernel OOM-kills the process. - memory.soft_limit_in_bytes: soft limit. The kernel tries to reclaim memory from the container before other containers, but does not kill it. - memory.oom_control: controls whether the OOM killer is invoked or the container is frozen.
blkio controller: Limits block device I/O: - blkio.throttle.read_bps_device: limits read throughput in bytes per second. - blkio.throttle.write_bps_device: limits write throughput.
pids controller: Limits the number of processes: - pids.max: maximum number of processes (including threads) the container can create. Prevents fork bombs.
The noisy neighbor problem: Without cgroup limits, one container can starve others. A container with a CPU-bound loop consumes 100% of all CPUs. A container with a memory leak consumes all host RAM, triggering the kernel OOM killer, which may kill unrelated containers. cgroup limits prevent this by enforcing per-container resource ceilings.
#!/bin/bash # Inspect and configure cgroup resource limits for containers # ββ Get a container's cgroup path ββββββββββββββββββββββββββββββββββββββββββββ CONTAINER_ID=$(docker ps -q | head -1) # cgroup v1 path: ls /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/ # cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us # cgroup v2 path: ls /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/ # cpu.max, memory.max, pids.max # ββ Check CPU limits βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # cgroup v1: cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.shares # Default: 1024 (1 CPU share). Set with --cpu-shares cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_quota_us # -1 means no limit. Set with --cpus=1.0 (becomes 100000) cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_period_us # Default: 100000 (100ms) # cgroup v2: cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max # Format: quota period. Example: 100000 100000 (1 CPU limit) # ββ Check memory limits βββββββββββββββββββββββββββββββββββββββββββββββββββββ # cgroup v1: cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.limit_in_bytes # 9223372036854771712 means no limit (max int64) # Set with --memory=512m (becomes 536870912) cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.usage_in_bytes # Current memory usage in bytes # cgroup v2: cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max # max means no limit. Set with --memory=512m cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current # Current memory usage # ββ Check OOM events ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.oom_control # oom_kill_disable: 0 (OOM killer enabled) # under_oom: 0 (not currently under OOM pressure) # Check kernel OOM log: dmesg | grep -i 'oom\|killed process' | tail -10 # ββ Check PID limits ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # cgroup v1: cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.max # max means no limit. Set with --pids-limit=256 cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.current # Current number of processes # ββ Run a container with resource limits βββββββββββββββββββββββββββββββββββββ docker run -d \ --name resource-limited \ --cpus=1.0 \ --memory=512m \ --pids-limit=256 \ --memory-swap=512m \ # --memory-swap=512m disables swap (swap = memory limit) alpine:3.19 sleep 3600 # Verify the limits: docker inspect resource-limited --format '{{.HostConfig.NanoCpus}}' # 1000000000 = 1 CPU (in nanoseconds) docker inspect resource-limited --format '{{.HostConfig.Memory}}' # 536870912 = 512MB (in bytes) docker stats resource-limited --no-stream # Shows: MEM USAGE / LIMIT β 512MiB / 512MiB
1024
100000
100000
# Memory limits:
536870912
45219840
# PID limits:
256
3
# docker stats:
NAME CPU % MEM USAGE / LIMIT MEM %
resource-limited 0.00% 43.2MiB / 512MiB 8.44%
- Without a memory limit, a container can consume all host RAM.
- The kernel OOM killer then selects a process to kill β it may kill an unrelated container, not the leaking one.
- With --memory=512m, the kernel kills only the container that exceeded its limit.
- Without limits, the OOM killer uses a heuristic (oom_score) that may choose the wrong victim.
Union Filesystem and overlay2: How Docker Images Work Without Copying
The union filesystem is the reason Docker images are lightweight and containers start in milliseconds. Instead of copying files, Docker overlays multiple read-only directories and presents them as a single merged filesystem.
overlay2 driver: The default storage driver in modern Docker. It stacks directories (layers) and presents a merged view. Each layer is a directory on the host filesystem. The bottom layers are read-only (image layers). The top layer is writable (container-specific changes).
How it works: When a container reads a file, overlay2 checks the top (writable) layer first. If the file exists there, it is returned. If not, overlay2 checks each lower layer in order until the file is found. When a container writes a file, the write goes to the top layer only β lower layers are never modified. When a container deletes a file, a whiteout file (a character device with major/minor 0/0, prefixed with .wh.) is created in the top layer to mask the lower layer's file.
The four directories: - lowerdir: colon-separated list of read-only image layers (bottom to top) - upperdir: the writable layer (container-specific changes) - workdir: overlay2 internal working directory (must be empty, used for atomic operations) - merged: the combined view that the container sees as its root filesystem
Performance implications: Read performance is slightly slower than native because overlay2 must check multiple layers. Write performance is native (writes go directly to the upperdir on the host filesystem). The performance difference is negligible for most workloads but can matter for I/O-intensive applications (databases, search engines).
The copy-up problem: When a container modifies a file from a lower layer, overlay2 must first copy the entire file to the upperdir (copy-up), then modify the copy. For large files (multi-GB database files), copy-up can cause a noticeable delay on first write. This is why databases should use volumes (bind mounts) instead of the container's overlay2 filesystem.
#!/bin/bash # Inspect the overlay2 filesystem for a running container # ββ Get the overlay2 paths for a container βββββββββββββββββββββββββββββββββββ CONTAINER_ID=$(docker ps -q | head -1) GRAPH_DATA=$(docker inspect --format '{{json .GraphDriver.Data}}' $CONTAINER_ID) echo $GRAPH_DATA | python3 -m json.tool # Extract individual paths: MERGED_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['MergedDir'])") UPPER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['UpperDir'])") LOWER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['LowerDir'])") WORK_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['WorkDir'])") echo "Merged (container sees this as /): $MERGED_DIR" echo "Upper (writable layer): $UPPER_DIR" echo "Lower (read-only layers): $LOWER_DIR" echo "Work (overlay2 internal): $WORK_DIR" # ββ Inspect the writable layer (upperdir) ββββββββββββββββββββββββββββββββββββ ls -la $UPPER_DIR/ # Shows files the container has created or modified # Files prefixed with .wh. are whiteout files (deleted from lower layers) # ββ Inspect the merged view ββββββββββββββββββββββββββββββββββββββββββββββββββ ls -la $MERGED_DIR/ # This is what the container sees as its root filesystem # It is the combination of all lower layers + the upper layer # ββ Demonstrate the copy-up behavior βββββββββββββββββββββββββββββββββββββββββ # Create a file in the container docker exec $CONTAINER_ID sh -c 'echo "hello" > /tmp/test-file' # The file appears in the writable layer (upperdir): ls -la $UPPER_DIR/tmp/test-file # The file is in the upper layer, not in any lower layer # ββ Demonstrate the whiteout behavior ββββββββββββββββββββββββββββββββββββββββ # Delete a file that exists in a lower layer docker exec $CONTAINER_ID rm /etc/hostname # A whiteout file appears in the upper layer: ls -la $UPPER_DIR/etc/.wh.hostname # This character device (0/0) tells overlay2 to hide the lower layer's file # ββ Check the number of layers in an image βββββββββββββββββββββββββββββββββββ docker inspect <image> --format '{{len .RootFS.Layers}} layers' # Each layer is a directory under /var/lib/docker/overlay2/ # ββ Check disk usage per layer βββββββββββββββββββββββββββββββββββββββββββββββ du -sh /var/lib/docker/overlay2/* | sort -hr | head -10 # Shows disk usage for each layer (shared layers are counted once) # ββ Compare overlay2 with native filesystem performance βββββββββββββββββββββ # Write performance test: time docker exec $CONTAINER_ID dd if=/dev/zero of=/tmp/test bs=1M count=100 # Overlay2 write: ~0.3s (writes to upperdir on host filesystem) # Read performance test: time docker exec $CONTAINER_ID dd if=/tmp/test of=/dev/null bs=1M # Overlay2 read: ~0.1s (slightly slower than native due to layer lookup)
{
"LowerDir": "/var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers",
"MergedDir": "/var/lib/docker/overlay2/abc123/merged",
"UpperDir": "/var/lib/docker/overlay2/abc123/diff",
"WorkDir": "/var/lib/docker/overlay2/abc123/work"
}
# Writable layer contents:
drwxr-xr-x 4 root root 4096 Jan 15 10:25 tmp
drwxr-xr-x 2 root root 4096 Jan 15 10:25 etc
-rw-r--r-- 1 root root 0 Jan 15 10:25 etc/.wh.hostname
# The whiteout file etc/.wh.hostname hides /etc/hostname from lower layers
- Each layer is additive β deleting a file in layer N+1 does not remove it from layer N.
- The delete creates a whiteout marker in layer N+1, but the data still exists in layer N.
- The only way to truly remove data is to not include it in any layer (use multi-stage builds or .dockerignore).
- This is why RUN apt-get install ... && rm -rf /var/lib/apt/lists/* must be in the same RUN β separate RUNs create separate layers.
The Container Lifecycle: From Clone to Exit β Every Kernel Call
When you run docker run, a precise sequence of kernel calls creates the container. Understanding this sequence is the key to debugging startup failures, permission errors, and namespace issues.
Step 1: Image pull and unpack. The Docker daemon pulls the image layers from the registry and unpacks them into /var/lib/docker/overlay2/. Each layer is a directory. If the layers already exist locally (cached), this step is skipped.
Step 2: Create the OCI runtime spec. containerd generates a config.json file β the OCI runtime specification. This JSON file defines: - The namespaces to create (PID, network, mount, user, UTS, IPC) - The cgroup limits (CPU, memory, pids) - The root filesystem path (the overlay2 merge directory) - The environment variables, working directory, and command to execute - The mount points (volumes, /proc, /sys, /dev)
Step 3: runc creates the container. runc reads config.json and executes the following kernel calls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) β creates a new process with new namespaces - sethostname() β sets the container's hostname (UTS namespace) - mount() β mounts /proc, /sys, /dev inside the container's mount namespace - pivot_root() β changes the container's root directory to the overlay2 merge directory - chdir("/") β moves to the new root - setuid() / setgid() β drops privileges to the container's user (if non-root) - execve() β replaces the runc process with the container's entrypoint command
Step 4: runc exits, containerd monitors. After execve(), runc is replaced by the container's process. runc exits. containerd (via containerd-shim) monitors the container process, captures stdout/stderr, and handles signals.
Step 5: The container process runs. The application process is now running inside a set of namespaces with cgroup limits and an overlay2 filesystem. It has PID 1 inside the container's PID namespace. On the host, it has a real PID visible in ps aux.
The pause process: Each container has a 'pause' process that holds the namespaces open. If the application process exits, the pause process keeps the namespaces alive (for restart). You can see pause processes on the host: ps aux | grep pause.
#!/bin/bash # Trace the complete container lifecycle from clone to exec # ββ Step 1: Pull and inspect image layers ββββββββββββββββββββββββββββββββββββ docker pull alpine:3.19 # Inspect the image layers: docker inspect alpine:3.19 --format '{{json .RootFS.Layers}}' | python3 -m json.tool # Each entry is a layer (SHA256 digest) # Find the layers on disk: ls /var/lib/docker/overlay2/ | head -5 # Each directory is a layer. Shared layers are hard-linked or reflinked. # ββ Step 2: Create a container and inspect the OCI spec ββββββββββββββββββββββ # Create a container without starting it: docker create --name lifecycle-demo alpine:3.19 echo 'hello' # Find the OCI runtime spec: find /run/containerd -name config.json -path '*lifecycle-demo*' 2>/dev/null # This file is the OCI runtime spec that runc reads # Inspect the spec (if found): cat /run/containerd/io.containerd.runtime.v2.task/default/lifecycle-demo/config.json | python3 -m json.tool | head -50 # Shows: namespaces, mounts, cgroups, process config, root filesystem # ββ Step 3: Trace runc's kernel calls ββββββββββββββββββββββββββββββββββββββββ # Start a container with strace to see the kernel calls: sudo strace -f -e trace=clone,clone3,unshare,sethostname,mount,pivot_root,setuid,setgid,execve \ -o /tmp/runc-trace.log \ runc run --bundle /path/to/bundle test-trace # The trace shows: # clone3({flags=CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|..., ...}) = 12345 # sethostname("container-id", 12) = 0 # mount("proc", "/proc", "proc", ...) = 0 # mount("sysfs", "/sys", "sysfs", ...) = 0 # pivot_root(".", "/old-root") = 0 # setuid(1000) = 0 # setgid(1000) = 0 # execve("/bin/sh", ["sh"], ...) = 0 # ββ Step 4: Find the container process and pause process on the host βββββββββ docker start lifecycle-demo # Find the container's host PID: CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' lifecycle-demo) echo "Container process PID: $CONTAINER_PID" # Find the pause process (holds namespaces open): ps aux | grep pause | grep -v grep # root 5678 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause # The pause process and the container process share the same namespaces: ls -la /proc/$CONTAINER_PID/ns/net ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net # Both point to the same namespace inode # ββ Step 5: Watch the container process on the host ββββββββββββββββββββββββββ ps aux | grep $CONTAINER_PID # root 5679 0.0 0.1 ... echo hello # This is the REAL process on the host, running inside namespaces # ββ Cleanup ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ docker rm -f lifecycle-demo
[
"sha256:abc123def456..."
]
# Container process on host:
Container process PID: 5679
# Pause process:
root 5678 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause
# Host process:
root 5679 0.0 0.1 4520 1820 ? Ss 10:23 0:00 echo hello
- runc's job is to create the container, not to manage it. After execve(), runc is replaced by the application process.
- containerd (via containerd-shim) monitors the container process, captures output, and handles signals.
- The pause process holds the namespaces open so they survive application restarts.
- This separation allows containerd to manage the lifecycle without being PID 1 in the container.
| Namespace | Flag | Isolates | Docker Default | Host Flag to Disable |
|---|---|---|---|---|
| PID | CLONE_NEWPID | Process ID tree | Enabled | --pid=host |
| Network | CLONE_NEWNET | Network stack (interfaces, routes, iptables) | Enabled | --net=host |
| Mount | CLONE_NEWNS | Filesystem mount points | Enabled | --volume /:/host (partial) |
| User | CLONE_NEWUSER | UID/GID mapping | Disabled (opt-in) | N/A (disabled by default) |
| UTS | CLONE_NEWUTS | Hostname and domain name | Enabled | --uts=host |
| IPC | CLONE_NEWIPC | System V IPC and POSIX message queues | Enabled | --ipc=host |
| Cgroup | CLONE_NEWCGROUP | cgroup root directory view | Enabled (cgroup v2) | --cgroupns=host |
π― Key Takeaways
- Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls.
- Namespaces isolate what a container can see (PID tree, network, filesystem, hostname, IPC). cgroups limit what a container can consume (CPU, memory, I/O, processes).
- Every container is a real Linux process visible in ps aux on the host. The kernel does not know what a 'container' is β it only knows processes, namespaces, and cgroups.
- overlay2 stacks read-only image layers with a writable top layer. No data is copied on container creation. Writes go to the top layer. Deletes create whiteout files.
- The pause process holds namespaces open so they survive application restarts. runc exits after creating the container. containerd monitors the process.
- The Docker socket (/var/run/docker.sock) is equivalent to root access on the host. Never mount it into containers without a socket proxy.
β Common Mistakes to Avoid
- βMistake 1: Assuming containers are VMs β Symptom: treating containers as isolated machines with their own kernel, expecting kernel-level isolation, and being surprised when a kernel CVE affects all containers β Fix: understand that containers are Linux processes with namespaces and cgroups. They share the host kernel. A kernel vulnerability affects all containers on the host.
- βMistake 2: Using --pid=host or --net=host in production β Symptom: all host processes visible inside the container, or container sharing the host network stack with no isolation β Fix: these flags are debugging tools. Never use them in production unless on a dedicated host. Use nsenter from the host for debugging instead.
- βMistake 3: Not setting memory limits (--memory) β Symptom: one container's memory leak consumes all host RAM, triggering the OOM killer on unrelated containers β Fix: set --memory on every production container. Without limits, the OOM killer's victim selection heuristic may kill the wrong container.
- βMistake 4: Writing database data to the overlay2 filesystem β Symptom: slow first writes due to copy-up, data loss on container removal β Fix: use named volumes or bind mounts for database data. The overlay2 writable layer is deleted when the container is removed.
- βMistake 5: Deleting files in a Dockerfile layer to reduce image size β Symptom: image size does not decrease because the deleted files persist in the previous layer β Fix: chain download and cleanup in the same RUN with &&. Use multi-stage builds to exclude build tools from the final image.
- βMistake 6: Not understanding that the Docker socket is root access β Symptom: mounting /var/run/docker.sock into a container for convenience, giving the container full control over the Docker daemon β Fix: the Docker socket is equivalent to root access on the host. Use a socket proxy that restricts API access, or avoid mounting it entirely.
Interview Questions on This Topic
- QWalk me through what happens at the kernel level when you run 'docker run alpine echo hello'. What syscalls does runc make?
- QExplain the difference between namespaces and cgroups. What does each one isolate or limit?
- QHow does the overlay2 union filesystem work? What happens when a container reads a file, writes a new file, modifies an existing file, and deletes a file?
- QWhat is the pause process in Docker? Why does every container have one?
- QYour container is OOM-killed but the host has plenty of free memory. How do you diagnose this using cgroup files in /sys/fs/cgroup?
- QExplain how a veth pair connects a container to the Docker bridge network. What kernel mechanisms are involved?
- QWhat is the OCI runtime spec? How does it enable runtime replaceability (swapping runc for gVisor or Kata)?
Frequently Asked Questions
Is a Docker container a virtual machine?
No. A container is a Linux process running inside namespaces and cgroups. It shares the host kernel. A VM runs a full guest operating system with its own kernel on top of a hypervisor. Containers start in milliseconds and use megabytes of memory. VMs take minutes to start and use gigabytes. The security trade-off: containers share the host kernel (a kernel CVE affects all containers), while VMs have a separate kernel per instance.
What is the difference between a namespace and a cgroup?
Namespaces isolate what a process can see. The PID namespace hides other processes. The network namespace gives the process its own network stack. The mount namespace gives it its own filesystem view. cgroups limit what a process can consume. The memory cgroup limits RAM usage. The CPU cgroup limits CPU time. The pids cgroup limits process count. Together, they provide isolation (namespaces) and resource control (cgroups).
Can I see a container's process on the host?
Yes. Every container is a real Linux process. Run docker inspect --format '{{.State.Pid}}' <container> to get the host PID. Then run ps aux | grep <pid> to see the process. The process runs inside namespaces, so it has a restricted view of the system, but it is a real process with a real PID on the host.
What happens when a container exceeds its memory limit?
The kernel's OOM killer terminates the container process with SIGKILL (exit code 137). The cgroup memory controller enforces the limit set by --memory. Without a limit, the container can consume all host RAM, and the OOM killer may kill an unrelated process based on its oom_score heuristic. Always set --memory in production.
What is the OCI runtime spec?
The OCI (Open Container Initiative) runtime spec is a JSON file (config.json) that describes how to create a container. It defines the namespaces, cgroups, mounts, environment, and command. runc reads this spec and creates the container by calling kernel syscalls. Any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can read the same spec and create the container.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.