Docker Internals: PID Namespace Bug Kills Containers
A --pid=host debugging flag left in production caused container SIGTERM kills every 5-10 minutes.
20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.
- ✓Production DevOps experience
- ✓Deep understanding of the tool's internals
- ✓Experience debugging distributed systems
- Docker CLI sends API requests to the Docker daemon (dockerd)
- dockerd delegates container lifecycle to containerd
- runc reads the OCI runtime spec and configures namespaces, cgroups, and filesystem
- Namespaces: isolate PID, network, mount, user, UTS, and IPC views
- cgroups: limit CPU, memory, I/O, and process count per container
- Union filesystem (overlay2): stack read-only image layers with a writable top layer
- seccomp: filter syscalls at the kernel level
Docker is a containerization platform that packages applications and their dependencies into isolated environments called containers. Unlike virtual machines, which emulate entire operating systems with separate kernels, Docker containers share the host's Linux kernel while using kernel features like namespaces and cgroups to create lightweight, portable execution environments.
This design allows containers to start in milliseconds and consume minimal overhead compared to VMs, making them ideal for microservices, CI/CD pipelines, and cloud-native deployments. Docker's ecosystem includes tools like Docker Compose for multi-container applications, Docker Swarm for orchestration, and integration with Kubernetes for production-scale management.
At its core, Docker leverages Linux namespaces to provide process isolation — each container gets its own PID, network, mount, and other namespaces, making processes inside appear to have their own system. cgroups (control groups) enforce resource limits on CPU, memory, and I/O to prevent any single container from starving others. The union filesystem (typically overlay2) enables Docker images to be built from read-only layers, sharing common base layers across containers without duplication.
This layered approach means pulling a new image often only downloads the delta, drastically reducing bandwidth and storage.
However, Docker's isolation is not a security boundary. PID namespace isolation, for example, prevents a container from seeing host processes but doesn't protect against kernel exploits — a compromised container can escape to the host. For production security, you need additional measures like seccomp profiles, AppArmor/SELinux, user namespaces, and running containers as non-root.
Docker is not a replacement for VMs in multi-tenant environments where strong isolation is required; for that, consider gVisor, Kata Containers, or Firecracker microVMs. Understanding these internals — from the Docker CLI's REST API calls to containerd's shim processes and runc's clone() syscalls — is essential for debugging container failures and building secure, efficient deployments.
Think of Docker as a building contractor. The Docker daemon is the project manager — it takes your blueprints (Dockerfile), coordinates the workers, and hands off the actual construction. containerd is the foreman who manages the construction site. runc is the worker who actually lays the foundation (namespaces), installs the walls (cgroups), and puts on the roof (union filesystem). The Linux kernel is the land itself — it provides the raw materials (system calls, filesystem, networking) that everything else is built on top of. None of them work alone; they are a chain of specialized components.
Most Docker tutorials stop at 'docker run' and never explain what happens inside the kernel. This creates a dangerous gap — when containers misbehave, engineers without kernel-level understanding cannot diagnose the root cause. They restart containers, rebuild images, and escalate to platform teams for problems that a single /proc inspection would have solved.
Docker is a stack of components: the CLI, the daemon (dockerd), containerd, runc, and the Linux kernel. Each layer has a specific responsibility. The daemon manages images and the API. containerd manages container lifecycle. runc creates containers by configuring kernel primitives — namespaces for isolation, cgroups for resource limits, and overlay2 for the filesystem. The kernel does the actual work.
Understanding this stack is essential for production debugging. When a container cannot resolve DNS, the answer is in the network namespace. When a container is OOM-killed, the answer is in the cgroup memory controller. When a container starts slowly, the answer is in the overlay2 filesystem or image pull. Every container problem has a kernel-level root cause.
Why PID Namespace Isolation Is Not a Security Boundary
Docker uses Linux namespaces to give each container its own view of system resources. The PID namespace is the mechanism that makes processes inside a container see only their own process tree, starting at PID 1. But this is purely a visibility filter — it does not limit what a container can do to processes on the host if other capabilities or mounts are misconfigured.
PID namespaces nest hierarchically. A container's PID 1 is a real process on the host with a different PID, and the kernel translates between namespaces. The critical property: a process with CAP_SYS_ADMIN inside a namespace can escape it by calling setns() on a host-level file descriptor if it can access /proc/<pid>/ns/pid from the host. This is not a theoretical attack — it's the exact mechanism used in the 2022 runc container breakout (CVE-2019-5736).
Use PID namespaces for process isolation, not security. They prevent accidental signal delivery between containers and keep 'ps' output clean. But never rely on them to contain a malicious process. Always pair with seccomp profiles, AppArmor, and user namespaces. In production, drop CAP_SYS_ADMIN from all containers unless absolutely required.
setns() to join the host PID namespace, then spawned a reverse shell visible only on the host. The symptom: no container logs, but host 'ps' showed an unknown bash process. Rule: never mount /proc from the host into a container, and drop CAP_SYS_ADMIN unconditionally.setns() if it can access host /proc.The Docker Stack: From CLI to Kernel — Every Component Explained
Docker is not a single program. It is a stack of components, each with a specific responsibility. Understanding this stack is the foundation for debugging any container issue.
Docker CLI (docker): The command-line interface. It sends HTTP API requests to the Docker daemon. The CLI does not create containers — it is a client that talks to the server. You can replace it with curl, Postman, or any HTTP client.
Docker daemon (dockerd): The server that manages images, networks, volumes, and the container API. It listens on a Unix socket (/var/run/docker.sock) or a TCP port. The daemon does not create containers directly — it delegates to containerd.
containerd: A container runtime that manages the complete container lifecycle — pulling images, creating containers, managing snapshots, and handling container execution. containerd was originally part of Docker but was extracted as a standalone project. It is now used by Docker, Kubernetes (via CRI), and other orchestration platforms.
runc: A lightweight container runtime that creates containers using Linux kernel primitives. runc reads an OCI (Open Container Initiative) runtime specification — a JSON file that describes the container's namespaces, cgroups, mounts, and environment. runc calls clone() to create a new process, configures namespaces and cgroups, pivot_root to change the filesystem, and exec to start the application. runc exits after creating the container — it does not manage the container's lifecycle.
The OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged) and the runtime spec (how containers are created). runc implements the runtime spec. This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run OCI-compliant images.
The flow: docker run -> dockerd API -> containerd creates container spec -> runc reads OCI spec -> runc calls clone() with namespaces -> runc configures cgroups -> runc pivot_root to overlay2 filesystem -> runc exec the application process -> runc exits -> containerd monitors the container process.
#!/bin/bash # Inspect every layer of the Docker stack # ── Docker CLI -> Daemon communication ─────────────────────────────────────── # The CLI sends HTTP requests to the daemon. You can do this manually: curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool # Shows: Docker version, API version, Go version, OS, architecture curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool # Shows: all running containers (same as docker ps) # ── Check if containerd is running ─────────────────────────────────────────── systemctl status containerd # containerd is the container runtime daemon # It manages container lifecycle independently of dockerd # ── Find the runc binary ───────────────────────────────────────────────────── which runc # Typically: /usr/bin/runc or /usr/local/bin/runc runc --version # Shows: runc version, commit, spec version (OCI 1.0.2) # ── Inspect the OCI runtime spec for a running container ───────────────────── # containerd stores the OCI spec for each container CONTAINER_ID=$(docker ps -q | head -1) # Find the container's bundle directory (contains config.json) find /run/containerd/io.containerd.runtime.v2.task/default/ -name config.json 2>/dev/null | head -1 # This file is the OCI runtime spec — it defines namespaces, cgroups, mounts # ── Trace the container creation flow ──────────────────────────────────────── # Start a container and watch the kernel calls strace -f -e trace=clone,unshare,pivot_root,chroot,execve \ -o /tmp/container-trace.log \ runc run test-container & # The trace shows: # clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|...) = <child-pid> # pivot_root(".", "/old-root") = 0 # execve("/app/server", ["server"], ...) = 0 # ── Check the daemon socket ───────────────────────────────────────────────── ls -la /var/run/docker.sock # srw-rw---- 1 root docker /var/run/docker.sock # The socket is owned by root:docker group # Any process in the docker group can control ALL containers # ── Check the daemon process tree ──────────────────────────────────────────── pstree -p $(pidof dockerd) # dockerd ─┬─ containerd ─┬─ containerd-shim-runc-v2 ─┬─ <app-pid> # │ │ └─ pause # │ └─ containerd-shim-runc-v2 ─┬─ <app-pid> # │ └─ pause # └─ docker-proxy (for published ports)
- Separation of concerns: the daemon manages the API and images, containerd manages lifecycle, runc creates containers.
- Replaceability: you can swap runc for crun (faster), kata-runtime (VM isolation), or runsc (gVisor) without changing Docker.
- Standardization: the OCI spec ensures any compliant runtime can run any compliant image.
- Kubernetes reuses containerd directly — it does not need dockerd. This is why containerd was extracted.
Linux Namespaces: The Isolation Mechanism Behind Every Container
Namespaces are the Linux kernel feature that provides process isolation. Each namespace gives a process its own view of a system resource. A container is a regular Linux process that runs inside a set of namespaces — it sees its own PID tree, its own network stack, its own filesystem mount points, and its own hostname, even though it shares the host kernel.
There are seven namespace types in Linux. Docker uses six of them by default:
PID namespace (CLONE_NEWPID): Each container has its own PID tree. The first process inside the container is PID 1. Processes inside the container cannot see processes outside the container. On the host, the container process has a real PID — you can see it with ps aux. The PID namespace is hierarchical — a child namespace can see parent PIDs if configured, but not sibling PIDs.
Network namespace (CLONE_NEWNET): Each container gets its own network stack — its own interfaces, routing table, firewall rules, and /proc/net. When Docker creates a container, it creates a veth (virtual Ethernet) pair — one end inside the container's network namespace, one end connected to the Docker bridge. This is how containers communicate with each other and the outside world.
Mount namespace (CLONE_NEWNS): Each container has its own mount table. The container's root filesystem is a union mount (overlay2) that layers the image's read-only layers with a writable top layer. The container cannot see the host's filesystem unless explicitly mounted. pivot_root changes the container's root directory to the overlay2 merge directory.
User namespace (CLONE_NEWUSER): Maps container UIDs to different host UIDs. Container UID 0 (root) can be mapped to host UID 100000 (unprivileged). This means even a container escape results in an unprivileged host user. User namespace remapping is not enabled by default in Docker because it breaks some workflows (volume permissions, Docker-in-Docker).
UTS namespace (CLONE_NEWUTS): Each container has its own hostname. The hostname is set during container creation and can be changed inside the container without affecting the host or other containers.
IPC namespace (CLONE_NEWIPC): Each container has its own System V IPC and POSIX message queues. Processes in different containers cannot share shared memory segments or message queues.
Cgroup namespace (CLONE_NEWCGROUP): Virtualizes the /proc/self/cgroup view. The container sees its own cgroup path as '/' instead of the real path (/docker/<container-id>). This prevents the container from seeing or manipulating other containers' cgroups.
#!/bin/bash # Inspect and compare namespaces for containers and the host # ── Get a container's host PID ─────────────────────────────────────────────── CONTAINER_NAME="my-api" CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' $CONTAINER_NAME) echo "Container $CONTAINER_NAME is PID $CONTAINER_PID on the host" # ── List all namespaces for the container process ──────────────────────────── ls -la /proc/$CONTAINER_PID/ns/ # Output: # lrwxrwxrwx ... ipc -> 'ipc:[4026532XXX]' # lrwxrwxrwx ... mnt -> 'mnt:[4026532XXX]' # lrwxrwxrwx ... net -> 'net:[4026532XXX]' # lrwxrwxrwx ... pid -> 'pid:[4026532XXX]' # lrwxrwxrwx ... user -> 'user:[4026531XXX]' # lrwxrwxrwx ... uts -> 'uts:[4026532XXX]' # ── Compare with host namespaces ───────────────────────────────────────────── ls -la /proc/1/ns/ # The host PID 1 (systemd) has different namespace IDs than the container # If namespace IDs match, the container shares that namespace with the host # ── PID namespace: container sees its own PID tree ──────────────────────────── docker exec $CONTAINER_NAME ps aux # PID 1 is the container's entrypoint process # The container cannot see host processes # On the host, the same process has a different PID: ps aux | grep $(docker exec $CONTAINER_NAME cat /proc/1/cmdline | tr '\0' ' ') # The host sees the real PID, the container sees PID 1 # ── Network namespace: container has its own network stack ─────────────────── docker exec $CONTAINER_NAME ip addr show # Shows: lo (loopback) and eth0 (veth pair inside the container) # On the host, inspect the veth pair: ip link show | grep veth # vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0 # One end is in the container's net namespace, one end is on the docker0 bridge # ── Enter a container's network namespace from the host ────────────────────── sudo nsenter --net --target $CONTAINER_PID ip addr show # Shows the same network config as docker exec, but from the host # Useful for debugging without a shell inside the container # ── Mount namespace: inspect the overlay2 filesystem ───────────────────────── docker inspect --format '{{.GraphDriver.Data}}' $CONTAINER_NAME # Shows: MergedDir, UpperDir, LowerDir, WorkDir # MergedDir is what the container sees as / # UpperDir is the writable layer (container-specific changes) # LowerDir is the read-only image layers (colon-separated) # ── User namespace: check if remapping is enabled ──────────────────────────── cat /etc/subuid # If userns-remap is enabled: dockremap:100000:65536 # This maps container UID 0 to host UID 100000 # ── UTS namespace: container has its own hostname ───────────────────────────── docker exec $CONTAINER_NAME hostname # Shows the container's hostname (usually the container ID) hostname # Shows the host's hostname — different from the container # ── IPC namespace: container has its own IPC resources ─────────────────────── docker exec $CONTAINER_NAME ipcs # Shows only IPC resources created inside the container ipcs # Shows host IPC resources — not visible inside the container
- --pid=host: the container sees ALL host processes. Its PID 1 is the host's PID 1 (systemd).
- --net=host: the container shares the host's network stack. It can bind to any host port.
- --ipc=host: the container can access host shared memory segments. Potential data leak.
- Each flag removes one isolation layer. --privileged removes ALL of them.
cgroups: Resource Limits That Prevent Noisy Neighbors
While namespaces provide isolation (what a container can see), cgroups provide resource limits (how much a container can consume). Without cgroups, a container with a memory leak can consume all host RAM and trigger the OOM killer on unrelated containers.
cgroup v1 vs v2: Linux has two cgroup versions. cgroup v1 has separate hierarchies for each resource controller (cpu, memory, blkio, pids). cgroup v2 has a unified hierarchy. Docker supports both, but cgroup v2 is the default on newer Linux distributions (Ubuntu 22.04+, Fedora 31+, RHEL 9+).
CPU controller: Limits CPU usage in two ways: - cpu.shares: relative weight. Default is 1024. A container with 2048 gets twice the CPU of a container with 1024 when there is contention. Does not limit absolute CPU usage. - cpu.cfs_quota_us / cpu.cfs_period_us: absolute limit. --cpus=1.0 sets a quota of 100ms per 100ms period, limiting the container to one CPU core.
Memory controller: Limits memory usage: - memory.limit_in_bytes: hard limit. If the container exceeds this, the kernel OOM-kills the process. - memory.soft_limit_in_bytes: soft limit. The kernel tries to reclaim memory from the container before other containers, but does not kill it. - memory.oom_control: controls whether the OOM killer is invoked or the container is frozen.
blkio controller: Limits block device I/O: - blkio.throttle.read_bps_device: limits read throughput in bytes per second. - blkio.throttle.write_bps_device: limits write throughput.
pids controller: Limits the number of processes: - pids.max: maximum number of processes (including threads) the container can create. Prevents fork bombs.
The noisy neighbor problem: Without cgroup limits, one container can starve others. A container with a CPU-bound loop consumes 100% of all CPUs. A container with a memory leak consumes all host RAM, triggering the kernel OOM killer, which may kill unrelated containers. cgroup limits prevent this by enforcing per-container resource ceilings.
#!/bin/bash # Inspect and configure cgroup resource limits for containers # ── Get a container's cgroup path ──────────────────────────────────────────── CONTAINER_ID=$(docker ps -q | head -1) # cgroup v1 path: ls /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/ # cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us # cgroup v2 path: ls /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/ # cpu.max, memory.max, pids.max # ── Check CPU limits ───────────────────────────────────────────────────────── # cgroup v1: cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.shares # Default: 1024 (1 CPU share). Set with --cpu-shares cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_quota_us # -1 means no limit. Set with --cpus=1.0 (becomes 100000) cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_period_us # Default: 100000 (100ms) # cgroup v2: cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max # Format: quota period. Example: 100000 100000 (1 CPU limit) # ── Check memory limits ───────────────────────────────────────────────────── # cgroup v1: cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.limit_in_bytes # 9223372036854771712 means no limit (max int64) # Set with --memory=512m (becomes 536870912) cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.usage_in_bytes # Current memory usage in bytes # cgroup v2: cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max # max means no limit. Set with --memory=512m cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current # Current memory usage # ── Check OOM events ──────────────────────────────────────────────────────── cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.oom_control # oom_kill_disable: 0 (OOM killer enabled) # under_oom: 0 (not currently under OOM pressure) # Check kernel OOM log: dmesg | grep -i 'oom\|killed process' | tail -10 # ── Check PID limits ──────────────────────────────────────────────────────── # cgroup v1: cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.max # max means no limit. Set with --pids-limit=256 cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.current # Current number of processes # ── Run a container with resource limits ───────────────────────────────────── docker run -d \ --name resource-limited \ --cpus=1.0 \ --memory=512m \ --pids-limit=256 \ --memory-swap=512m \ # --memory-swap=512m disables swap (swap = memory limit) alpine:3.19 sleep 3600 # Verify the limits: docker inspect resource-limited --format '{{.HostConfig.NanoCpus}}' # 1000000000 = 1 CPU (in nanoseconds) docker inspect resource-limited --format '{{.HostConfig.Memory}}' # 536870912 = 512MB (in bytes) docker stats resource-limited --no-stream # Shows: MEM USAGE / LIMIT — 512MiB / 512MiB
- Without a memory limit, a container can consume all host RAM.
- The kernel OOM killer then selects a process to kill — it may kill an unrelated container, not the leaking one.
- With --memory=512m, the kernel kills only the container that exceeded its limit.
- Without limits, the OOM killer uses a heuristic (oom_score) that may choose the wrong victim.
Union Filesystem and overlay2: How Docker Images Work Without Copying
The union filesystem is the reason Docker images are lightweight and containers start in milliseconds. Instead of copying files, Docker overlays multiple read-only directories and presents them as a single merged filesystem.
overlay2 driver: The default storage driver in modern Docker. It stacks directories (layers) and presents a merged view. Each layer is a directory on the host filesystem. The bottom layers are read-only (image layers). The top layer is writable (container-specific changes).
How it works: When a container reads a file, overlay2 checks the top (writable) layer first. If the file exists there, it is returned. If not, overlay2 checks each lower layer in order until the file is found. When a container writes a file, the write goes to the top layer only — lower layers are never modified. When a container deletes a file, a whiteout file (a character device with major/minor 0/0, prefixed with .wh.) is created in the top layer to mask the lower layer's file.
The four directories: - lowerdir: colon-separated list of read-only image layers (bottom to top) - upperdir: the writable layer (container-specific changes) - workdir: overlay2 internal working directory (must be empty, used for atomic operations) - merged: the combined view that the container sees as its root filesystem
Performance implications: Read performance is slightly slower than native because overlay2 must check multiple layers. Write performance is native (writes go directly to the upperdir on the host filesystem). The performance difference is negligible for most workloads but can matter for I/O-intensive applications (databases, search engines).
The copy-up problem: When a container modifies a file from a lower layer, overlay2 must first copy the entire file to the upperdir (copy-up), then modify the copy. For large files (multi-GB database files), copy-up can cause a noticeable delay on first write. This is why databases should use volumes (bind mounts) instead of the container's overlay2 filesystem.
#!/bin/bash # Inspect the overlay2 filesystem for a running container # ── Get the overlay2 paths for a container ─────────────────────────────────── CONTAINER_ID=$(docker ps -q | head -1) GRAPH_DATA=$(docker inspect --format '{{json .GraphDriver.Data}}' $CONTAINER_ID) echo $GRAPH_DATA | python3 -m json.tool # Extract individual paths: MERGED_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['MergedDir'])") UPPER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['UpperDir'])") LOWER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['LowerDir'])") WORK_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['WorkDir'])") echo "Merged (container sees this as /): $MERGED_DIR" echo "Upper (writable layer): $UPPER_DIR" echo "Lower (read-only layers): $LOWER_DIR" echo "Work (overlay2 internal): $WORK_DIR" # ── Inspect the writable layer (upperdir) ──────────────────────────────────── ls -la $UPPER_DIR/ # Shows files the container has created or modified # Files prefixed with .wh. are whiteout files (deleted from lower layers) # ── Inspect the merged view ────────────────────────────────────────────────── ls -la $MERGED_DIR/ # This is what the container sees as its root filesystem # It is the combination of all lower layers + the upper layer # ── Demonstrate the copy-up behavior ───────────────────────────────────────── # Create a file in the container docker exec $CONTAINER_ID sh -c 'echo "hello" > /tmp/test-file' # The file appears in the writable layer (upperdir): ls -la $UPPER_DIR/tmp/test-file # The file is in the upper layer, not in any lower layer # ── Demonstrate the whiteout behavior ──────────────────────────────────────── # Delete a file that exists in a lower layer docker exec $CONTAINER_ID rm /etc/hostname # A whiteout file appears in the upper layer: ls -la $UPPER_DIR/etc/.wh.hostname # This character device (0/0) tells overlay2 to hide the lower layer's file # ── Check the number of layers in an image ─────────────────────────────────── docker inspect <image> --format '{{len .RootFS.Layers}} layers' # Each layer is a directory under /var/lib/docker/overlay2/ # ── Check disk usage per layer ─────────────────────────────────────────────── du -sh /var/lib/docker/overlay2/* | sort -hr | head -10 # Shows disk usage for each layer (shared layers are counted once) # ── Compare overlay2 with native filesystem performance ───────────────────── # Write performance test: time docker exec $CONTAINER_ID dd if=/dev/zero of=/tmp/test bs=1M count=100 # Overlay2 write: ~0.3s (writes to upperdir on host filesystem) # Read performance test: time docker exec $CONTAINER_ID dd if=/tmp/test of=/dev/null bs=1M # Overlay2 read: ~0.1s (slightly slower than native due to layer lookup)
- Each layer is additive — deleting a file in layer N+1 does not remove it from layer N.
- The delete creates a whiteout marker in layer N+1, but the data still exists in layer N.
- The only way to truly remove data is to not include it in any layer (use multi-stage builds or .dockerignore).
- This is why RUN apt-get install ... && rm -rf /var/lib/apt/lists/* must be in the same RUN — separate RUNs create separate layers.
The Container Lifecycle: From Clone to Exit — Every Kernel Call
When you run docker run, a precise sequence of kernel calls creates the container. Understanding this sequence is the key to debugging startup failures, permission errors, and namespace issues.
Step 1: Image pull and unpack. The Docker daemon pulls the image layers from the registry and unpacks them into /var/lib/docker/overlay2/. Each layer is a directory. If the layers already exist locally (cached), this step is skipped.
Step 2: Create the OCI runtime spec. containerd generates a config.json file — the OCI runtime specification. This JSON file defines: - The namespaces to create (PID, network, mount, user, UTS, IPC) - The cgroup limits (CPU, memory, pids) - The root filesystem path (the overlay2 merge directory) - The environment variables, working directory, and command to execute - The mount points (volumes, /proc, /sys, /dev)
Step 3: runc creates the container. runc reads config.json and executes the following kernel calls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) — creates a new process with new namespaces - sethostname() — sets the container's hostname (UTS namespace) - mount() — mounts /proc, /sys, /dev inside the container's mount namespace - pivot_root() — changes the container's root directory to the overlay2 merge directory - chdir("/") — moves to the new root - setuid() / setgid() — drops privileges to the container's user (if non-root) - execve() — replaces the runc process with the container's entrypoint command
Step 4: runc exits, containerd monitors. After execve(), runc is replaced by the container's process. runc exits. containerd (via containerd-shim) monitors the container process, captures stdout/stderr, and handles signals.
Step 5: The container process runs. The application process is now running inside a set of namespaces with cgroup limits and an overlay2 filesystem. It has PID 1 inside the container's PID namespace. On the host, it has a real PID visible in ps aux.
The pause process: Each container has a 'pause' process that holds the namespaces open. If the application process exits, the pause process keeps the namespaces alive (for restart). You can see pause processes on the host: ps aux | grep pause.
#!/bin/bash # Trace the complete container lifecycle from clone to exec # ── Step 1: Pull and inspect image layers ──────────────────────────────────── docker pull alpine:3.19 # Inspect the image layers: docker inspect alpine:3.19 --format '{{json .RootFS.Layers}}' | python3 -m json.tool # Each entry is a layer (SHA256 digest) # Find the layers on disk: ls /var/lib/docker/overlay2/ | head -5 # Each directory is a layer. Shared layers are hard-linked or reflinked. # ── Step 2: Create a container and inspect the OCI spec ────────────────────── # Create a container without starting it: docker create --name lifecycle-demo alpine:3.19 echo 'hello' # Find the OCI runtime spec: find /run/containerd -name config.json -path '*lifecycle-demo*' 2>/dev/null # This file is the OCI runtime spec that runc reads # Inspect the spec (if found): cat /run/containerd/io.containerd.runtime.v2.task/default/lifecycle-demo/config.json | python3 -m json.tool | head -50 # Shows: namespaces, mounts, cgroups, process config, root filesystem # ── Step 3: Trace runc's kernel calls ──────────────────────────────────────── # Start a container with strace to see the kernel calls: sudo strace -f -e trace=clone,clone3,unshare,sethostname,mount,pivot_root,setuid,setgid,execve \ -o /tmp/runc-trace.log \ runc run --bundle /path/to/bundle test-trace # The trace shows: # clone3({flags=CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|..., ...}) = 12345 # sethostname("container-id", 12) = 0 # mount("proc", "/proc", "proc", ...) = 0 # mount("sysfs", "/sys", "sysfs", ...) = 0 # pivot_root(".", "/old-root") = 0 # setuid(1000) = 0 # setgid(1000) = 0 # execve("/bin/sh", ["sh"], ...) = 0 # ── Step 4: Find the container process and pause process on the host ───────── docker start lifecycle-demo # Find the container's host PID: CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' lifecycle-demo) echo "Container process PID: $CONTAINER_PID" # Find the pause process (holds namespaces open): ps aux | grep pause | grep -v grep # root 5678 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause # The pause process and the container process share the same namespaces: ls -la /proc/$CONTAINER_PID/ns/net ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net # Both point to the same namespace inode # ── Step 5: Watch the container process on the host ────────────────────────── ps aux | grep $CONTAINER_PID # root 5679 0.0 0.1 ... echo hello # This is the REAL process on the host, running inside namespaces # ── Cleanup ────────────────────────────────────────────────────────────────── docker rm -f lifecycle-demo
- runc's job is to create the container, not to manage it. After
execve(), runc is replaced by the application process. - containerd (via containerd-shim) monitors the container process, captures output, and handles signals.
- The pause process holds the namespaces open so they survive application restarts.
- This separation allows containerd to manage the lifecycle without being PID 1 in the container.
Docker Networking: How Packets Escape the Container Without a Firewall Meltdown
Docker networking isn't magic. It's a carefully orchestrated lie. By default, Docker creates a bridge network (docker0) on the host. Every container gets a virtual Ethernet pair (veth). One end lives inside the container's namespace. The other plugs into the bridge. Packets traverse the bridge, get NAT'd through iptables, and hit your physical NIC.
The mistake most junior engineers make is thinking containers are fully isolated at the network layer. They're not. A container can ARP scan the entire bridge subnet. The only thing stopping them from reaching other containers is the default --icc=false? Wait — no, Docker sets --icc=true by default. That means containers on the same bridge can talk to each other unhindered.
For production: never use the default bridge. Create user-defined networks. They give you built-in DNS resolution (no more --link deprecated garbage) and proper isolation between container groups. Also drop the --iptables=false flag only when you know exactly what you're doing. Otherwise, Docker writes iptables rules that will make your security team twitch.
// io.thecodeforge — devops tutorial name: audit-default-docker-bridge on: schedule: - cron: '0 6 * * 1' jobs: check-bridge: runs-on: ubuntu-22.04 steps: - name: Inspect default bridge run: | docker network inspect bridge --format '{{json .IPAM.Config}}' echo "---" echo "icc (inter-container comms): $(docker info --format '{{.Swarm.LocalNodeState}}')" - name: List iptables for Docker run: | sudo iptables -t nat -L -n | grep -i docker
--internal or --attachable controlled.--publish.The Image Layer Cache: Why Your Docker Builds Take 45 Minutes (And How to Fix It)
Every Dockerfile instruction creates a layer. Each layer is a delta of the filesystem. When you rebuild, Docker checks if the instruction's context has changed. If not, it uses the cached layer. Sounds great. Breaks constantly.
The problem? Most people put COPY . . before running apt-get update or npm install. Any file change in the source dir invalidates every subsequent layer. Now you're reinstalling all system packages and rebuilding all node_modules. Every. Single. Build.
Fix your layer ordering: pin your package manager dependencies first. Copy package.json and requirements.txt before the rest of your source. Run your install commands immediately after. That way, changes to your application code don't trigger a full dependency reinstall. Use --mount=type=cache for buildkit to persist apt and npm caches across builds. Multi-stage builds aren't just for final image size — they're for caching compiled artifacts so you don't recompile the entire Go project when you changed one string in a config file.
// io.thecodeforge — devops tutorial FROM node:20-alpine AS builder WORKDIR /app # Dependency layers — cached unless package.json changes COPY package.json yarn.lock ./ RUN yarn install --frozen-lockfile --no-cache # Source layers — invalidate only on code changes COPY src/ ./src/ RUN yarn build FROM node:20-alpine AS runner WORKDIR /app COPY --from=builder /app/dist ./dist COPY --from=builder /app/node_modules ./node_modules EXPOSE 3000 CMD ["node", "dist/server.js"] # Build with: DOCKER_BUILDKIT=1 docker build --cache-from myapp:cache -t myapp:latest .
COPY . . before package manifests.The Underlying Technology
Docker's internals rest on two kernel features: namespaces and cgroups. PID, network, mount, and user namespaces each provide isolated views of system resources, making a container feel like a separate machine. cgroups enforce hard limits on CPU, memory, and I/O, preventing any single container from starving the host. The real magic is the union filesystem, typically overlay2, which layers read-only image layers with a writable container layer using copy-on-write. This means multiple containers share the same base image blocks on disk, reducing storage and startup time to milliseconds. Understanding these primitives explains why containers are not lightweight VMs: they share the host kernel, so a kernel panic takes down all containers. The performance gain comes from skipping hardware virtualization, not from magic.
// io.thecodeforge — devops tutorial # Docker image layers stored in /var/lib/docker/overlay2/ lowerdir: /var/lib/docker/overlay2/l/LAYER1:/var/lib/docker/overlay2/l/LAYER2 upperdir: /var/lib/docker/overlay2/CONTAINER_ID/diff workdir: /var/lib/docker/overlay2/CONTAINER_ID/work merged: /var/lib/docker/overlay2/CONTAINER_ID/merged # On write, data is copied from lower to upper dir # Deletions create a whiteout file in upperdir
Putting It All Together
When you run docker run nginx, Docker CLI sends a REST call to dockerd, which pulls layers from a registry via HTTP, assembles them into a rootfs over overlay2, creates a new PID, mount, and network namespace, applies cgroup limits, and calls clone() with CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET. The new process sees only its isolated namespace tree, its root filesystem, and a virtual Ethernet pair. The container's init process (PID 1) runs as a normal process on the host but cannot see sibling containers or host processes. Docker's networking uses a bridge or overlay driver to forward packets through iptables rules. Resource contention is prevented by cgroups throttling CPU shares and memory limits. When the container exits, dockerd destroys namespaces, unmounts overlay2 layers, and optionally removes the writable layer. This entire cycle happens in under a second because no hardware emulation is involved.
// io.thecodeforge — devops tutorial # Simplified call flow for 'docker run nginx' 1. CLI -> dockerd: REST POST /containers/create 2. dockerd -> registry: GET /v2/nginx/manifests/latest 3. dockerd -> overlay: mount -t overlay overlay \ -olowerdir=/layers/nginx,upperdir=/diff,workdir=/work /merged 4. dockerd -> kernel: unshare(CLONE_NEWNS|CLONE_NEWPID|CLONE_NEWNET) 5. dockerd -> cgroup: echo $PID > /sys/fs/cgroup/memory/tasks 6. dockerd -> kernel: exec /usr/sbin/nginx inside new namespace
clone() fails due to user namespace restrictions, check /proc/sys/kernel/unprivileged_userns_clone.Conclusion: The Orchestrated Symphony of Kernel Primitives
Docker containers are not lightweight VMs. They are a clever orchestration of Linux kernel primitives—namespaces for isolation, cgroups for resource control, and union filesystems for efficient image layering. Understanding this internal machinery transforms debugging from guesswork into science. When a container leaks memory, check cgroup limits and the OOM killer. When networking fails, trace the veth pair and iptables NAT rules. When builds are slow, audit the overlay2 layer cache. This knowledge separates engineers who merely run containers from those who master them. The Docker CLI is just a user-friendly wrapper around syscalls like clone(CLONE_NEWNS), pivot_root, and iptables commands. Every production incident related to containers can be traced back to a specific kernel mechanism—no magic, just well-documented Linux internals. By internalizing these concepts, you move from operator to architect, designing resilient containerized systems rather than debugging black boxes.
// io.thecodeforge — devops tutorial # Deep kernel interaction for container lifecycle container_init: step_1_namespaces: - "clone(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET)" - "Separates mount table, PID tree, network stack" step_2_cgroups: - "Write limits to /sys/fs/cgroup/{cpu,memory,pids}/" - "Kernel enforces 2 CPU cores, 512MB RAM" step_3_chroot: - "pivot_root to container rootfs" - "overlay2 merges layers read-only, top writable" step_4_networking: - "veth pair: eth0 <-> host bridge (docker0)" - "iptables -t nat -A POSTROUTING -j MASQUERADE" exit_path: - "Namespace destroyed on last process exit" - "cgroup cleanup via release_agent"
Further Reading: Deep Dives Into the Linux Kernel Underpinnings
To solidify your internal model, start with the canonical sources. Michael Kerrisk's 'The Linux Programming Interface' covers namespaces and cgroups with exact syscall semantics. For kernel code, read net/core/dev.c for veth pair creation and kernel/nsproxy.c for namespace lifecycle. The 'Docker and Go: Why We Chose Go' talk by Solomon Hykes explains the initial design decisions. For advanced debugging, tools like strace -f docker run... reveal the exact syscall sequence. The 'Container Security' book by Liz Rice details when namespace isolation fails (e.g., /proc mounts). For cgroups v2, the kernel documentation under Documentation/admin-guide/cgroup-v2.rst is authoritative. The overlay filesystem code in fs/overlayfs/ shows how copy-up works on writes. Finally, the 'BPF Performance Tools' book by Brendan Gregg explains how eBPF can trace container syscalls without overhead—essential for production profiling. Understanding these resources transforms you from a Docker user into a container platform engineer.
// io.thecodeforge — devops tutorial # Essential resources for deep container understanding resources: books: - title: "The Linux Programming Interface" author: "Michael Kerrisk" focus: "Namespaces, cgroups, clone() syscalls" - title: "Container Security" author: "Liz Rice" focus: "Namespace boundary weaknesses, /proc escapes" kernel_source: - path: "kernel/nsproxy.c" desc: "Namespace lifecycle management" - path: "net/core/dev.c" desc: "veth pair creation code" - path: "fs/overlayfs/" desc: "Copy-up on write logic" tools: - "strace -f docker run alpine sh" - "lsns && cat /proc/1/cgroup" - "ebpf tracing: bpftrace -e 't:syscalls:sys_*clone*'"
Container Process Visible on Host — PID Namespace Misconfiguration Exposes All Container Processes
- --pid=host disables PID namespace isolation. All host processes become visible inside the container. This is a debugging flag, not a production configuration.
- A monitoring container with --pid=host can see and interact with all processes on the host, including other containers' processes.
- Always audit namespace flags (--pid, --net, --ipc, --uts) in production deployments. Any flag that disables namespace isolation increases the blast radius of a compromised container.
- Add automated pre-deployment checks for dangerous flags. Manual review is insufficient — flags added during debugging are easily forgotten.
- The PID namespace is the most important isolation boundary. Without it, a container is not isolated — it is just a chroot.
docker inspect --format '{{.State.Pid}}' <container>cat /proc/<pid>/cgroup && cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytesdocker inspect --format '{{.State.Pid}}' <container>nsenter --net --target <pid> ip addr show && nsenter --net --target <pid> ip route showdocker exec <container> idls -la /var/lib/docker/volumes/<volume>/_datadmesg | grep -i 'oom\|killed process'cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytesdocker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container>ls -la $(docker inspect --format '{{.GraphDriver.Data.UpperDir}}' <container>)systemctl status docker && systemctl status containerdjournalctl -u docker --since '10 minutes ago' --no-pager | tail -50| Namespace | Flag | Isolates | Docker Default | Host Flag to Disable |
|---|---|---|---|---|
| PID | CLONE_NEWPID | Process ID tree | Enabled | --pid=host |
| Network | CLONE_NEWNET | Network stack (interfaces, routes, iptables) | Enabled | --net=host |
| Mount | CLONE_NEWNS | Filesystem mount points | Enabled | --volume /:/host (partial) |
| User | CLONE_NEWUSER | UID/GID mapping | Disabled (opt-in) | N/A (disabled by default) |
| UTS | CLONE_NEWUTS | Hostname and domain name | Enabled | --uts=host |
| IPC | CLONE_NEWIPC | System V IPC and POSIX message queues | Enabled | --ipc=host |
| Cgroup | CLONE_NEWCGROUP | cgroup root directory view | Enabled (cgroup v2) | --cgroupns=host |
| File | Command / Code | Purpose |
|---|---|---|
| io | curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m js... | The Docker Stack: From CLI to Kernel |
| io | CONTAINER_NAME="my-api" | Linux Namespaces |
| io | CONTAINER_ID=$(docker ps -q | head -1) | cgroups |
| io | CONTAINER_ID=$(docker ps -q | head -1) | Union Filesystem and overlay2 |
| io | docker pull alpine:3.19 | The Container Lifecycle: From Clone to Exit |
| AuditDefaultNetworking.yml | name: audit-default-docker-bridge | Docker Networking |
| OptimizedDockerfile.yml | FROM node:20-alpine AS builder | The Image Layer Cache |
| overlay2_structure_example.yml | lowerdir: /var/lib/docker/overlay2/l/LAYER1:/var/lib/docker/overlay2/l/LAYER2 | The Underlying Technology |
| docker_run_kernel_call_sequence.yml | 1. CLI -> dockerd: REST POST /containers/create | Putting It All Together |
| container_internals_cheat.yml | container_init: | Conclusion |
| further_reading_curated.yml | resources: | Further Reading |
Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
No. A container is a Linux process running inside namespaces and cgroups. It shares the host kernel. A VM runs a full guest operating system with its own kernel on top of a hypervisor. Containers start in milliseconds and use megabytes of memory. VMs take minutes to start and use gigabytes. The security trade-off: containers share the host kernel (a kernel CVE affects all containers), while VMs have a separate kernel per instance.
Namespaces isolate what a process can see. The PID namespace hides other processes. The network namespace gives the process its own network stack. The mount namespace gives it its own filesystem view. cgroups limit what a process can consume. The memory cgroup limits RAM usage. The CPU cgroup limits CPU time. The pids cgroup limits process count. Together, they provide isolation (namespaces) and resource control (cgroups).
Yes. Every container is a real Linux process. Run docker inspect --format '{{.State.Pid}}' <container> to get the host PID. Then run ps aux | grep <pid> to see the process. The process runs inside namespaces, so it has a restricted view of the system, but it is a real process with a real PID on the host.
The kernel's OOM killer terminates the container process with SIGKILL (exit code 137). The cgroup memory controller enforces the limit set by --memory. Without a limit, the container can consume all host RAM, and the OOM killer may kill an unrelated process based on its oom_score heuristic. Always set --memory in production.
The OCI (Open Container Initiative) runtime spec is a JSON file (config.json) that describes how to create a container. It defines the namespaces, cgroups, mounts, environment, and command. runc reads this spec and creates the container by calling kernel syscalls. Any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can read the same spec and create the container.
20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.
That's Docker. Mark it forged?
14 min read · try the examples if you haven't