Advanced 10 min · April 05, 2026

Docker Internals: PID Namespace Bug Kills Containers

A --pid=host debugging flag left in production caused container SIGTERM kills every 5-10 minutes.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Docker CLI sends API requests to the Docker daemon (dockerd)
  • dockerd delegates container lifecycle to containerd
  • runc reads the OCI runtime spec and configures namespaces, cgroups, and filesystem
  • Namespaces: isolate PID, network, mount, user, UTS, and IPC views
  • cgroups: limit CPU, memory, I/O, and process count per container
  • Union filesystem (overlay2): stack read-only image layers with a writable top layer
  • seccomp: filter syscalls at the kernel level
Plain-English First

Think of Docker as a building contractor. The Docker daemon is the project manager — it takes your blueprints (Dockerfile), coordinates the workers, and hands off the actual construction. containerd is the foreman who manages the construction site. runc is the worker who actually lays the foundation (namespaces), installs the walls (cgroups), and puts on the roof (union filesystem). The Linux kernel is the land itself — it provides the raw materials (system calls, filesystem, networking) that everything else is built on top of. None of them work alone; they are a chain of specialized components.

Most Docker tutorials stop at 'docker run' and never explain what happens inside the kernel. This creates a dangerous gap — when containers misbehave, engineers without kernel-level understanding cannot diagnose the root cause. They restart containers, rebuild images, and escalate to platform teams for problems that a single /proc inspection would have solved.

Docker is a stack of components: the CLI, the daemon (dockerd), containerd, runc, and the Linux kernel. Each layer has a specific responsibility. The daemon manages images and the API. containerd manages container lifecycle. runc creates containers by configuring kernel primitives — namespaces for isolation, cgroups for resource limits, and overlay2 for the filesystem. The kernel does the actual work.

Understanding this stack is essential for production debugging. When a container cannot resolve DNS, the answer is in the network namespace. When a container is OOM-killed, the answer is in the cgroup memory controller. When a container starts slowly, the answer is in the overlay2 filesystem or image pull. Every container problem has a kernel-level root cause.

Why PID Namespace Isolation Is Not a Security Boundary

Docker uses Linux namespaces to give each container its own view of system resources. The PID namespace is the mechanism that makes processes inside a container see only their own process tree, starting at PID 1. But this is purely a visibility filter — it does not limit what a container can do to processes on the host if other capabilities or mounts are misconfigured.

PID namespaces nest hierarchically. A container's PID 1 is a real process on the host with a different PID, and the kernel translates between namespaces. The critical property: a process with CAP_SYS_ADMIN inside a namespace can escape it by calling setns() on a host-level file descriptor if it can access /proc/<pid>/ns/pid from the host. This is not a theoretical attack — it's the exact mechanism used in the 2022 runc container breakout (CVE-2019-5736).

Use PID namespaces for process isolation, not security. They prevent accidental signal delivery between containers and keep 'ps' output clean. But never rely on them to contain a malicious process. Always pair with seccomp profiles, AppArmor, and user namespaces. In production, drop CAP_SYS_ADMIN from all containers unless absolutely required.

PID 1 Is Not Special
PID 1 inside a container is still just a process on the host. It does not inherit the kernel's special init process handling — zombie reaping must be done explicitly.
Production Insight
A team ran a container with CAP_SYS_ADMIN and a bind mount of /proc. An attacker inside the container opened /proc/1/ns/pid and called setns() to join the host PID namespace, then spawned a reverse shell visible only on the host. The symptom: no container logs, but host 'ps' showed an unknown bash process. Rule: never mount /proc from the host into a container, and drop CAP_SYS_ADMIN unconditionally.
Key Takeaway
PID namespaces hide processes but do not restrict them — they are a visibility layer, not a security boundary.
A container with CAP_SYS_ADMIN can escape its PID namespace via setns() if it can access host /proc.
Always combine PID namespaces with user namespaces, seccomp, and AppArmor for real isolation.

The Docker Stack: From CLI to Kernel — Every Component Explained

Docker is not a single program. It is a stack of components, each with a specific responsibility. Understanding this stack is the foundation for debugging any container issue.

Docker CLI (docker): The command-line interface. It sends HTTP API requests to the Docker daemon. The CLI does not create containers — it is a client that talks to the server. You can replace it with curl, Postman, or any HTTP client.

Docker daemon (dockerd): The server that manages images, networks, volumes, and the container API. It listens on a Unix socket (/var/run/docker.sock) or a TCP port. The daemon does not create containers directly — it delegates to containerd.

containerd: A container runtime that manages the complete container lifecycle — pulling images, creating containers, managing snapshots, and handling container execution. containerd was originally part of Docker but was extracted as a standalone project. It is now used by Docker, Kubernetes (via CRI), and other orchestration platforms.

runc: A lightweight container runtime that creates containers using Linux kernel primitives. runc reads an OCI (Open Container Initiative) runtime specification — a JSON file that describes the container's namespaces, cgroups, mounts, and environment. runc calls clone() to create a new process, configures namespaces and cgroups, pivot_root to change the filesystem, and exec to start the application. runc exits after creating the container — it does not manage the container's lifecycle.

The OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged) and the runtime spec (how containers are created). runc implements the runtime spec. This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run OCI-compliant images.

The flow: docker run -> dockerd API -> containerd creates container spec -> runc reads OCI spec -> runc calls clone() with namespaces -> runc configures cgroups -> runc pivot_root to overlay2 filesystem -> runc exec the application process -> runc exits -> containerd monitors the container process.

io/thecodeforge/docker_stack_inspection.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/bin/bash
# Inspect every layer of the Docker stack

# ── Docker CLI -> Daemon communication ───────────────────────────────────────
# The CLI sends HTTP requests to the daemon. You can do this manually:
curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool
# Shows: Docker version, API version, Go version, OS, architecture

curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool
# Shows: all running containers (same as docker ps)

# ── Check if containerd is running ───────────────────────────────────────────
systemctl status containerd
# containerd is the container runtime daemon
# It manages container lifecycle independently of dockerd

# ── Find the runc binary ─────────────────────────────────────────────────────
which runc
# Typically: /usr/bin/runc or /usr/local/bin/runc

runc --version
# Shows: runc version, commit, spec version (OCI 1.0.2)

# ── Inspect the OCI runtime spec for a running container ─────────────────────
# containerd stores the OCI spec for each container
CONTAINER_ID=$(docker ps -q | head -1)

# Find the container's bundle directory (contains config.json)
find /run/containerd/io.containerd.runtime.v2.task/default/ -name config.json 2>/dev/null | head -1
# This file is the OCI runtime spec — it defines namespaces, cgroups, mounts

# ── Trace the container creation flow ────────────────────────────────────────
# Start a container and watch the kernel calls
strace -f -e trace=clone,unshare,pivot_root,chroot,execve \
  -o /tmp/container-trace.log \
  runc run test-container &

# The trace shows:
# clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|...) = <child-pid>
# pivot_root(".", "/old-root") = 0
# execve("/app/server", ["server"], ...) = 0

# ── Check the daemon socket ─────────────────────────────────────────────────
ls -la /var/run/docker.sock
# srw-rw---- 1 root docker /var/run/docker.sock
# The socket is owned by root:docker group
# Any process in the docker group can control ALL containers

# ── Check the daemon process tree ────────────────────────────────────────────
pstree -p $(pidof dockerd)
# dockerd ─┬─ containerd ─┬─ containerd-shim-runc-v2 ─┬─ <app-pid>
#          │              │                            └─ pause
#          │              └─ containerd-shim-runc-v2 ─┬─ <app-pid>
#          │                                          └─ pause
#          └─ docker-proxy (for published ports)
Output
# Docker daemon version:
{
"Version": "24.0.7",
"ApiVersion": "1.43",
"MinAPIVersion": "1.12",
"GitCommit": "afdd53b",
"GoVersion": "go1.20.10",
"Os": "linux",
"Arch": "amd64"
}
# runc version:
runc version 1.1.9
commit: v1.1.9-0-gccaecfc
spec: 1.0.2-dev
# Process tree:
dockerd(1234)───containerd(1235)───containerd-shim(5678)───node(5679)
# The container process (node, PID 5679) is a real Linux process on the host
The Docker Stack as a Restaurant Chain
  • Separation of concerns: the daemon manages the API and images, containerd manages lifecycle, runc creates containers.
  • Replaceability: you can swap runc for crun (faster), kata-runtime (VM isolation), or runsc (gVisor) without changing Docker.
  • Standardization: the OCI spec ensures any compliant runtime can run any compliant image.
  • Kubernetes reuses containerd directly — it does not need dockerd. This is why containerd was extracted.
Production Insight
The /var/run/docker.sock socket is the most dangerous file on a Docker host. Any process with access to this socket can create, stop, and delete containers — effectively root access to the host. In production, never mount this socket into containers unless absolutely necessary. If you must, use a socket proxy that restricts the API calls the container can make.
Key Takeaway
Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls (clone, pivot_root, exec). The OCI spec standardizes the interface between layers. Understanding this stack is the foundation for debugging any container issue.
Runtime Selection by Use Case
IfStandard single-tenant application workload
Userunc (default). Fast, lightweight, standard namespace isolation.
IfMulti-tenant workload running untrusted code
Userunsc (gVisor) or kata-runtime. User-space kernel or VM isolation.
IfPerformance-critical workload, low syscall overhead
Usecrun (written in C, faster than runc's Go implementation).
IfServerless platform, short-lived functions
UseFirecracker via containerd Firecracker shim. 125ms VM startup.

Linux Namespaces: The Isolation Mechanism Behind Every Container

Namespaces are the Linux kernel feature that provides process isolation. Each namespace gives a process its own view of a system resource. A container is a regular Linux process that runs inside a set of namespaces — it sees its own PID tree, its own network stack, its own filesystem mount points, and its own hostname, even though it shares the host kernel.

There are seven namespace types in Linux. Docker uses six of them by default:

PID namespace (CLONE_NEWPID): Each container has its own PID tree. The first process inside the container is PID 1. Processes inside the container cannot see processes outside the container. On the host, the container process has a real PID — you can see it with ps aux. The PID namespace is hierarchical — a child namespace can see parent PIDs if configured, but not sibling PIDs.

Network namespace (CLONE_NEWNET): Each container gets its own network stack — its own interfaces, routing table, firewall rules, and /proc/net. When Docker creates a container, it creates a veth (virtual Ethernet) pair — one end inside the container's network namespace, one end connected to the Docker bridge. This is how containers communicate with each other and the outside world.

Mount namespace (CLONE_NEWNS): Each container has its own mount table. The container's root filesystem is a union mount (overlay2) that layers the image's read-only layers with a writable top layer. The container cannot see the host's filesystem unless explicitly mounted. pivot_root changes the container's root directory to the overlay2 merge directory.

User namespace (CLONE_NEWUSER): Maps container UIDs to different host UIDs. Container UID 0 (root) can be mapped to host UID 100000 (unprivileged). This means even a container escape results in an unprivileged host user. User namespace remapping is not enabled by default in Docker because it breaks some workflows (volume permissions, Docker-in-Docker).

UTS namespace (CLONE_NEWUTS): Each container has its own hostname. The hostname is set during container creation and can be changed inside the container without affecting the host or other containers.

IPC namespace (CLONE_NEWIPC): Each container has its own System V IPC and POSIX message queues. Processes in different containers cannot share shared memory segments or message queues.

Cgroup namespace (CLONE_NEWCGROUP): Virtualizes the /proc/self/cgroup view. The container sees its own cgroup path as '/' instead of the real path (/docker/<container-id>). This prevents the container from seeing or manipulating other containers' cgroups.

io/thecodeforge/namespace_inspection.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#!/bin/bash
# Inspect and compare namespaces for containers and the host

# ── Get a container's host PID ───────────────────────────────────────────────
CONTAINER_NAME="my-api"
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' $CONTAINER_NAME)
echo "Container $CONTAINER_NAME is PID $CONTAINER_PID on the host"

# ── List all namespaces for the container process ────────────────────────────
ls -la /proc/$CONTAINER_PID/ns/
# Output:
# lrwxrwxrwx ... ipc -> 'ipc:[4026532XXX]'
# lrwxrwxrwx ... mnt -> 'mnt:[4026532XXX]'
# lrwxrwxrwx ... net -> 'net:[4026532XXX]'
# lrwxrwxrwx ... pid -> 'pid:[4026532XXX]'
# lrwxrwxrwx ... user -> 'user:[4026531XXX]'
# lrwxrwxrwx ... uts -> 'uts:[4026532XXX]'

# ── Compare with host namespaces ─────────────────────────────────────────────
ls -la /proc/1/ns/
# The host PID 1 (systemd) has different namespace IDs than the container
# If namespace IDs match, the container shares that namespace with the host

# ── PID namespace: container sees its own PID tree ────────────────────────────
docker exec $CONTAINER_NAME ps aux
# PID 1 is the container's entrypoint process
# The container cannot see host processes

# On the host, the same process has a different PID:
ps aux | grep $(docker exec $CONTAINER_NAME cat /proc/1/cmdline | tr '\0' ' ')
# The host sees the real PID, the container sees PID 1

# ── Network namespace: container has its own network stack ───────────────────
docker exec $CONTAINER_NAME ip addr show
# Shows: lo (loopback) and eth0 (veth pair inside the container)

# On the host, inspect the veth pair:
ip link show | grep veth
# vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0
# One end is in the container's net namespace, one end is on the docker0 bridge

# ── Enter a container's network namespace from the host ──────────────────────
sudo nsenter --net --target $CONTAINER_PID ip addr show
# Shows the same network config as docker exec, but from the host
# Useful for debugging without a shell inside the container

# ── Mount namespace: inspect the overlay2 filesystem ─────────────────────────
docker inspect --format '{{.GraphDriver.Data}}' $CONTAINER_NAME
# Shows: MergedDir, UpperDir, LowerDir, WorkDir
# MergedDir is what the container sees as /
# UpperDir is the writable layer (container-specific changes)
# LowerDir is the read-only image layers (colon-separated)

# ── User namespace: check if remapping is enabled ────────────────────────────
cat /etc/subuid
# If userns-remap is enabled: dockremap:100000:65536
# This maps container UID 0 to host UID 100000

# ── UTS namespace: container has its own hostname ─────────────────────────────
docker exec $CONTAINER_NAME hostname
# Shows the container's hostname (usually the container ID)

hostname
# Shows the host's hostname — different from the container

# ── IPC namespace: container has its own IPC resources ───────────────────────
docker exec $CONTAINER_NAME ipcs
# Shows only IPC resources created inside the container

ipcs
# Shows host IPC resources — not visible inside the container
Output
# Container PID on host:
Container my-api is PID 5679 on the host
# Container namespaces:
lrwxrwxrwx 1 root root 0 Jan 15 10:23 ipc -> 'ipc:[4026532847]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 mnt -> 'mnt:[4026532849]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 net -> 'net:[4026532851]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 pid -> 'pid:[4026532852]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 uts -> 'uts:[4026532853]'
# Container process list:
PID USER COMMAND
1 node node dist/index.js
# Container network:
1: lo: <LOOPBACK,UP> mtu 65536
4: eth0@if7: <BROADCAST,MULTICAST,UP> mtu 1500
inet 172.17.0.2/16
# Overlay2 filesystem:
MergedDir: /var/lib/docker/overlay2/abc123/merged
UpperDir: /var/lib/docker/overlay2/abc123/diff
LowerDir: /var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers
Namespaces as Tinted Windows
  • --pid=host: the container sees ALL host processes. Its PID 1 is the host's PID 1 (systemd).
  • --net=host: the container shares the host's network stack. It can bind to any host port.
  • --ipc=host: the container can access host shared memory segments. Potential data leak.
  • Each flag removes one isolation layer. --privileged removes ALL of them.
Production Insight
The --pid=host and --net=host flags are debugging tools, not production configurations. --pid=host exposes all host processes to the container — a compromised container can kill any process on the host. --net=host removes network isolation — a container can sniff traffic from other containers. Always use the default namespace isolation in production. If you need host networking for performance, use it only on dedicated hosts.
Key Takeaway
Namespaces are the isolation mechanism. PID namespace isolates the process tree. Network namespace isolates the network stack. Mount namespace isolates the filesystem. User namespace maps UIDs. Every container is a Linux process running inside these namespaces. Disabling a namespace (with --pid=host, --net=host) removes that isolation layer.
Namespace Configuration Decisions
IfStandard production container
UseAll six namespaces enabled (default). Maximum isolation.
IfMonitoring container that needs host process visibility
UseUse --pid=host only on dedicated monitoring hosts. Never on shared hosts.
IfPerformance-critical proxy or load balancer
UseUse --net=host to avoid NAT overhead. Accept the security trade-off.
IfDebugging a container issue interactively
UseUse nsenter from the host: nsenter --net --target <pid> bash. No need to modify the container.

cgroups: Resource Limits That Prevent Noisy Neighbors

While namespaces provide isolation (what a container can see), cgroups provide resource limits (how much a container can consume). Without cgroups, a container with a memory leak can consume all host RAM and trigger the OOM killer on unrelated containers.

cgroup v1 vs v2: Linux has two cgroup versions. cgroup v1 has separate hierarchies for each resource controller (cpu, memory, blkio, pids). cgroup v2 has a unified hierarchy. Docker supports both, but cgroup v2 is the default on newer Linux distributions (Ubuntu 22.04+, Fedora 31+, RHEL 9+).

CPU controller: Limits CPU usage in two ways: - cpu.shares: relative weight. Default is 1024. A container with 2048 gets twice the CPU of a container with 1024 when there is contention. Does not limit absolute CPU usage. - cpu.cfs_quota_us / cpu.cfs_period_us: absolute limit. --cpus=1.0 sets a quota of 100ms per 100ms period, limiting the container to one CPU core.

Memory controller: Limits memory usage: - memory.limit_in_bytes: hard limit. If the container exceeds this, the kernel OOM-kills the process. - memory.soft_limit_in_bytes: soft limit. The kernel tries to reclaim memory from the container before other containers, but does not kill it. - memory.oom_control: controls whether the OOM killer is invoked or the container is frozen.

blkio controller: Limits block device I/O: - blkio.throttle.read_bps_device: limits read throughput in bytes per second. - blkio.throttle.write_bps_device: limits write throughput.

pids controller: Limits the number of processes: - pids.max: maximum number of processes (including threads) the container can create. Prevents fork bombs.

The noisy neighbor problem: Without cgroup limits, one container can starve others. A container with a CPU-bound loop consumes 100% of all CPUs. A container with a memory leak consumes all host RAM, triggering the kernel OOM killer, which may kill unrelated containers. cgroup limits prevent this by enforcing per-container resource ceilings.

io/thecodeforge/cgroup_inspection.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#!/bin/bash
# Inspect and configure cgroup resource limits for containers

# ── Get a container's cgroup path ────────────────────────────────────────────
CONTAINER_ID=$(docker ps -q | head -1)

# cgroup v1 path:
ls /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/
# cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us

# cgroup v2 path:
ls /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/
# cpu.max, memory.max, pids.max

# ── Check CPU limits ─────────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.shares
# Default: 1024 (1 CPU share). Set with --cpu-shares
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_quota_us
# -1 means no limit. Set with --cpus=1.0 (becomes 100000)
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_period_us
# Default: 100000 (100ms)

# cgroup v2:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
# Format: quota period. Example: 100000 100000 (1 CPU limit)

# ── Check memory limits ─────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.limit_in_bytes
# 9223372036854771712 means no limit (max int64)
# Set with --memory=512m (becomes 536870912)
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.usage_in_bytes
# Current memory usage in bytes

# cgroup v2:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
# max means no limit. Set with --memory=512m
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
# Current memory usage

# ── Check OOM events ────────────────────────────────────────────────────────
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.oom_control
# oom_kill_disable: 0 (OOM killer enabled)
# under_oom: 0 (not currently under OOM pressure)

# Check kernel OOM log:
dmesg | grep -i 'oom\|killed process' | tail -10

# ── Check PID limits ────────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.max
# max means no limit. Set with --pids-limit=256
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.current
# Current number of processes

# ── Run a container with resource limits ─────────────────────────────────────
docker run -d \
  --name resource-limited \
  --cpus=1.0 \
  --memory=512m \
  --pids-limit=256 \
  --memory-swap=512m \
  # --memory-swap=512m disables swap (swap = memory limit)
  alpine:3.19 sleep 3600

# Verify the limits:
docker inspect resource-limited --format '{{.HostConfig.NanoCpus}}'
# 1000000000 = 1 CPU (in nanoseconds)

docker inspect resource-limited --format '{{.HostConfig.Memory}}'
# 536870912 = 512MB (in bytes)

docker stats resource-limited --no-stream
# Shows: MEM USAGE / LIMIT — 512MiB / 512MiB
Output
# CPU limits (cgroup v1):
1024
100000
100000
# Memory limits:
536870912
45219840
# PID limits:
256
3
# docker stats:
NAME CPU % MEM USAGE / LIMIT MEM %
resource-limited 0.00% 43.2MiB / 512MiB 8.44%
cgroups as Budget Limits
  • Without a memory limit, a container can consume all host RAM.
  • The kernel OOM killer then selects a process to kill — it may kill an unrelated container, not the leaking one.
  • With --memory=512m, the kernel kills only the container that exceeded its limit.
  • Without limits, the OOM killer uses a heuristic (oom_score) that may choose the wrong victim.
Production Insight
The OOM killer's victim selection heuristic (oom_score) favors killing processes with high memory usage and low importance. But it does not know which container is the problem — it sees host PIDs, not container boundaries. Without cgroup memory limits, a memory leak in container A can cause the OOM killer to kill container B (which happens to have a higher oom_score). Always set --memory on every production container to ensure the OOM killer targets the right process.
Key Takeaway
cgroups limit how much CPU, memory, I/O, and processes a container can consume. Without cgroup limits, one container can starve others (noisy neighbor problem). Always set --memory in production — without it, the OOM killer may kill the wrong container. Use --cpus for CPU limits and --pids-limit to prevent fork bombs.
Resource Limit Strategy
IfStateless web API with predictable resource usage
UseSet --cpus and --memory based on load testing. Use --memory-swap=limit to disable swap.
IfDatabase or cache with memory-based eviction
UseSet --memory to the expected working set size. Do not set --memory-swap (allow swap for eviction).
IfWorker process that may fork subprocesses
UseSet --pids-limit=256 to prevent fork bombs. Set --cpus to limit total CPU across all forks.
IfDevelopment/testing environment
UseSkip resource limits. They add complexity without benefit in non-production environments.

Union Filesystem and overlay2: How Docker Images Work Without Copying

The union filesystem is the reason Docker images are lightweight and containers start in milliseconds. Instead of copying files, Docker overlays multiple read-only directories and presents them as a single merged filesystem.

overlay2 driver: The default storage driver in modern Docker. It stacks directories (layers) and presents a merged view. Each layer is a directory on the host filesystem. The bottom layers are read-only (image layers). The top layer is writable (container-specific changes).

How it works: When a container reads a file, overlay2 checks the top (writable) layer first. If the file exists there, it is returned. If not, overlay2 checks each lower layer in order until the file is found. When a container writes a file, the write goes to the top layer only — lower layers are never modified. When a container deletes a file, a whiteout file (a character device with major/minor 0/0, prefixed with .wh.) is created in the top layer to mask the lower layer's file.

The four directories: - lowerdir: colon-separated list of read-only image layers (bottom to top) - upperdir: the writable layer (container-specific changes) - workdir: overlay2 internal working directory (must be empty, used for atomic operations) - merged: the combined view that the container sees as its root filesystem

Performance implications: Read performance is slightly slower than native because overlay2 must check multiple layers. Write performance is native (writes go directly to the upperdir on the host filesystem). The performance difference is negligible for most workloads but can matter for I/O-intensive applications (databases, search engines).

The copy-up problem: When a container modifies a file from a lower layer, overlay2 must first copy the entire file to the upperdir (copy-up), then modify the copy. For large files (multi-GB database files), copy-up can cause a noticeable delay on first write. This is why databases should use volumes (bind mounts) instead of the container's overlay2 filesystem.

io/thecodeforge/overlay2_inspection.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#!/bin/bash
# Inspect the overlay2 filesystem for a running container

# ── Get the overlay2 paths for a container ───────────────────────────────────
CONTAINER_ID=$(docker ps -q | head -1)

GRAPH_DATA=$(docker inspect --format '{{json .GraphDriver.Data}}' $CONTAINER_ID)
echo $GRAPH_DATA | python3 -m json.tool

# Extract individual paths:
MERGED_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['MergedDir'])")
UPPER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['UpperDir'])")
LOWER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['LowerDir'])")
WORK_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['WorkDir'])")

echo "Merged (container sees this as /): $MERGED_DIR"
echo "Upper (writable layer): $UPPER_DIR"
echo "Lower (read-only layers): $LOWER_DIR"
echo "Work (overlay2 internal): $WORK_DIR"

# ── Inspect the writable layer (upperdir) ────────────────────────────────────
ls -la $UPPER_DIR/
# Shows files the container has created or modified
# Files prefixed with .wh. are whiteout files (deleted from lower layers)

# ── Inspect the merged view ──────────────────────────────────────────────────
ls -la $MERGED_DIR/
# This is what the container sees as its root filesystem
# It is the combination of all lower layers + the upper layer

# ── Demonstrate the copy-up behavior ─────────────────────────────────────────
# Create a file in the container
docker exec $CONTAINER_ID sh -c 'echo "hello" > /tmp/test-file'

# The file appears in the writable layer (upperdir):
ls -la $UPPER_DIR/tmp/test-file
# The file is in the upper layer, not in any lower layer

# ── Demonstrate the whiteout behavior ────────────────────────────────────────
# Delete a file that exists in a lower layer
docker exec $CONTAINER_ID rm /etc/hostname

# A whiteout file appears in the upper layer:
ls -la $UPPER_DIR/etc/.wh.hostname
# This character device (0/0) tells overlay2 to hide the lower layer's file

# ── Check the number of layers in an image ───────────────────────────────────
docker inspect <image> --format '{{len .RootFS.Layers}} layers'
# Each layer is a directory under /var/lib/docker/overlay2/

# ── Check disk usage per layer ───────────────────────────────────────────────
du -sh /var/lib/docker/overlay2/* | sort -hr | head -10
# Shows disk usage for each layer (shared layers are counted once)

# ── Compare overlay2 with native filesystem performance ─────────────────────
# Write performance test:
time docker exec $CONTAINER_ID dd if=/dev/zero of=/tmp/test bs=1M count=100
# Overlay2 write: ~0.3s (writes to upperdir on host filesystem)

# Read performance test:
time docker exec $CONTAINER_ID dd if=/tmp/test of=/dev/null bs=1M
# Overlay2 read: ~0.1s (slightly slower than native due to layer lookup)
Output
# Overlay2 paths:
{
"LowerDir": "/var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers",
"MergedDir": "/var/lib/docker/overlay2/abc123/merged",
"UpperDir": "/var/lib/docker/overlay2/abc123/diff",
"WorkDir": "/var/lib/docker/overlay2/abc123/work"
}
# Writable layer contents:
drwxr-xr-x 4 root root 4096 Jan 15 10:25 tmp
drwxr-xr-x 2 root root 4096 Jan 15 10:25 etc
-rw-r--r-- 1 root root 0 Jan 15 10:25 etc/.wh.hostname
# The whiteout file etc/.wh.hostname hides /etc/hostname from lower layers
Overlay2 as Transparent Acetate Sheets
  • Each layer is additive — deleting a file in layer N+1 does not remove it from layer N.
  • The delete creates a whiteout marker in layer N+1, but the data still exists in layer N.
  • The only way to truly remove data is to not include it in any layer (use multi-stage builds or .dockerignore).
  • This is why RUN apt-get install ... && rm -rf /var/lib/apt/lists/* must be in the same RUN — separate RUNs create separate layers.
Production Insight
The copy-up problem is the most common performance issue with overlay2. When a database container writes to a file that exists in a lower layer (e.g., modifying a config file from the image), overlay2 must first copy the entire file to the upperdir. For multi-GB database files, this can cause seconds of latency on first write. The fix: use bind mount volumes for database data directories instead of writing to the overlay2 filesystem.
Key Takeaway
overlay2 stacks read-only image layers with a writable top layer. Reads check the top layer first, then fall through to lower layers. Writes always go to the top layer. Deletes create whiteout files. Deleting files in a later layer does not reclaim space — the data persists in the earlier layer. Use volumes for databases to avoid the copy-up overhead.
Filesystem Strategy by Workload
IfStateless application (API, web server)
UseUse overlay2 (default). The writable layer is sufficient for temporary files and logs.
IfDatabase or persistent storage
UseUse named volumes or bind mounts. Bypass overlay2 entirely. Avoid copy-up overhead.
IfBuild process creating many temporary files
UseUse tmpfs mounts for temporary data. Avoids disk I/O entirely.
IfHigh-security environment
UseUse --read-only flag to make the overlay2 filesystem read-only. All writes must go to explicit tmpfs or volume mounts.

The Container Lifecycle: From Clone to Exit — Every Kernel Call

When you run docker run, a precise sequence of kernel calls creates the container. Understanding this sequence is the key to debugging startup failures, permission errors, and namespace issues.

Step 1: Image pull and unpack. The Docker daemon pulls the image layers from the registry and unpacks them into /var/lib/docker/overlay2/. Each layer is a directory. If the layers already exist locally (cached), this step is skipped.

Step 2: Create the OCI runtime spec. containerd generates a config.json file — the OCI runtime specification. This JSON file defines: - The namespaces to create (PID, network, mount, user, UTS, IPC) - The cgroup limits (CPU, memory, pids) - The root filesystem path (the overlay2 merge directory) - The environment variables, working directory, and command to execute - The mount points (volumes, /proc, /sys, /dev)

Step 3: runc creates the container. runc reads config.json and executes the following kernel calls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) — creates a new process with new namespaces - sethostname() — sets the container's hostname (UTS namespace) - mount() — mounts /proc, /sys, /dev inside the container's mount namespace - pivot_root() — changes the container's root directory to the overlay2 merge directory - chdir("/") — moves to the new root - setuid() / setgid() — drops privileges to the container's user (if non-root) - execve() — replaces the runc process with the container's entrypoint command

Step 4: runc exits, containerd monitors. After execve(), runc is replaced by the container's process. runc exits. containerd (via containerd-shim) monitors the container process, captures stdout/stderr, and handles signals.

Step 5: The container process runs. The application process is now running inside a set of namespaces with cgroup limits and an overlay2 filesystem. It has PID 1 inside the container's PID namespace. On the host, it has a real PID visible in ps aux.

The pause process: Each container has a 'pause' process that holds the namespaces open. If the application process exits, the pause process keeps the namespaces alive (for restart). You can see pause processes on the host: ps aux | grep pause.

io/thecodeforge/container_lifecycle.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/bin/bash
# Trace the complete container lifecycle from clone to exec

# ── Step 1: Pull and inspect image layers ────────────────────────────────────
docker pull alpine:3.19

# Inspect the image layers:
docker inspect alpine:3.19 --format '{{json .RootFS.Layers}}' | python3 -m json.tool
# Each entry is a layer (SHA256 digest)

# Find the layers on disk:
ls /var/lib/docker/overlay2/ | head -5
# Each directory is a layer. Shared layers are hard-linked or reflinked.

# ── Step 2: Create a container and inspect the OCI spec ──────────────────────
# Create a container without starting it:
docker create --name lifecycle-demo alpine:3.19 echo 'hello'

# Find the OCI runtime spec:
find /run/containerd -name config.json -path '*lifecycle-demo*' 2>/dev/null
# This file is the OCI runtime spec that runc reads

# Inspect the spec (if found):
cat /run/containerd/io.containerd.runtime.v2.task/default/lifecycle-demo/config.json | python3 -m json.tool | head -50
# Shows: namespaces, mounts, cgroups, process config, root filesystem

# ── Step 3: Trace runc's kernel calls ────────────────────────────────────────
# Start a container with strace to see the kernel calls:
sudo strace -f -e trace=clone,clone3,unshare,sethostname,mount,pivot_root,setuid,setgid,execve \
  -o /tmp/runc-trace.log \
  runc run --bundle /path/to/bundle test-trace

# The trace shows:
# clone3({flags=CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|..., ...}) = 12345
# sethostname("container-id", 12) = 0
# mount("proc", "/proc", "proc", ...) = 0
# mount("sysfs", "/sys", "sysfs", ...) = 0
# pivot_root(".", "/old-root") = 0
# setuid(1000) = 0
# setgid(1000) = 0
# execve("/bin/sh", ["sh"], ...) = 0

# ── Step 4: Find the container process and pause process on the host ─────────
docker start lifecycle-demo

# Find the container's host PID:
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' lifecycle-demo)
echo "Container process PID: $CONTAINER_PID"

# Find the pause process (holds namespaces open):
ps aux | grep pause | grep -v grep
# root  5678  0.0  0.0  1024  4  ?  Ss  10:23  0:00 /pause

# The pause process and the container process share the same namespaces:
ls -la /proc/$CONTAINER_PID/ns/net
ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net
# Both point to the same namespace inode

# ── Step 5: Watch the container process on the host ──────────────────────────
ps aux | grep $CONTAINER_PID
# root  5679  0.0  0.1  ...  echo hello
# This is the REAL process on the host, running inside namespaces

# ── Cleanup ──────────────────────────────────────────────────────────────────
docker rm -f lifecycle-demo
Output
# Image layers:
[
"sha256:abc123def456..."
]
# Container process on host:
Container process PID: 5679
# Pause process:
root 5678 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause
# Host process:
root 5679 0.0 0.1 4520 1820 ? Ss 10:23 0:00 echo hello
Container Creation as Building a Room
  • runc's job is to create the container, not to manage it. After execve(), runc is replaced by the application process.
  • containerd (via containerd-shim) monitors the container process, captures output, and handles signals.
  • The pause process holds the namespaces open so they survive application restarts.
  • This separation allows containerd to manage the lifecycle without being PID 1 in the container.
Production Insight
The pause process is essential for container restarts. When the application process exits, the pause process keeps the namespaces alive. containerd can then start a new process inside the same namespaces (restart). Without the pause process, the namespaces would be destroyed on application exit, and a restart would require creating new namespaces from scratch. You can see pause processes on the host with ps aux | grep pause — one per container.
Key Takeaway
Container creation is a sequence of kernel calls: clone (namespaces), mount (filesystem), pivot_root (change root), setuid (drop privileges), execve (start application). runc creates the container and exits. containerd monitors the process. The pause process holds namespaces open for restarts. Every container is a real Linux process visible in ps aux on the host.
● Production incidentPOST-MORTEMseverity: high

Container Process Visible on Host — PID Namespace Misconfiguration Exposes All Container Processes

Symptom
Production containers were being killed randomly — sometimes the API container, sometimes the worker container, sometimes the database. The kills occurred every 5-10 minutes with no pattern. The team checked application logs — no errors. They checked OOM killer logs (dmesg | grep -i oom) — no OOM kills. They checked Docker events (docker events) — containers were being killed with SIGTERM, not SIGKILL. The kills were coming from the host, not from inside the containers.
Assumption
The team assumed a rogue cron job on the host was killing processes. They checked crontab -l for all users — nothing suspicious. They assumed a Kubernetes liveness probe was misconfigured — but they were not running Kubernetes. They assumed a Docker daemon bug and restarted dockerd — the kills continued.
Root cause
A monitoring container was started with --pid=host, which disables the PID namespace isolation. This made all processes on the host (including other containers' processes) visible inside the monitoring container. The monitoring script used ps aux to collect process metrics and passed PIDs matching a certain pattern to a cleanup script that killed orphaned processes. Because the monitoring container could see all container processes (not just its own), the cleanup script identified legitimate container processes as 'orphans' and killed them. The --pid=host flag was added during debugging a week earlier and never removed.
Fix
1. Removed --pid=host from the monitoring container. PID namespace isolation was restored — the monitoring container could only see its own processes. 2. Added a pre-deployment check that scans docker run and docker-compose.yml for --pid=host, --net=host, and --privileged flags and requires explicit approval. 3. Modified the cleanup script to verify that PIDs belong to the expected namespace before killing them. 4. Added docker events monitoring that alerts on unexpected container kills. 5. Documented that --pid=host is a debugging flag that must never be used in production.
Key lesson
  • --pid=host disables PID namespace isolation. All host processes become visible inside the container. This is a debugging flag, not a production configuration.
  • A monitoring container with --pid=host can see and interact with all processes on the host, including other containers' processes.
  • Always audit namespace flags (--pid, --net, --ipc, --uts) in production deployments. Any flag that disables namespace isolation increases the blast radius of a compromised container.
  • Add automated pre-deployment checks for dangerous flags. Manual review is insufficient — flags added during debugging are easily forgotten.
  • The PID namespace is the most important isolation boundary. Without it, a container is not isolated — it is just a chroot.
Production debug guideSystematic debugging paths using /proc, /sys, and namespace inspection.6 entries
Symptom · 01
Container process is consuming unexpected CPU or memory.
Fix
Find the container's host PID and inspect its cgroup. Run docker inspect --format '{{.State.Pid}}' <container> to get the host PID. Then cat /proc/<pid>/cgroup to see which cgroup the process belongs to. Check CPU usage with top -p <pid> and memory with cat /proc/<pid>/status | grep -i vm. Compare with the cgroup limits: cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes.
Symptom · 02
Container cannot reach the network or resolve DNS.
Fix
Inspect the container's network namespace. Get the container PID: docker inspect --format '{{.State.Pid}}' <container>. Enter the network namespace: nsenter --net --target <pid> ip addr show. Check the veth pair: nsenter --net --target <pid> ip link show. Check DNS resolution: nsenter --net --target <pid> cat /etc/resolv.conf. Check routing: nsenter --net --target <pid> ip route show.
Symptom · 03
Container cannot write to a mounted volume — permission denied.
Fix
Check the container's user namespace and the volume ownership. Run docker exec <container> id to see the container's UID. Check the volume ownership on the host: ls -la /var/lib/docker/volumes/<volume>/_data. If the container runs as UID 1000 but the volume is owned by root (UID 0), the container cannot write. Check if user namespace remapping is enabled: cat /etc/docker/daemon.json | grep userns-remap.
Symptom · 04
Container starts but the application process is not running.
Fix
Check if the container process exited immediately. Run docker ps -a to see if the container status is 'Exited'. Check the exit code: docker inspect --format '{{.State.ExitCode}}' <container>. Check the logs: docker logs <container>. If the exit code is 137, the process was OOM-killed — check the cgroup memory limit and the OOM killer log: dmesg | grep -i oom.
Symptom · 05
Container filesystem shows unexpected files or missing files.
Fix
Inspect the overlay2 layers. Get the container's merge directory: docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container>. List the layers: docker inspect --format '{{.GraphDriver.Data.LowerDir}}' <container>. Check the writable layer: docker inspect --format '{{.GraphDriver.Data.UpperDir}}' <container>. Files in the writable layer override files in lower layers. Files 'deleted' in the writable layer are whiteout files (character device 0/0).
Symptom · 06
Docker daemon is unresponsive or containers cannot be created.
Fix
Check the daemon status: systemctl status docker. Check daemon logs: journalctl -u docker --since '10 minutes ago'. Check if containerd is running: systemctl status containerd. Check if the daemon socket is accessible: curl --unix-socket /var/run/docker.sock http://localhost/version. Check if the daemon is out of disk space: df -h /var/lib/docker.
★ Docker Internals Triage Cheat SheetFirst-response commands for container, namespace, cgroup, and filesystem issues.
Container is consuming too much CPU or memory.
Immediate action
Find the container's host PID and inspect its cgroup.
Commands
docker inspect --format '{{.State.Pid}}' <container>
cat /proc/<pid>/cgroup && cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
Fix now
If memory usage exceeds limit, increase --memory. If no limit is set, add --memory=512m to prevent one container from consuming all host RAM.
Container cannot reach the network.+
Immediate action
Inspect the container's network namespace.
Commands
docker inspect --format '{{.State.Pid}}' <container>
nsenter --net --target <pid> ip addr show && nsenter --net --target <pid> ip route show
Fix now
If no veth interface exists, the container is not connected to a network. If routes are missing, check docker network inspect <network>.
Container cannot write to volume — permission denied.+
Immediate action
Check UID mapping between container and host.
Commands
docker exec <container> id
ls -la /var/lib/docker/volumes/<volume>/_data
Fix now
If UID mismatch, chown the volume to match the container's UID. Check if userns-remap is enabled in daemon.json.
Container exits immediately with code 137.+
Immediate action
Check if the process was OOM-killed.
Commands
dmesg | grep -i 'oom\|killed process'
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
Fix now
If OOM-killed, increase --memory limit or fix the memory leak. Check docker stats for actual memory usage before the kill.
Container filesystem shows unexpected or missing files.+
Immediate action
Inspect overlay2 layers and the writable layer.
Commands
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container>
ls -la $(docker inspect --format '{{.GraphDriver.Data.UpperDir}}' <container>)
Fix now
Files in UpperDir override LowerDir. Whiteout files (c character devices) mark deleted files. Check if a file was deleted in a layer by looking for .wh. prefixed files.
Docker daemon is unresponsive.+
Immediate action
Check daemon and containerd status.
Commands
systemctl status docker && systemctl status containerd
journalctl -u docker --since '10 minutes ago' --no-pager | tail -50
Fix now
If daemon is hung, restart it: systemctl restart docker. If disk space is full, prune: docker system prune -a. If containerd is crashed, restart: systemctl restart containerd.
Linux Namespace Types Used by Docker
NamespaceFlagIsolatesDocker DefaultHost Flag to Disable
PIDCLONE_NEWPIDProcess ID treeEnabled--pid=host
NetworkCLONE_NEWNETNetwork stack (interfaces, routes, iptables)Enabled--net=host
MountCLONE_NEWNSFilesystem mount pointsEnabled--volume /:/host (partial)
UserCLONE_NEWUSERUID/GID mappingDisabled (opt-in)N/A (disabled by default)
UTSCLONE_NEWUTSHostname and domain nameEnabled--uts=host
IPCCLONE_NEWIPCSystem V IPC and POSIX message queuesEnabled--ipc=host
CgroupCLONE_NEWCGROUPcgroup root directory viewEnabled (cgroup v2)--cgroupns=host

Key takeaways

1
Docker is a stack
CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls.
2
Namespaces isolate what a container can see (PID tree, network, filesystem, hostname, IPC). cgroups limit what a container can consume (CPU, memory, I/O, processes).
3
Every container is a real Linux process visible in ps aux on the host. The kernel does not know what a 'container' is
it only knows processes, namespaces, and cgroups.
4
overlay2 stacks read-only image layers with a writable top layer. No data is copied on container creation. Writes go to the top layer. Deletes create whiteout files.
5
The pause process holds namespaces open so they survive application restarts. runc exits after creating the container. containerd monitors the process.
6
The Docker socket (/var/run/docker.sock) is equivalent to root access on the host. Never mount it into containers without a socket proxy.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Is a Docker container a virtual machine?
02
What is the difference between a namespace and a cgroup?
03
Can I see a container's process on the host?
04
What happens when a container exceeds its memory limit?
05
What is the OCI runtime spec?
🔥

That's Docker. Mark it forged?

10 min read · try the examples if you haven't

Previous
Docker vs Virtual Machine
4 / 18 · Docker
Next
Docker Architecture Explained