Advanced 14 min · April 05, 2026

How Docker Works Internally

Docker Internals: PID Namespace Bug Kills Containers

Q: Is a Docker container a virtual machine?

No. A container is a Linux process running inside namespaces and cgroups. It shares the host kernel. A VM runs a full guest operating system with its own kernel on top of a hypervisor. Containers start in milliseconds and use megabytes of memory. VMs take minutes to start and use gigabytes. The security trade-off: containers share the host kernel (a kernel CVE affects all containers), while VMs have a separate kernel per instance.

Q: What is the difference between a namespace and a cgroup?

Namespaces isolate what a process can see. The PID namespace hides other processes. The network namespace gives the process its own network stack. The mount namespace gives it its own filesystem view. cgroups limit what a process can consume. The memory cgroup limits RAM usage. The CPU cgroup limits CPU time. The pids cgroup limits process count. Together, they provide isolation (namespaces) and resource control (cgroups).

Q: Can I see a container's process on the host?

Yes. Every container is a real Linux process. Run docker inspect --format '{{.State.Pid}}' to get the host PID. Then run ps aux | grep to see the process. The process runs inside namespaces, so it has a restricted view of the system, but it is a real process with a real PID on the host.

Q: What happens when a container exceeds its memory limit?

The kernel's OOM killer terminates the container process with SIGKILL (exit code 137). The cgroup memory controller enforces the limit set by --memory. Without a limit, the container can consume all host RAM, and the OOM killer may kill an unrelated process based on its oom_score heuristic. Always set --memory in production.

Q: What is the OCI runtime spec?

The OCI (Open Container Initiative) runtime spec is a JSON file (config.json) that describes how to create a container. It defines the namespaces, cgroups, mounts, environment, and command. runc reads this spec and creates the container by calling kernel syscalls. Any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can read the same spec and create the container.

A --pid=host debugging flag left in production caused container SIGTERM kills every 5-10 minutes.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.

✓ Production

production tested

July 04, 2026

last updated

1,663

articles · all by Naren

Before you start⏱ 30 min

✓Production DevOps experience
✓Deep understanding of the tool's internals
✓Experience debugging distributed systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Docker CLI sends API requests to the Docker daemon (dockerd)
dockerd delegates container lifecycle to containerd
runc reads the OCI runtime spec and configures namespaces, cgroups, and filesystem
Namespaces: isolate PID, network, mount, user, UTS, and IPC views
cgroups: limit CPU, memory, I/O, and process count per container
Union filesystem (overlay2): stack read-only image layers with a writable top layer
seccomp: filter syscalls at the kernel level

✦ Definition~90s read

What is How Docker Works Internally?

Docker is a containerization platform that packages applications and their dependencies into isolated environments called containers. Unlike virtual machines, which emulate entire operating systems with separate kernels, Docker containers share the host's Linux kernel while using kernel features like namespaces and cgroups to create lightweight, portable execution environments.

★

Think of Docker as a building contractor.

This design allows containers to start in milliseconds and consume minimal overhead compared to VMs, making them ideal for microservices, CI/CD pipelines, and cloud-native deployments. Docker's ecosystem includes tools like Docker Compose for multi-container applications, Docker Swarm for orchestration, and integration with Kubernetes for production-scale management.

At its core, Docker leverages Linux namespaces to provide process isolation — each container gets its own PID, network, mount, and other namespaces, making processes inside appear to have their own system. cgroups (control groups) enforce resource limits on CPU, memory, and I/O to prevent any single container from starving others. The union filesystem (typically overlay2) enables Docker images to be built from read-only layers, sharing common base layers across containers without duplication.

This layered approach means pulling a new image often only downloads the delta, drastically reducing bandwidth and storage.

However, Docker's isolation is not a security boundary. PID namespace isolation, for example, prevents a container from seeing host processes but doesn't protect against kernel exploits — a compromised container can escape to the host. For production security, you need additional measures like seccomp profiles, AppArmor/SELinux, user namespaces, and running containers as non-root.

Docker is not a replacement for VMs in multi-tenant environments where strong isolation is required; for that, consider gVisor, Kata Containers, or Firecracker microVMs. Understanding these internals — from the Docker CLI's REST API calls to containerd's shim processes and runc's clone() syscalls — is essential for debugging container failures and building secure, efficient deployments.

Plain-English First

Think of Docker as a building contractor. The Docker daemon is the project manager — it takes your blueprints (Dockerfile), coordinates the workers, and hands off the actual construction. containerd is the foreman who manages the construction site. runc is the worker who actually lays the foundation (namespaces), installs the walls (cgroups), and puts on the roof (union filesystem). The Linux kernel is the land itself — it provides the raw materials (system calls, filesystem, networking) that everything else is built on top of. None of them work alone; they are a chain of specialized components.

Most Docker tutorials stop at 'docker run' and never explain what happens inside the kernel. This creates a dangerous gap — when containers misbehave, engineers without kernel-level understanding cannot diagnose the root cause. They restart containers, rebuild images, and escalate to platform teams for problems that a single /proc inspection would have solved.

Docker is a stack of components: the CLI, the daemon (dockerd), containerd, runc, and the Linux kernel. Each layer has a specific responsibility. The daemon manages images and the API. containerd manages container lifecycle. runc creates containers by configuring kernel primitives — namespaces for isolation, cgroups for resource limits, and overlay2 for the filesystem. The kernel does the actual work.

Understanding this stack is essential for production debugging. When a container cannot resolve DNS, the answer is in the network namespace. When a container is OOM-killed, the answer is in the cgroup memory controller. When a container starts slowly, the answer is in the overlay2 filesystem or image pull. Every container problem has a kernel-level root cause.

Why PID Namespace Isolation Is Not a Security Boundary

Docker uses Linux namespaces to give each container its own view of system resources. The PID namespace is the mechanism that makes processes inside a container see only their own process tree, starting at PID 1. But this is purely a visibility filter — it does not limit what a container can do to processes on the host if other capabilities or mounts are misconfigured.

PID namespaces nest hierarchically. A container's PID 1 is a real process on the host with a different PID, and the kernel translates between namespaces. The critical property: a process with CAP_SYS_ADMIN inside a namespace can escape it by calling setns() on a host-level file descriptor if it can access /proc/<pid>/ns/pid from the host. This is not a theoretical attack — it's the exact mechanism used in the 2022 runc container breakout (CVE-2019-5736).

Use PID namespaces for process isolation, not security. They prevent accidental signal delivery between containers and keep 'ps' output clean. But never rely on them to contain a malicious process. Always pair with seccomp profiles, AppArmor, and user namespaces. In production, drop CAP_SYS_ADMIN from all containers unless absolutely required.

PID 1 Is Not Special

PID 1 inside a container is still just a process on the host. It does not inherit the kernel's special init process handling — zombie reaping must be done explicitly.

Production Insight

A team ran a container with CAP_SYS_ADMIN and a bind mount of /proc. An attacker inside the container opened /proc/1/ns/pid and called setns() to join the host PID namespace, then spawned a reverse shell visible only on the host. The symptom: no container logs, but host 'ps' showed an unknown bash process. Rule: never mount /proc from the host into a container, and drop CAP_SYS_ADMIN unconditionally.

Key Takeaway

PID namespaces hide processes but do not restrict them — they are a visibility layer, not a security boundary.

A container with CAP_SYS_ADMIN can escape its PID namespace via setns() if it can access host /proc.

Always combine PID namespaces with user namespaces, seccomp, and AppArmor for real isolation.

thecodeforge.io

How Docker Works Internally

The Docker Stack: From CLI to Kernel — Every Component Explained

Docker is not a single program. It is a stack of components, each with a specific responsibility. Understanding this stack is the foundation for debugging any container issue.

Docker CLI (docker): The command-line interface. It sends HTTP API requests to the Docker daemon. The CLI does not create containers — it is a client that talks to the server. You can replace it with curl, Postman, or any HTTP client.

Docker daemon (dockerd): The server that manages images, networks, volumes, and the container API. It listens on a Unix socket (/var/run/docker.sock) or a TCP port. The daemon does not create containers directly — it delegates to containerd.

containerd: A container runtime that manages the complete container lifecycle — pulling images, creating containers, managing snapshots, and handling container execution. containerd was originally part of Docker but was extracted as a standalone project. It is now used by Docker, Kubernetes (via CRI), and other orchestration platforms.

runc: A lightweight container runtime that creates containers using Linux kernel primitives. runc reads an OCI (Open Container Initiative) runtime specification — a JSON file that describes the container's namespaces, cgroups, mounts, and environment. runc calls clone() to create a new process, configures namespaces and cgroups, pivot_root to change the filesystem, and exec to start the application. runc exits after creating the container — it does not manage the container's lifecycle.

The OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged) and the runtime spec (how containers are created). runc implements the runtime spec. This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run OCI-compliant images.

The flow: docker run -> dockerd API -> containerd creates container spec -> runc reads OCI spec -> runc calls clone() with namespaces -> runc configures cgroups -> runc pivot_root to overlay2 filesystem -> runc exec the application process -> runc exits -> containerd monitors the container process.

io/thecodeforge/docker_stack_inspection.shBASH

#!/bin/bash
# Inspect every layer of the Docker stack

# ── Docker CLI -> Daemon communication ───────────────────────────────────────
# The CLI sends HTTP requests to the daemon. You can do this manually:
curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool
# Shows: Docker version, API version, Go version, OS, architecture

curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool
# Shows: all running containers (same as docker ps)

# ── Check if containerd is running ───────────────────────────────────────────
systemctl status containerd
# containerd is the container runtime daemon
# It manages container lifecycle independently of dockerd

# ── Find the runc binary ─────────────────────────────────────────────────────
which runc
# Typically: /usr/bin/runc or /usr/local/bin/runc

runc --version
# Shows: runc version, commit, spec version (OCI 1.0.2)

# ── Inspect the OCI runtime spec for a running container ─────────────────────
# containerd stores the OCI spec for each container
CONTAINER_ID=$(docker ps -q | head -1)

# Find the container's bundle directory (contains config.json)
find /run/containerd/io.containerd.runtime.v2.task/default/ -name config.json 2>/dev/null | head -1
# This file is the OCI runtime spec — it defines namespaces, cgroups, mounts

# ── Trace the container creation flow ────────────────────────────────────────
# Start a container and watch the kernel calls
strace -f -e trace=clone,unshare,pivot_root,chroot,execve \
  -o /tmp/container-trace.log \
  runc run test-container &

# The trace shows:
# clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|...) = <child-pid>
# pivot_root(".", "/old-root") = 0
# execve("/app/server", ["server"], ...) = 0

# ── Check the daemon socket ─────────────────────────────────────────────────
ls -la /var/run/docker.sock
# srw-rw---- 1 root docker /var/run/docker.sock
# The socket is owned by root:docker group
# Any process in the docker group can control ALL containers

# ── Check the daemon process tree ────────────────────────────────────────────
pstree -p $(pidof dockerd)
# dockerd ─┬─ containerd ─┬─ containerd-shim-runc-v2 ─┬─ <app-pid>
#          │              │                            └─ pause
#          │              └─ containerd-shim-runc-v2 ─┬─ <app-pid>
#          │                                          └─ pause
#          └─ docker-proxy (for published ports)

Output

# Docker daemon version:

{

"Version": "24.0.7",

"ApiVersion": "1.43",

"MinAPIVersion": "1.12",

"GitCommit": "afdd53b",

"GoVersion": "go1.20.10",

"Os": "linux",

"Arch": "amd64"

}

# runc version:

runc version 1.1.9

commit: v1.1.9-0-gccaecfc

spec: 1.0.2-dev

# Process tree:

dockerd(1234)───containerd(1235)───containerd-shim(5678)───node(5679)

# The container process (node, PID 5679) is a real Linux process on the host

The Docker Stack as a Restaurant Chain

Separation of concerns: the daemon manages the API and images, containerd manages lifecycle, runc creates containers.
Replaceability: you can swap runc for crun (faster), kata-runtime (VM isolation), or runsc (gVisor) without changing Docker.
Standardization: the OCI spec ensures any compliant runtime can run any compliant image.
Kubernetes reuses containerd directly — it does not need dockerd. This is why containerd was extracted.

Production Insight

The /var/run/docker.sock socket is the most dangerous file on a Docker host. Any process with access to this socket can create, stop, and delete containers — effectively root access to the host. In production, never mount this socket into containers unless absolutely necessary. If you must, use a socket proxy that restricts the API calls the container can make.

Key Takeaway

Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls (clone, pivot_root, exec). The OCI spec standardizes the interface between layers. Understanding this stack is the foundation for debugging any container issue.

Runtime Selection by Use Case

IfStandard single-tenant application workload

→

Userunc (default). Fast, lightweight, standard namespace isolation.

IfMulti-tenant workload running untrusted code

→

Userunsc (gVisor) or kata-runtime. User-space kernel or VM isolation.

IfPerformance-critical workload, low syscall overhead

→

Usecrun (written in C, faster than runc's Go implementation).

IfServerless platform, short-lived functions

→

UseFirecracker via containerd Firecracker shim. 125ms VM startup.

Linux Namespaces: The Isolation Mechanism Behind Every Container

Namespaces are the Linux kernel feature that provides process isolation. Each namespace gives a process its own view of a system resource. A container is a regular Linux process that runs inside a set of namespaces — it sees its own PID tree, its own network stack, its own filesystem mount points, and its own hostname, even though it shares the host kernel.

There are seven namespace types in Linux. Docker uses six of them by default:

PID namespace (CLONE_NEWPID): Each container has its own PID tree. The first process inside the container is PID 1. Processes inside the container cannot see processes outside the container. On the host, the container process has a real PID — you can see it with ps aux. The PID namespace is hierarchical — a child namespace can see parent PIDs if configured, but not sibling PIDs.

Network namespace (CLONE_NEWNET): Each container gets its own network stack — its own interfaces, routing table, firewall rules, and /proc/net. When Docker creates a container, it creates a veth (virtual Ethernet) pair — one end inside the container's network namespace, one end connected to the Docker bridge. This is how containers communicate with each other and the outside world.

Mount namespace (CLONE_NEWNS): Each container has its own mount table. The container's root filesystem is a union mount (overlay2) that layers the image's read-only layers with a writable top layer. The container cannot see the host's filesystem unless explicitly mounted. pivot_root changes the container's root directory to the overlay2 merge directory.

User namespace (CLONE_NEWUSER): Maps container UIDs to different host UIDs. Container UID 0 (root) can be mapped to host UID 100000 (unprivileged). This means even a container escape results in an unprivileged host user. User namespace remapping is not enabled by default in Docker because it breaks some workflows (volume permissions, Docker-in-Docker).

UTS namespace (CLONE_NEWUTS): Each container has its own hostname. The hostname is set during container creation and can be changed inside the container without affecting the host or other containers.

IPC namespace (CLONE_NEWIPC): Each container has its own System V IPC and POSIX message queues. Processes in different containers cannot share shared memory segments or message queues.

Cgroup namespace (CLONE_NEWCGROUP): Virtualizes the /proc/self/cgroup view. The container sees its own cgroup path as '/' instead of the real path (/docker/<container-id>). This prevents the container from seeing or manipulating other containers' cgroups.

io/thecodeforge/namespace_inspection.shBASH

#!/bin/bash
# Inspect and compare namespaces for containers and the host

# ── Get a container's host PID ───────────────────────────────────────────────
CONTAINER_NAME="my-api"
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' $CONTAINER_NAME)
echo "Container $CONTAINER_NAME is PID $CONTAINER_PID on the host"

# ── List all namespaces for the container process ────────────────────────────
ls -la /proc/$CONTAINER_PID/ns/
# Output:
# lrwxrwxrwx ... ipc -> 'ipc:[4026532XXX]'
# lrwxrwxrwx ... mnt -> 'mnt:[4026532XXX]'
# lrwxrwxrwx ... net -> 'net:[4026532XXX]'
# lrwxrwxrwx ... pid -> 'pid:[4026532XXX]'
# lrwxrwxrwx ... user -> 'user:[4026531XXX]'
# lrwxrwxrwx ... uts -> 'uts:[4026532XXX]'

# ── Compare with host namespaces ─────────────────────────────────────────────
ls -la /proc/1/ns/
# The host PID 1 (systemd) has different namespace IDs than the container
# If namespace IDs match, the container shares that namespace with the host

# ── PID namespace: container sees its own PID tree ────────────────────────────
docker exec $CONTAINER_NAME ps aux
# PID 1 is the container's entrypoint process
# The container cannot see host processes

# On the host, the same process has a different PID:
ps aux | grep $(docker exec $CONTAINER_NAME cat /proc/1/cmdline | tr '\0' ' ')
# The host sees the real PID, the container sees PID 1

# ── Network namespace: container has its own network stack ───────────────────
docker exec $CONTAINER_NAME ip addr show
# Shows: lo (loopback) and eth0 (veth pair inside the container)

# On the host, inspect the veth pair:
ip link show | grep veth
# vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0
# One end is in the container's net namespace, one end is on the docker0 bridge

# ── Enter a container's network namespace from the host ──────────────────────
sudo nsenter --net --target $CONTAINER_PID ip addr show
# Shows the same network config as docker exec, but from the host
# Useful for debugging without a shell inside the container

# ── Mount namespace: inspect the overlay2 filesystem ─────────────────────────
docker inspect --format '{{.GraphDriver.Data}}' $CONTAINER_NAME
# Shows: MergedDir, UpperDir, LowerDir, WorkDir
# MergedDir is what the container sees as /
# UpperDir is the writable layer (container-specific changes)
# LowerDir is the read-only image layers (colon-separated)

# ── User namespace: check if remapping is enabled ────────────────────────────
cat /etc/subuid
# If userns-remap is enabled: dockremap:100000:65536
# This maps container UID 0 to host UID 100000

# ── UTS namespace: container has its own hostname ─────────────────────────────
docker exec $CONTAINER_NAME hostname
# Shows the container's hostname (usually the container ID)

hostname
# Shows the host's hostname — different from the container

# ── IPC namespace: container has its own IPC resources ───────────────────────
docker exec $CONTAINER_NAME ipcs
# Shows only IPC resources created inside the container

ipcs
# Shows host IPC resources — not visible inside the container

Output

# Container PID on host:

Container my-api is PID 5679 on the host

# Container namespaces:

lrwxrwxrwx 1 root root 0 Jan 15 10:23 ipc -> 'ipc:[4026532847]'

lrwxrwxrwx 1 root root 0 Jan 15 10:23 mnt -> 'mnt:[4026532849]'

lrwxrwxrwx 1 root root 0 Jan 15 10:23 net -> 'net:[4026532851]'

lrwxrwxrwx 1 root root 0 Jan 15 10:23 pid -> 'pid:[4026532852]'

lrwxrwxrwx 1 root root 0 Jan 15 10:23 user -> 'user:[4026531837]'

lrwxrwxrwx 1 root root 0 Jan 15 10:23 uts -> 'uts:[4026532853]'

# Container process list:

PID USER COMMAND

1 node node dist/index.js

# Container network:

1: lo: <LOOPBACK,UP> mtu 65536

4: eth0@if7: <BROADCAST,MULTICAST,UP> mtu 1500

inet 172.17.0.2/16

# Overlay2 filesystem:

MergedDir: /var/lib/docker/overlay2/abc123/merged

UpperDir: /var/lib/docker/overlay2/abc123/diff

LowerDir: /var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers

Namespaces as Tinted Windows

--pid=host: the container sees ALL host processes. Its PID 1 is the host's PID 1 (systemd).
--net=host: the container shares the host's network stack. It can bind to any host port.
--ipc=host: the container can access host shared memory segments. Potential data leak.
Each flag removes one isolation layer. --privileged removes ALL of them.

Production Insight

The --pid=host and --net=host flags are debugging tools, not production configurations. --pid=host exposes all host processes to the container — a compromised container can kill any process on the host. --net=host removes network isolation — a container can sniff traffic from other containers. Always use the default namespace isolation in production. If you need host networking for performance, use it only on dedicated hosts.

Key Takeaway

Namespaces are the isolation mechanism. PID namespace isolates the process tree. Network namespace isolates the network stack. Mount namespace isolates the filesystem. User namespace maps UIDs. Every container is a Linux process running inside these namespaces. Disabling a namespace (with --pid=host, --net=host) removes that isolation layer.

Namespace Configuration Decisions

IfStandard production container

→

UseAll six namespaces enabled (default). Maximum isolation.

IfMonitoring container that needs host process visibility

→

UseUse --pid=host only on dedicated monitoring hosts. Never on shared hosts.

IfPerformance-critical proxy or load balancer

→

UseUse --net=host to avoid NAT overhead. Accept the security trade-off.

IfDebugging a container issue interactively

→

UseUse nsenter from the host: nsenter --net --target <pid> bash. No need to modify the container.

thecodeforge.io

How Docker Works Internally

cgroups: Resource Limits That Prevent Noisy Neighbors

While namespaces provide isolation (what a container can see), cgroups provide resource limits (how much a container can consume). Without cgroups, a container with a memory leak can consume all host RAM and trigger the OOM killer on unrelated containers.

cgroup v1 vs v2: Linux has two cgroup versions. cgroup v1 has separate hierarchies for each resource controller (cpu, memory, blkio, pids). cgroup v2 has a unified hierarchy. Docker supports both, but cgroup v2 is the default on newer Linux distributions (Ubuntu 22.04+, Fedora 31+, RHEL 9+).

CPU controller: Limits CPU usage in two ways: - cpu.shares: relative weight. Default is 1024. A container with 2048 gets twice the CPU of a container with 1024 when there is contention. Does not limit absolute CPU usage. - cpu.cfs_quota_us / cpu.cfs_period_us: absolute limit. --cpus=1.0 sets a quota of 100ms per 100ms period, limiting the container to one CPU core.

Memory controller: Limits memory usage: - memory.limit_in_bytes: hard limit. If the container exceeds this, the kernel OOM-kills the process. - memory.soft_limit_in_bytes: soft limit. The kernel tries to reclaim memory from the container before other containers, but does not kill it. - memory.oom_control: controls whether the OOM killer is invoked or the container is frozen.

blkio controller: Limits block device I/O: - blkio.throttle.read_bps_device: limits read throughput in bytes per second. - blkio.throttle.write_bps_device: limits write throughput.

pids controller: Limits the number of processes: - pids.max: maximum number of processes (including threads) the container can create. Prevents fork bombs.

The noisy neighbor problem: Without cgroup limits, one container can starve others. A container with a CPU-bound loop consumes 100% of all CPUs. A container with a memory leak consumes all host RAM, triggering the kernel OOM killer, which may kill unrelated containers. cgroup limits prevent this by enforcing per-container resource ceilings.

io/thecodeforge/cgroup_inspection.shBASH

#!/bin/bash
# Inspect and configure cgroup resource limits for containers

# ── Get a container's cgroup path ────────────────────────────────────────────
CONTAINER_ID=$(docker ps -q | head -1)

# cgroup v1 path:
ls /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/
# cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us

# cgroup v2 path:
ls /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/
# cpu.max, memory.max, pids.max

# ── Check CPU limits ─────────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.shares
# Default: 1024 (1 CPU share). Set with --cpu-shares
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_quota_us
# -1 means no limit. Set with --cpus=1.0 (becomes 100000)
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_period_us
# Default: 100000 (100ms)

# cgroup v2:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
# Format: quota period. Example: 100000 100000 (1 CPU limit)

# ── Check memory limits ─────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.limit_in_bytes
# 9223372036854771712 means no limit (max int64)
# Set with --memory=512m (becomes 536870912)
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.usage_in_bytes
# Current memory usage in bytes

# cgroup v2:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
# max means no limit. Set with --memory=512m
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
# Current memory usage

# ── Check OOM events ────────────────────────────────────────────────────────
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.oom_control
# oom_kill_disable: 0 (OOM killer enabled)
# under_oom: 0 (not currently under OOM pressure)

# Check kernel OOM log:
dmesg | grep -i 'oom\|killed process' | tail -10

# ── Check PID limits ────────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.max
# max means no limit. Set with --pids-limit=256
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.current
# Current number of processes

# ── Run a container with resource limits ─────────────────────────────────────
docker run -d \
  --name resource-limited \
  --cpus=1.0 \
  --memory=512m \
  --pids-limit=256 \
  --memory-swap=512m \
  # --memory-swap=512m disables swap (swap = memory limit)
  alpine:3.19 sleep 3600

# Verify the limits:
docker inspect resource-limited --format '{{.HostConfig.NanoCpus}}'
# 1000000000 = 1 CPU (in nanoseconds)

docker inspect resource-limited --format '{{.HostConfig.Memory}}'
# 536870912 = 512MB (in bytes)

docker stats resource-limited --no-stream
# Shows: MEM USAGE / LIMIT — 512MiB / 512MiB

Output

# CPU limits (cgroup v1):

1024

100000

# Memory limits:

536870912

45219840

# PID limits:

256

# docker stats:

NAME CPU % MEM USAGE / LIMIT MEM %

resource-limited 0.00% 43.2MiB / 512MiB 8.44%

cgroups as Budget Limits

Without a memory limit, a container can consume all host RAM.
The kernel OOM killer then selects a process to kill — it may kill an unrelated container, not the leaking one.
With --memory=512m, the kernel kills only the container that exceeded its limit.
Without limits, the OOM killer uses a heuristic (oom_score) that may choose the wrong victim.

Production Insight

The OOM killer's victim selection heuristic (oom_score) favors killing processes with high memory usage and low importance. But it does not know which container is the problem — it sees host PIDs, not container boundaries. Without cgroup memory limits, a memory leak in container A can cause the OOM killer to kill container B (which happens to have a higher oom_score). Always set --memory on every production container to ensure the OOM killer targets the right process.

Key Takeaway

cgroups limit how much CPU, memory, I/O, and processes a container can consume. Without cgroup limits, one container can starve others (noisy neighbor problem). Always set --memory in production — without it, the OOM killer may kill the wrong container. Use --cpus for CPU limits and --pids-limit to prevent fork bombs.

Resource Limit Strategy

IfStateless web API with predictable resource usage

→

UseSet --cpus and --memory based on load testing. Use --memory-swap=limit to disable swap.

IfDatabase or cache with memory-based eviction

→

UseSet --memory to the expected working set size. Do not set --memory-swap (allow swap for eviction).

IfWorker process that may fork subprocesses

→

UseSet --pids-limit=256 to prevent fork bombs. Set --cpus to limit total CPU across all forks.

IfDevelopment/testing environment

→

UseSkip resource limits. They add complexity without benefit in non-production environments.

Union Filesystem and overlay2: How Docker Images Work Without Copying

The union filesystem is the reason Docker images are lightweight and containers start in milliseconds. Instead of copying files, Docker overlays multiple read-only directories and presents them as a single merged filesystem.

overlay2 driver: The default storage driver in modern Docker. It stacks directories (layers) and presents a merged view. Each layer is a directory on the host filesystem. The bottom layers are read-only (image layers). The top layer is writable (container-specific changes).

How it works: When a container reads a file, overlay2 checks the top (writable) layer first. If the file exists there, it is returned. If not, overlay2 checks each lower layer in order until the file is found. When a container writes a file, the write goes to the top layer only — lower layers are never modified. When a container deletes a file, a whiteout file (a character device with major/minor 0/0, prefixed with .wh.) is created in the top layer to mask the lower layer's file.

The four directories: - lowerdir: colon-separated list of read-only image layers (bottom to top) - upperdir: the writable layer (container-specific changes) - workdir: overlay2 internal working directory (must be empty, used for atomic operations) - merged: the combined view that the container sees as its root filesystem

Performance implications: Read performance is slightly slower than native because overlay2 must check multiple layers. Write performance is native (writes go directly to the upperdir on the host filesystem). The performance difference is negligible for most workloads but can matter for I/O-intensive applications (databases, search engines).

The copy-up problem: When a container modifies a file from a lower layer, overlay2 must first copy the entire file to the upperdir (copy-up), then modify the copy. For large files (multi-GB database files), copy-up can cause a noticeable delay on first write. This is why databases should use volumes (bind mounts) instead of the container's overlay2 filesystem.

io/thecodeforge/overlay2_inspection.shBASH

#!/bin/bash
# Inspect the overlay2 filesystem for a running container

# ── Get the overlay2 paths for a container ───────────────────────────────────
CONTAINER_ID=$(docker ps -q | head -1)

GRAPH_DATA=$(docker inspect --format '{{json .GraphDriver.Data}}' $CONTAINER_ID)
echo $GRAPH_DATA | python3 -m json.tool

# Extract individual paths:
MERGED_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['MergedDir'])")
UPPER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['UpperDir'])")
LOWER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['LowerDir'])")
WORK_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['WorkDir'])")

echo "Merged (container sees this as /): $MERGED_DIR"
echo "Upper (writable layer): $UPPER_DIR"
echo "Lower (read-only layers): $LOWER_DIR"
echo "Work (overlay2 internal): $WORK_DIR"

# ── Inspect the writable layer (upperdir) ────────────────────────────────────
ls -la $UPPER_DIR/
# Shows files the container has created or modified
# Files prefixed with .wh. are whiteout files (deleted from lower layers)

# ── Inspect the merged view ──────────────────────────────────────────────────
ls -la $MERGED_DIR/
# This is what the container sees as its root filesystem
# It is the combination of all lower layers + the upper layer

# ── Demonstrate the copy-up behavior ─────────────────────────────────────────
# Create a file in the container
docker exec $CONTAINER_ID sh -c 'echo "hello" > /tmp/test-file'

# The file appears in the writable layer (upperdir):
ls -la $UPPER_DIR/tmp/test-file
# The file is in the upper layer, not in any lower layer

# ── Demonstrate the whiteout behavior ────────────────────────────────────────
# Delete a file that exists in a lower layer
docker exec $CONTAINER_ID rm /etc/hostname

# A whiteout file appears in the upper layer:
ls -la $UPPER_DIR/etc/.wh.hostname
# This character device (0/0) tells overlay2 to hide the lower layer's file

# ── Check the number of layers in an image ───────────────────────────────────
docker inspect <image> --format '{{len .RootFS.Layers}} layers'
# Each layer is a directory under /var/lib/docker/overlay2/

# ── Check disk usage per layer ───────────────────────────────────────────────
du -sh /var/lib/docker/overlay2/* | sort -hr | head -10
# Shows disk usage for each layer (shared layers are counted once)

# ── Compare overlay2 with native filesystem performance ─────────────────────
# Write performance test:
time docker exec $CONTAINER_ID dd if=/dev/zero of=/tmp/test bs=1M count=100
# Overlay2 write: ~0.3s (writes to upperdir on host filesystem)

# Read performance test:
time docker exec $CONTAINER_ID dd if=/tmp/test of=/dev/null bs=1M
# Overlay2 read: ~0.1s (slightly slower than native due to layer lookup)

Output

# Overlay2 paths:

{

"LowerDir": "/var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers",

"MergedDir": "/var/lib/docker/overlay2/abc123/merged",

"UpperDir": "/var/lib/docker/overlay2/abc123/diff",

"WorkDir": "/var/lib/docker/overlay2/abc123/work"

}

# Writable layer contents:

drwxr-xr-x 4 root root 4096 Jan 15 10:25 tmp

drwxr-xr-x 2 root root 4096 Jan 15 10:25 etc

-rw-r--r-- 1 root root 0 Jan 15 10:25 etc/.wh.hostname

# The whiteout file etc/.wh.hostname hides /etc/hostname from lower layers

Overlay2 as Transparent Acetate Sheets

Each layer is additive — deleting a file in layer N+1 does not remove it from layer N.
The delete creates a whiteout marker in layer N+1, but the data still exists in layer N.
The only way to truly remove data is to not include it in any layer (use multi-stage builds or .dockerignore).
This is why RUN apt-get install ... && rm -rf /var/lib/apt/lists/* must be in the same RUN — separate RUNs create separate layers.

Production Insight

The copy-up problem is the most common performance issue with overlay2. When a database container writes to a file that exists in a lower layer (e.g., modifying a config file from the image), overlay2 must first copy the entire file to the upperdir. For multi-GB database files, this can cause seconds of latency on first write. The fix: use bind mount volumes for database data directories instead of writing to the overlay2 filesystem.

Key Takeaway

overlay2 stacks read-only image layers with a writable top layer. Reads check the top layer first, then fall through to lower layers. Writes always go to the top layer. Deletes create whiteout files. Deleting files in a later layer does not reclaim space — the data persists in the earlier layer. Use volumes for databases to avoid the copy-up overhead.

Filesystem Strategy by Workload

IfStateless application (API, web server)

→

UseUse overlay2 (default). The writable layer is sufficient for temporary files and logs.

IfDatabase or persistent storage

→

UseUse named volumes or bind mounts. Bypass overlay2 entirely. Avoid copy-up overhead.

IfBuild process creating many temporary files

→

UseUse tmpfs mounts for temporary data. Avoids disk I/O entirely.

IfHigh-security environment

→

UseUse --read-only flag to make the overlay2 filesystem read-only. All writes must go to explicit tmpfs or volume mounts.

The Container Lifecycle: From Clone to Exit — Every Kernel Call

When you run docker run, a precise sequence of kernel calls creates the container. Understanding this sequence is the key to debugging startup failures, permission errors, and namespace issues.

Step 1: Image pull and unpack. The Docker daemon pulls the image layers from the registry and unpacks them into /var/lib/docker/overlay2/. Each layer is a directory. If the layers already exist locally (cached), this step is skipped.

Step 2: Create the OCI runtime spec. containerd generates a config.json file — the OCI runtime specification. This JSON file defines: - The namespaces to create (PID, network, mount, user, UTS, IPC) - The cgroup limits (CPU, memory, pids) - The root filesystem path (the overlay2 merge directory) - The environment variables, working directory, and command to execute - The mount points (volumes, /proc, /sys, /dev)

Step 3: runc creates the container. runc reads config.json and executes the following kernel calls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) — creates a new process with new namespaces - sethostname() — sets the container's hostname (UTS namespace) - mount() — mounts /proc, /sys, /dev inside the container's mount namespace - pivot_root() — changes the container's root directory to the overlay2 merge directory - chdir("/") — moves to the new root - setuid() / setgid() — drops privileges to the container's user (if non-root) - execve() — replaces the runc process with the container's entrypoint command

Step 4: runc exits, containerd monitors. After execve(), runc is replaced by the container's process. runc exits. containerd (via containerd-shim) monitors the container process, captures stdout/stderr, and handles signals.

Step 5: The container process runs. The application process is now running inside a set of namespaces with cgroup limits and an overlay2 filesystem. It has PID 1 inside the container's PID namespace. On the host, it has a real PID visible in ps aux.

The pause process: Each container has a 'pause' process that holds the namespaces open. If the application process exits, the pause process keeps the namespaces alive (for restart). You can see pause processes on the host: ps aux | grep pause.

io/thecodeforge/container_lifecycle.shBASH

#!/bin/bash
# Trace the complete container lifecycle from clone to exec

# ── Step 1: Pull and inspect image layers ────────────────────────────────────
docker pull alpine:3.19

# Inspect the image layers:
docker inspect alpine:3.19 --format '{{json .RootFS.Layers}}' | python3 -m json.tool
# Each entry is a layer (SHA256 digest)

# Find the layers on disk:
ls /var/lib/docker/overlay2/ | head -5
# Each directory is a layer. Shared layers are hard-linked or reflinked.

# ── Step 2: Create a container and inspect the OCI spec ──────────────────────
# Create a container without starting it:
docker create --name lifecycle-demo alpine:3.19 echo 'hello'

# Find the OCI runtime spec:
find /run/containerd -name config.json -path '*lifecycle-demo*' 2>/dev/null
# This file is the OCI runtime spec that runc reads

# Inspect the spec (if found):
cat /run/containerd/io.containerd.runtime.v2.task/default/lifecycle-demo/config.json | python3 -m json.tool | head -50
# Shows: namespaces, mounts, cgroups, process config, root filesystem

# ── Step 3: Trace runc's kernel calls ────────────────────────────────────────
# Start a container with strace to see the kernel calls:
sudo strace -f -e trace=clone,clone3,unshare,sethostname,mount,pivot_root,setuid,setgid,execve \
  -o /tmp/runc-trace.log \
  runc run --bundle /path/to/bundle test-trace

# The trace shows:
# clone3({flags=CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|..., ...}) = 12345
# sethostname("container-id", 12) = 0
# mount("proc", "/proc", "proc", ...) = 0
# mount("sysfs", "/sys", "sysfs", ...) = 0
# pivot_root(".", "/old-root") = 0
# setuid(1000) = 0
# setgid(1000) = 0
# execve("/bin/sh", ["sh"], ...) = 0

# ── Step 4: Find the container process and pause process on the host ─────────
docker start lifecycle-demo

# Find the container's host PID:
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' lifecycle-demo)
echo "Container process PID: $CONTAINER_PID"

# Find the pause process (holds namespaces open):
ps aux | grep pause | grep -v grep
# root  5678  0.0  0.0  1024  4  ?  Ss  10:23  0:00 /pause

# The pause process and the container process share the same namespaces:
ls -la /proc/$CONTAINER_PID/ns/net
ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net
# Both point to the same namespace inode

# ── Step 5: Watch the container process on the host ──────────────────────────
ps aux | grep $CONTAINER_PID
# root  5679  0.0  0.1  ...  echo hello
# This is the REAL process on the host, running inside namespaces

# ── Cleanup ──────────────────────────────────────────────────────────────────
docker rm -f lifecycle-demo

Output

# Image layers:

[

"sha256:abc123def456..."

]

# Container process on host:

Container process PID: 5679

# Pause process:

root 5678 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause

# Host process:

root 5679 0.0 0.1 4520 1820 ? Ss 10:23 0:00 echo hello

Container Creation as Building a Room

runc's job is to create the container, not to manage it. After execve(), runc is replaced by the application process.
containerd (via containerd-shim) monitors the container process, captures output, and handles signals.
The pause process holds the namespaces open so they survive application restarts.
This separation allows containerd to manage the lifecycle without being PID 1 in the container.

Production Insight

The pause process is essential for container restarts. When the application process exits, the pause process keeps the namespaces alive. containerd can then start a new process inside the same namespaces (restart). Without the pause process, the namespaces would be destroyed on application exit, and a restart would require creating new namespaces from scratch. You can see pause processes on the host with ps aux | grep pause — one per container.

Key Takeaway

Container creation is a sequence of kernel calls: clone (namespaces), mount (filesystem), pivot_root (change root), setuid (drop privileges), execve (start application). runc creates the container and exits. containerd monitors the process. The pause process holds namespaces open for restarts. Every container is a real Linux process visible in ps aux on the host.

Docker Networking: How Packets Escape the Container Without a Firewall Meltdown

Docker networking isn't magic. It's a carefully orchestrated lie. By default, Docker creates a bridge network (docker0) on the host. Every container gets a virtual Ethernet pair (veth). One end lives inside the container's namespace. The other plugs into the bridge. Packets traverse the bridge, get NAT'd through iptables, and hit your physical NIC.

The mistake most junior engineers make is thinking containers are fully isolated at the network layer. They're not. A container can ARP scan the entire bridge subnet. The only thing stopping them from reaching other containers is the default --icc=false? Wait — no, Docker sets --icc=true by default. That means containers on the same bridge can talk to each other unhindered.

For production: never use the default bridge. Create user-defined networks. They give you built-in DNS resolution (no more --link deprecated garbage) and proper isolation between container groups. Also drop the --iptables=false flag only when you know exactly what you're doing. Otherwise, Docker writes iptables rules that will make your security team twitch.

AuditDefaultNetworking.ymlYAML

// io.thecodeforge — devops tutorial

name: audit-default-docker-bridge
on:
  schedule:
    - cron: '0 6 * * 1'

jobs:
  check-bridge:
    runs-on: ubuntu-22.04
    steps:
      - name: Inspect default bridge
        run: |
          docker network inspect bridge --format '{{json .IPAM.Config}}'
          echo "---"
          echo "icc (inter-container comms): $(docker info --format '{{.Swarm.LocalNodeState}}')"
      - name: List iptables for Docker
        run: |
          sudo iptables -t nat -L -n | grep -i docker

Output

[{"Subnet":"172.17.0.0/16","Gateway":"172.17.0.1"}]

---

icc (inter-container comms): inactive

Chain DOCKER (2 references)

target prot opt source destination

dnat tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.2:80

Production Trap: Default Bridge is a Party Subnet

Containers on the default bridge can talk to each other on any port. No encryption. No firewall. If one gets popped, they all get popped. Always use user-defined networks with --internal or --attachable controlled.

Key Takeaway

Docker's default bridge is for development only. User-defined networks give you DNS and isolation. Verify your iptables rules are not accidentally wide open every time you add a --publish.

The Image Layer Cache: Why Your Docker Builds Take 45 Minutes (And How to Fix It)

Every Dockerfile instruction creates a layer. Each layer is a delta of the filesystem. When you rebuild, Docker checks if the instruction's context has changed. If not, it uses the cached layer. Sounds great. Breaks constantly.

The problem? Most people put COPY . . before running apt-get update or npm install. Any file change in the source dir invalidates every subsequent layer. Now you're reinstalling all system packages and rebuilding all node_modules. Every. Single. Build.

Fix your layer ordering: pin your package manager dependencies first. Copy package.json and requirements.txt before the rest of your source. Run your install commands immediately after. That way, changes to your application code don't trigger a full dependency reinstall. Use --mount=type=cache for buildkit to persist apt and npm caches across builds. Multi-stage builds aren't just for final image size — they're for caching compiled artifacts so you don't recompile the entire Go project when you changed one string in a config file.

OptimizedDockerfile.ymlYAML

// io.thecodeforge — devops tutorial

FROM node:20-alpine AS builder
WORKDIR /app

# Dependency layers — cached unless package.json changes
COPY package.json yarn.lock ./
RUN yarn install --frozen-lockfile --no-cache

# Source layers — invalidate only on code changes
COPY src/ ./src/
RUN yarn build

FROM node:20-alpine AS runner
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules

EXPOSE 3000
CMD ["node", "dist/server.js"]

# Build with: DOCKER_BUILDKIT=1 docker build --cache-from myapp:cache -t myapp:latest .

Output

Step 6/15 : COPY package.json yarn.lock ./

---> Using cache

Step 7/15 : RUN yarn install --frozen-lockfile --no-cache

---> Using cache

Step 8/15 : COPY src/ ./src/

---> d6a2f1b3c8e9

Step 9/15 : RUN yarn build

---> Running in 5f2e0a1b3c4d

Senior Shortcut: Cache Bomb Detection

Key Takeaway

Order your Dockerfile from least to most volatile dependencies. Cache mounts and multi-stage builds are free performance wins. Never COPY . . before package manifests.

The Underlying Technology

Docker's internals rest on two kernel features: namespaces and cgroups. PID, network, mount, and user namespaces each provide isolated views of system resources, making a container feel like a separate machine. cgroups enforce hard limits on CPU, memory, and I/O, preventing any single container from starving the host. The real magic is the union filesystem, typically overlay2, which layers read-only image layers with a writable container layer using copy-on-write. This means multiple containers share the same base image blocks on disk, reducing storage and startup time to milliseconds. Understanding these primitives explains why containers are not lightweight VMs: they share the host kernel, so a kernel panic takes down all containers. The performance gain comes from skipping hardware virtualization, not from magic.

overlay2_structure_example.ymlYAML

// io.thecodeforge — devops tutorial

# Docker image layers stored in /var/lib/docker/overlay2/
lowerdir: /var/lib/docker/overlay2/l/LAYER1:/var/lib/docker/overlay2/l/LAYER2
upperdir: /var/lib/docker/overlay2/CONTAINER_ID/diff
workdir: /var/lib/docker/overlay2/CONTAINER_ID/work
merged: /var/lib/docker/overlay2/CONTAINER_ID/merged

# On write, data is copied from lower to upper dir
# Deletions create a whiteout file in upperdir

Production Trap:

Overlay2 requires a supported filesystem (xfs, ext4). Running aufs on old kernels will corrupt layers under high I/O.

Key Takeaway

Containers share the host kernel. Overlay2 saves disk space, but kernel crashes kill all containers.

Putting It All Together

When you run docker run nginx, Docker CLI sends a REST call to dockerd, which pulls layers from a registry via HTTP, assembles them into a rootfs over overlay2, creates a new PID, mount, and network namespace, applies cgroup limits, and calls clone() with CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET. The new process sees only its isolated namespace tree, its root filesystem, and a virtual Ethernet pair. The container's init process (PID 1) runs as a normal process on the host but cannot see sibling containers or host processes. Docker's networking uses a bridge or overlay driver to forward packets through iptables rules. Resource contention is prevented by cgroups throttling CPU shares and memory limits. When the container exits, dockerd destroys namespaces, unmounts overlay2 layers, and optionally removes the writable layer. This entire cycle happens in under a second because no hardware emulation is involved.

docker_run_kernel_call_sequence.ymlYAML

// io.thecodeforge — devops tutorial

# Simplified call flow for 'docker run nginx'
1. CLI -> dockerd: REST POST /containers/create
2. dockerd -> registry: GET /v2/nginx/manifests/latest
3. dockerd -> overlay: mount -t overlay overlay \
  -olowerdir=/layers/nginx,upperdir=/diff,workdir=/work /merged
4. dockerd -> kernel: unshare(CLONE_NEWNS|CLONE_NEWPID|CLONE_NEWNET)
5. dockerd -> cgroup: echo $PID > /sys/fs/cgroup/memory/tasks
6. dockerd -> kernel: exec /usr/sbin/nginx inside new namespace

Performance Insight:

If clone() fails due to user namespace restrictions, check /proc/sys/kernel/unprivileged_userns_clone.

Key Takeaway

The entire container lifecycle is a sequence of kernel syscalls, orchestrated by dockerd with no virtualization overhead.

Conclusion: The Orchestrated Symphony of Kernel Primitives

Docker containers are not lightweight VMs. They are a clever orchestration of Linux kernel primitives—namespaces for isolation, cgroups for resource control, and union filesystems for efficient image layering. Understanding this internal machinery transforms debugging from guesswork into science. When a container leaks memory, check cgroup limits and the OOM killer. When networking fails, trace the veth pair and iptables NAT rules. When builds are slow, audit the overlay2 layer cache. This knowledge separates engineers who merely run containers from those who master them. The Docker CLI is just a user-friendly wrapper around syscalls like clone(CLONE_NEWNS), pivot_root, and iptables commands. Every production incident related to containers can be traced back to a specific kernel mechanism—no magic, just well-documented Linux internals. By internalizing these concepts, you move from operator to architect, designing resilient containerized systems rather than debugging black boxes.

container_internals_cheat.ymlYAML

// io.thecodeforge — devops tutorial
# Deep kernel interaction for container lifecycle
container_init:
  step_1_namespaces:
    - "clone(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET)"
    - "Separates mount table, PID tree, network stack"
  step_2_cgroups:
    - "Write limits to /sys/fs/cgroup/{cpu,memory,pids}/"
    - "Kernel enforces 2 CPU cores, 512MB RAM"
  step_3_chroot:
    - "pivot_root to container rootfs"
    - "overlay2 merges layers read-only, top writable"
  step_4_networking:
    - "veth pair: eth0 <-> host bridge (docker0)"
    - "iptables -t nat -A POSTROUTING -j MASQUERADE"
  exit_path:
    - "Namespace destroyed on last process exit"
    - "cgroup cleanup via release_agent"

Production Trap:

Containers are not sandboxes. Shared kernel means a kernel exploit breaks all containers—always run workloads at separate privilege levels.

Key Takeaway

Docker is a userspace abstraction over kernel syscalls—master the primitives, master the container.

Namespace	Flag	Isolates	Docker Default	Host Flag to Disable
PID	CLONE_NEWPID	Process ID tree	Enabled	--pid=host
Network	CLONE_NEWNET	Network stack (interfaces, routes, iptables)	Enabled	--net=host
Mount	CLONE_NEWNS	Filesystem mount points	Enabled	--volume /:/host (partial)
User	CLONE_NEWUSER	UID/GID mapping	Disabled (opt-in)	N/A (disabled by default)
UTS	CLONE_NEWUTS	Hostname and domain name	Enabled	--uts=host
IPC	CLONE_NEWIPC	System V IPC and POSIX message queues	Enabled	--ipc=host
Cgroup	CLONE_NEWCGROUP	cgroup root directory view	Enabled (cgroup v2)	--cgroupns=host

File	Command / Code	Purpose
iothecodeforgedocker_stack_inspection.sh	curl --unix-socket /var/run/docker.sock http://localhost/version \| python3 -m js...	The Docker Stack: From CLI to Kernel
iothecodeforgenamespace_inspection.sh	CONTAINER_NAME="my-api"	Linux Namespaces
iothecodeforgecgroup_inspection.sh	CONTAINER_ID=$(docker ps -q \| head -1)	cgroups
iothecodeforgeoverlay2_inspection.sh	CONTAINER_ID=$(docker ps -q \| head -1)	Union Filesystem and overlay2
iothecodeforgecontainer_lifecycle.sh	docker pull alpine:3.19	The Container Lifecycle: From Clone to Exit
AuditDefaultNetworking.yml	name: audit-default-docker-bridge	Docker Networking
OptimizedDockerfile.yml	FROM node:20-alpine AS builder	The Image Layer Cache
overlay2_structure_example.yml	lowerdir: /var/lib/docker/overlay2/l/LAYER1:/var/lib/docker/overlay2/l/LAYER2	The Underlying Technology
docker_run_kernel_call_sequence.yml	1. CLI -> dockerd: REST POST /containers/create	Putting It All Together
container_internals_cheat.yml	container_init:	Conclusion
further_reading_curated.yml	resources:	Further Reading

Key takeaways

Docker is a stack

CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls.

Namespaces isolate what a container can see (PID tree, network, filesystem, hostname, IPC). cgroups limit what a container can consume (CPU, memory, I/O, processes).

Every container is a real Linux process visible in ps aux on the host. The kernel does not know what a 'container' is

it only knows processes, namespaces, and cgroups.

overlay2 stacks read-only image layers with a writable top layer. No data is copied on container creation. Writes go to the top layer. Deletes create whiteout files.

The pause process holds namespaces open so they survive application restarts. runc exits after creating the container. containerd monitors the process.

The Docker socket (/var/run/docker.sock) is equivalent to root access on the host. Never mount it into containers without a socket proxy.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 5 QUESTIONS

Frequently Asked Questions

Is a Docker container a virtual machine?

What is the difference between a namespace and a cgroup?

Can I see a container's process on the host?

What happens when a container exceeds its memory limit?

What is the OCI runtime spec?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.

✓ Verified

production tested

July 04, 2026

last updated

1,663

articles · all by Naren

🔥

That's Docker. Mark it forged?

14 min read · try the examples if you haven't

Docker Internals: PID Namespace Bug Kills Containers

Why PID Namespace Isolation Is Not a Security Boundary

The Docker Stack: From CLI to Kernel — Every Component Explained

Linux Namespaces: The Isolation Mechanism Behind Every Container

cgroups: Resource Limits That Prevent Noisy Neighbors

Union Filesystem and overlay2: How Docker Images Work Without Copying

The Container Lifecycle: From Clone to Exit — Every Kernel Call

Docker Networking: How Packets Escape the Container Without a Firewall Meltdown

The Image Layer Cache: Why Your Docker Builds Take 45 Minutes (And How to Fix It)

The Underlying Technology

Putting It All Together

Conclusion: The Orchestrated Symphony of Kernel Primitives

Further Reading: Deep Dives Into the Linux Kernel Underpinnings

Container Process Visible on Host — PID Namespace Misconfiguration Exposes All Container Processes

Key takeaways

Interview Questions on This Topic

Frequently Asked Questions

That's Docker. Mark it forged?