Homeβ€Ί DevOpsβ€Ί How Docker Works Internally: Architecture, Namespaces, and Containers Explained

How Docker Works Internally: Architecture, Namespaces, and Containers Explained

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Docker β†’ Topic 4 of 17
Docker internals deep-dive: containerd, runc, Linux namespaces, cgroups, overlay2 union filesystem, and the OCI runtime spec β€” understand what actually happens when you run docker run.
πŸ”₯ Advanced β€” solid DevOps foundation required
In this tutorial, you'll learn
Docker internals deep-dive: containerd, runc, Linux namespaces, cgroups, overlay2 union filesystem, and the OCI runtime spec β€” understand what actually happens when you run docker run.
  • Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls.
  • Namespaces isolate what a container can see (PID tree, network, filesystem, hostname, IPC). cgroups limit what a container can consume (CPU, memory, I/O, processes).
  • Every container is a real Linux process visible in ps aux on the host. The kernel does not know what a 'container' is β€” it only knows processes, namespaces, and cgroups.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑Quick Answer
  • Docker CLI sends API requests to the Docker daemon (dockerd)
  • dockerd delegates container lifecycle to containerd
  • runc reads the OCI runtime spec and configures namespaces, cgroups, and filesystem
  • Namespaces: isolate PID, network, mount, user, UTS, and IPC views
  • cgroups: limit CPU, memory, I/O, and process count per container
  • Union filesystem (overlay2): stack read-only image layers with a writable top layer
  • seccomp: filter syscalls at the kernel level
🚨 START HERE
Docker Internals Triage Cheat Sheet
First-response commands for container, namespace, cgroup, and filesystem issues.
🟠Container is consuming too much CPU or memory.
Immediate ActionFind the container's host PID and inspect its cgroup.
Commands
docker inspect --format '{{.State.Pid}}' <container>
cat /proc/<pid>/cgroup && cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
Fix NowIf memory usage exceeds limit, increase --memory. If no limit is set, add --memory=512m to prevent one container from consuming all host RAM.
🟑Container cannot reach the network.
Immediate ActionInspect the container's network namespace.
Commands
docker inspect --format '{{.State.Pid}}' <container>
nsenter --net --target <pid> ip addr show && nsenter --net --target <pid> ip route show
Fix NowIf no veth interface exists, the container is not connected to a network. If routes are missing, check docker network inspect <network>.
🟑Container cannot write to volume β€” permission denied.
Immediate ActionCheck UID mapping between container and host.
Commands
docker exec <container> id
ls -la /var/lib/docker/volumes/<volume>/_data
Fix NowIf UID mismatch, chown the volume to match the container's UID. Check if userns-remap is enabled in daemon.json.
🟑Container exits immediately with code 137.
Immediate ActionCheck if the process was OOM-killed.
Commands
dmesg | grep -i 'oom\|killed process'
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
Fix NowIf OOM-killed, increase --memory limit or fix the memory leak. Check docker stats for actual memory usage before the kill.
🟑Container filesystem shows unexpected or missing files.
Immediate ActionInspect overlay2 layers and the writable layer.
Commands
docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container>
ls -la $(docker inspect --format '{{.GraphDriver.Data.UpperDir}}' <container>)
Fix NowFiles in UpperDir override LowerDir. Whiteout files (c character devices) mark deleted files. Check if a file was deleted in a layer by looking for .wh. prefixed files.
🟑Docker daemon is unresponsive.
Immediate ActionCheck daemon and containerd status.
Commands
systemctl status docker && systemctl status containerd
journalctl -u docker --since '10 minutes ago' --no-pager | tail -50
Fix NowIf daemon is hung, restart it: systemctl restart docker. If disk space is full, prune: docker system prune -a. If containerd is crashed, restart: systemctl restart containerd.
Production IncidentContainer Process Visible on Host β€” PID Namespace Misconfiguration Exposes All Container ProcessesA monitoring script running on the host was accidentally given --pid=host when starting a container. All container processes became visible in the host's process tree. The monitoring script collected PIDs from containers and sent them to a kill script that was designed to clean up zombie processes on the host, resulting in production containers being killed unexpectedly.
SymptomProduction containers were being killed randomly β€” sometimes the API container, sometimes the worker container, sometimes the database. The kills occurred every 5-10 minutes with no pattern. The team checked application logs β€” no errors. They checked OOM killer logs (dmesg | grep -i oom) β€” no OOM kills. They checked Docker events (docker events) β€” containers were being killed with SIGTERM, not SIGKILL. The kills were coming from the host, not from inside the containers.
AssumptionThe team assumed a rogue cron job on the host was killing processes. They checked crontab -l for all users β€” nothing suspicious. They assumed a Kubernetes liveness probe was misconfigured β€” but they were not running Kubernetes. They assumed a Docker daemon bug and restarted dockerd β€” the kills continued.
Root causeA monitoring container was started with --pid=host, which disables the PID namespace isolation. This made all processes on the host (including other containers' processes) visible inside the monitoring container. The monitoring script used ps aux to collect process metrics and passed PIDs matching a certain pattern to a cleanup script that killed orphaned processes. Because the monitoring container could see all container processes (not just its own), the cleanup script identified legitimate container processes as 'orphans' and killed them. The --pid=host flag was added during debugging a week earlier and never removed.
Fix1. Removed --pid=host from the monitoring container. PID namespace isolation was restored β€” the monitoring container could only see its own processes. 2. Added a pre-deployment check that scans docker run and docker-compose.yml for --pid=host, --net=host, and --privileged flags and requires explicit approval. 3. Modified the cleanup script to verify that PIDs belong to the expected namespace before killing them. 4. Added docker events monitoring that alerts on unexpected container kills. 5. Documented that --pid=host is a debugging flag that must never be used in production.
Key Lesson
--pid=host disables PID namespace isolation. All host processes become visible inside the container. This is a debugging flag, not a production configuration.A monitoring container with --pid=host can see and interact with all processes on the host, including other containers' processes.Always audit namespace flags (--pid, --net, --ipc, --uts) in production deployments. Any flag that disables namespace isolation increases the blast radius of a compromised container.Add automated pre-deployment checks for dangerous flags. Manual review is insufficient β€” flags added during debugging are easily forgotten.The PID namespace is the most important isolation boundary. Without it, a container is not isolated β€” it is just a chroot.
Production Debug GuideSystematic debugging paths using /proc, /sys, and namespace inspection.
Container process is consuming unexpected CPU or memory.β†’Find the container's host PID and inspect its cgroup. Run docker inspect --format '{{.State.Pid}}' <container> to get the host PID. Then cat /proc/<pid>/cgroup to see which cgroup the process belongs to. Check CPU usage with top -p <pid> and memory with cat /proc/<pid>/status | grep -i vm. Compare with the cgroup limits: cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes.
Container cannot reach the network or resolve DNS.β†’Inspect the container's network namespace. Get the container PID: docker inspect --format '{{.State.Pid}}' <container>. Enter the network namespace: nsenter --net --target <pid> ip addr show. Check the veth pair: nsenter --net --target <pid> ip link show. Check DNS resolution: nsenter --net --target <pid> cat /etc/resolv.conf. Check routing: nsenter --net --target <pid> ip route show.
Container cannot write to a mounted volume β€” permission denied.β†’Check the container's user namespace and the volume ownership. Run docker exec <container> id to see the container's UID. Check the volume ownership on the host: ls -la /var/lib/docker/volumes/<volume>/_data. If the container runs as UID 1000 but the volume is owned by root (UID 0), the container cannot write. Check if user namespace remapping is enabled: cat /etc/docker/daemon.json | grep userns-remap.
Container starts but the application process is not running.β†’Check if the container process exited immediately. Run docker ps -a to see if the container status is 'Exited'. Check the exit code: docker inspect --format '{{.State.ExitCode}}' <container>. Check the logs: docker logs <container>. If the exit code is 137, the process was OOM-killed β€” check the cgroup memory limit and the OOM killer log: dmesg | grep -i oom.
Container filesystem shows unexpected files or missing files.β†’Inspect the overlay2 layers. Get the container's merge directory: docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container>. List the layers: docker inspect --format '{{.GraphDriver.Data.LowerDir}}' <container>. Check the writable layer: docker inspect --format '{{.GraphDriver.Data.UpperDir}}' <container>. Files in the writable layer override files in lower layers. Files 'deleted' in the writable layer are whiteout files (character device 0/0).
Docker daemon is unresponsive or containers cannot be created.β†’Check the daemon status: systemctl status docker. Check daemon logs: journalctl -u docker --since '10 minutes ago'. Check if containerd is running: systemctl status containerd. Check if the daemon socket is accessible: curl --unix-socket /var/run/docker.sock http://localhost/version. Check if the daemon is out of disk space: df -h /var/lib/docker.

Most Docker tutorials stop at 'docker run' and never explain what happens inside the kernel. This creates a dangerous gap β€” when containers misbehave, engineers without kernel-level understanding cannot diagnose the root cause. They restart containers, rebuild images, and escalate to platform teams for problems that a single /proc inspection would have solved.

Docker is a stack of components: the CLI, the daemon (dockerd), containerd, runc, and the Linux kernel. Each layer has a specific responsibility. The daemon manages images and the API. containerd manages container lifecycle. runc creates containers by configuring kernel primitives β€” namespaces for isolation, cgroups for resource limits, and overlay2 for the filesystem. The kernel does the actual work.

Understanding this stack is essential for production debugging. When a container cannot resolve DNS, the answer is in the network namespace. When a container is OOM-killed, the answer is in the cgroup memory controller. When a container starts slowly, the answer is in the overlay2 filesystem or image pull. Every container problem has a kernel-level root cause.

The Docker Stack: From CLI to Kernel β€” Every Component Explained

Docker is not a single program. It is a stack of components, each with a specific responsibility. Understanding this stack is the foundation for debugging any container issue.

Docker CLI (docker): The command-line interface. It sends HTTP API requests to the Docker daemon. The CLI does not create containers β€” it is a client that talks to the server. You can replace it with curl, Postman, or any HTTP client.

Docker daemon (dockerd): The server that manages images, networks, volumes, and the container API. It listens on a Unix socket (/var/run/docker.sock) or a TCP port. The daemon does not create containers directly β€” it delegates to containerd.

containerd: A container runtime that manages the complete container lifecycle β€” pulling images, creating containers, managing snapshots, and handling container execution. containerd was originally part of Docker but was extracted as a standalone project. It is now used by Docker, Kubernetes (via CRI), and other orchestration platforms.

runc: A lightweight container runtime that creates containers using Linux kernel primitives. runc reads an OCI (Open Container Initiative) runtime specification β€” a JSON file that describes the container's namespaces, cgroups, mounts, and environment. runc calls clone() to create a new process, configures namespaces and cgroups, pivot_root to change the filesystem, and exec to start the application. runc exits after creating the container β€” it does not manage the container's lifecycle.

The OCI spec: The Open Container Initiative defines two standards: the image spec (how images are packaged) and the runtime spec (how containers are created). runc implements the runtime spec. This standardization means any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can run OCI-compliant images.

The flow: docker run -> dockerd API -> containerd creates container spec -> runc reads OCI spec -> runc calls clone() with namespaces -> runc configures cgroups -> runc pivot_root to overlay2 filesystem -> runc exec the application process -> runc exits -> containerd monitors the container process.

io/thecodeforge/docker_stack_inspection.sh Β· BASH
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
#!/bin/bash
# Inspect every layer of the Docker stack

# ── Docker CLI -> Daemon communication ───────────────────────────────────────
# The CLI sends HTTP requests to the daemon. You can do this manually:
curl --unix-socket /var/run/docker.sock http://localhost/version | python3 -m json.tool
# Shows: Docker version, API version, Go version, OS, architecture

curl --unix-socket /var/run/docker.sock http://localhost/containers/json | python3 -m json.tool
# Shows: all running containers (same as docker ps)

# ── Check if containerd is running ───────────────────────────────────────────
systemctl status containerd
# containerd is the container runtime daemon
# It manages container lifecycle independently of dockerd

# ── Find the runc binary ─────────────────────────────────────────────────────
which runc
# Typically: /usr/bin/runc or /usr/local/bin/runc

runc --version
# Shows: runc version, commit, spec version (OCI 1.0.2)

# ── Inspect the OCI runtime spec for a running container ─────────────────────
# containerd stores the OCI spec for each container
CONTAINER_ID=$(docker ps -q | head -1)

# Find the container's bundle directory (contains config.json)
find /run/containerd/io.containerd.runtime.v2.task/default/ -name config.json 2>/dev/null | head -1
# This file is the OCI runtime spec β€” it defines namespaces, cgroups, mounts

# ── Trace the container creation flow ────────────────────────────────────────
# Start a container and watch the kernel calls
strace -f -e trace=clone,unshare,pivot_root,chroot,execve \
  -o /tmp/container-trace.log \
  runc run test-container &

# The trace shows:
# clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|...) = <child-pid>
# pivot_root(".", "/old-root") = 0
# execve("/app/server", ["server"], ...) = 0

# ── Check the daemon socket ─────────────────────────────────────────────────
ls -la /var/run/docker.sock
# srw-rw---- 1 root docker /var/run/docker.sock
# The socket is owned by root:docker group
# Any process in the docker group can control ALL containers

# ── Check the daemon process tree ────────────────────────────────────────────
pstree -p $(pidof dockerd)
# dockerd ─┬─ containerd ─┬─ containerd-shim-runc-v2 ─┬─ <app-pid>
#          β”‚              β”‚                            └─ pause
#          β”‚              └─ containerd-shim-runc-v2 ─┬─ <app-pid>
#          β”‚                                          └─ pause
#          └─ docker-proxy (for published ports)
β–Ά Output
# Docker daemon version:
{
"Version": "24.0.7",
"ApiVersion": "1.43",
"MinAPIVersion": "1.12",
"GitCommit": "afdd53b",
"GoVersion": "go1.20.10",
"Os": "linux",
"Arch": "amd64"
}

# runc version:
runc version 1.1.9
commit: v1.1.9-0-gccaecfc
spec: 1.0.2-dev

# Process tree:
dockerd(1234)───containerd(1235)───containerd-shim(5678)───node(5679)

# The container process (node, PID 5679) is a real Linux process on the host
Mental Model
The Docker Stack as a Restaurant Chain
Why does Docker have so many components instead of one program?
  • Separation of concerns: the daemon manages the API and images, containerd manages lifecycle, runc creates containers.
  • Replaceability: you can swap runc for crun (faster), kata-runtime (VM isolation), or runsc (gVisor) without changing Docker.
  • Standardization: the OCI spec ensures any compliant runtime can run any compliant image.
  • Kubernetes reuses containerd directly β€” it does not need dockerd. This is why containerd was extracted.
πŸ“Š Production Insight
The /var/run/docker.sock socket is the most dangerous file on a Docker host. Any process with access to this socket can create, stop, and delete containers β€” effectively root access to the host. In production, never mount this socket into containers unless absolutely necessary. If you must, use a socket proxy that restricts the API calls the container can make.
🎯 Key Takeaway
Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls (clone, pivot_root, exec). The OCI spec standardizes the interface between layers. Understanding this stack is the foundation for debugging any container issue.
Runtime Selection by Use Case
IfStandard single-tenant application workload
β†’
Userunc (default). Fast, lightweight, standard namespace isolation.
IfMulti-tenant workload running untrusted code
β†’
Userunsc (gVisor) or kata-runtime. User-space kernel or VM isolation.
IfPerformance-critical workload, low syscall overhead
β†’
Usecrun (written in C, faster than runc's Go implementation).
IfServerless platform, short-lived functions
β†’
UseFirecracker via containerd Firecracker shim. 125ms VM startup.

Linux Namespaces: The Isolation Mechanism Behind Every Container

Namespaces are the Linux kernel feature that provides process isolation. Each namespace gives a process its own view of a system resource. A container is a regular Linux process that runs inside a set of namespaces β€” it sees its own PID tree, its own network stack, its own filesystem mount points, and its own hostname, even though it shares the host kernel.

There are seven namespace types in Linux. Docker uses six of them by default:

PID namespace (CLONE_NEWPID): Each container has its own PID tree. The first process inside the container is PID 1. Processes inside the container cannot see processes outside the container. On the host, the container process has a real PID β€” you can see it with ps aux. The PID namespace is hierarchical β€” a child namespace can see parent PIDs if configured, but not sibling PIDs.

Network namespace (CLONE_NEWNET): Each container gets its own network stack β€” its own interfaces, routing table, firewall rules, and /proc/net. When Docker creates a container, it creates a veth (virtual Ethernet) pair β€” one end inside the container's network namespace, one end connected to the Docker bridge. This is how containers communicate with each other and the outside world.

Mount namespace (CLONE_NEWNS): Each container has its own mount table. The container's root filesystem is a union mount (overlay2) that layers the image's read-only layers with a writable top layer. The container cannot see the host's filesystem unless explicitly mounted. pivot_root changes the container's root directory to the overlay2 merge directory.

User namespace (CLONE_NEWUSER): Maps container UIDs to different host UIDs. Container UID 0 (root) can be mapped to host UID 100000 (unprivileged). This means even a container escape results in an unprivileged host user. User namespace remapping is not enabled by default in Docker because it breaks some workflows (volume permissions, Docker-in-Docker).

UTS namespace (CLONE_NEWUTS): Each container has its own hostname. The hostname is set during container creation and can be changed inside the container without affecting the host or other containers.

IPC namespace (CLONE_NEWIPC): Each container has its own System V IPC and POSIX message queues. Processes in different containers cannot share shared memory segments or message queues.

Cgroup namespace (CLONE_NEWCGROUP): Virtualizes the /proc/self/cgroup view. The container sees its own cgroup path as '/' instead of the real path (/docker/<container-id>). This prevents the container from seeing or manipulating other containers' cgroups.

io/thecodeforge/namespace_inspection.sh Β· BASH
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
#!/bin/bash
# Inspect and compare namespaces for containers and the host

# ── Get a container's host PID ───────────────────────────────────────────────
CONTAINER_NAME="my-api"
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' $CONTAINER_NAME)
echo "Container $CONTAINER_NAME is PID $CONTAINER_PID on the host"

# ── List all namespaces for the container process ────────────────────────────
ls -la /proc/$CONTAINER_PID/ns/
# Output:
# lrwxrwxrwx ... ipc -> 'ipc:[4026532XXX]'
# lrwxrwxrwx ... mnt -> 'mnt:[4026532XXX]'
# lrwxrwxrwx ... net -> 'net:[4026532XXX]'
# lrwxrwxrwx ... pid -> 'pid:[4026532XXX]'
# lrwxrwxrwx ... user -> 'user:[4026531XXX]'
# lrwxrwxrwx ... uts -> 'uts:[4026532XXX]'

# ── Compare with host namespaces ─────────────────────────────────────────────
ls -la /proc/1/ns/
# The host PID 1 (systemd) has different namespace IDs than the container
# If namespace IDs match, the container shares that namespace with the host

# ── PID namespace: container sees its own PID tree ────────────────────────────
docker exec $CONTAINER_NAME ps aux
# PID 1 is the container's entrypoint process
# The container cannot see host processes

# On the host, the same process has a different PID:
ps aux | grep $(docker exec $CONTAINER_NAME cat /proc/1/cmdline | tr '\0' ' ')
# The host sees the real PID, the container sees PID 1

# ── Network namespace: container has its own network stack ───────────────────
docker exec $CONTAINER_NAME ip addr show
# Shows: lo (loopback) and eth0 (veth pair inside the container)

# On the host, inspect the veth pair:
ip link show | grep veth
# vethXXXX@if4: <BROADCAST,MULTICAST,UP> ... master docker0
# One end is in the container's net namespace, one end is on the docker0 bridge

# ── Enter a container's network namespace from the host ──────────────────────
sudo nsenter --net --target $CONTAINER_PID ip addr show
# Shows the same network config as docker exec, but from the host
# Useful for debugging without a shell inside the container

# ── Mount namespace: inspect the overlay2 filesystem ─────────────────────────
docker inspect --format '{{.GraphDriver.Data}}' $CONTAINER_NAME
# Shows: MergedDir, UpperDir, LowerDir, WorkDir
# MergedDir is what the container sees as /
# UpperDir is the writable layer (container-specific changes)
# LowerDir is the read-only image layers (colon-separated)

# ── User namespace: check if remapping is enabled ────────────────────────────
cat /etc/subuid
# If userns-remap is enabled: dockremap:100000:65536
# This maps container UID 0 to host UID 100000

# ── UTS namespace: container has its own hostname ─────────────────────────────
docker exec $CONTAINER_NAME hostname
# Shows the container's hostname (usually the container ID)

hostname
# Shows the host's hostname β€” different from the container

# ── IPC namespace: container has its own IPC resources ───────────────────────
docker exec $CONTAINER_NAME ipcs
# Shows only IPC resources created inside the container

ipcs
# Shows host IPC resources β€” not visible inside the container
β–Ά Output
# Container PID on host:
Container my-api is PID 5679 on the host

# Container namespaces:
lrwxrwxrwx 1 root root 0 Jan 15 10:23 ipc -> 'ipc:[4026532847]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 mnt -> 'mnt:[4026532849]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 net -> 'net:[4026532851]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 pid -> 'pid:[4026532852]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 Jan 15 10:23 uts -> 'uts:[4026532853]'

# Container process list:
PID USER COMMAND
1 node node dist/index.js

# Container network:
1: lo: <LOOPBACK,UP> mtu 65536
4: eth0@if7: <BROADCAST,MULTICAST,UP> mtu 1500
inet 172.17.0.2/16

# Overlay2 filesystem:
MergedDir: /var/lib/docker/overlay2/abc123/merged
UpperDir: /var/lib/docker/overlay2/abc123/diff
LowerDir: /var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers
Mental Model
Namespaces as Tinted Windows
What happens when you disable a namespace with a Docker flag?
  • --pid=host: the container sees ALL host processes. Its PID 1 is the host's PID 1 (systemd).
  • --net=host: the container shares the host's network stack. It can bind to any host port.
  • --ipc=host: the container can access host shared memory segments. Potential data leak.
  • Each flag removes one isolation layer. --privileged removes ALL of them.
πŸ“Š Production Insight
The --pid=host and --net=host flags are debugging tools, not production configurations. --pid=host exposes all host processes to the container β€” a compromised container can kill any process on the host. --net=host removes network isolation β€” a container can sniff traffic from other containers. Always use the default namespace isolation in production. If you need host networking for performance, use it only on dedicated hosts.
🎯 Key Takeaway
Namespaces are the isolation mechanism. PID namespace isolates the process tree. Network namespace isolates the network stack. Mount namespace isolates the filesystem. User namespace maps UIDs. Every container is a Linux process running inside these namespaces. Disabling a namespace (with --pid=host, --net=host) removes that isolation layer.
Namespace Configuration Decisions
IfStandard production container
β†’
UseAll six namespaces enabled (default). Maximum isolation.
IfMonitoring container that needs host process visibility
β†’
UseUse --pid=host only on dedicated monitoring hosts. Never on shared hosts.
IfPerformance-critical proxy or load balancer
β†’
UseUse --net=host to avoid NAT overhead. Accept the security trade-off.
IfDebugging a container issue interactively
β†’
UseUse nsenter from the host: nsenter --net --target <pid> bash. No need to modify the container.

cgroups: Resource Limits That Prevent Noisy Neighbors

While namespaces provide isolation (what a container can see), cgroups provide resource limits (how much a container can consume). Without cgroups, a container with a memory leak can consume all host RAM and trigger the OOM killer on unrelated containers.

cgroup v1 vs v2: Linux has two cgroup versions. cgroup v1 has separate hierarchies for each resource controller (cpu, memory, blkio, pids). cgroup v2 has a unified hierarchy. Docker supports both, but cgroup v2 is the default on newer Linux distributions (Ubuntu 22.04+, Fedora 31+, RHEL 9+).

CPU controller: Limits CPU usage in two ways: - cpu.shares: relative weight. Default is 1024. A container with 2048 gets twice the CPU of a container with 1024 when there is contention. Does not limit absolute CPU usage. - cpu.cfs_quota_us / cpu.cfs_period_us: absolute limit. --cpus=1.0 sets a quota of 100ms per 100ms period, limiting the container to one CPU core.

Memory controller: Limits memory usage: - memory.limit_in_bytes: hard limit. If the container exceeds this, the kernel OOM-kills the process. - memory.soft_limit_in_bytes: soft limit. The kernel tries to reclaim memory from the container before other containers, but does not kill it. - memory.oom_control: controls whether the OOM killer is invoked or the container is frozen.

blkio controller: Limits block device I/O: - blkio.throttle.read_bps_device: limits read throughput in bytes per second. - blkio.throttle.write_bps_device: limits write throughput.

pids controller: Limits the number of processes: - pids.max: maximum number of processes (including threads) the container can create. Prevents fork bombs.

The noisy neighbor problem: Without cgroup limits, one container can starve others. A container with a CPU-bound loop consumes 100% of all CPUs. A container with a memory leak consumes all host RAM, triggering the kernel OOM killer, which may kill unrelated containers. cgroup limits prevent this by enforcing per-container resource ceilings.

io/thecodeforge/cgroup_inspection.sh Β· BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
#!/bin/bash
# Inspect and configure cgroup resource limits for containers

# ── Get a container's cgroup path ────────────────────────────────────────────
CONTAINER_ID=$(docker ps -q | head -1)

# cgroup v1 path:
ls /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/
# cpu.shares, cpu.cfs_quota_us, cpu.cfs_period_us

# cgroup v2 path:
ls /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/
# cpu.max, memory.max, pids.max

# ── Check CPU limits ─────────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.shares
# Default: 1024 (1 CPU share). Set with --cpu-shares
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_quota_us
# -1 means no limit. Set with --cpus=1.0 (becomes 100000)
cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.cfs_period_us
# Default: 100000 (100ms)

# cgroup v2:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
# Format: quota period. Example: 100000 100000 (1 CPU limit)

# ── Check memory limits ─────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.limit_in_bytes
# 9223372036854771712 means no limit (max int64)
# Set with --memory=512m (becomes 536870912)
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.usage_in_bytes
# Current memory usage in bytes

# cgroup v2:
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
# max means no limit. Set with --memory=512m
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
# Current memory usage

# ── Check OOM events ────────────────────────────────────────────────────────
cat /sys/fs/cgroup/memory/docker/$CONTAINER_ID/memory.oom_control
# oom_kill_disable: 0 (OOM killer enabled)
# under_oom: 0 (not currently under OOM pressure)

# Check kernel OOM log:
dmesg | grep -i 'oom\|killed process' | tail -10

# ── Check PID limits ────────────────────────────────────────────────────────

# cgroup v1:
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.max
# max means no limit. Set with --pids-limit=256
cat /sys/fs/cgroup/pids/docker/$CONTAINER_ID/pids.current
# Current number of processes

# ── Run a container with resource limits ─────────────────────────────────────
docker run -d \
  --name resource-limited \
  --cpus=1.0 \
  --memory=512m \
  --pids-limit=256 \
  --memory-swap=512m \
  # --memory-swap=512m disables swap (swap = memory limit)
  alpine:3.19 sleep 3600

# Verify the limits:
docker inspect resource-limited --format '{{.HostConfig.NanoCpus}}'
# 1000000000 = 1 CPU (in nanoseconds)

docker inspect resource-limited --format '{{.HostConfig.Memory}}'
# 536870912 = 512MB (in bytes)

docker stats resource-limited --no-stream
# Shows: MEM USAGE / LIMIT β€” 512MiB / 512MiB
β–Ά Output
# CPU limits (cgroup v1):
1024
100000
100000

# Memory limits:
536870912
45219840

# PID limits:
256
3

# docker stats:
NAME CPU % MEM USAGE / LIMIT MEM %
resource-limited 0.00% 43.2MiB / 512MiB 8.44%
Mental Model
cgroups as Budget Limits
Why should you always set memory limits in production?
  • Without a memory limit, a container can consume all host RAM.
  • The kernel OOM killer then selects a process to kill β€” it may kill an unrelated container, not the leaking one.
  • With --memory=512m, the kernel kills only the container that exceeded its limit.
  • Without limits, the OOM killer uses a heuristic (oom_score) that may choose the wrong victim.
πŸ“Š Production Insight
The OOM killer's victim selection heuristic (oom_score) favors killing processes with high memory usage and low importance. But it does not know which container is the problem β€” it sees host PIDs, not container boundaries. Without cgroup memory limits, a memory leak in container A can cause the OOM killer to kill container B (which happens to have a higher oom_score). Always set --memory on every production container to ensure the OOM killer targets the right process.
🎯 Key Takeaway
cgroups limit how much CPU, memory, I/O, and processes a container can consume. Without cgroup limits, one container can starve others (noisy neighbor problem). Always set --memory in production β€” without it, the OOM killer may kill the wrong container. Use --cpus for CPU limits and --pids-limit to prevent fork bombs.
Resource Limit Strategy
IfStateless web API with predictable resource usage
β†’
UseSet --cpus and --memory based on load testing. Use --memory-swap=limit to disable swap.
IfDatabase or cache with memory-based eviction
β†’
UseSet --memory to the expected working set size. Do not set --memory-swap (allow swap for eviction).
IfWorker process that may fork subprocesses
β†’
UseSet --pids-limit=256 to prevent fork bombs. Set --cpus to limit total CPU across all forks.
IfDevelopment/testing environment
β†’
UseSkip resource limits. They add complexity without benefit in non-production environments.

Union Filesystem and overlay2: How Docker Images Work Without Copying

The union filesystem is the reason Docker images are lightweight and containers start in milliseconds. Instead of copying files, Docker overlays multiple read-only directories and presents them as a single merged filesystem.

overlay2 driver: The default storage driver in modern Docker. It stacks directories (layers) and presents a merged view. Each layer is a directory on the host filesystem. The bottom layers are read-only (image layers). The top layer is writable (container-specific changes).

How it works: When a container reads a file, overlay2 checks the top (writable) layer first. If the file exists there, it is returned. If not, overlay2 checks each lower layer in order until the file is found. When a container writes a file, the write goes to the top layer only β€” lower layers are never modified. When a container deletes a file, a whiteout file (a character device with major/minor 0/0, prefixed with .wh.) is created in the top layer to mask the lower layer's file.

The four directories: - lowerdir: colon-separated list of read-only image layers (bottom to top) - upperdir: the writable layer (container-specific changes) - workdir: overlay2 internal working directory (must be empty, used for atomic operations) - merged: the combined view that the container sees as its root filesystem

Performance implications: Read performance is slightly slower than native because overlay2 must check multiple layers. Write performance is native (writes go directly to the upperdir on the host filesystem). The performance difference is negligible for most workloads but can matter for I/O-intensive applications (databases, search engines).

The copy-up problem: When a container modifies a file from a lower layer, overlay2 must first copy the entire file to the upperdir (copy-up), then modify the copy. For large files (multi-GB database files), copy-up can cause a noticeable delay on first write. This is why databases should use volumes (bind mounts) instead of the container's overlay2 filesystem.

io/thecodeforge/overlay2_inspection.sh Β· BASH
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
#!/bin/bash
# Inspect the overlay2 filesystem for a running container

# ── Get the overlay2 paths for a container ───────────────────────────────────
CONTAINER_ID=$(docker ps -q | head -1)

GRAPH_DATA=$(docker inspect --format '{{json .GraphDriver.Data}}' $CONTAINER_ID)
echo $GRAPH_DATA | python3 -m json.tool

# Extract individual paths:
MERGED_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['MergedDir'])")
UPPER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['UpperDir'])")
LOWER_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['LowerDir'])")
WORK_DIR=$(echo $GRAPH_DATA | python3 -c "import sys,json; print(json.load(sys.stdin)['WorkDir'])")

echo "Merged (container sees this as /): $MERGED_DIR"
echo "Upper (writable layer): $UPPER_DIR"
echo "Lower (read-only layers): $LOWER_DIR"
echo "Work (overlay2 internal): $WORK_DIR"

# ── Inspect the writable layer (upperdir) ────────────────────────────────────
ls -la $UPPER_DIR/
# Shows files the container has created or modified
# Files prefixed with .wh. are whiteout files (deleted from lower layers)

# ── Inspect the merged view ──────────────────────────────────────────────────
ls -la $MERGED_DIR/
# This is what the container sees as its root filesystem
# It is the combination of all lower layers + the upper layer

# ── Demonstrate the copy-up behavior ─────────────────────────────────────────
# Create a file in the container
docker exec $CONTAINER_ID sh -c 'echo "hello" > /tmp/test-file'

# The file appears in the writable layer (upperdir):
ls -la $UPPER_DIR/tmp/test-file
# The file is in the upper layer, not in any lower layer

# ── Demonstrate the whiteout behavior ────────────────────────────────────────
# Delete a file that exists in a lower layer
docker exec $CONTAINER_ID rm /etc/hostname

# A whiteout file appears in the upper layer:
ls -la $UPPER_DIR/etc/.wh.hostname
# This character device (0/0) tells overlay2 to hide the lower layer's file

# ── Check the number of layers in an image ───────────────────────────────────
docker inspect <image> --format '{{len .RootFS.Layers}} layers'
# Each layer is a directory under /var/lib/docker/overlay2/

# ── Check disk usage per layer ───────────────────────────────────────────────
du -sh /var/lib/docker/overlay2/* | sort -hr | head -10
# Shows disk usage for each layer (shared layers are counted once)

# ── Compare overlay2 with native filesystem performance ─────────────────────
# Write performance test:
time docker exec $CONTAINER_ID dd if=/dev/zero of=/tmp/test bs=1M count=100
# Overlay2 write: ~0.3s (writes to upperdir on host filesystem)

# Read performance test:
time docker exec $CONTAINER_ID dd if=/tmp/test of=/dev/null bs=1M
# Overlay2 read: ~0.1s (slightly slower than native due to layer lookup)
β–Ά Output
# Overlay2 paths:
{
"LowerDir": "/var/lib/docker/overlay2/def456/layers:/var/lib/docker/overlay2/ghi789/layers",
"MergedDir": "/var/lib/docker/overlay2/abc123/merged",
"UpperDir": "/var/lib/docker/overlay2/abc123/diff",
"WorkDir": "/var/lib/docker/overlay2/abc123/work"
}

# Writable layer contents:
drwxr-xr-x 4 root root 4096 Jan 15 10:25 tmp
drwxr-xr-x 2 root root 4096 Jan 15 10:25 etc
-rw-r--r-- 1 root root 0 Jan 15 10:25 etc/.wh.hostname

# The whiteout file etc/.wh.hostname hides /etc/hostname from lower layers
Mental Model
Overlay2 as Transparent Acetate Sheets
Why can't you reduce image size by deleting files in a Dockerfile layer?
  • Each layer is additive β€” deleting a file in layer N+1 does not remove it from layer N.
  • The delete creates a whiteout marker in layer N+1, but the data still exists in layer N.
  • The only way to truly remove data is to not include it in any layer (use multi-stage builds or .dockerignore).
  • This is why RUN apt-get install ... && rm -rf /var/lib/apt/lists/* must be in the same RUN β€” separate RUNs create separate layers.
πŸ“Š Production Insight
The copy-up problem is the most common performance issue with overlay2. When a database container writes to a file that exists in a lower layer (e.g., modifying a config file from the image), overlay2 must first copy the entire file to the upperdir. For multi-GB database files, this can cause seconds of latency on first write. The fix: use bind mount volumes for database data directories instead of writing to the overlay2 filesystem.
🎯 Key Takeaway
overlay2 stacks read-only image layers with a writable top layer. Reads check the top layer first, then fall through to lower layers. Writes always go to the top layer. Deletes create whiteout files. Deleting files in a later layer does not reclaim space β€” the data persists in the earlier layer. Use volumes for databases to avoid the copy-up overhead.
Filesystem Strategy by Workload
IfStateless application (API, web server)
β†’
UseUse overlay2 (default). The writable layer is sufficient for temporary files and logs.
IfDatabase or persistent storage
β†’
UseUse named volumes or bind mounts. Bypass overlay2 entirely. Avoid copy-up overhead.
IfBuild process creating many temporary files
β†’
UseUse tmpfs mounts for temporary data. Avoids disk I/O entirely.
IfHigh-security environment
β†’
UseUse --read-only flag to make the overlay2 filesystem read-only. All writes must go to explicit tmpfs or volume mounts.

The Container Lifecycle: From Clone to Exit β€” Every Kernel Call

When you run docker run, a precise sequence of kernel calls creates the container. Understanding this sequence is the key to debugging startup failures, permission errors, and namespace issues.

Step 1: Image pull and unpack. The Docker daemon pulls the image layers from the registry and unpacks them into /var/lib/docker/overlay2/. Each layer is a directory. If the layers already exist locally (cached), this step is skipped.

Step 2: Create the OCI runtime spec. containerd generates a config.json file β€” the OCI runtime specification. This JSON file defines: - The namespaces to create (PID, network, mount, user, UTS, IPC) - The cgroup limits (CPU, memory, pids) - The root filesystem path (the overlay2 merge directory) - The environment variables, working directory, and command to execute - The mount points (volumes, /proc, /sys, /dev)

Step 3: runc creates the container. runc reads config.json and executes the following kernel calls: - clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC) β€” creates a new process with new namespaces - sethostname() β€” sets the container's hostname (UTS namespace) - mount() β€” mounts /proc, /sys, /dev inside the container's mount namespace - pivot_root() β€” changes the container's root directory to the overlay2 merge directory - chdir("/") β€” moves to the new root - setuid() / setgid() β€” drops privileges to the container's user (if non-root) - execve() β€” replaces the runc process with the container's entrypoint command

Step 4: runc exits, containerd monitors. After execve(), runc is replaced by the container's process. runc exits. containerd (via containerd-shim) monitors the container process, captures stdout/stderr, and handles signals.

Step 5: The container process runs. The application process is now running inside a set of namespaces with cgroup limits and an overlay2 filesystem. It has PID 1 inside the container's PID namespace. On the host, it has a real PID visible in ps aux.

The pause process: Each container has a 'pause' process that holds the namespaces open. If the application process exits, the pause process keeps the namespaces alive (for restart). You can see pause processes on the host: ps aux | grep pause.

io/thecodeforge/container_lifecycle.sh Β· BASH
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
#!/bin/bash
# Trace the complete container lifecycle from clone to exec

# ── Step 1: Pull and inspect image layers ────────────────────────────────────
docker pull alpine:3.19

# Inspect the image layers:
docker inspect alpine:3.19 --format '{{json .RootFS.Layers}}' | python3 -m json.tool
# Each entry is a layer (SHA256 digest)

# Find the layers on disk:
ls /var/lib/docker/overlay2/ | head -5
# Each directory is a layer. Shared layers are hard-linked or reflinked.

# ── Step 2: Create a container and inspect the OCI spec ──────────────────────
# Create a container without starting it:
docker create --name lifecycle-demo alpine:3.19 echo 'hello'

# Find the OCI runtime spec:
find /run/containerd -name config.json -path '*lifecycle-demo*' 2>/dev/null
# This file is the OCI runtime spec that runc reads

# Inspect the spec (if found):
cat /run/containerd/io.containerd.runtime.v2.task/default/lifecycle-demo/config.json | python3 -m json.tool | head -50
# Shows: namespaces, mounts, cgroups, process config, root filesystem

# ── Step 3: Trace runc's kernel calls ────────────────────────────────────────
# Start a container with strace to see the kernel calls:
sudo strace -f -e trace=clone,clone3,unshare,sethostname,mount,pivot_root,setuid,setgid,execve \
  -o /tmp/runc-trace.log \
  runc run --bundle /path/to/bundle test-trace

# The trace shows:
# clone3({flags=CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS|..., ...}) = 12345
# sethostname("container-id", 12) = 0
# mount("proc", "/proc", "proc", ...) = 0
# mount("sysfs", "/sys", "sysfs", ...) = 0
# pivot_root(".", "/old-root") = 0
# setuid(1000) = 0
# setgid(1000) = 0
# execve("/bin/sh", ["sh"], ...) = 0

# ── Step 4: Find the container process and pause process on the host ─────────
docker start lifecycle-demo

# Find the container's host PID:
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' lifecycle-demo)
echo "Container process PID: $CONTAINER_PID"

# Find the pause process (holds namespaces open):
ps aux | grep pause | grep -v grep
# root  5678  0.0  0.0  1024  4  ?  Ss  10:23  0:00 /pause

# The pause process and the container process share the same namespaces:
ls -la /proc/$CONTAINER_PID/ns/net
ls -la /proc/$(pgrep -f '/pause' | head -1)/ns/net
# Both point to the same namespace inode

# ── Step 5: Watch the container process on the host ──────────────────────────
ps aux | grep $CONTAINER_PID
# root  5679  0.0  0.1  ...  echo hello
# This is the REAL process on the host, running inside namespaces

# ── Cleanup ──────────────────────────────────────────────────────────────────
docker rm -f lifecycle-demo
β–Ά Output
# Image layers:
[
"sha256:abc123def456..."
]

# Container process on host:
Container process PID: 5679

# Pause process:
root 5678 0.0 0.0 1024 4 ? Ss 10:23 0:00 /pause

# Host process:
root 5679 0.0 0.1 4520 1820 ? Ss 10:23 0:00 echo hello
Mental Model
Container Creation as Building a Room
Why does runc exit after creating the container?
  • runc's job is to create the container, not to manage it. After execve(), runc is replaced by the application process.
  • containerd (via containerd-shim) monitors the container process, captures output, and handles signals.
  • The pause process holds the namespaces open so they survive application restarts.
  • This separation allows containerd to manage the lifecycle without being PID 1 in the container.
πŸ“Š Production Insight
The pause process is essential for container restarts. When the application process exits, the pause process keeps the namespaces alive. containerd can then start a new process inside the same namespaces (restart). Without the pause process, the namespaces would be destroyed on application exit, and a restart would require creating new namespaces from scratch. You can see pause processes on the host with ps aux | grep pause β€” one per container.
🎯 Key Takeaway
Container creation is a sequence of kernel calls: clone (namespaces), mount (filesystem), pivot_root (change root), setuid (drop privileges), execve (start application). runc creates the container and exits. containerd monitors the process. The pause process holds namespaces open for restarts. Every container is a real Linux process visible in ps aux on the host.
πŸ—‚ Linux Namespace Types Used by Docker
Each namespace isolates a different system resource.
NamespaceFlagIsolatesDocker DefaultHost Flag to Disable
PIDCLONE_NEWPIDProcess ID treeEnabled--pid=host
NetworkCLONE_NEWNETNetwork stack (interfaces, routes, iptables)Enabled--net=host
MountCLONE_NEWNSFilesystem mount pointsEnabled--volume /:/host (partial)
UserCLONE_NEWUSERUID/GID mappingDisabled (opt-in)N/A (disabled by default)
UTSCLONE_NEWUTSHostname and domain nameEnabled--uts=host
IPCCLONE_NEWIPCSystem V IPC and POSIX message queuesEnabled--ipc=host
CgroupCLONE_NEWCGROUPcgroup root directory viewEnabled (cgroup v2)--cgroupns=host

🎯 Key Takeaways

  • Docker is a stack: CLI -> dockerd -> containerd -> runc -> kernel. Each layer has a specific responsibility. runc creates containers by calling kernel syscalls.
  • Namespaces isolate what a container can see (PID tree, network, filesystem, hostname, IPC). cgroups limit what a container can consume (CPU, memory, I/O, processes).
  • Every container is a real Linux process visible in ps aux on the host. The kernel does not know what a 'container' is β€” it only knows processes, namespaces, and cgroups.
  • overlay2 stacks read-only image layers with a writable top layer. No data is copied on container creation. Writes go to the top layer. Deletes create whiteout files.
  • The pause process holds namespaces open so they survive application restarts. runc exits after creating the container. containerd monitors the process.
  • The Docker socket (/var/run/docker.sock) is equivalent to root access on the host. Never mount it into containers without a socket proxy.

⚠ Common Mistakes to Avoid

  • βœ•Mistake 1: Assuming containers are VMs β€” Symptom: treating containers as isolated machines with their own kernel, expecting kernel-level isolation, and being surprised when a kernel CVE affects all containers β€” Fix: understand that containers are Linux processes with namespaces and cgroups. They share the host kernel. A kernel vulnerability affects all containers on the host.
  • βœ•Mistake 2: Using --pid=host or --net=host in production β€” Symptom: all host processes visible inside the container, or container sharing the host network stack with no isolation β€” Fix: these flags are debugging tools. Never use them in production unless on a dedicated host. Use nsenter from the host for debugging instead.
  • βœ•Mistake 3: Not setting memory limits (--memory) β€” Symptom: one container's memory leak consumes all host RAM, triggering the OOM killer on unrelated containers β€” Fix: set --memory on every production container. Without limits, the OOM killer's victim selection heuristic may kill the wrong container.
  • βœ•Mistake 4: Writing database data to the overlay2 filesystem β€” Symptom: slow first writes due to copy-up, data loss on container removal β€” Fix: use named volumes or bind mounts for database data. The overlay2 writable layer is deleted when the container is removed.
  • βœ•Mistake 5: Deleting files in a Dockerfile layer to reduce image size β€” Symptom: image size does not decrease because the deleted files persist in the previous layer β€” Fix: chain download and cleanup in the same RUN with &&. Use multi-stage builds to exclude build tools from the final image.
  • βœ•Mistake 6: Not understanding that the Docker socket is root access β€” Symptom: mounting /var/run/docker.sock into a container for convenience, giving the container full control over the Docker daemon β€” Fix: the Docker socket is equivalent to root access on the host. Use a socket proxy that restricts API access, or avoid mounting it entirely.

Interview Questions on This Topic

  • QWalk me through what happens at the kernel level when you run 'docker run alpine echo hello'. What syscalls does runc make?
  • QExplain the difference between namespaces and cgroups. What does each one isolate or limit?
  • QHow does the overlay2 union filesystem work? What happens when a container reads a file, writes a new file, modifies an existing file, and deletes a file?
  • QWhat is the pause process in Docker? Why does every container have one?
  • QYour container is OOM-killed but the host has plenty of free memory. How do you diagnose this using cgroup files in /sys/fs/cgroup?
  • QExplain how a veth pair connects a container to the Docker bridge network. What kernel mechanisms are involved?
  • QWhat is the OCI runtime spec? How does it enable runtime replaceability (swapping runc for gVisor or Kata)?

Frequently Asked Questions

Is a Docker container a virtual machine?

No. A container is a Linux process running inside namespaces and cgroups. It shares the host kernel. A VM runs a full guest operating system with its own kernel on top of a hypervisor. Containers start in milliseconds and use megabytes of memory. VMs take minutes to start and use gigabytes. The security trade-off: containers share the host kernel (a kernel CVE affects all containers), while VMs have a separate kernel per instance.

What is the difference between a namespace and a cgroup?

Namespaces isolate what a process can see. The PID namespace hides other processes. The network namespace gives the process its own network stack. The mount namespace gives it its own filesystem view. cgroups limit what a process can consume. The memory cgroup limits RAM usage. The CPU cgroup limits CPU time. The pids cgroup limits process count. Together, they provide isolation (namespaces) and resource control (cgroups).

Can I see a container's process on the host?

Yes. Every container is a real Linux process. Run docker inspect --format '{{.State.Pid}}' <container> to get the host PID. Then run ps aux | grep <pid> to see the process. The process runs inside namespaces, so it has a restricted view of the system, but it is a real process with a real PID on the host.

What happens when a container exceeds its memory limit?

The kernel's OOM killer terminates the container process with SIGKILL (exit code 137). The cgroup memory controller enforces the limit set by --memory. Without a limit, the container can consume all host RAM, and the OOM killer may kill an unrelated process based on its oom_score heuristic. Always set --memory in production.

What is the OCI runtime spec?

The OCI (Open Container Initiative) runtime spec is a JSON file (config.json) that describes how to create a container. It defines the namespaces, cgroups, mounts, environment, and command. runc reads this spec and creates the container by calling kernel syscalls. Any OCI-compliant runtime (runc, crun, kata-runtime, runsc) can read the same spec and create the container.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousDocker vs Virtual MachineNext β†’Docker Architecture Explained
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged