Docker Overlay Network MTU — Silent Failures at Scale
VXLAN fragmentation silently dropped 100% of traffic at 10x load while docker service ls showed healthy.
20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.
- ✓Production DevOps experience
- ✓Deep understanding of the tool's internals
- ✓Experience debugging distributed systems
- Orchestration layer (Kubernetes, ECS, Docker Swarm) manages container scheduling, scaling, and self-healing
- Container runtime (containerd, runc) executes containers with namespace isolation and cgroup limits
- Image registry (ECR, GCR, private Harbor) stores and distributes images with vulnerability scanning
- Service mesh (Istio, Linkerd) handles mTLS, traffic management, and observability
- Resource limits are mandatory — without --memory and --cpus, one container can starve all others on the host
- Logging must go to stdout/stderr — never write logs to container filesystem (lost on restart)
- Images must be immutable — tag with git SHA, never use :latest in production
- Health checks must be liveness + readiness — liveness restarts, readiness removes from load balancer
Docker overlay networks are the backbone of multi-host container communication in Docker Swarm and Kubernetes environments. They create a virtual Layer 2 network across physical hosts using VXLAN encapsulation, allowing containers on different machines to communicate as if they were on the same subnet.
The default MTU for these overlay networks is 1450 bytes, which accounts for the 50-byte VXLAN header overhead on top of the typical 1500-byte Ethernet MTU. This works fine in controlled environments, but in production—especially across cloud providers, VPNs, or physical networks with jumbo frames or non-standard MTUs—the mismatch causes silent packet fragmentation or drops.
You won't see errors in application logs; you'll just notice intermittent timeouts, degraded throughput, or mysterious connection resets that only appear under load. The problem compounds at scale: a 0.1% packet loss rate from MTU mismatches can cascade into retransmission storms, TCP window scaling failures, and eventual service degradation that's nearly impossible to trace without deep packet inspection.
Tools like ip link show, tcpdump, and ping -M do are your first line of defense, but the fix often requires aligning Docker's --mtu daemon flag, the host network interface MTU, and any intermediate network gear—a configuration that's easy to overlook during automated provisioning. If you're running overlay networks across AWS (9001 MTU), GCP (1460), or Azure (1500), you've likely hit this already and blamed the network team.
The reality is simpler: Docker's default MTU assumption is wrong for your infrastructure, and the silence of the failure is what makes it dangerous.
Running Docker in production is like running a restaurant kitchen versus cooking at home. At home, you can leave dishes in the sink, ignore the smoke alarm, and run to the store if you forgot an ingredient. In a restaurant kitchen, every dish must be tracked, every station must be clean, every appliance must be monitored, and if the oven breaks at 7 PM on Saturday, you need a backup plan immediately. The cooking technique (Docker) is the same — the operational requirements are completely different.
Docker works out of the box for development. Running a single container on a laptop requires no orchestration, no monitoring, and no security hardening. Production is a different problem entirely — hundreds of containers across dozens of hosts, with requirements for zero-downtime deploys, automatic scaling, persistent data, and compliance auditing.
Most Docker-in-production failures fall into five categories: resource exhaustion (no limits set), networking misconfiguration (DNS, overlay MTU, port conflicts), logging gaps (logs in container filesystem, not shipped to central system), security exposure (root containers, exposed daemon socket, unpinned images), and deployment errors (no health checks, no rollback strategy, no canary testing).
This article covers the architecture decisions, operational patterns, and failure scenarios that determine whether your Docker deployment survives production traffic or collapses under it. Every section includes real debugging commands and failure stories.
What Docker Overlay Networking Actually Does
Docker overlay networking creates a virtual Layer 2 network across multiple Docker hosts using VXLAN encapsulation. Each container gets its own IP from a private subnet, and traffic between containers on different hosts is wrapped in UDP packets (typically port 4789) by the kernel's VXLAN implementation. This allows containers to communicate as if they're on the same switch, regardless of physical host placement.
The key mechanic is the VXLAN Tunnel Endpoint (VTEP) — each Docker host runs a VTEP that maps container IPs to host IPs. When a container sends a packet to another container on a different host, the source VTEP encapsulates the original Ethernet frame inside a UDP packet with the destination host's IP. The destination VTEP decapsulates and delivers it. This adds 50 bytes of overhead per packet (20 IP + 8 UDP + 8 VXLAN + 14 inner Ethernet). That overhead is invisible to applications but directly impacts MTU: if the physical network's MTU is 1500, the effective MTU for containers becomes 1450. Ignoring this causes silent packet fragmentation or drops.
Use overlay networks when you need multi-host container communication without modifying the underlying network infrastructure — typical in Docker Swarm or Kubernetes clusters where hosts span different subnets or cloud regions. The critical production concern is MTU mismatch: if your physical network uses jumbo frames (9000 MTU) but your cloud provider's underlay caps at 1500, or if you set container MTU to 1450 but the host's physical interface is 1500, you'll see intermittent TCP timeouts, slow transfers, and mysterious connection resets that only appear under load. This is not a theoretical issue — it's the #1 cause of silent networking failures in Docker overlay deployments.
Production Architecture: Single Host to Multi-Host Orchestration
Running Docker in production requires an orchestration layer that manages container scheduling, networking, scaling, and self-healing across multiple hosts. Without orchestration, you are managing containers manually — which does not scale beyond 10-20 containers.
Single-host Docker (development only): Running docker run on a single host works for development but fails in production. There is no self-healing (if a container crashes, it stays dead unless you add --restart=always). There is no load balancing (all traffic goes to one container). There is no horizontal scaling (you must manually start more containers). There is no rolling deployment (you must stop the old container before starting the new one, causing downtime).
Docker Swarm: Docker's built-in orchestrator. Manages a cluster of Docker hosts as a single virtual host. Supports service definitions (desired state), rolling updates, and overlay networking. Swarm is simpler than Kubernetes but has fewer features — no custom resource definitions, limited networking options, and a smaller ecosystem. Swarm is adequate for small-to-medium deployments (< 100 services).
Kubernetes (K8s): The industry-standard orchestrator. Manages containers across a cluster with declarative configuration, automated scaling, self-healing, and a rich ecosystem of networking, storage, and observability tools. Kubernetes has a steep learning curve and significant operational overhead — it requires dedicated platform engineers to operate. Kubernetes is the right choice for large deployments (> 50 services) or when you need the ecosystem (service mesh, GitOps, custom operators).
AWS ECS / Fargate: AWS's managed container orchestration. ECS manages container scheduling on EC2 instances. Fargate abstracts the hosts entirely — you pay per container, not per host. ECS is simpler than Kubernetes (no control plane to manage) but locks you to AWS. Fargate eliminates host management entirely but costs 20-30% more than self-managed EC2.
Architecture pattern: The production architecture stack is: Load Balancer -> Ingress Controller -> Orchestrator -> Container Runtime -> Host. Each layer has specific failure modes and debugging approaches. Understanding the full stack is essential for production debugging.
#!/bin/bash # Production architecture setup and verification # ── Docker Swarm: production setup ─────────────────────────────────────────── # Initialize the swarm on the manager node docker swarm init --advertise-addr <manager-ip> # Output: join token for worker nodes # Join worker nodes docker swarm join --token <token> <manager-ip>:2377 # Verify cluster status docker node ls # ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS # abc* manager1 Ready Active Leader # def worker1 Ready Active # ghi worker2 Ready Active # Create a production service with resource limits docker service create \ --name io-thecodeforge-api \ --replicas 3 \ --limit-cpu 1.0 \ --limit-memory 512m \ --reserve-cpu 0.5 \ --reserve-memory 256m \ --publish published=80,target=3000 \ --update-parallelism 1 \ --update-delay 10s \ --update-failure-action rollback \ --update-max-failure-ratio 0.25 \ --health-cmd 'curl -f http://localhost:3000/health || exit 1' \ --health-interval 10s \ --health-timeout 5s \ --health-retries 3 \ --network app-overlay \ registry.example.com/io-thecodeforge/api:1.0.0 # Verify service status docker service ls # ID NAME MODE REPLICAS IMAGE # abc io-thecodeforge-api replicated 3/3 registry.example.com/io-thecodeforge/api:1.0.0 # Check service logs across all replicas docker service logs io-thecodeforge-api --tail 50 --follow # ── Kubernetes: production deployment manifest ─────────────────────────────── cat <<'EOF' > /tmp/io-thecodeforge-api-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: io-thecodeforge-api namespace: production spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: io-thecodeforge-api template: metadata: labels: app: io-thecodeforge-api spec: containers: - name: api image: registry.example.com/io-thecodeforge/api:1.0.0 ports: - containerPort: 3000 resources: requests: cpu: 500m memory: 256Mi limits: cpu: 1000m memory: 512Mi livenessProbe: httpGet: path: /health port: 3000 initialDelaySeconds: 15 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 3000 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3 EOF kubectl apply -f /tmp/io-thecodeforge-api-deployment.yaml kubectl -n production rollout status deployment/io-thecodeforge-api # ── AWS ECS: production task definition ─────────────────────────────────────── cat <<'EOF' > /tmp/io-thecodeforge-api-task.json { "family": "io-thecodeforge-api", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "1024", "memory": "2048", "containerDefinitions": [ { "name": "api", "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge/api:1.0.0", "portMappings": [{"containerPort": 3000, "protocol": "tcp"}], "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60 }, "logConfiguration": { "logDriver": "awsfirelens", "options": { "Name": "cloudwatch", "region": "us-east-1", "log_group_name": "/ecs/io-thecodeforge-api", "auto_create_group": "true" } } } ] } EOF aws ecs register-task-definition --cli-input-json file:///tmp/io-thecodeforge-api-task.json
- Kubernetes manages not just containers but networking (CNI), storage (CSI), service discovery, ingress, RBAC, and custom resources.
- Swarm is simpler because it delegates networking and storage to Docker's built-in drivers.
- Kubernetes' complexity is the cost of flexibility — it can model any production topology.
- For simple deployments (< 50 services), Swarm is sufficient and far easier to operate.
Resource Management: CPU, Memory, OOM, and Noisy Neighbors
Resource management is the most critical production concern for shared container hosts. Without explicit resource limits, one misbehaving container can starve every other container on the same host.
CPU limits: Docker uses cgroups to enforce CPU limits. --cpus=1.0 gives the container access to 1 CPU core worth of time. Without a limit, a container can consume all available CPU. CPU is a compressible resource — the kernel throttles CPU-intensive containers, but does not kill them. This means a CPU-hungry container slows down other containers but does not kill them.
Memory limits: Memory is an incompressible resource. When a container exceeds its memory limit, the kernel OOM killer terminates it. The OOM killer selects processes based on oom_score — a heuristic that considers memory usage, process age, and oom_score_adj. Without a memory limit, a leaking container consumes all host memory, and the OOM killer may kill unrelated containers or critical host processes (kubelet, containerd).
Requests vs limits (Kubernetes): Requests guarantee a minimum allocation — the scheduler places the pod on a node with enough available resources. Limits set the maximum — the container is throttled (CPU) or killed (memory) if exceeded. Best practice: set requests equal to limits for critical services (guaranteed QoS). Set requests lower than limits for burstable services (burstable QoS).
Noisy neighbor problem: Multiple containers on the same host compete for CPU, memory, disk I/O, and network bandwidth. Without resource limits, one container's spike affects all others. The fix: set limits on every production container. Monitor host-level resource usage with docker stats and Prometheus node_exporter.
OOM score and priority: The kernel assigns each process an oom_score from 0 to 1000. Higher scores are killed first. Docker sets oom_score_adj for each container — containers with higher scores are killed before lower-scored containers. Critical services (databases) should have oom_score_adj=-999 (almost never killed). Non-critical services should have oom_score_adj=1000 (killed first).
#!/bin/bash # Production resource management configuration and monitoring # ── CPU limits ─────────────────────────────────────────────────────────────── # Run with 1 CPU core limit docker run --cpus=1.0 --name cpu-test alpine:3.19 stress --cpu 2 --timeout 10s # The container is throttled to 1 CPU even if stress spawns 2 workers # Check CPU throttling cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.stat # nr_periods: total scheduling periods # nr_throttled: periods where the container was throttled # throttled_time: total time throttled (nanoseconds) # Check CPU shares (relative priority) cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.shares # Default: 1024. Set with --cpu-shares=512 for lower priority # ── Memory limits ──────────────────────────────────────────────────────────── # Run with 256MB memory limit docker run --memory=256m --memory-swap=256m --name mem-test alpine:3.19 stress --vm 1 --vm-bytes 300M --timeout 10s # Container is OOM-killed because it exceeds 256MB limit # Check memory usage before OOM cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.max_usage_in_bytes # Check OOM events dmesg | grep -i 'oom\|killed process' | tail -10 # [12345.678] Out of memory: Killed process 5678 (node) total-vm:123456kB, anon-rss:98765kB # ── OOM score management ──────────────────────────────────────────────────── # Check a container's OOM score CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}') cat /proc/$CONTAINER_PID/oom_score # 0-1000: higher = more likely to be killed cat /proc/$CONTAINER_PID/oom_score_adj # -1000 to 1000: adjust the score # Set OOM priority for critical services (database) docker run --oom-score-adj=-999 --name critical-db postgres:16 # This container is almost never killed by the OOM killer # Set OOM priority for non-critical services (cache) docker run --oom-score-adj=1000 --name expendable-cache redis:7 # This container is killed first in an OOM situation # ── Kubernetes resource management ────────────────────────────────────────── # Guaranteed QoS: requests == limits (never evicted for resource reasons) cat <<'EOF' resources: requests: cpu: 1000m memory: 512Mi limits: cpu: 1000m memory: 512Mi EOF # Burstable QoS: requests < limits (can burst but may be throttled/killed) cat <<'EOF' resources: requests: cpu: 500m memory: 256Mi limits: cpu: 1000m memory: 512Mi EOF # ── Monitor resource usage across all containers ───────────────────────────── # Real-time resource usage docker stats --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}' # Find containers without resource limits docker ps -q | xargs -I{} docker inspect {} --format '{{.Name}}: CPU={{.HostConfig.NanoCpus}} MEM={{.HostConfig.Memory}}' # Containers with NanoCpus=0 or Memory=0 have no limits # Host-level resource check free -h cat /proc/loadavg uptime
- CPU is compressible — the kernel throttles a CPU-hungry container but does not kill it.
- Memory is incompressible — when physical memory is exhausted, the kernel must kill a process.
- Without memory limits, the OOM killer may kill critical host processes (containerd, kubelet).
- With memory limits, only the offending container is killed — other containers are unaffected.
Networking in Production: DNS, Overlay, Load Balancing, and Service Mesh
Production Docker networking requires reliable DNS resolution, load balancing, and health-aware traffic routing. The default bridge network provides none of these — production deployments must use user-defined networks or an orchestrator's networking layer.
DNS-based service discovery: User-defined Docker networks and Kubernetes provide DNS-based service discovery. Containers resolve service names to IP addresses via an embedded DNS server (127.0.0.11 in Docker, CoreDNS in Kubernetes). The default bridge network has no DNS — containers can only reach each other by IP, which changes on every restart.
Overlay networking: For multi-host deployments, overlay networks use VXLAN encapsulation to create a virtual Layer 2 network across hosts. Each overlay network has an MTU of 1450 (VXLAN adds 50 bytes of overhead). Misconfigured MTU is a common production failure — packets larger than the overlay MTU are fragmented, and under high load, the fragment queue can overflow, causing silent packet drops.
Load balancing: Docker Swarm provides built-in load balancing via a routing mesh — any node can route traffic to any service replica. Kubernetes provides kube-proxy (iptables/IPVS-based) and ingress controllers (NGINX, Traefik, Envoy) for external traffic. For production, an ingress controller with TLS termination, rate limiting, and circuit breaking is mandatory.
Service mesh: A service mesh (Istio, Linkerd) adds mTLS between services, traffic splitting (canary deployments), circuit breaking, and observability (distributed tracing, metrics). The trade-off: added latency (1-3ms per hop) and operational complexity. Use a service mesh when you need mTLS or traffic splitting. Do not add one 'just in case.'
Network policies: In Kubernetes, NetworkPolicy resources restrict which pods can communicate with each other. Without network policies, all pods can communicate — a compromised pod can reach the database directly. Default-deny network policies are a production best practice.
#!/bin/bash # Production networking configuration and debugging # ── Docker Swarm overlay network ───────────────────────────────────────────── # Create an overlay network with correct MTU docker network create \ --driver overlay \ --opt com.docker.network.driver.mtu=8950 \ --subnet 10.0.0.0/24 \ --gateway 10.0.0.1 \ app-overlay # MTU calculation: VPC MTU (9001) - VXLAN overhead (50) = 8951, round to 8950 # Verify overlay network docker network inspect app-overlay --format '{{.Driver}} {{.Options}}' # overlay map[com.docker.network.driver.mtu:8950] # ── DNS resolution verification ────────────────────────────────────────────── # Check embedded DNS server docker exec <container> cat /etc/resolv.conf # nameserver 127.0.0.11 # options ndots:0 # Resolve a service name docker exec <container> nslookup io-thecodeforge-api # Server: 127.0.0.11 # Address: 10.0.0.5 # Check DNS query logs (Docker daemon) sudo journalctl -u docker | grep 'DNS query' | tail -10 # ── Network health checks ─────────────────────────────────────────────────── # Check overlay network peer status docker network inspect app-overlay --format '{{.Peers}}' # Shows all nodes participating in the overlay # Check IP fragment queue (critical for overlay networks) cat /proc/net/snmp | grep -i frag # Ip: FragCreates FragOKs FragFails # If FragFails > 0, packets are being dropped due to fragment queue overflow # Check MTU of container interface docker exec <container> cat /sys/class/net/eth0/mtu # Should match the overlay network MTU (8950) # ── Traffic debugging with tcpdump ─────────────────────────────────────────── # Capture traffic on the overlay bridge sudo tcpdump -i docker_gwbridge -n -c 20 # Capture traffic inside a container's namespace CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}') sudo nsenter --net --target $CONTAINER_PID tcpdump -i eth0 -n -c 20 # ── Kubernetes network policies (default-deny) ────────────────────────────── cat <<'EOF' > /tmp/default-deny-policy.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all namespace: production spec: podSelector: {} policyTypes: - Ingress - Egress EOF cat <<'EOF' > /tmp/allow-api-to-db.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-api-to-db namespace: production spec: podSelector: matchLabels: app: io-thecodeforge-db policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: io-thecodeforge-api ports: - protocol: TCP port: 5432 EOF kubectl apply -f /tmp/default-deny-policy.yaml kubectl apply -f /tmp/allow-api-to-db.yaml # ── Load balancer health verification ──────────────────────────────────────── # Check which containers are receiving traffic curl -s http://localhost:80/health | jq .hostname # Repeat 10 times — should show different hostnames (round-robin) for i in $(seq 1 10); do curl -s http://localhost:80/health | jq -r .hostname done
- VXLAN encapsulation adds 50 bytes of overhead — the overlay MTU must be 50 bytes less than the underlay MTU.
- If the overlay MTU is too large, packets are fragmented at the VXLAN boundary.
- Under normal load, fragmentation is slow but functional. Under high load, the fragment queue overflows and packets are silently dropped.
- The failure is silent — containers appear healthy but inter-service communication fails.
Logging, Monitoring, and Observability
Production observability is the difference between debugging a failure in 5 minutes and debugging it in 5 hours. Docker provides basic logging — production requires a centralized logging pipeline, metrics collection, and distributed tracing.
Container logging model: Docker captures stdout and stderr from each container and writes them to JSON files under /var/lib/docker/containers/<id>/<id>-json.log. Applications must write logs to stdout/stderr — never to a file inside the container. Log files inside the container are lost when the container restarts.
Log rotation: Docker's default JSON log driver has no size limit — log files grow unbounded until disk is full. Production deployments must configure log rotation in daemon.json: max-size (e.g., 10m) and max-file (e.g., 3). Without rotation, a chatty application can fill the host disk in hours.
Centralized logging: Container logs must be shipped to a centralized system (ELK, Datadog, CloudWatch, Loki) for search, alerting, and retention. Use a logging agent (Fluentd, Filebeat, FireLens) as a DaemonSet or sidecar. The agent reads container logs and ships them to the central system.
Metrics collection: Container metrics (CPU, memory, network, disk I/O) are exposed by Docker (docker stats) and cAdvisor. For production, use Prometheus with node_exporter (host metrics) and cAdvisor (container metrics). Kubernetes exposes metrics via the metrics-server. Alert on: container memory usage > 80% of limit, CPU throttling > 10%, restart count > 5 in 1 hour.
Distributed tracing: For microservices, distributed tracing (Jaeger, Zipkin, OpenTelemetry) tracks a request across multiple services. Each service adds a trace ID to outgoing requests and logs. The tracing system aggregates these logs into a single trace view. Essential for debugging latency issues in multi-service architectures.
Structured logging: Applications should emit structured logs (JSON) with fields: timestamp, level, message, trace_id, service, request_id. Unstructured logs (plain text) are impossible to parse and alert on at scale.
#!/bin/bash # Production logging, monitoring, and observability setup # ── Docker daemon log rotation ─────────────────────────────────────────────── cat <<'EOF' | sudo tee /etc/docker/daemon.json { "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3", "compress": "true" } } EOF sudo systemctl restart docker # Verify log rotation is configured docker info --format '{{.LoggingDriver}}' # json-file # ── Check container log size ──────────────────────────────────────────────── # Find large container logs find /var/lib/docker/containers -name '*-json.log' -exec ls -lhS {} + | head -10 # If any log is > 100MB, rotation is not working # Check total log disk usage du -sh /var/lib/docker/containers/* # ── Fluentd DaemonSet logging agent (Kubernetes) ──────────────────────────── cat <<'EOF' > /tmp/fluentd-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: fluentd namespace: kube-system spec: selector: matchLabels: app: fluentd template: metadata: labels: app: fluentd spec: containers: - name: fluentd image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8-1 resources: limits: memory: 512Mi requests: cpu: 100m memory: 200Mi volumeMounts: - name: varlog mountPath: /var/log - name: dockercontainers mountPath: /var/lib/docker/containers readOnly: true volumes: - name: varlog hostPath: path: /var/log - name: dockercontainers hostPath: path: /var/lib/docker/containers EOF kubectl apply -f /tmp/fluentd-daemonset.yaml # ── Prometheus alerting rules ──────────────────────────────────────────────── cat <<'EOF' > /tmp/container-alerts.yaml groups: - name: container-alerts rules: - alert: ContainerMemoryHigh expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8 for: 5m labels: severity: warning annotations: summary: 'Container {{ $labels.name }} memory usage above 80%' - alert: ContainerCPUThrottled expr: rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: 'Container {{ $labels.name }} CPU throttled > 10%' - alert: ContainerRestarting expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 0m labels: severity: critical annotations: summary: 'Container {{ $labels.container }} restarted {{ $value }} times in 1 hour' EOF # ── Structured logging example (Java) ──────────────────────────────────────── cat <<'EOF' // io.thecodeforge.logging.StructuredLogger.java package io.thecodeforge.logging; import com.fasterxml.jackson.databind.ObjectMapper; import java.time.Instant; import java.util.Map; public class StructuredLogger { private static final ObjectMapper mapper = new ObjectMapper(); private final String serviceName; public StructuredLogger(String serviceName) { this.serviceName = serviceName; } public void info(String message, String traceId, Map<String, Object> fields) { try { Map<String, Object> logEntry = Map.of( "timestamp", Instant.now().toString(), "level", "INFO", "message", message, "service", serviceName, "trace_id", traceId != null ? traceId : "", "fields", fields != null ? fields : Map.of() ); System.out.println(mapper.writeValueAsString(logEntry)); } catch (Exception e) { System.err.println("LOG_ERROR: " + e.getMessage()); } } public void error(String message, String traceId, Throwable throwable) { try { Map<String, Object> logEntry = Map.of( "timestamp", Instant.now().toString(), "level", "ERROR", "message", message, "service", serviceName, "trace_id", traceId != null ? traceId : "", "error_class", throwable.getClass().getName(), "error_message", throwable.getMessage(), "stack_trace", throwable.getStackTrace()[0].toString() ); System.err.println(mapper.writeValueAsString(logEntry)); } catch (Exception e) { System.err.println("LOG_ERROR: " + e.getMessage()); } } } EOF
- Unstructured logs (plain text) cannot be parsed by log aggregation systems at scale.
- Structured logs (JSON) allow filtering by service, level, trace_id, and custom fields.
- Alerts on structured logs (e.g., 'error rate > 5% in 5 minutes') require parseable fields.
- Without structured logging, you are grepping through terabytes of text files.
CI/CD Pipeline: Image Building, Scanning, and Deployment Strategies
Production deployments require a CI/CD pipeline that builds, scans, tests, and deploys container images with zero downtime. Manual docker build && docker push does not scale and introduces human error.
Image building best practices: - Use multi-stage builds to separate build dependencies from runtime. The final image should contain only the application binary and runtime dependencies. - Pin base image versions (node:20.11-alpine, not node:latest). Latest tags change without notice. - Use .dockerignore to exclude build context bloat (node_modules, .git, *.log). - Enable BuildKit (DOCKER_BUILDKIT=1) for parallel builds and secret mounting. - Tag images with git SHA (not :latest, not :v1). The git SHA is immutable and traceable.
Image scanning: Every image must be scanned for known CVEs before deployment. Tools: Trivy, Snyk, AWS ECR scanning, Grivy. Block deployment if critical or high CVEs are found. Scan the base image AND the application dependencies.
Deployment strategies: - Rolling update: replace containers one at a time. Simple but no rollback guarantee. - Blue-green: deploy new version alongside old, switch traffic atomically. Instant rollback. - Canary: deploy to 5% of traffic, monitor for errors, then gradually increase. Best for catching regressions. - A/B testing: deploy two versions simultaneously, split traffic by user segment. Best for feature testing.
Rollback: Every deployment must have a one-command rollback. In Kubernetes: kubectl rollout undo. In Docker Swarm: docker service rollback. In ECS: update the service to the previous task definition. If rollback requires a new build, you do not have a rollback strategy.
Image immutability: Never push to the same tag twice. If you rebuild an image, use a new tag (new git SHA). Mutable tags (pushing to :latest or :v1 twice) cause 'works on my machine' bugs because different hosts have different image layers cached.
#!/bin/bash # Production CI/CD pipeline for Docker images # ── Multi-stage Dockerfile ─────────────────────────────────────────────────── cat <<'EOF' > /tmp/Dockerfile # Stage 1: Build FROM node:20.11-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --only=production COPY . . RUN npm run build # Stage 2: Runtime (minimal image) FROM node:20.11-alpine AS runtime RUN addgroup -g 1001 -S appgroup && \ adduser -S appuser -u 1001 -G appgroup WORKDIR /app COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=10s --timeout=5s --retries=3 \ CMD wget -qO- http://localhost:3000/health || exit 1 CMD ["node", "dist/server.js"] EOF # ── Build with BuildKit ────────────────────────────────────────────────────── GIT_SHA=$(git rev-parse --short HEAD) IMAGE_TAG="registry.example.com/io-thecodeforge/api:${GIT_SHA}" DOCKER_BUILDKIT=1 docker build \ --tag ${IMAGE_TAG} \ --label "io.thecodeforge.build.sha=${GIT_SHA}" \ --label "io.thecodeforge.build.date=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ --file /tmp/Dockerfile \ . docker push ${IMAGE_TAG} # ── Image scanning with Trivy ─────────────────────────────────────────────── trivy image --severity HIGH,CRITICAL --exit-code 1 ${IMAGE_TAG} # Exit code 1 = vulnerabilities found, block deployment # Exit code 0 = no critical/high vulnerabilities # ── Deployment: rolling update (Kubernetes) ────────────────────────────────── kubectl -n production set image deployment/io-thecodeforge-api \ api=${IMAGE_TAG} # Monitor rollout kubectl -n production rollout status deployment/io-thecodeforge-api --timeout=300s # ── Rollback (one command) ────────────────────────────────────────────────── kubectl -n production rollout undo deployment/io-thecodeforge-api # ── Canary deployment (Kubernetes with Istio) ─────────────────────────────── cat <<'EOF' > /tmp/canary-virtualservice.yaml apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: io-thecodeforge-api namespace: production spec: hosts: - api.example.com http: - route: - destination: host: io-thecodeforge-api subset: stable weight: 95 - destination: host: io-thecodeforge-api subset: canary weight: 5 EOF kubectl apply -f /tmp/canary-virtualservice.yaml # Monitor canary error rate # If error rate > 1%, rollback: kubectl delete -f /tmp/canary-virtualservice.yaml # ── Verify image immutability ──────────────────────────────────────────────── # Check that no two images share the same tag docker images --format '{{.Repository}}:{{.Tag}} {{.ID}}' | sort | uniq -w 50 -d # If output is non-empty, the same tag points to different image IDs (bad)
- Mutable tags cause different hosts to run different image versions — the same tag means different things on different machines.
- Immutable tags (git SHA) guarantee that every host runs the exact same binary.
- Rollback is trivial with immutable tags — just point to the previous SHA.
- Debugging is deterministic — the git SHA maps directly to the source code that produced the image.
Security Hardening: Root, Secrets, Network, and Supply Chain
Production Docker security is a layered defense — no single measure is sufficient. Each layer (image, runtime, network, host) must be hardened independently.
Run as non-root: Containers running as root (uid 0) can exploit kernel vulnerabilities with maximum privileges. Every production container should run as a non-root user. Set USER in the Dockerfile or use --user in the run command. Drop all capabilities and add back only what is needed: --cap-drop=ALL --cap-add=NET_BIND_SERVICE.
Secrets management: Never bake secrets (API keys, passwords, certificates) into Docker images. Secrets in images are visible to anyone who can pull the image. Use: Docker secrets (Swarm), Kubernetes secrets (with external secret managers like Vault or AWS Secrets Manager), or environment variables injected at runtime from a secret manager. Use --mount=type=secret for build-time secrets in BuildKit.
Image provenance: Verify the source of base images. Use official images or images from trusted registries. Enable Docker Content Trust (DOCKER_CONTENT_TRUST=1) to verify image signatures. Use SBOM (Software Bill of Materials) tools to track all components in your images.
Runtime security: - seccomp: filters syscalls. The default profile blocks ~44 dangerous syscalls. Use custom profiles for stricter filtering. - AppArmor/SELinux: mandatory access control. The docker-default AppArmor profile restricts container capabilities. - Read-only filesystem: --read-only prevents the container from modifying its filesystem. Use tmpfs for writable directories. - No new privileges: --security-opt=no-new-privileges prevents privilege escalation.
Daemon security: The Docker daemon runs as root and has full access to the host. The daemon socket (/var/run/docker.sock) is equivalent to root access. Never mount the daemon socket into containers. Never expose the daemon over TCP without TLS client authentication. Use rootless Docker for environments where daemon root access is unacceptable.
#!/bin/bash # Production security hardening for Docker # ── Non-root container ─────────────────────────────────────────────────────── # Dockerfile best practice # RUN addgroup -g 1001 -S appgroup && adduser -S appuser -u 1001 -G appgroup # USER appuser # Runtime: run as non-root with dropped capabilities docker run \ --user 1001:1001 \ --cap-drop=ALL \ --cap-add=NET_BIND_SERVICE \ --security-opt=no-new-privileges \ --read-only \ --tmpfs /tmp:size=64m \ --name hardened-api \ io-thecodeforge/api:1.0.0 # Verify non-root docker exec hardened-api id # uid=1001(appuser) gid=1001(appgroup) # Verify capabilities docker inspect hardened-api --format '{{.HostConfig.CapAdd}} {{.HostConfig.CapDrop}}' # [NET_BIND_SERVICE] [ALL] # ── Secrets management ────────────────────────────────────────────────────── # Docker Swarm secrets echo 'my-database-password' | docker secret create db-password - docker service create --secret db-password --name io-thecodeforge-api \ io-thecodeforge/api:1.0.0 # Secret is available at /run/secrets/db-password inside the container # BuildKit: mount secrets during build (not in final image) DOCKER_BUILDKIT=1 docker build --secret id=npmrc,src=$HOME/.npmrc -t api:1.0 . # In Dockerfile: RUN --mount=type=secret,id=npmrc cp /run/secrets/npmrc $HOME/.npmrc && npm ci # ── Image scanning and provenance ─────────────────────────────────────────── # Scan for vulnerabilities trivy image --severity HIGH,CRITICAL io-thecodeforge/api:1.0.0 # Generate SBOM (Software Bill of Materials) syft io-thecodeforge/api:1.0.0 -o spdx-json > sbom.json # Verify image signature (Docker Content Trust) export DOCKER_CONTENT_TRUST=1 docker pull io-thecodeforge/api:1.0.0 # Fails if the image is not signed # ── Seccomp profile ───────────────────────────────────────────────────────── # Use the default seccomp profile docker run --security-opt seccomp=/etc/docker/seccomp/default.json \ io-thecodeforge/api:1.0.0 # Create a custom seccomp profile (allow only required syscalls) cat <<'EOF' > /tmp/seccomp-api.json { "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["read", "write", "open", "close", "stat", "fstat", "mmap", "mprotect", "munmap", "brk", "ioctl", "access", "socket", "connect", "sendto", "recvfrom", "clone", "execve", "exit", "exit_group", "futex", "epoll_create1", "epoll_ctl", "epoll_wait", "accept4", "listen", "bind", "setsockopt"], "action": "SCMP_ACT_ALLOW" } ] } EOF docker run --security-opt seccomp=/tmp/seccomp-api.json \ io-thecodeforge/api:1.0.0 # ── Daemon security ───────────────────────────────────────────────────────── # Check if daemon socket is exposed curl --unix-socket /var/run/docker.sock http://localhost/version # If this works, the daemon is accessible — anyone with socket access has root # Check if daemon is exposed over TCP netstat -tlnp | grep 2375 # Port 2375 = unencrypted Docker API (NEVER expose this) netstat -tlnp | grep 2376 # Port 2376 = TLS-encrypted Docker API (OK if TLS client auth is configured) # Verify daemon configuration cat /etc/docker/daemon.json # Should NOT contain: "hosts": ["tcp://0.0.0.0:2375"]
- The daemon runs as root and has full access to the host filesystem, network, and processes.
- Anyone who can access the socket can create a container with the host filesystem mounted.
- Mounting the host filesystem into a container gives the container root access to the host.
- Never mount /var/run/docker.sock into containers. Never expose the daemon over TCP without TLS.
Scaling Strategies: Horizontal, Vertical, and Auto-Scaling
Scaling Docker in production means adding capacity to handle increased traffic. The strategy depends on the workload pattern: predictable traffic, bursty traffic, or event-driven traffic.
Horizontal scaling (scale out): Add more container replicas. Each replica handles a portion of the traffic. Horizontal scaling is preferred for stateless workloads — it provides redundancy (if one replica fails, others continue), and it scales linearly. Docker Swarm: docker service scale api=10. Kubernetes: kubectl scale deployment/api --replicas=10. ECS: update the service desired count.
Vertical scaling (scale up): Increase the resources (CPU, memory) of existing containers. Vertical scaling is simpler but limited by the host's capacity. It also requires restarting the container with new resource limits. Vertical scaling is appropriate for stateful workloads (databases) that cannot easily distribute across replicas.
Auto-scaling: Automatically adjust replica count based on metrics. The most common triggers: - CPU utilization > 70% for 5 minutes -> add replicas - Request rate > 1000 req/s -> add replicas - Queue depth > 100 messages -> add worker replicas - Custom metrics (response latency, error rate) -> add or remove replicas
Pre-warming: Container startup is fast (0.3-2s) but application cold start can be 10-60s (JVM startup, dependency initialization, connection pool warmup). Pre-warm containers by pulling images before scaling events and using readiness probes that wait for full initialization. For JVM applications, use class data sharing (CDS) or GraalVM native images to reduce cold start.
Scale-down strategy: Removing replicas must be graceful. The replica should stop accepting new requests, drain in-flight requests, close connections, and then exit. Kubernetes handles this with preStop hooks and terminationGracePeriodSeconds. Without graceful shutdown, in-flight requests are dropped during scale-down, causing user-facing errors.
Capacity planning: Monitor resource usage trends over weeks. If average CPU usage is growing 5% per week, plan to add capacity before it reaches 80%. Auto-scaling handles burst traffic, but baseline capacity must be planned manually.
#!/bin/bash # Production scaling strategies and configuration # ── Horizontal scaling (Docker Swarm) ──────────────────────────────────────── # Scale to 10 replicas docker service scale io-thecodeforge-api=10 # Verify replicas are distributed across nodes docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}' # manager1 Running # worker1 Running # worker2 Running # (distributed across 3 nodes) # ── Horizontal scaling (Kubernetes) ────────────────────────────────────────── # Manual scaling kubectl -n production scale deployment/io-thecodeforge-api --replicas=10 # Auto-scaling based on CPU utilization cat <<'EOF' > /tmp/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: io-thecodeforge-api-hpa namespace: production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: io-thecodeforge-api minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 4 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 2 periodSeconds: 60 EOF kubectl apply -f /tmp/hpa.yaml # ── Graceful shutdown (preStop hook) ───────────────────────────────────────── cat <<'EOF' > /tmp/graceful-shutdown-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: io-thecodeforge-api namespace: production spec: template: spec: terminationGracePeriodSeconds: 30 containers: - name: api image: registry.example.com/io-thecodeforge/api:1.0.0 lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"] # The preStop hook: # 1. Sleep 5s (allow load balancer to drain the pod) # 2. Send SIGTERM to PID 1 (the application) # 3. Application drains in-flight requests and exits # 4. Kubernetes waits up to terminationGracePeriodSeconds (30s) EOF kubectl apply -f /tmp/graceful-shutdown-deployment.yaml # ── Pre-warming: pull images before scaling events ─────────────────────────── # Pre-pull images on all nodes (Kubernetes DaemonSet) cat <<'EOF' > /tmp/prepull-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: image-prepull namespace: kube-system spec: selector: matchLabels: app: image-prepull template: metadata: labels: app: image-prepull spec: initContainers: - name: prepull image: registry.example.com/io-thecodeforge/api:1.0.0 command: ["true"] containers: - name: pause image: registry.k8s.io/pause:3.9 EOF kubectl apply -f /tmp/prepull-daemonset.yaml # This DaemonSet runs on every node and pulls the image into the node's cache # ── Monitor scaling effectiveness ──────────────────────────────────────────── # Check current replica count and resource usage kubectl -n production get deployment io-thecodeforge-api -o wide # NAME READY UP-TO-DATE AVAILABLE AGE # io-thecodeforge-api 10/10 10 10 5d # Check HPA status kubectl -n production get hpa io-thecodeforge-api-hpa # NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS # io-thecodeforge-api-hpa Deployment/io-thecodeforge-api 45%/70% 3 20 5 # (45% CPU — below 70% threshold — HPA will scale down after stabilization window)
- Without graceful shutdown, Kubernetes sends SIGTERM and immediately removes the pod from the load balancer.
- In-flight requests (requests that have already been routed to the pod) are dropped mid-processing.
- The preStop hook adds a delay, allowing the load balancer to stop sending new requests before the pod exits.
- The application must also handle SIGTERM by stopping new request acceptance and draining in-flight requests.
High Availability and Disaster Recovery
Production Docker deployments must survive host failures, network partitions, and data center outages. High availability (HA) ensures continuous operation during failures. Disaster recovery (DR) ensures data and service restoration after catastrophic failures.
Multi-host redundancy: Run multiple replicas of each service across multiple hosts. If one host fails, the orchestrator reschedules containers on healthy hosts. Docker Swarm: use --replicas=3 and ensure the swarm has 3+ manager nodes. Kubernetes: use pod anti-affinity to spread replicas across nodes and zones.
Multi-AZ deployment: Deploy across multiple availability zones (data centers within a region). If one AZ fails, services continue in other AZs. AWS: use ECS/Kubernetes with nodes in 3+ AZs. Use Application Load Balancer (ALB) to distribute traffic across AZs.
Stateful services (databases): Databases require special HA strategies: - PostgreSQL: streaming replication with automatic failover (Patroni, pg_auto_failover) - MySQL: Group Replication or Galera Cluster - Redis: Redis Sentinel or Redis Cluster - Use volumes for data persistence. Back up volumes to object storage (S3) regularly.
Data backup and recovery: - Volume snapshots: snapshot named volumes regularly (docker volume snapshot or cloud provider snapshots). - Database backups: pg_dump, mysqldump, or continuous WAL archiving to object storage. - Image registry backup: replicate images across regions (ECR replication, Harbor replication). - Configuration backup: store all configuration (Docker Compose, Kubernetes manifests, daemon.json) in version control.
Health checks and self-healing: The orchestrator uses health checks to detect unhealthy containers and automatically restart or reschedule them. Liveness probes detect deadlocked processes (restart the container). Readiness probes detect services that are not ready to receive traffic (remove from load balancer). Startup probes detect slow-starting applications (give them more time before health checking).
Failover testing: HA is only as good as your last failover test. Regularly simulate failures: kill a container, drain a node, shut down an AZ. Measure the time to recovery and the error rate during failover. If you have never tested failover, you do not have HA.
#!/bin/bash # High availability and disaster recovery configuration # ── Multi-host redundancy (Docker Swarm) ───────────────────────────────────── # Create a service with replicas spread across nodes docker service create \ --name io-thecodeforge-api \ --replicas 6 \ --constraint 'node.role==worker' \ --placement-pref 'spread=node.id' \ --limit-cpu 1.0 \ --limit-memory 512m \ --update-parallelism 2 \ --update-delay 10s \ --update-failure-action rollback \ --restart-condition on-failure \ --restart-delay 5s \ --restart-max-attempts 3 \ registry.example.com/io-thecodeforge/api:1.0.0 # Verify distribution across nodes docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}' # worker1 Running # worker2 Running # worker3 Running # worker1 Running # worker2 Running # worker3 Running # ── Multi-AZ pod anti-affinity (Kubernetes) ────────────────────────────────── cat <<'EOF' > /tmp/multi-az-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: io-thecodeforge-api namespace: production spec: replicas: 6 template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - io-thecodeforge-api topologyKey: topology.kubernetes.io/zone preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - io-thecodeforge-api topologyKey: kubernetes.io/hostname EOF kubectl apply -f /tmp/multi-az-deployment.yaml # ── Volume backup (named volume to S3) ────────────────────────────────────── # Create a backup container that mounts the volume and uploads to S3 docker run --rm \ -v postgres-data:/data:ro \ -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \ -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \ amazon/aws-cli s3 cp /data s3://my-backups/postgres-data/$(date +%Y-%m-%d)/ --recursive # ── Kubernetes CronJob for automated backups ───────────────────────────────── cat <<'EOF' > /tmp/backup-cronjob.yaml apiVersion: batch/v1 kind: CronJob metadata: name: postgres-backup namespace: production spec: schedule: "0 */6 * * *" # Every 6 hours jobTemplate: spec: template: spec: containers: - name: backup image: postgres:16 command: - /bin/sh - -c - | pg_dump -h db-host -U postgres -d mydb | \ gzip | \ aws s3 cp - s3://my-backups/db/mydb-$(date +%Y%m%d-%H%M%S).sql.gz env: - name: PGPASSWORD valueFrom: secretKeyRef: name: db-credentials key: password restartPolicy: OnFailure EOF kubectl apply -f /tmp/backup-cronjob.yaml # ── Failover testing ──────────────────────────────────────────────────────── # Kill a random container and verify self-healing docker service scale io-thecodeforge-api=3 sleep 5 # Kill a container docker kill $(docker ps -q | head -1) # Watch the orchestrator reschedule watch docker service ps io-thecodeforge-api # A new container should start within seconds # Kubernetes: simulate node failure cordon k8s-worker-2 # Mark node as unschedulable drain k8s-worker-2 # Evict all pods from the node # Pods are rescheduled to other nodes # Verify all pods are running on remaining nodes kubectl get pods -o wide | grep io-thecodeforge-api
- Configuration without testing is an assumption. Failover may fail due to DNS TTL, connection pool exhaustion, or split-brain scenarios.
- Regular failover tests reveal hidden dependencies that are not visible in configuration.
- Measure time-to-recovery (TTR) and error-rate-during-failover. If TTR > 30s or error rate > 5%, the failover is inadequate.
- Run failover tests monthly. Test killing containers, draining nodes, and simulating AZ failures.
Why Your Containers Die Without Resource Limits
You’ve seen it: a container runs fine on your laptop, then murders every process on the production host. That’s not a bug — it’s a missing --memory flag. The Linux kernel gives containers a shared pool of RAM and CPU. One greedy process can starve the rest, trigger the OOM killer, and take down your orchestrator’s health checks with it.
The fix is brutal and simple. Set hard limits. --memory=512m caps the container. --cpus=0.5 throttles CPU before it throttles your neighbors. Never rely on “requests” alone — those are scheduling hints, not enforcement. Limits are the difference between a noisy neighbor and a dead cluster.
Why does this matter in production? Because orchestrators like Kubernetes assume limits exist. Without them, the scheduler overloads nodes, the OOM killer fires randomly, and your incident alert is just a rerun of last week’s outage. Set limits. Test with stress. Verify with docker stats. Your cluster’s life depends on it.
// io.thecodeforge — devops tutorial // Prevent OOM killer from taking down your payment service services: payment-api: image: payment-api:v3.2.1 deploy: resources: limits: cpus: '0.5' memory: 512M reservations: cpus: '0.25' memory: 256M # Without these, Docker assumes unlimited — bad news on shared hosts
memory_reservation without memory is like locking your car but leaving the windows open. Reservations guarantee nothing — limits enforce. If you see OOM kills, check your limits first.Storage Gotchas: When Containers Forget Everything
Containers are ephemeral by design. Write something inside /var/lib/data? It’s gone the second the container stops. That’s fine for stateless apps, but your database, logs, and cached assets need persistence. This is where volumes come in — but most devs get them wrong the first time.
The default bind mount ties a container directory directly to a host path. It’s fast, simple, and a security risk: any container with write access can replace system files. The better option is a named volume managed by Docker. It isolates storage and survives container restarts without exposing the host filesystem.
Where the wheels fall off is multi-host. On a single node, a named volume works great. In a swarm or Kubernetes cluster, containers move between nodes — and your local volume doesn’t follow. That’s why production stacks use network-attached storage (NFS, EBS, Ceph) or orchestrator-backed volumes. Never rely on local host paths for stateful workloads across nodes. You will lose data.
// io.thecodeforge — devops tutorial // Avoid data loss when payment-db moves to another node services: payment-db: image: postgres:16-alpine volumes: - pg_data_prod:/var/lib/postgresql/data deploy: placement: constraints: [node.labels.storage == ssd] volumes: pg_data_prod: driver: rexray/ebs # or nfs, ceph — never local host paths
Containerization vs. Virtualization: Why You’re Paying for an Entire OS You Don’t Need
Virtual machines run a full guest OS. Containers share the host kernel. That’s not a minor optimization — it’s the difference between 50 MB overhead and 5 GB overhead per instance.
In production, that cost shows up in density, boot time, and patching cadence. A single VM takes minutes to provision, then needs its own security updates. A container starts in milliseconds and inherits the host kernel's patches. If you’re running 50 microservices as VMs, you’re burning CPU cycles on 50 redundant kernels that do nothing but wait for systemd.
Virtualization still wins when you need strong isolation boundaries — different kernels, different OS families, or hardware-level security. But if your workloads are Linux-on-Linux, containers deliver better resource utilization and faster deployments. The question isn’t which is better. It’s which problem you’re solving.
// io.thecodeforge — devops tutorial vm: type: "m5.xlarge" vcpu: 4 memory_gb: 16 guest_os: "Ubuntu 22.04" boot_time_sec: 45 image_size_gb: 5 instances_per_host: 10 container: base_image: "ubuntu:22.04" layers: 3 image_size_mb: 180 start_time_ms: 200 instances_per_host: 250
Container Runtimes: runc, containerd, and Why Docker Is Just the Tip of the Stack
Docker is not a container runtime. Docker is a UX layer over containerd, which uses runc to actually spawn containers. When you run docker start, you’re talking to a daemon that delegates to containerd, which creates a CNI network and then calls runc.
In production, you don’t need the Docker CLI on every host. Kubernetes uses containerd directly — no dockerd, no overhead, no socket vulnerabilities. The shift to containerd reduces attack surface and cuts memory usage by 50-100 MB per node. For a 100-node cluster, that’s 5-10 GB of RAM freed.
The runtime chain matters for security too. runc runs container processes with cgroups and namespaces. But vulnerabilities like CVE-2019-5736 exploit runc’s host binary access. Modern setups use gVisor or Kata Containers for extra isolation — they intercept syscalls before they hit the host kernel. That’s your last line of defense when a container goes rogue.
// io.thecodeforge — devops tutorial runtime_stack: - layer: "docker cli" daemon: "dockerd" purpose: "user interface, image build, compose" - layer: "containerd" daemon: "containerd" purpose: "image management, container lifecycle" - layer: "runc" binary: "runc" purpose: "OCI spec execution, cgroup/namespace setup" production_node: runtime: "containerd" snapshotters: "overlayfs, devmapper" cni_plugins: "calico, flannel"
Docker Swarm Overlay Network Failure — 45-Minute Outage During Black Friday Traffic
- Overlay network MTU is invisible until traffic volume exceeds the fragmentation queue capacity. Always calculate overlay MTU as: underlay_MTU - 50 (VXLAN overhead).
- docker service ls shows containers as 'running' even when the overlay network is broken. Network health is not visible in Docker's built-in status checks.
- Add network-level health checks (TCP connectivity to downstream services) in addition to HTTP health checks. A container can be HTTP-healthy but network-unreachable.
- Load test at 2x expected peak traffic before any high-traffic event. The MTU issue had been dormant for 6 months — it only manifested under 10x traffic.
- Monitor IP fragment queue drops (netstat -s | grep -i frag) on all container hosts. Fragment queue exhaustion is a silent failure mode.
docker inspect <container> --format '{{.HostConfig.Memory}}'docker stats --no-stream <container>docker exec <container> ss -tlnpdocker inspect <container> --format '{{.State.Health.Status}}'docker network inspect <network> | grep -A5 <container>docker exec <container> nslookup <target-service>docker system df -vdu -sh /var/lib/docker/* | sort -hrdocker pull <image> 2>&1docker login <registry> --username <user>docker inspect <container> --format '{{.State.ExitCode}}'docker logs <container>cat /proc/net/snmp | grep -i fragdocker exec <container> cat /sys/class/net/eth0/mtups aux --sort=-%cpu | head -10docker stats --no-stream| Aspect | Docker Swarm | Kubernetes | AWS ECS | AWS Fargate |
|---|---|---|---|---|
| Complexity | Low | High | Medium | Low |
| Learning curve | 1-2 weeks | 2-6 months | 2-4 weeks | 1-2 weeks |
| Self-healing | Yes | Yes | Yes | Yes |
| Auto-scaling | Limited (external) | HPA, VPA, KEDA | Service Auto Scaling | Service Auto Scaling |
| Service mesh | No | Istio, Linkerd | App Mesh | App Mesh |
| Multi-AZ | Manual | Built-in (topology spread) | Built-in | Built-in |
| Host management | Self-managed | Self-managed (or EKS/GKE) | EC2 instances | No hosts (serverless) |
| Cost | Lowest (self-managed) | Medium (EKS $73/mo + nodes) | Medium (EC2 + ECS) | Highest (20-30% premium) |
| Ecosystem | Small | Massive | AWS-native | AWS-native |
| Best for | Small teams, simple deployments | Large teams, complex workloads | AWS-native, medium complexity | Minimal ops, AWS-native |
| File | Command / Code | Purpose |
|---|---|---|
| io | docker swarm init --advertise-addr | Production Architecture |
| io | docker run --cpus=1.0 --name cpu-test alpine:3.19 stress --cpu 2 --timeout 10s | Resource Management |
| io | docker network create \ | Networking in Production |
| io | cat <<'EOF' | sudo tee /etc/docker/daemon.json | Logging, Monitoring, and Observability |
| io | cat <<'EOF' > /tmp/Dockerfile | CI/CD Pipeline |
| io | docker run \ | Security Hardening |
| io | docker service scale io-thecodeforge-api=10 | Scaling Strategies |
| io | docker service create \ | High Availability and Disaster Recovery |
| ResourceLimitGuard.yml | services: | Why Your Containers Die Without Resource Limits |
| VolumeStrategyProd.yml | services: | Storage Gotchas |
| vm-vs-container.yml | vm: | Containerization vs. Virtualization |
| runtime-stack.yml | runtime_stack: | Container Runtimes |
Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
Use Docker Swarm if you have a small team (< 10 engineers), fewer than 50 services, and want simplicity. Swarm is easier to learn and operate. Use Kubernetes if you need the ecosystem (service mesh, GitOps, custom operators), have a platform team, or run more than 50 services. Kubernetes has a steeper learning curve but provides more flexibility and a larger ecosystem.
Use named volumes (docker volume create) for persistent data. Volumes survive container restarts and removals. Back up volumes regularly to object storage (S3) using a backup container or CronJob. For databases, use cloud-managed database services (RDS, Cloud SQL) when possible — they handle replication, backups, and failover automatically.
Check in order: (1) Is the process listening on the correct port and interface? (docker exec <container> ss -tlnp). (2) Is the health check passing? (docker inspect <container> --format '{{.State.Health.Status}}'). (3) Is the container on the correct network? (docker network inspect <network>). (4) Are there iptables rules blocking traffic? (iptables -L -n). (5) Is the application actually started? (docker logs <container>).
A liveness probe checks if the container is alive. If it fails, Kubernetes restarts the container. Use liveness for detecting deadlocked processes. A readiness probe checks if the container is ready to serve traffic. If it fails, Kubernetes removes the container from the load balancer but does not restart it. Use readiness for detecting initialization issues or temporary overload. A startup probe gives slow-starting applications extra time before liveness checks begin.
Use rolling updates with health checks. The orchestrator starts new containers, waits for them to pass health checks, then stops old containers. Add preStop hooks with a sleep delay to allow load balancers to drain connections. Handle SIGTERM in your application to stop accepting new requests and drain in-flight requests. Set terminationGracePeriodSeconds to the maximum drain time.
Collect three types of signals: (1) Logs — ship stdout/stderr to ELK, Datadog, or Loki. (2) Metrics — use Prometheus with node_exporter and cAdvisor. Alert on memory > 80% of limit, CPU throttling > 10%, restart count > 5/hour. (3) Traces — use OpenTelemetry with Jaeger or Zipkin for distributed tracing across microservices.
20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.
That's Docker. Mark it forged?
17 min read · try the examples if you haven't