Docker in Production: Real-World Architecture, Failures, and Scaling Strategies
- Production Docker requires an orchestrator (Swarm, Kubernetes, ECS) for scheduling, scaling, self-healing, and networking. Choose based on team size and ecosystem needs.
- Resource limits are mandatory β set --memory and --cpus on every container. Without limits, one container can starve all others on the host.
- Logging must go to stdout/stderr with rotation. Ship to a centralized system. Never write logs to the container filesystem.
- Orchestration layer (Kubernetes, ECS, Docker Swarm) manages container scheduling, scaling, and self-healing
- Container runtime (containerd, runc) executes containers with namespace isolation and cgroup limits
- Image registry (ECR, GCR, private Harbor) stores and distributes images with vulnerability scanning
- Service mesh (Istio, Linkerd) handles mTLS, traffic management, and observability
- Resource limits are mandatory β without --memory and --cpus, one container can starve all others on the host
- Logging must go to stdout/stderr β never write logs to container filesystem (lost on restart)
- Images must be immutable β tag with git SHA, never use :latest in production
- Health checks must be liveness + readiness β liveness restarts, readiness removes from load balancer
Container is OOM-killed repeatedly.
docker inspect <container> --format '{{.HostConfig.Memory}}'docker stats --no-stream <container>Container is running but not serving traffic.
docker exec <container> ss -tlnpdocker inspect <container> --format '{{.State.Health.Status}}'Inter-container communication failing.
docker network inspect <network> | grep -A5 <container>docker exec <container> nslookup <target-service>Docker daemon disk usage growing rapidly.
docker system df -vdu -sh /var/lib/docker/* | sort -hrImage pull failing in CI/CD pipeline.
docker pull <image> 2>&1docker login <registry> --username <user>Container immediately exits on start.
docker inspect <container> --format '{{.State.ExitCode}}'docker logs <container>Overlay network latency spike.
cat /proc/net/snmp | grep -i fragdocker exec <container> cat /sys/class/net/eth0/mtuHost CPU is 100% but individual containers show low usage.
ps aux --sort=-%cpu | head -10docker stats --no-streamProduction Incident
Production Debug GuideFrom container crashes to network partitions β real debugging paths through production Docker deployments.
Docker works out of the box for development. Running a single container on a laptop requires no orchestration, no monitoring, and no security hardening. Production is a different problem entirely β hundreds of containers across dozens of hosts, with requirements for zero-downtime deploys, automatic scaling, persistent data, and compliance auditing.
Most Docker-in-production failures fall into five categories: resource exhaustion (no limits set), networking misconfiguration (DNS, overlay MTU, port conflicts), logging gaps (logs in container filesystem, not shipped to central system), security exposure (root containers, exposed daemon socket, unpinned images), and deployment errors (no health checks, no rollback strategy, no canary testing).
This article covers the architecture decisions, operational patterns, and failure scenarios that determine whether your Docker deployment survives production traffic or collapses under it. Every section includes real debugging commands and failure stories.
Production Architecture: Single Host to Multi-Host Orchestration
Running Docker in production requires an orchestration layer that manages container scheduling, networking, scaling, and self-healing across multiple hosts. Without orchestration, you are managing containers manually β which does not scale beyond 10-20 containers.
Single-host Docker (development only): Running docker run on a single host works for development but fails in production. There is no self-healing (if a container crashes, it stays dead unless you add --restart=always). There is no load balancing (all traffic goes to one container). There is no horizontal scaling (you must manually start more containers). There is no rolling deployment (you must stop the old container before starting the new one, causing downtime).
Docker Swarm: Docker's built-in orchestrator. Manages a cluster of Docker hosts as a single virtual host. Supports service definitions (desired state), rolling updates, and overlay networking. Swarm is simpler than Kubernetes but has fewer features β no custom resource definitions, limited networking options, and a smaller ecosystem. Swarm is adequate for small-to-medium deployments (< 100 services).
Kubernetes (K8s): The industry-standard orchestrator. Manages containers across a cluster with declarative configuration, automated scaling, self-healing, and a rich ecosystem of networking, storage, and observability tools. Kubernetes has a steep learning curve and significant operational overhead β it requires dedicated platform engineers to operate. Kubernetes is the right choice for large deployments (> 50 services) or when you need the ecosystem (service mesh, GitOps, custom operators).
AWS ECS / Fargate: AWS's managed container orchestration. ECS manages container scheduling on EC2 instances. Fargate abstracts the hosts entirely β you pay per container, not per host. ECS is simpler than Kubernetes (no control plane to manage) but locks you to AWS. Fargate eliminates host management entirely but costs 20-30% more than self-managed EC2.
Architecture pattern: The production architecture stack is: Load Balancer -> Ingress Controller -> Orchestrator -> Container Runtime -> Host. Each layer has specific failure modes and debugging approaches. Understanding the full stack is essential for production debugging.
#!/bin/bash # Production architecture setup and verification # ββ Docker Swarm: production setup βββββββββββββββββββββββββββββββββββββββββββ # Initialize the swarm on the manager node docker swarm init --advertise-addr <manager-ip> # Output: join token for worker nodes # Join worker nodes docker swarm join --token <token> <manager-ip>:2377 # Verify cluster status docker node ls # ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS # abc* manager1 Ready Active Leader # def worker1 Ready Active # ghi worker2 Ready Active # Create a production service with resource limits docker service create \ --name io-thecodeforge-api \ --replicas 3 \ --limit-cpu 1.0 \ --limit-memory 512m \ --reserve-cpu 0.5 \ --reserve-memory 256m \ --publish published=80,target=3000 \ --update-parallelism 1 \ --update-delay 10s \ --update-failure-action rollback \ --update-max-failure-ratio 0.25 \ --health-cmd 'curl -f http://localhost:3000/health || exit 1' \ --health-interval 10s \ --health-timeout 5s \ --health-retries 3 \ --network app-overlay \ registry.example.com/io-thecodeforge/api:1.0.0 # Verify service status docker service ls # ID NAME MODE REPLICAS IMAGE # abc io-thecodeforge-api replicated 3/3 registry.example.com/io-thecodeforge/api:1.0.0 # Check service logs across all replicas docker service logs io-thecodeforge-api --tail 50 --follow # ββ Kubernetes: production deployment manifest βββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/io-thecodeforge-api-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: io-thecodeforge-api namespace: production spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: io-thecodeforge-api template: metadata: labels: app: io-thecodeforge-api spec: containers: - name: api image: registry.example.com/io-thecodeforge/api:1.0.0 ports: - containerPort: 3000 resources: requests: cpu: 500m memory: 256Mi limits: cpu: 1000m memory: 512Mi livenessProbe: httpGet: path: /health port: 3000 initialDelaySeconds: 15 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 3000 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3 EOF kubectl apply -f /tmp/io-thecodeforge-api-deployment.yaml kubectl -n production rollout status deployment/io-thecodeforge-api # ββ AWS ECS: production task definition βββββββββββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/io-thecodeforge-api-task.json { "family": "io-thecodeforge-api", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "1024", "memory": "2048", "containerDefinitions": [ { "name": "api", "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge/api:1.0.0", "portMappings": [{"containerPort": 3000, "protocol": "tcp"}], "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"], "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 60 }, "logConfiguration": { "logDriver": "awsfirelens", "options": { "Name": "cloudwatch", "region": "us-east-1", "log_group_name": "/ecs/io-thecodeforge-api", "auto_create_group": "true" } } } ] } EOF aws ecs register-task-definition --cli-input-json file:///tmp/io-thecodeforge-api-task.json
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
abc* manager1 Ready Active Leader
def worker1 Ready Active
ghi worker2 Ready Active
# Service status:
ID NAME MODE REPLICAS IMAGE
abc io-thecodeforge-api replicated 3/3 registry.example.com/io-thecodeforge/api:1.0.0
# Kubernetes deployment:
deployment.apps/io-thecodeforge-api configured
Waiting for deployment "io-thecodeforge-api" rollout to finish: 0 of 3 updated replicas are available...
Waiting for deployment "io-thecodeforge-api" rollout to finish: 1 of 3 updated replicas are available...
Waiting for deployment "io-thecodeforge-api" rollout to finish: 2 of 3 updated replicas are available...
deployment "io-thecodeforge-api" successfully rolled out
- Kubernetes manages not just containers but networking (CNI), storage (CSI), service discovery, ingress, RBAC, and custom resources.
- Swarm is simpler because it delegates networking and storage to Docker's built-in drivers.
- Kubernetes' complexity is the cost of flexibility β it can model any production topology.
- For simple deployments (< 50 services), Swarm is sufficient and far easier to operate.
Resource Management: CPU, Memory, OOM, and Noisy Neighbors
Resource management is the most critical production concern for shared container hosts. Without explicit resource limits, one misbehaving container can starve every other container on the same host.
CPU limits: Docker uses cgroups to enforce CPU limits. --cpus=1.0 gives the container access to 1 CPU core worth of time. Without a limit, a container can consume all available CPU. CPU is a compressible resource β the kernel throttles CPU-intensive containers, but does not kill them. This means a CPU-hungry container slows down other containers but does not kill them.
Memory limits: Memory is an incompressible resource. When a container exceeds its memory limit, the kernel OOM killer terminates it. The OOM killer selects processes based on oom_score β a heuristic that considers memory usage, process age, and oom_score_adj. Without a memory limit, a leaking container consumes all host memory, and the OOM killer may kill unrelated containers or critical host processes (kubelet, containerd).
Requests vs limits (Kubernetes): Requests guarantee a minimum allocation β the scheduler places the pod on a node with enough available resources. Limits set the maximum β the container is throttled (CPU) or killed (memory) if exceeded. Best practice: set requests equal to limits for critical services (guaranteed QoS). Set requests lower than limits for burstable services (burstable QoS).
Noisy neighbor problem: Multiple containers on the same host compete for CPU, memory, disk I/O, and network bandwidth. Without resource limits, one container's spike affects all others. The fix: set limits on every production container. Monitor host-level resource usage with docker stats and Prometheus node_exporter.
OOM score and priority: The kernel assigns each process an oom_score from 0 to 1000. Higher scores are killed first. Docker sets oom_score_adj for each container β containers with higher scores are killed before lower-scored containers. Critical services (databases) should have oom_score_adj=-999 (almost never killed). Non-critical services should have oom_score_adj=1000 (killed first).
#!/bin/bash # Production resource management configuration and monitoring # ββ CPU limits βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Run with 1 CPU core limit docker run --cpus=1.0 --name cpu-test alpine:3.19 stress --cpu 2 --timeout 10s # The container is throttled to 1 CPU even if stress spawns 2 workers # Check CPU throttling cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.stat # nr_periods: total scheduling periods # nr_throttled: periods where the container was throttled # throttled_time: total time throttled (nanoseconds) # Check CPU shares (relative priority) cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.shares # Default: 1024. Set with --cpu-shares=512 for lower priority # ββ Memory limits ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Run with 256MB memory limit docker run --memory=256m --memory-swap=256m --name mem-test alpine:3.19 stress --vm 1 --vm-bytes 300M --timeout 10s # Container is OOM-killed because it exceeds 256MB limit # Check memory usage before OOM cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes cat /sys/fs/cgroup/memory/docker/<container-id>/memory.max_usage_in_bytes # Check OOM events dmesg | grep -i 'oom\|killed process' | tail -10 # [12345.678] Out of memory: Killed process 5678 (node) total-vm:123456kB, anon-rss:98765kB # ββ OOM score management ββββββββββββββββββββββββββββββββββββββββββββββββββββ # Check a container's OOM score CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}') cat /proc/$CONTAINER_PID/oom_score # 0-1000: higher = more likely to be killed cat /proc/$CONTAINER_PID/oom_score_adj # -1000 to 1000: adjust the score # Set OOM priority for critical services (database) docker run --oom-score-adj=-999 --name critical-db postgres:16 # This container is almost never killed by the OOM killer # Set OOM priority for non-critical services (cache) docker run --oom-score-adj=1000 --name expendable-cache redis:7 # This container is killed first in an OOM situation # ββ Kubernetes resource management ββββββββββββββββββββββββββββββββββββββββββ # Guaranteed QoS: requests == limits (never evicted for resource reasons) cat <<'EOF' resources: requests: cpu: 1000m memory: 512Mi limits: cpu: 1000m memory: 512Mi EOF # Burstable QoS: requests < limits (can burst but may be throttled/killed) cat <<'EOF' resources: requests: cpu: 500m memory: 256Mi limits: cpu: 1000m memory: 512Mi EOF # ββ Monitor resource usage across all containers βββββββββββββββββββββββββββββ # Real-time resource usage docker stats --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}' # Find containers without resource limits docker ps -q | xargs -I{} docker inspect {} --format '{{.Name}}: CPU={{.HostConfig.NanoCpus}} MEM={{.HostConfig.Memory}}' # Containers with NanoCpus=0 or Memory=0 have no limits # Host-level resource check free -h cat /proc/loadavg uptime
nr_periods: 1000
nr_throttled: 350
throttled_time: 3500000000 (3.5 seconds)
# Memory OOM:
[12345.678] Out of memory: Killed process 5678 (stress) total-vm:312320kB, anon-rss:262144kB
# OOM scores:
Container PID: 5678
oom_score: 300
oom_score_adj: 0
# Docker stats:
NAME CPU % MEM USAGE / LIMIT NET I/O BLOCK I/O
io-api-1 12.34% 256MiB / 512MiB 1.2GB / 800MB 50MB / 100MB
io-api-2 8.76% 198MiB / 512MiB 900MB / 600MB 30MB / 80MB
io-db-1 45.67% 1.2GiB / 2GiB 5GB / 3GB 2GB / 500MB
io-cache-1 2.34% 85MiB / 128MiB 500MB / 400MB 10MB / 5MB
- CPU is compressible β the kernel throttles a CPU-hungry container but does not kill it.
- Memory is incompressible β when physical memory is exhausted, the kernel must kill a process.
- Without memory limits, the OOM killer may kill critical host processes (containerd, kubelet).
- With memory limits, only the offending container is killed β other containers are unaffected.
Networking in Production: DNS, Overlay, Load Balancing, and Service Mesh
Production Docker networking requires reliable DNS resolution, load balancing, and health-aware traffic routing. The default bridge network provides none of these β production deployments must use user-defined networks or an orchestrator's networking layer.
DNS-based service discovery: User-defined Docker networks and Kubernetes provide DNS-based service discovery. Containers resolve service names to IP addresses via an embedded DNS server (127.0.0.11 in Docker, CoreDNS in Kubernetes). The default bridge network has no DNS β containers can only reach each other by IP, which changes on every restart.
Overlay networking: For multi-host deployments, overlay networks use VXLAN encapsulation to create a virtual Layer 2 network across hosts. Each overlay network has an MTU of 1450 (VXLAN adds 50 bytes of overhead). Misconfigured MTU is a common production failure β packets larger than the overlay MTU are fragmented, and under high load, the fragment queue can overflow, causing silent packet drops.
Load balancing: Docker Swarm provides built-in load balancing via a routing mesh β any node can route traffic to any service replica. Kubernetes provides kube-proxy (iptables/IPVS-based) and ingress controllers (NGINX, Traefik, Envoy) for external traffic. For production, an ingress controller with TLS termination, rate limiting, and circuit breaking is mandatory.
Service mesh: A service mesh (Istio, Linkerd) adds mTLS between services, traffic splitting (canary deployments), circuit breaking, and observability (distributed tracing, metrics). The trade-off: added latency (1-3ms per hop) and operational complexity. Use a service mesh when you need mTLS or traffic splitting. Do not add one 'just in case.'
Network policies: In Kubernetes, NetworkPolicy resources restrict which pods can communicate with each other. Without network policies, all pods can communicate β a compromised pod can reach the database directly. Default-deny network policies are a production best practice.
#!/bin/bash # Production networking configuration and debugging # ββ Docker Swarm overlay network βββββββββββββββββββββββββββββββββββββββββββββ # Create an overlay network with correct MTU docker network create \ --driver overlay \ --opt com.docker.network.driver.mtu=8950 \ --subnet 10.0.0.0/24 \ --gateway 10.0.0.1 \ app-overlay # MTU calculation: VPC MTU (9001) - VXLAN overhead (50) = 8951, round to 8950 # Verify overlay network docker network inspect app-overlay --format '{{.Driver}} {{.Options}}' # overlay map[com.docker.network.driver.mtu:8950] # ββ DNS resolution verification ββββββββββββββββββββββββββββββββββββββββββββββ # Check embedded DNS server docker exec <container> cat /etc/resolv.conf # nameserver 127.0.0.11 # options ndots:0 # Resolve a service name docker exec <container> nslookup io-thecodeforge-api # Server: 127.0.0.11 # Address: 10.0.0.5 # Check DNS query logs (Docker daemon) sudo journalctl -u docker | grep 'DNS query' | tail -10 # ββ Network health checks βββββββββββββββββββββββββββββββββββββββββββββββββββ # Check overlay network peer status docker network inspect app-overlay --format '{{.Peers}}' # Shows all nodes participating in the overlay # Check IP fragment queue (critical for overlay networks) cat /proc/net/snmp | grep -i frag # Ip: FragCreates FragOKs FragFails # If FragFails > 0, packets are being dropped due to fragment queue overflow # Check MTU of container interface docker exec <container> cat /sys/class/net/eth0/mtu # Should match the overlay network MTU (8950) # ββ Traffic debugging with tcpdump βββββββββββββββββββββββββββββββββββββββββββ # Capture traffic on the overlay bridge sudo tcpdump -i docker_gwbridge -n -c 20 # Capture traffic inside a container's namespace CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}') sudo nsenter --net --target $CONTAINER_PID tcpdump -i eth0 -n -c 20 # ββ Kubernetes network policies (default-deny) ββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/default-deny-policy.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-all namespace: production spec: podSelector: {} policyTypes: - Ingress - Egress EOF cat <<'EOF' > /tmp/allow-api-to-db.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-api-to-db namespace: production spec: podSelector: matchLabels: app: io-thecodeforge-db policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: io-thecodeforge-api ports: - protocol: TCP port: 5432 EOF kubectl apply -f /tmp/default-deny-policy.yaml kubectl apply -f /tmp/allow-api-to-db.yaml # ββ Load balancer health verification ββββββββββββββββββββββββββββββββββββββββ # Check which containers are receiving traffic curl -s http://localhost:80/health | jq .hostname # Repeat 10 times β should show different hostnames (round-robin) for i in $(seq 1 10); do curl -s http://localhost:80/health | jq -r .hostname done
abc123def456
# DNS resolution:
Server: 127.0.0.11
Address: 10.0.0.5
# Fragment queue:
Ip: FragCreates FragOKs FragFails
123456789 123456000 789
# FragFails > 0 indicates fragment queue overflow
# MTU check:
8950
# Network policies:
networkpolicy.networking.k8s.io/default-deny-all created
networkpolicy.networking.k8s.io/allow-api-to-db created
# Load balancer verification:
api-1
api-2
api-3
api-1
api-2
# Round-robin distribution confirmed
- VXLAN encapsulation adds 50 bytes of overhead β the overlay MTU must be 50 bytes less than the underlay MTU.
- If the overlay MTU is too large, packets are fragmented at the VXLAN boundary.
- Under normal load, fragmentation is slow but functional. Under high load, the fragment queue overflows and packets are silently dropped.
- The failure is silent β containers appear healthy but inter-service communication fails.
Logging, Monitoring, and Observability
Production observability is the difference between debugging a failure in 5 minutes and debugging it in 5 hours. Docker provides basic logging β production requires a centralized logging pipeline, metrics collection, and distributed tracing.
Container logging model: Docker captures stdout and stderr from each container and writes them to JSON files under /var/lib/docker/containers/<id>/<id>-json.log. Applications must write logs to stdout/stderr β never to a file inside the container. Log files inside the container are lost when the container restarts.
Log rotation: Docker's default JSON log driver has no size limit β log files grow unbounded until disk is full. Production deployments must configure log rotation in daemon.json: max-size (e.g., 10m) and max-file (e.g., 3). Without rotation, a chatty application can fill the host disk in hours.
Centralized logging: Container logs must be shipped to a centralized system (ELK, Datadog, CloudWatch, Loki) for search, alerting, and retention. Use a logging agent (Fluentd, Filebeat, FireLens) as a DaemonSet or sidecar. The agent reads container logs and ships them to the central system.
Metrics collection: Container metrics (CPU, memory, network, disk I/O) are exposed by Docker (docker stats) and cAdvisor. For production, use Prometheus with node_exporter (host metrics) and cAdvisor (container metrics). Kubernetes exposes metrics via the metrics-server. Alert on: container memory usage > 80% of limit, CPU throttling > 10%, restart count > 5 in 1 hour.
Distributed tracing: For microservices, distributed tracing (Jaeger, Zipkin, OpenTelemetry) tracks a request across multiple services. Each service adds a trace ID to outgoing requests and logs. The tracing system aggregates these logs into a single trace view. Essential for debugging latency issues in multi-service architectures.
Structured logging: Applications should emit structured logs (JSON) with fields: timestamp, level, message, trace_id, service, request_id. Unstructured logs (plain text) are impossible to parse and alert on at scale.
#!/bin/bash # Production logging, monitoring, and observability setup # ββ Docker daemon log rotation βββββββββββββββββββββββββββββββββββββββββββββββ cat <<'EOF' | sudo tee /etc/docker/daemon.json { "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3", "compress": "true" } } EOF sudo systemctl restart docker # Verify log rotation is configured docker info --format '{{.LoggingDriver}}' # json-file # ββ Check container log size ββββββββββββββββββββββββββββββββββββββββββββββββ # Find large container logs find /var/lib/docker/containers -name '*-json.log' -exec ls -lhS {} + | head -10 # If any log is > 100MB, rotation is not working # Check total log disk usage du -sh /var/lib/docker/containers/* # ββ Fluentd DaemonSet logging agent (Kubernetes) ββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/fluentd-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: fluentd namespace: kube-system spec: selector: matchLabels: app: fluentd template: metadata: labels: app: fluentd spec: containers: - name: fluentd image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8-1 resources: limits: memory: 512Mi requests: cpu: 100m memory: 200Mi volumeMounts: - name: varlog mountPath: /var/log - name: dockercontainers mountPath: /var/lib/docker/containers readOnly: true volumes: - name: varlog hostPath: path: /var/log - name: dockercontainers hostPath: path: /var/lib/docker/containers EOF kubectl apply -f /tmp/fluentd-daemonset.yaml # ββ Prometheus alerting rules ββββββββββββββββββββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/container-alerts.yaml groups: - name: container-alerts rules: - alert: ContainerMemoryHigh expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8 for: 5m labels: severity: warning annotations: summary: 'Container {{ $labels.name }} memory usage above 80%' - alert: ContainerCPUThrottled expr: rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: 'Container {{ $labels.name }} CPU throttled > 10%' - alert: ContainerRestarting expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 0m labels: severity: critical annotations: summary: 'Container {{ $labels.container }} restarted {{ $value }} times in 1 hour' EOF # ββ Structured logging example (Java) ββββββββββββββββββββββββββββββββββββββββ cat <<'EOF' // io.thecodeforge.logging.StructuredLogger.java package io.thecodeforge.logging; import com.fasterxml.jackson.databind.ObjectMapper; import java.time.Instant; import java.util.Map; public class StructuredLogger { private static final ObjectMapper mapper = new ObjectMapper(); private final String serviceName; public StructuredLogger(String serviceName) { this.serviceName = serviceName; } public void info(String message, String traceId, Map<String, Object> fields) { try { Map<String, Object> logEntry = Map.of( "timestamp", Instant.now().toString(), "level", "INFO", "message", message, "service", serviceName, "trace_id", traceId != null ? traceId : "", "fields", fields != null ? fields : Map.of() ); System.out.println(mapper.writeValueAsString(logEntry)); } catch (Exception e) { System.err.println("LOG_ERROR: " + e.getMessage()); } } public void error(String message, String traceId, Throwable throwable) { try { Map<String, Object> logEntry = Map.of( "timestamp", Instant.now().toString(), "level", "ERROR", "message", message, "service", serviceName, "trace_id", traceId != null ? traceId : "", "error_class", throwable.getClass().getName(), "error_message", throwable.getMessage(), "stack_trace", throwable.getStackTrace()[0].toString() ); System.err.println(mapper.writeValueAsString(logEntry)); } catch (Exception e) { System.err.println("LOG_ERROR: " + e.getMessage()); } } } EOF
json-file
# Container log sizes:
-rw-r----- 1 root root 8.2M /var/lib/docker/containers/abc/abc-json.log
-rw-r----- 1 root root 3.1M /var/lib/docker/containers/def/def-json.log
# All under 10MB β rotation working
# Fluentd DaemonSet:
daemonset.apps/fluentd configured
# Prometheus alerts:
Alert rules written to /tmp/container-alerts.yaml
# Structured log output:
{"timestamp":"2026-04-05T10:30:00Z","level":"INFO","message":"Request processed","service":"io-thecodeforge-api","trace_id":"abc123","fields":{"duration_ms":45,"status":200}}
- Unstructured logs (plain text) cannot be parsed by log aggregation systems at scale.
- Structured logs (JSON) allow filtering by service, level, trace_id, and custom fields.
- Alerts on structured logs (e.g., 'error rate > 5% in 5 minutes') require parseable fields.
- Without structured logging, you are grepping through terabytes of text files.
CI/CD Pipeline: Image Building, Scanning, and Deployment Strategies
Production deployments require a CI/CD pipeline that builds, scans, tests, and deploys container images with zero downtime. Manual docker build && docker push does not scale and introduces human error.
Image building best practices: - Use multi-stage builds to separate build dependencies from runtime. The final image should contain only the application binary and runtime dependencies. - Pin base image versions (node:20.11-alpine, not node:latest). Latest tags change without notice. - Use .dockerignore to exclude build context bloat (node_modules, .git, *.log). - Enable BuildKit (DOCKER_BUILDKIT=1) for parallel builds and secret mounting. - Tag images with git SHA (not :latest, not :v1). The git SHA is immutable and traceable.
Image scanning: Every image must be scanned for known CVEs before deployment. Tools: Trivy, Snyk, AWS ECR scanning, Grivy. Block deployment if critical or high CVEs are found. Scan the base image AND the application dependencies.
Deployment strategies: - Rolling update: replace containers one at a time. Simple but no rollback guarantee. - Blue-green: deploy new version alongside old, switch traffic atomically. Instant rollback. - Canary: deploy to 5% of traffic, monitor for errors, then gradually increase. Best for catching regressions. - A/B testing: deploy two versions simultaneously, split traffic by user segment. Best for feature testing.
Rollback: Every deployment must have a one-command rollback. In Kubernetes: kubectl rollout undo. In Docker Swarm: docker service rollback. In ECS: update the service to the previous task definition. If rollback requires a new build, you do not have a rollback strategy.
Image immutability: Never push to the same tag twice. If you rebuild an image, use a new tag (new git SHA). Mutable tags (pushing to :latest or :v1 twice) cause 'works on my machine' bugs because different hosts have different image layers cached.
#!/bin/bash # Production CI/CD pipeline for Docker images # ββ Multi-stage Dockerfile βββββββββββββββββββββββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/Dockerfile # Stage 1: Build FROM node:20.11-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --only=production COPY . . RUN npm run build # Stage 2: Runtime (minimal image) FROM node:20.11-alpine AS runtime RUN addgroup -g 1001 -S appgroup && \ adduser -S appuser -u 1001 -G appgroup WORKDIR /app COPY --from=builder --chown=appuser:appgroup /app/dist ./dist COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules COPY --from=builder --chown=appuser:appgroup /app/package.json ./ USER appuser EXPOSE 3000 HEALTHCHECK --interval=10s --timeout=5s --retries=3 \ CMD wget -qO- http://localhost:3000/health || exit 1 CMD ["node", "dist/server.js"] EOF # ββ Build with BuildKit ββββββββββββββββββββββββββββββββββββββββββββββββββββββ GIT_SHA=$(git rev-parse --short HEAD) IMAGE_TAG="registry.example.com/io-thecodeforge/api:${GIT_SHA}" DOCKER_BUILDKIT=1 docker build \ --tag ${IMAGE_TAG} \ --label "io.thecodeforge.build.sha=${GIT_SHA}" \ --label "io.thecodeforge.build.date=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \ --file /tmp/Dockerfile \ . docker push ${IMAGE_TAG} # ββ Image scanning with Trivy βββββββββββββββββββββββββββββββββββββββββββββββ trivy image --severity HIGH,CRITICAL --exit-code 1 ${IMAGE_TAG} # Exit code 1 = vulnerabilities found, block deployment # Exit code 0 = no critical/high vulnerabilities # ββ Deployment: rolling update (Kubernetes) ββββββββββββββββββββββββββββββββββ kubectl -n production set image deployment/io-thecodeforge-api \ api=${IMAGE_TAG} # Monitor rollout kubectl -n production rollout status deployment/io-thecodeforge-api --timeout=300s # ββ Rollback (one command) ββββββββββββββββββββββββββββββββββββββββββββββββββ kubectl -n production rollout undo deployment/io-thecodeforge-api # ββ Canary deployment (Kubernetes with Istio) βββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/canary-virtualservice.yaml apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: io-thecodeforge-api namespace: production spec: hosts: - api.example.com http: - route: - destination: host: io-thecodeforge-api subset: stable weight: 95 - destination: host: io-thecodeforge-api subset: canary weight: 5 EOF kubectl apply -f /tmp/canary-virtualservice.yaml # Monitor canary error rate # If error rate > 1%, rollback: kubectl delete -f /tmp/canary-virtualservice.yaml # ββ Verify image immutability ββββββββββββββββββββββββββββββββββββββββββββββββ # Check that no two images share the same tag docker images --format '{{.Repository}}:{{.Tag}} {{.ID}}' | sort | uniq -w 50 -d # If output is non-empty, the same tag points to different image IDs (bad)
#5 [builder 4/4] RUN npm run build
#5 DONE 12.3s
#7 [runtime 5/5] CMD ["node", "dist/server.js"]
#7 DONE 0.1s
# Push:
The push refers to registry.example.com/io-thecodeforge/api
abc123: Pushed
def456: Pushed
abc789: digest: sha256:xyz... size: 1570
# Scan:
Total: 0 (HIGH: 0, CRITICAL: 0)
# No critical vulnerabilities β deployment approved
# Rollout:
Waiting for deployment "io-thecodeforge-api" rollout to finish...
deployment "io-thecodeforge-api" successfully rolled out
# Rollback:
deployment.apps/io-thecodeforge-api rolled back
# Immutability check:
# (empty output = no duplicate tags = good)
- Mutable tags cause different hosts to run different image versions β the same tag means different things on different machines.
- Immutable tags (git SHA) guarantee that every host runs the exact same binary.
- Rollback is trivial with immutable tags β just point to the previous SHA.
- Debugging is deterministic β the git SHA maps directly to the source code that produced the image.
Security Hardening: Root, Secrets, Network, and Supply Chain
Production Docker security is a layered defense β no single measure is sufficient. Each layer (image, runtime, network, host) must be hardened independently.
Run as non-root: Containers running as root (uid 0) can exploit kernel vulnerabilities with maximum privileges. Every production container should run as a non-root user. Set USER in the Dockerfile or use --user in the run command. Drop all capabilities and add back only what is needed: --cap-drop=ALL --cap-add=NET_BIND_SERVICE.
Secrets management: Never bake secrets (API keys, passwords, certificates) into Docker images. Secrets in images are visible to anyone who can pull the image. Use: Docker secrets (Swarm), Kubernetes secrets (with external secret managers like Vault or AWS Secrets Manager), or environment variables injected at runtime from a secret manager. Use --mount=type=secret for build-time secrets in BuildKit.
Image provenance: Verify the source of base images. Use official images or images from trusted registries. Enable Docker Content Trust (DOCKER_CONTENT_TRUST=1) to verify image signatures. Use SBOM (Software Bill of Materials) tools to track all components in your images.
Runtime security: - seccomp: filters syscalls. The default profile blocks ~44 dangerous syscalls. Use custom profiles for stricter filtering. - AppArmor/SELinux: mandatory access control. The docker-default AppArmor profile restricts container capabilities. - Read-only filesystem: --read-only prevents the container from modifying its filesystem. Use tmpfs for writable directories. - No new privileges: --security-opt=no-new-privileges prevents privilege escalation.
Daemon security: The Docker daemon runs as root and has full access to the host. The daemon socket (/var/run/docker.sock) is equivalent to root access. Never mount the daemon socket into containers. Never expose the daemon over TCP without TLS client authentication. Use rootless Docker for environments where daemon root access is unacceptable.
#!/bin/bash # Production security hardening for Docker # ββ Non-root container βββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Dockerfile best practice # RUN addgroup -g 1001 -S appgroup && adduser -S appuser -u 1001 -G appgroup # USER appuser # Runtime: run as non-root with dropped capabilities docker run \ --user 1001:1001 \ --cap-drop=ALL \ --cap-add=NET_BIND_SERVICE \ --security-opt=no-new-privileges \ --read-only \ --tmpfs /tmp:size=64m \ --name hardened-api \ io-thecodeforge/api:1.0.0 # Verify non-root docker exec hardened-api id # uid=1001(appuser) gid=1001(appgroup) # Verify capabilities docker inspect hardened-api --format '{{.HostConfig.CapAdd}} {{.HostConfig.CapDrop}}' # [NET_BIND_SERVICE] [ALL] # ββ Secrets management ββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Docker Swarm secrets echo 'my-database-password' | docker secret create db-password - docker service create --secret db-password --name io-thecodeforge-api \ io-thecodeforge/api:1.0.0 # Secret is available at /run/secrets/db-password inside the container # BuildKit: mount secrets during build (not in final image) DOCKER_BUILDKIT=1 docker build --secret id=npmrc,src=$HOME/.npmrc -t api:1.0 . # In Dockerfile: RUN --mount=type=secret,id=npmrc cp /run/secrets/npmrc $HOME/.npmrc && npm ci # ββ Image scanning and provenance βββββββββββββββββββββββββββββββββββββββββββ # Scan for vulnerabilities trivy image --severity HIGH,CRITICAL io-thecodeforge/api:1.0.0 # Generate SBOM (Software Bill of Materials) syft io-thecodeforge/api:1.0.0 -o spdx-json > sbom.json # Verify image signature (Docker Content Trust) export DOCKER_CONTENT_TRUST=1 docker pull io-thecodeforge/api:1.0.0 # Fails if the image is not signed # ββ Seccomp profile βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Use the default seccomp profile docker run --security-opt seccomp=/etc/docker/seccomp/default.json \ io-thecodeforge/api:1.0.0 # Create a custom seccomp profile (allow only required syscalls) cat <<'EOF' > /tmp/seccomp-api.json { "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ { "names": ["read", "write", "open", "close", "stat", "fstat", "mmap", "mprotect", "munmap", "brk", "ioctl", "access", "socket", "connect", "sendto", "recvfrom", "clone", "execve", "exit", "exit_group", "futex", "epoll_create1", "epoll_ctl", "epoll_wait", "accept4", "listen", "bind", "setsockopt"], "action": "SCMP_ACT_ALLOW" } ] } EOF docker run --security-opt seccomp=/tmp/seccomp-api.json \ io-thecodeforge/api:1.0.0 # ββ Daemon security βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Check if daemon socket is exposed curl --unix-socket /var/run/docker.sock http://localhost/version # If this works, the daemon is accessible β anyone with socket access has root # Check if daemon is exposed over TCP netstat -tlnp | grep 2375 # Port 2375 = unencrypted Docker API (NEVER expose this) netstat -tlnp | grep 2376 # Port 2376 = TLS-encrypted Docker API (OK if TLS client auth is configured) # Verify daemon configuration cat /etc/docker/daemon.json # Should NOT contain: "hosts": ["tcp://0.0.0.0:2375"]
uid=1001(appuser) gid=1001(appgroup)
# Capabilities:
[NET_BIND_SERVICE] [ALL]
# Image scan:
Total: 0 (HIGH: 0, CRITICAL: 0)
# Daemon socket check:
{"Version":"24.0.7","ApiVersion":"1.43"}
# Socket is accessible β ensure proper file permissions
# TCP exposure:
# (no output on 2375 = good, daemon not exposed unencrypted)
- The daemon runs as root and has full access to the host filesystem, network, and processes.
- Anyone who can access the socket can create a container with the host filesystem mounted.
- Mounting the host filesystem into a container gives the container root access to the host.
- Never mount /var/run/docker.sock into containers. Never expose the daemon over TCP without TLS.
Scaling Strategies: Horizontal, Vertical, and Auto-Scaling
Scaling Docker in production means adding capacity to handle increased traffic. The strategy depends on the workload pattern: predictable traffic, bursty traffic, or event-driven traffic.
Horizontal scaling (scale out): Add more container replicas. Each replica handles a portion of the traffic. Horizontal scaling is preferred for stateless workloads β it provides redundancy (if one replica fails, others continue), and it scales linearly. Docker Swarm: docker service scale api=10. Kubernetes: kubectl scale deployment/api --replicas=10. ECS: update the service desired count.
Vertical scaling (scale up): Increase the resources (CPU, memory) of existing containers. Vertical scaling is simpler but limited by the host's capacity. It also requires restarting the container with new resource limits. Vertical scaling is appropriate for stateful workloads (databases) that cannot easily distribute across replicas.
Auto-scaling: Automatically adjust replica count based on metrics. The most common triggers: - CPU utilization > 70% for 5 minutes -> add replicas - Request rate > 1000 req/s -> add replicas - Queue depth > 100 messages -> add worker replicas - Custom metrics (response latency, error rate) -> add or remove replicas
Pre-warming: Container startup is fast (0.3-2s) but application cold start can be 10-60s (JVM startup, dependency initialization, connection pool warmup). Pre-warm containers by pulling images before scaling events and using readiness probes that wait for full initialization. For JVM applications, use class data sharing (CDS) or GraalVM native images to reduce cold start.
Scale-down strategy: Removing replicas must be graceful. The replica should stop accepting new requests, drain in-flight requests, close connections, and then exit. Kubernetes handles this with preStop hooks and terminationGracePeriodSeconds. Without graceful shutdown, in-flight requests are dropped during scale-down, causing user-facing errors.
Capacity planning: Monitor resource usage trends over weeks. If average CPU usage is growing 5% per week, plan to add capacity before it reaches 80%. Auto-scaling handles burst traffic, but baseline capacity must be planned manually.
#!/bin/bash # Production scaling strategies and configuration # ββ Horizontal scaling (Docker Swarm) ββββββββββββββββββββββββββββββββββββββββ # Scale to 10 replicas docker service scale io-thecodeforge-api=10 # Verify replicas are distributed across nodes docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}' # manager1 Running # worker1 Running # worker2 Running # (distributed across 3 nodes) # ββ Horizontal scaling (Kubernetes) ββββββββββββββββββββββββββββββββββββββββββ # Manual scaling kubectl -n production scale deployment/io-thecodeforge-api --replicas=10 # Auto-scaling based on CPU utilization cat <<'EOF' > /tmp/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: io-thecodeforge-api-hpa namespace: production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: io-thecodeforge-api minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 4 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 2 periodSeconds: 60 EOF kubectl apply -f /tmp/hpa.yaml # ββ Graceful shutdown (preStop hook) βββββββββββββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/graceful-shutdown-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: io-thecodeforge-api namespace: production spec: template: spec: terminationGracePeriodSeconds: 30 containers: - name: api image: registry.example.com/io-thecodeforge/api:1.0.0 lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"] # The preStop hook: # 1. Sleep 5s (allow load balancer to drain the pod) # 2. Send SIGTERM to PID 1 (the application) # 3. Application drains in-flight requests and exits # 4. Kubernetes waits up to terminationGracePeriodSeconds (30s) EOF kubectl apply -f /tmp/graceful-shutdown-deployment.yaml # ββ Pre-warming: pull images before scaling events βββββββββββββββββββββββββββ # Pre-pull images on all nodes (Kubernetes DaemonSet) cat <<'EOF' > /tmp/prepull-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: image-prepull namespace: kube-system spec: selector: matchLabels: app: image-prepull template: metadata: labels: app: image-prepull spec: initContainers: - name: prepull image: registry.example.com/io-thecodeforge/api:1.0.0 command: ["true"] containers: - name: pause image: registry.k8s.io/pause:3.9 EOF kubectl apply -f /tmp/prepull-daemonset.yaml # This DaemonSet runs on every node and pulls the image into the node's cache # ββ Monitor scaling effectiveness ββββββββββββββββββββββββββββββββββββββββββββ # Check current replica count and resource usage kubectl -n production get deployment io-thecodeforge-api -o wide # NAME READY UP-TO-DATE AVAILABLE AGE # io-thecodeforge-api 10/10 10 10 5d # Check HPA status kubectl -n production get hpa io-thecodeforge-api-hpa # NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS # io-thecodeforge-api-hpa Deployment/io-thecodeforge-api 45%/70% 3 20 5 # (45% CPU β below 70% threshold β HPA will scale down after stabilization window)
io-thecodeforge-api scaled to 10
# HPA status:
horizontalpodautoscaler.autoscaling/io-thecodeforge-api-hpa created
# Graceful shutdown:
deployment.apps/io-thecodeforge-api configured
# Pre-pull:
daemonset.apps/image-prepull created
# Scaling status:
NAME READY UP-TO-DATE AVAILABLE AGE
io-thecodeforge-api 10/10 10 10 5d
# HPA:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
io-thecodeforge-api-hpa Deployment/io-thecodeforge-api 45%/70% 3 20 5
- Without graceful shutdown, Kubernetes sends SIGTERM and immediately removes the pod from the load balancer.
- In-flight requests (requests that have already been routed to the pod) are dropped mid-processing.
- The preStop hook adds a delay, allowing the load balancer to stop sending new requests before the pod exits.
- The application must also handle SIGTERM by stopping new request acceptance and draining in-flight requests.
High Availability and Disaster Recovery
Production Docker deployments must survive host failures, network partitions, and data center outages. High availability (HA) ensures continuous operation during failures. Disaster recovery (DR) ensures data and service restoration after catastrophic failures.
Multi-host redundancy: Run multiple replicas of each service across multiple hosts. If one host fails, the orchestrator reschedules containers on healthy hosts. Docker Swarm: use --replicas=3 and ensure the swarm has 3+ manager nodes. Kubernetes: use pod anti-affinity to spread replicas across nodes and zones.
Multi-AZ deployment: Deploy across multiple availability zones (data centers within a region). If one AZ fails, services continue in other AZs. AWS: use ECS/Kubernetes with nodes in 3+ AZs. Use Application Load Balancer (ALB) to distribute traffic across AZs.
Stateful services (databases): Databases require special HA strategies: - PostgreSQL: streaming replication with automatic failover (Patroni, pg_auto_failover) - MySQL: Group Replication or Galera Cluster - Redis: Redis Sentinel or Redis Cluster - Use volumes for data persistence. Back up volumes to object storage (S3) regularly.
Data backup and recovery: - Volume snapshots: snapshot named volumes regularly (docker volume snapshot or cloud provider snapshots). - Database backups: pg_dump, mysqldump, or continuous WAL archiving to object storage. - Image registry backup: replicate images across regions (ECR replication, Harbor replication). - Configuration backup: store all configuration (Docker Compose, Kubernetes manifests, daemon.json) in version control.
Health checks and self-healing: The orchestrator uses health checks to detect unhealthy containers and automatically restart or reschedule them. Liveness probes detect deadlocked processes (restart the container). Readiness probes detect services that are not ready to receive traffic (remove from load balancer). Startup probes detect slow-starting applications (give them more time before health checking).
Failover testing: HA is only as good as your last failover test. Regularly simulate failures: kill a container, drain a node, shut down an AZ. Measure the time to recovery and the error rate during failover. If you have never tested failover, you do not have HA.
#!/bin/bash # High availability and disaster recovery configuration # ββ Multi-host redundancy (Docker Swarm) βββββββββββββββββββββββββββββββββββββ # Create a service with replicas spread across nodes docker service create \ --name io-thecodeforge-api \ --replicas 6 \ --constraint 'node.role==worker' \ --placement-pref 'spread=node.id' \ --limit-cpu 1.0 \ --limit-memory 512m \ --update-parallelism 2 \ --update-delay 10s \ --update-failure-action rollback \ --restart-condition on-failure \ --restart-delay 5s \ --restart-max-attempts 3 \ registry.example.com/io-thecodeforge/api:1.0.0 # Verify distribution across nodes docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}' # worker1 Running # worker2 Running # worker3 Running # worker1 Running # worker2 Running # worker3 Running # ββ Multi-AZ pod anti-affinity (Kubernetes) ββββββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/multi-az-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: io-thecodeforge-api namespace: production spec: replicas: 6 template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - io-thecodeforge-api topologyKey: topology.kubernetes.io/zone preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - io-thecodeforge-api topologyKey: kubernetes.io/hostname EOF kubectl apply -f /tmp/multi-az-deployment.yaml # ββ Volume backup (named volume to S3) ββββββββββββββββββββββββββββββββββββββ # Create a backup container that mounts the volume and uploads to S3 docker run --rm \ -v postgres-data:/data:ro \ -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \ -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \ amazon/aws-cli s3 cp /data s3://my-backups/postgres-data/$(date +%Y-%m-%d)/ --recursive # ββ Kubernetes CronJob for automated backups βββββββββββββββββββββββββββββββββ cat <<'EOF' > /tmp/backup-cronjob.yaml apiVersion: batch/v1 kind: CronJob metadata: name: postgres-backup namespace: production spec: schedule: "0 */6 * * *" # Every 6 hours jobTemplate: spec: template: spec: containers: - name: backup image: postgres:16 command: - /bin/sh - -c - | pg_dump -h db-host -U postgres -d mydb | \ gzip | \ aws s3 cp - s3://my-backups/db/mydb-$(date +%Y%m%d-%H%M%S).sql.gz env: - name: PGPASSWORD valueFrom: secretKeyRef: name: db-credentials key: password restartPolicy: OnFailure EOF kubectl apply -f /tmp/backup-cronjob.yaml # ββ Failover testing ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ # Kill a random container and verify self-healing docker service scale io-thecodeforge-api=3 sleep 5 # Kill a container docker kill $(docker ps -q | head -1) # Watch the orchestrator reschedule watch docker service ps io-thecodeforge-api # A new container should start within seconds # Kubernetes: simulate node failure cordon k8s-worker-2 # Mark node as unschedulable drain k8s-worker-2 # Evict all pods from the node # Pods are rescheduled to other nodes # Verify all pods are running on remaining nodes kubectl get pods -o wide | grep io-thecodeforge-api
worker1 Running
worker2 Running
worker3 Running
worker1 Running
worker2 Running
worker3 Running
# Evenly distributed across 3 nodes
# Multi-AZ:
deployment.apps/io-thecodeforge-api configured
# Backup:
upload: data/ to s3://my-backups/postgres-data/2026-04-05/
# Failover test:
# Container killed on worker2
# New container started on worker1 within 3 seconds
# Service remained available throughout
- Configuration without testing is an assumption. Failover may fail due to DNS TTL, connection pool exhaustion, or split-brain scenarios.
- Regular failover tests reveal hidden dependencies that are not visible in configuration.
- Measure time-to-recovery (TTR) and error-rate-during-failover. If TTR > 30s or error rate > 5%, the failover is inadequate.
- Run failover tests monthly. Test killing containers, draining nodes, and simulating AZ failures.
| Aspect | Docker Swarm | Kubernetes | AWS ECS | AWS Fargate |
|---|---|---|---|---|
| Complexity | Low | High | Medium | Low |
| Learning curve | 1-2 weeks | 2-6 months | 2-4 weeks | 1-2 weeks |
| Self-healing | Yes | Yes | Yes | Yes |
| Auto-scaling | Limited (external) | HPA, VPA, KEDA | Service Auto Scaling | Service Auto Scaling |
| Service mesh | No | Istio, Linkerd | App Mesh | App Mesh |
| Multi-AZ | Manual | Built-in (topology spread) | Built-in | Built-in |
| Host management | Self-managed | Self-managed (or EKS/GKE) | EC2 instances | No hosts (serverless) |
| Cost | Lowest (self-managed) | Medium (EKS $73/mo + nodes) | Medium (EC2 + ECS) | Highest (20-30% premium) |
| Ecosystem | Small | Massive | AWS-native | AWS-native |
| Best for | Small teams, simple deployments | Large teams, complex workloads | AWS-native, medium complexity | Minimal ops, AWS-native |
π― Key Takeaways
- Production Docker requires an orchestrator (Swarm, Kubernetes, ECS) for scheduling, scaling, self-healing, and networking. Choose based on team size and ecosystem needs.
- Resource limits are mandatory β set --memory and --cpus on every container. Without limits, one container can starve all others on the host.
- Logging must go to stdout/stderr with rotation. Ship to a centralized system. Never write logs to the container filesystem.
- CI/CD requires multi-stage builds, git SHA tags, image scanning, one-command rollback, and immutable images. Never use :latest in production.
- Security requires non-root users, dropped capabilities, secret managers, seccomp profiles, and daemon socket protection. Defense in depth at every layer.
- Scaling requires horizontal replicas for stateless workloads, auto-scaling based on metrics, pre-warming for cold start, and graceful shutdown for zero-downtime deploys.
- HA requires multi-host replicas with cross-node placement, multi-AZ deployment, regular backups, and tested failover. Untested failover is not HA.
β Common Mistakes to Avoid
- βMistake 1: Using :latest tags in production β Symptom: different hosts run different image versions because :latest was updated between deployments β Fix: tag images with git SHA (immutable). Never use :latest in production. Pin base images to specific versions.
- βMistake 2: No resource limits on containers β Symptom: one container's memory leak causes OOM kills on unrelated containers β Fix: set --memory and --cpus on every production container. Monitor host-level resource usage. Use Guaranteed QoS (requests == limits) for critical services.
- βMistake 3: Writing logs to container filesystem β Symptom: logs are lost when the container restarts β Fix: write all logs to stdout/stderr. Configure log rotation in daemon.json (max-size: 10m, max-file: 3). Ship logs to a centralized system.
- βMistake 4: No health checks β Symptom: a container is running but the application inside has crashed. The load balancer continues sending traffic to a dead container β Fix: configure liveness and readiness probes. Liveness restarts deadlocked processes. Readiness removes unhealthy containers from the load balancer.
- βMistake 5: No graceful shutdown β Symptom: in-flight requests are dropped during deployments or scale-down β Fix: add preStop hooks with a sleep delay. Handle SIGTERM in the application to drain requests. Set terminationGracePeriodSeconds appropriately.
- βMistake 6: Mounting /var/run/docker.sock into containers β Symptom: a compromised container gains root access to the host via the daemon socket β Fix: never mount the daemon socket. Use Docker API proxy tools (like docker-proxy) with restricted permissions if containers need to manage sibling containers.
- βMistake 7: No overlay network MTU configuration β Symptom: silent packet drops under high load due to fragment queue overflow β Fix: calculate overlay MTU as underlay_MTU - 50. Monitor IP fragment queue drops (cat /proc/net/snmp | grep -i frag).
- βMistake 8: No rollback strategy β Symptom: a bad deploy takes 30 minutes to recover because you must rebuild and redeploy β Fix: maintain a one-command rollback (kubectl rollout undo, docker service rollback). Use immutable image tags so rollback is just pointing to a previous tag.
Interview Questions on This Topic
- QWalk me through your production Docker architecture β from the load balancer to the container runtime. What are the failure modes at each layer?
- QA container is OOM-killed every 2 hours. Walk me through your debugging process. What commands do you run, and what are the most likely root causes?
- QYour team is migrating from Docker Swarm to Kubernetes. What are the key architectural differences, and what production concerns must be addressed during the migration?
- QExplain the difference between liveness, readiness, and startup probes. Give a concrete example of when each is needed.
- QYour CI/CD pipeline deploys 10 times per day with zero downtime. Describe the deployment strategy, health checks, and rollback mechanism.
- QA container is running but not serving traffic. Walk me through the debugging steps β from the load balancer to the application process.
- QHow do you handle secrets in Docker production deployments? Compare Docker Swarm secrets, Kubernetes secrets, and external secret managers.
- QYour overlay network is dropping packets under high load. What is the most likely cause, and how do you fix it?
Frequently Asked Questions
Should I use Docker Swarm or Kubernetes in production?
Use Docker Swarm if you have a small team (< 10 engineers), fewer than 50 services, and want simplicity. Swarm is easier to learn and operate. Use Kubernetes if you need the ecosystem (service mesh, GitOps, custom operators), have a platform team, or run more than 50 services. Kubernetes has a steeper learning curve but provides more flexibility and a larger ecosystem.
How do I handle persistent data in Docker production?
Use named volumes (docker volume create) for persistent data. Volumes survive container restarts and removals. Back up volumes regularly to object storage (S3) using a backup container or CronJob. For databases, use cloud-managed database services (RDS, Cloud SQL) when possible β they handle replication, backups, and failover automatically.
How do I debug a container that is running but not serving traffic?
Check in order: (1) Is the process listening on the correct port and interface? (docker exec <container> ss -tlnp). (2) Is the health check passing? (docker inspect <container> --format '{{.State.Health.Status}}'). (3) Is the container on the correct network? (docker network inspect <network>). (4) Are there iptables rules blocking traffic? (iptables -L -n). (5) Is the application actually started? (docker logs <container>).
What is the difference between liveness and readiness probes?
A liveness probe checks if the container is alive. If it fails, Kubernetes restarts the container. Use liveness for detecting deadlocked processes. A readiness probe checks if the container is ready to serve traffic. If it fails, Kubernetes removes the container from the load balancer but does not restart it. Use readiness for detecting initialization issues or temporary overload. A startup probe gives slow-starting applications extra time before liveness checks begin.
How do I achieve zero-downtime deployments with Docker?
Use rolling updates with health checks. The orchestrator starts new containers, waits for them to pass health checks, then stops old containers. Add preStop hooks with a sleep delay to allow load balancers to drain connections. Handle SIGTERM in your application to stop accepting new requests and drain in-flight requests. Set terminationGracePeriodSeconds to the maximum drain time.
How do I monitor Docker in production?
Collect three types of signals: (1) Logs β ship stdout/stderr to ELK, Datadog, or Loki. (2) Metrics β use Prometheus with node_exporter and cAdvisor. Alert on memory > 80% of limit, CPU throttling > 10%, restart count > 5/hour. (3) Traces β use OpenTelemetry with Jaeger or Zipkin for distributed tracing across microservices.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.