Container runtime (containerd, runc) executes containers with namespace isolation and cgroup limits
Image registry (ECR, GCR, private Harbor) stores and distributes images with vulnerability scanning
Service mesh (Istio, Linkerd) handles mTLS, traffic management, and observability
Resource limits are mandatory — without --memory and --cpus, one container can starve all others on the host
Logging must go to stdout/stderr — never write logs to container filesystem (lost on restart)
Images must be immutable — tag with git SHA, never use :latest in production
Health checks must be liveness + readiness — liveness restarts, readiness removes from load balancer
Plain-English First
Running Docker in production is like running a restaurant kitchen versus cooking at home. At home, you can leave dishes in the sink, ignore the smoke alarm, and run to the store if you forgot an ingredient. In a restaurant kitchen, every dish must be tracked, every station must be clean, every appliance must be monitored, and if the oven breaks at 7 PM on Saturday, you need a backup plan immediately. The cooking technique (Docker) is the same — the operational requirements are completely different.
Docker works out of the box for development. Running a single container on a laptop requires no orchestration, no monitoring, and no security hardening. Production is a different problem entirely — hundreds of containers across dozens of hosts, with requirements for zero-downtime deploys, automatic scaling, persistent data, and compliance auditing.
Most Docker-in-production failures fall into five categories: resource exhaustion (no limits set), networking misconfiguration (DNS, overlay MTU, port conflicts), logging gaps (logs in container filesystem, not shipped to central system), security exposure (root containers, exposed daemon socket, unpinned images), and deployment errors (no health checks, no rollback strategy, no canary testing).
This article covers the architecture decisions, operational patterns, and failure scenarios that determine whether your Docker deployment survives production traffic or collapses under it. Every section includes real debugging commands and failure stories.
What Docker Overlay Networking Actually Does
Docker overlay networking creates a virtual Layer 2 network across multiple Docker hosts using VXLAN encapsulation. Each container gets its own IP from a private subnet, and traffic between containers on different hosts is wrapped in UDP packets (typically port 4789) by the kernel's VXLAN implementation. This allows containers to communicate as if they're on the same switch, regardless of physical host placement.
The key mechanic is the VXLAN Tunnel Endpoint (VTEP) — each Docker host runs a VTEP that maps container IPs to host IPs. When a container sends a packet to another container on a different host, the source VTEP encapsulates the original Ethernet frame inside a UDP packet with the destination host's IP. The destination VTEP decapsulates and delivers it. This adds 50 bytes of overhead per packet (20 IP + 8 UDP + 8 VXLAN + 14 inner Ethernet). That overhead is invisible to applications but directly impacts MTU: if the physical network's MTU is 1500, the effective MTU for containers becomes 1450. Ignoring this causes silent packet fragmentation or drops.
Use overlay networks when you need multi-host container communication without modifying the underlying network infrastructure — typical in Docker Swarm or Kubernetes clusters where hosts span different subnets or cloud regions. The critical production concern is MTU mismatch: if your physical network uses jumbo frames (9000 MTU) but your cloud provider's underlay caps at 1500, or if you set container MTU to 1450 but the host's physical interface is 1500, you'll see intermittent TCP timeouts, slow transfers, and mysterious connection resets that only appear under load. This is not a theoretical issue — it's the #1 cause of silent networking failures in Docker overlay deployments.
MTU Mismatch Is Invisible
Docker does not auto-detect the physical network MTU. If your host MTU is 1500, set overlay MTU to 1450. Otherwise, packets >1450 bytes silently fragment or drop.
Production Insight
Teams migrating from bare-metal (jumbo frames) to AWS (1500 MTU) keep default overlay MTU 1500, causing 50% packet loss for large writes.
Symptom: TCP connections succeed for small payloads but stall on any transfer >1450 bytes — no errors in application logs, only 'connection reset' or 'timeout'.
Rule: Always set 'com.docker.network.driver.mtu' to (physical MTU - 50) on every overlay network creation.
Key Takeaway
Overlay MTU must be 50 bytes less than the physical network MTU to avoid fragmentation.
Silent packet drops from MTU mismatch look like application bugs — always check MTU first.
Docker does not enforce or warn about MTU; it's your responsibility to configure it correctly per environment.
Production Architecture: Single Host to Multi-Host Orchestration
Running Docker in production requires an orchestration layer that manages container scheduling, networking, scaling, and self-healing across multiple hosts. Without orchestration, you are managing containers manually — which does not scale beyond 10-20 containers.
Single-host Docker (development only): Running docker run on a single host works for development but fails in production. There is no self-healing (if a container crashes, it stays dead unless you add --restart=always). There is no load balancing (all traffic goes to one container). There is no horizontal scaling (you must manually start more containers). There is no rolling deployment (you must stop the old container before starting the new one, causing downtime).
Docker Swarm: Docker's built-in orchestrator. Manages a cluster of Docker hosts as a single virtual host. Supports service definitions (desired state), rolling updates, and overlay networking. Swarm is simpler than Kubernetes but has fewer features — no custom resource definitions, limited networking options, and a smaller ecosystem. Swarm is adequate for small-to-medium deployments (< 100 services).
Kubernetes (K8s): The industry-standard orchestrator. Manages containers across a cluster with declarative configuration, automated scaling, self-healing, and a rich ecosystem of networking, storage, and observability tools. Kubernetes has a steep learning curve and significant operational overhead — it requires dedicated platform engineers to operate. Kubernetes is the right choice for large deployments (> 50 services) or when you need the ecosystem (service mesh, GitOps, custom operators).
AWS ECS / Fargate: AWS's managed container orchestration. ECS manages container scheduling on EC2 instances. Fargate abstracts the hosts entirely — you pay per container, not per host. ECS is simpler than Kubernetes (no control plane to manage) but locks you to AWS. Fargate eliminates host management entirely but costs 20-30% more than self-managed EC2.
Architecture pattern: The production architecture stack is: Load Balancer -> Ingress Controller -> Orchestrator -> Container Runtime -> Host. Each layer has specific failure modes and debugging approaches. Understanding the full stack is essential for production debugging.
Waiting for deployment "io-thecodeforge-api" rollout to finish: 0 of 3 updated replicas are available...
Waiting for deployment "io-thecodeforge-api" rollout to finish: 1 of 3 updated replicas are available...
Waiting for deployment "io-thecodeforge-api" rollout to finish: 2 of 3 updated replicas are available...
deployment "io-thecodeforge-api" successfully rolled out
Orchestration as a Conductor
Kubernetes manages not just containers but networking (CNI), storage (CSI), service discovery, ingress, RBAC, and custom resources.
Swarm is simpler because it delegates networking and storage to Docker's built-in drivers.
Kubernetes' complexity is the cost of flexibility — it can model any production topology.
For simple deployments (< 50 services), Swarm is sufficient and far easier to operate.
Production Insight
The choice of orchestrator determines your operational complexity for years. Migrating from Swarm to Kubernetes (or vice versa) is a multi-month project. Choose based on team size (Swarm for small teams, Kubernetes for platform teams), ecosystem needs (Kubernetes wins on ecosystem), and cloud strategy (ECS for AWS-only, Kubernetes for multi-cloud). Do not default to Kubernetes because it is popular — default to it because your workload requires it.
Key Takeaway
Production Docker requires an orchestrator — Swarm for simplicity, Kubernetes for flexibility, ECS/Fargate for AWS-native. The orchestrator manages scheduling, scaling, self-healing, and networking. Choose based on team size and ecosystem needs, not popularity.
Orchestrator Selection
IfSmall team (< 10 engineers), < 50 services, single cloud
→
UseDocker Swarm or AWS ECS. Simpler to operate, lower learning curve.
IfPlatform team available, > 50 services, need ecosystem (service mesh, GitOps)
→
UseKubernetes. Higher complexity but maximum flexibility and ecosystem support.
IfAWS-only, want to minimize infrastructure management
→
UseAWS Fargate. No hosts to manage, pay per container, but higher cost.
IfMulti-cloud or on-premises requirement
→
UseKubernetes. Portable across clouds with consistent API.
Resource Management: CPU, Memory, OOM, and Noisy Neighbors
Resource management is the most critical production concern for shared container hosts. Without explicit resource limits, one misbehaving container can starve every other container on the same host.
CPU limits: Docker uses cgroups to enforce CPU limits. --cpus=1.0 gives the container access to 1 CPU core worth of time. Without a limit, a container can consume all available CPU. CPU is a compressible resource — the kernel throttles CPU-intensive containers, but does not kill them. This means a CPU-hungry container slows down other containers but does not kill them.
Memory limits: Memory is an incompressible resource. When a container exceeds its memory limit, the kernel OOM killer terminates it. The OOM killer selects processes based on oom_score — a heuristic that considers memory usage, process age, and oom_score_adj. Without a memory limit, a leaking container consumes all host memory, and the OOM killer may kill unrelated containers or critical host processes (kubelet, containerd).
Requests vs limits (Kubernetes): Requests guarantee a minimum allocation — the scheduler places the pod on a node with enough available resources. Limits set the maximum — the container is throttled (CPU) or killed (memory) if exceeded. Best practice: set requests equal to limits for critical services (guaranteed QoS). Set requests lower than limits for burstable services (burstable QoS).
Noisy neighbor problem: Multiple containers on the same host compete for CPU, memory, disk I/O, and network bandwidth. Without resource limits, one container's spike affects all others. The fix: set limits on every production container. Monitor host-level resource usage with docker stats and Prometheus node_exporter.
OOM score and priority: The kernel assigns each process an oom_score from 0 to 1000. Higher scores are killed first. Docker sets oom_score_adj for each container — containers with higher scores are killed before lower-scored containers. Critical services (databases) should have oom_score_adj=-999 (almost never killed). Non-critical services should have oom_score_adj=1000 (killed first).
io/thecodeforge/resource_management.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
#!/bin/bash
# Production resource management configuration and monitoring
# ── CPU limits ───────────────────────────────────────────────────────────────
# Run with 1CPU core limit
docker run --cpus=1.0 --name cpu-test alpine:3.19 stress --cpu 2 --timeout 10s
# The container is throttled to 1CPU even if stress spawns 2 workers
# CheckCPU throttling
cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.stat
# nr_periods: total scheduling periods
# nr_throttled: periods where the container was throttled
# throttled_time: total time throttled (nanoseconds)
# CheckCPUshares (relative priority)
cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.shares
# Default: 1024. Set with --cpu-shares=512for lower priority
# ── Memory limits ────────────────────────────────────────────────────────────
# Run with 256MB memory limit
docker run --memory=256m --memory-swap=256m --name mem-test alpine:3.19 stress --vm 1 --vm-bytes 300M --timeout 10s
# Container is OOM-killed because it exceeds 256MB limit
# Check memory usage before OOM
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.max_usage_in_bytes
# CheckOOM events
dmesg | grep -i 'oom\|killed process' | tail -10
# [12345.678] Out of memory: Killed process 5678 (node) total-vm:123456kB, anon-rss:98765kB
# ── OOM score management ────────────────────────────────────────────────────
# Check a container's OOM score
CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}')
cat /proc/$CONTAINER_PID/oom_score
# 0-1000: higher = more likely to be killed
cat /proc/$CONTAINER_PID/oom_score_adj
# -1000 to 1000: adjust the score
# SetOOM priority for critical services (database)
docker run --oom-score-adj=-999 --name critical-db postgres:16
# This container is almost never killed by the OOM killer
# SetOOM priority for non-critical services (cache)
docker run --oom-score-adj=1000 --name expendable-cache redis:7
# This container is killed first in an OOM situation
# ── Kubernetes resource management ──────────────────────────────────────────
# GuaranteedQoS: requests == limits (never evicted for resource reasons)
cat <<'EOF'
resources:
requests:
cpu: 1000m
memory: 512Mi
limits:
cpu: 1000m
memory: 512Mi
EOF
# BurstableQoS: requests < limits (can burst but may be throttled/killed)
cat <<'EOF'
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
EOF
# ── Monitor resource usage across all containers ─────────────────────────────
# Real-time resource usage
docker stats --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}'
# Find containers without resource limits
docker ps -q | xargs -I{} docker inspect {} --format '{{.Name}}: CPU={{.HostConfig.NanoCpus}} MEM={{.HostConfig.Memory}}'
# Containers with NanoCpus=0 or Memory=0 have no limits
# Host-level resource check
free -h
cat /proc/loadavg
uptime
Output
# CPU throttling:
nr_periods: 1000
nr_throttled: 350
throttled_time: 3500000000 (3.5 seconds)
# Memory OOM:
[12345.678] Out of memory: Killed process 5678 (stress) total-vm:312320kB, anon-rss:262144kB
CPU is compressible — the kernel throttles a CPU-hungry container but does not kill it.
Memory is incompressible — when physical memory is exhausted, the kernel must kill a process.
Without memory limits, the OOM killer may kill critical host processes (containerd, kubelet).
With memory limits, only the offending container is killed — other containers are unaffected.
Production Insight
The OOM killer's process selection is not random — it uses oom_score to decide which process to kill. Without explicit oom_score_adj, the OOM killer may kill a critical database before killing a non-critical cache. Set oom_score_adj=-999 for databases and oom_score_adj=1000 for expendable services. In Kubernetes, Guaranteed QoS pods (requests == limits) are protected from eviction — use this for critical services.
Key Takeaway
CPU limits prevent throttling. Memory limits prevent OOM kills. Without limits, one container can starve all others on the host. Set requests == limits for critical services (Guaranteed QoS). Set oom_score_adj for priority-based OOM protection. Monitor host-level resource usage — container-level metrics miss cross-container contention.
Resource Limit Strategy
IfCritical stateful service (database, queue)
→
UseGuaranteed QoS: requests == limits. Set oom_score_adj=-999. Use dedicated node pools.
IfStateless API with predictable load
→
UseGuaranteed QoS: requests == limits based on load testing. Monitor for throttling.
IfBatch job or worker with variable load
→
UseBurstable QoS: requests < limits. Allow bursting but set sensible limits.
IfDevelopment or testing environment
→
UseNo limits acceptable. But never deploy without limits to production.
Networking in Production: DNS, Overlay, Load Balancing, and Service Mesh
Production Docker networking requires reliable DNS resolution, load balancing, and health-aware traffic routing. The default bridge network provides none of these — production deployments must use user-defined networks or an orchestrator's networking layer.
DNS-based service discovery: User-defined Docker networks and Kubernetes provide DNS-based service discovery. Containers resolve service names to IP addresses via an embedded DNS server (127.0.0.11 in Docker, CoreDNS in Kubernetes). The default bridge network has no DNS — containers can only reach each other by IP, which changes on every restart.
Overlay networking: For multi-host deployments, overlay networks use VXLAN encapsulation to create a virtual Layer 2 network across hosts. Each overlay network has an MTU of 1450 (VXLAN adds 50 bytes of overhead). Misconfigured MTU is a common production failure — packets larger than the overlay MTU are fragmented, and under high load, the fragment queue can overflow, causing silent packet drops.
Load balancing: Docker Swarm provides built-in load balancing via a routing mesh — any node can route traffic to any service replica. Kubernetes provides kube-proxy (iptables/IPVS-based) and ingress controllers (NGINX, Traefik, Envoy) for external traffic. For production, an ingress controller with TLS termination, rate limiting, and circuit breaking is mandatory.
Service mesh: A service mesh (Istio, Linkerd) adds mTLS between services, traffic splitting (canary deployments), circuit breaking, and observability (distributed tracing, metrics). The trade-off: added latency (1-3ms per hop) and operational complexity. Use a service mesh when you need mTLS or traffic splitting. Do not add one 'just in case.'
Network policies: In Kubernetes, NetworkPolicy resources restrict which pods can communicate with each other. Without network policies, all pods can communicate — a compromised pod can reach the database directly. Default-deny network policies are a production best practice.
io/thecodeforge/production_networking.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
#!/bin/bash
# Production networking configuration and debugging
# ── DockerSwarm overlay network ─────────────────────────────────────────────
# Create an overlay network with correct MTU
docker network create \
--driver overlay \
--opt com.docker.network.driver.mtu=8950 \
--subnet 10.0.0.0/24 \
--gateway 10.0.0.1 \
app-overlay
# MTU calculation: VPCMTU (9001) - VXLANoverhead (50) = 8951, round to 8950
# Verify overlay network
docker network inspect app-overlay --format '{{.Driver}} {{.Options}}'
# overlay map[com.docker.network.driver.mtu:8950]
# ── DNS resolution verification ──────────────────────────────────────────────
# Check embedded DNS server
docker exec <container> cat /etc/resolv.conf
# nameserver 127.0.0.11
# options ndots:0
# Resolve a service name
docker exec <container> nslookup io-thecodeforge-api
# Server: 127.0.0.11
# Address: 10.0.0.5
# CheckDNS query logs (Docker daemon)
sudo journalctl -u docker | grep 'DNS query' | tail -10
# ── Network health checks ───────────────────────────────────────────────────
# Check overlay network peer status
docker network inspect app-overlay --format '{{.Peers}}'
# Shows all nodes participating in the overlay
# CheckIP fragment queue (critical for overlay networks)
cat /proc/net/snmp | grep -i frag
# Ip: FragCreatesFragOKsFragFails
# IfFragFails > 0, packets are being dropped due to fragment queue overflow
# CheckMTU of container interface
docker exec <container> cat /sys/class/net/eth0/mtu
# Should match the overlay network MTU (8950)
# ── Traffic debugging with tcpdump ───────────────────────────────────────────
# Capture traffic on the overlay bridge
sudo tcpdump -i docker_gwbridge -n -c 20
# Capture traffic inside a container's namespace
CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}')
sudo nsenter --net --target $CONTAINER_PID tcpdump -i eth0 -n -c 20
# ── Kubernetes network policies (default-deny) ──────────────────────────────
cat <<'EOF' > /tmp/default-deny-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- EgressEOF
cat <<'EOF' > /tmp/allow-api-to-db.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-db
namespace: production
spec:
podSelector:
matchLabels:
app: io-thecodeforge-db
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: io-thecodeforge-api
ports:
- protocol: TCP
port: 5432EOF
kubectl apply -f /tmp/default-deny-policy.yaml
kubectl apply -f /tmp/allow-api-to-db.yaml
# ── Load balancer health verification ────────────────────────────────────────
# Check which containers are receiving traffic
curl -s http://localhost:80/health | jq .hostname
# Repeat10 times — should show different hostnames (round-robin)
for i in $(seq 110); do
curl -s http://localhost:80/health | jq -r .hostname
done
Output
# Overlay network created:
abc123def456
# DNS resolution:
Server: 127.0.0.11
Address: 10.0.0.5
# Fragment queue:
Ip: FragCreates FragOKs FragFails
123456789 123456000 789
# FragFails > 0 indicates fragment queue overflow
# MTU check:
8950
# Network policies:
networkpolicy.networking.k8s.io/default-deny-all created
networkpolicy.networking.k8s.io/allow-api-to-db created
# Load balancer verification:
api-1
api-2
api-3
api-1
api-2
# Round-robin distribution confirmed
Production Networking as a Postal System
VXLAN encapsulation adds 50 bytes of overhead — the overlay MTU must be 50 bytes less than the underlay MTU.
If the overlay MTU is too large, packets are fragmented at the VXLAN boundary.
Under normal load, fragmentation is slow but functional. Under high load, the fragment queue overflows and packets are silently dropped.
The failure is silent — containers appear healthy but inter-service communication fails.
Production Insight
Default-deny network policies are the single most impactful security improvement for Kubernetes deployments. Without them, any compromised pod can reach any other pod, including databases and secrets stores. Start with a default-deny policy, then add allow rules for each required communication path. This is zero-trust networking at the pod level.
Key Takeaway
Production networking requires DNS-based service discovery, correct overlay MTU, load balancing with health checks, and network policies. Default-deny network policies are mandatory for security. Overlay MTU miscalculation is the most common silent failure — always calculate overlay_MTU = underlay_MTU - 50.
Logging, Monitoring, and Observability
Production observability is the difference between debugging a failure in 5 minutes and debugging it in 5 hours. Docker provides basic logging — production requires a centralized logging pipeline, metrics collection, and distributed tracing.
Container logging model: Docker captures stdout and stderr from each container and writes them to JSON files under /var/lib/docker/containers/<id>/<id>-json.log. Applications must write logs to stdout/stderr — never to a file inside the container. Log files inside the container are lost when the container restarts.
Log rotation: Docker's default JSON log driver has no size limit — log files grow unbounded until disk is full. Production deployments must configure log rotation in daemon.json: max-size (e.g., 10m) and max-file (e.g., 3). Without rotation, a chatty application can fill the host disk in hours.
Centralized logging: Container logs must be shipped to a centralized system (ELK, Datadog, CloudWatch, Loki) for search, alerting, and retention. Use a logging agent (Fluentd, Filebeat, FireLens) as a DaemonSet or sidecar. The agent reads container logs and ships them to the central system.
Metrics collection: Container metrics (CPU, memory, network, disk I/O) are exposed by Docker (docker stats) and cAdvisor. For production, use Prometheus with node_exporter (host metrics) and cAdvisor (container metrics). Kubernetes exposes metrics via the metrics-server. Alert on: container memory usage > 80% of limit, CPU throttling > 10%, restart count > 5 in 1 hour.
Distributed tracing: For microservices, distributed tracing (Jaeger, Zipkin, OpenTelemetry) tracks a request across multiple services. Each service adds a trace ID to outgoing requests and logs. The tracing system aggregates these logs into a single trace view. Essential for debugging latency issues in multi-service architectures.
Structured logging: Applications should emit structured logs (JSON) with fields: timestamp, level, message, trace_id, service, request_id. Unstructured logs (plain text) are impossible to parse and alert on at scale.
Unstructured logs (plain text) cannot be parsed by log aggregation systems at scale.
Structured logs (JSON) allow filtering by service, level, trace_id, and custom fields.
Alerts on structured logs (e.g., 'error rate > 5% in 5 minutes') require parseable fields.
Without structured logging, you are grepping through terabytes of text files.
Production Insight
Log rotation is the most overlooked production configuration. A team deployed 50 containers without log rotation. Within 48 hours, the host disk was full. The Docker daemon crashed because it could not write log files. All containers became unreachable. The fix was 3 lines in daemon.json. Set log rotation on every production host before deploying the first container.
Key Takeaway
Production logging requires: stdout/stderr output, log rotation (max-size, max-file), centralized aggregation (ELK, Datadog, Loki), structured JSON format, and trace IDs. Metrics require Prometheus with alerts on memory > 80%, CPU throttling > 10%, and restart count > 5/hour. Distributed tracing is mandatory for microservices debugging.
CI/CD Pipeline: Image Building, Scanning, and Deployment Strategies
Production deployments require a CI/CD pipeline that builds, scans, tests, and deploys container images with zero downtime. Manual docker build && docker push does not scale and introduces human error.
Image building best practices: - Use multi-stage builds to separate build dependencies from runtime. The final image should contain only the application binary and runtime dependencies. - Pin base image versions (node:20.11-alpine, not node:latest). Latest tags change without notice. - Use .dockerignore to exclude build context bloat (node_modules, .git, *.log). - Enable BuildKit (DOCKER_BUILDKIT=1) for parallel builds and secret mounting. - Tag images with git SHA (not :latest, not :v1). The git SHA is immutable and traceable.
Image scanning: Every image must be scanned for known CVEs before deployment. Tools: Trivy, Snyk, AWS ECR scanning, Grivy. Block deployment if critical or high CVEs are found. Scan the base image AND the application dependencies.
Deployment strategies: - Rolling update: replace containers one at a time. Simple but no rollback guarantee. - Blue-green: deploy new version alongside old, switch traffic atomically. Instant rollback. - Canary: deploy to 5% of traffic, monitor for errors, then gradually increase. Best for catching regressions. - A/B testing: deploy two versions simultaneously, split traffic by user segment. Best for feature testing.
Rollback: Every deployment must have a one-command rollback. In Kubernetes: kubectl rollout undo. In Docker Swarm: docker service rollback. In ECS: update the service to the previous task definition. If rollback requires a new build, you do not have a rollback strategy.
Image immutability: Never push to the same tag twice. If you rebuild an image, use a new tag (new git SHA). Mutable tags (pushing to :latest or :v1 twice) cause 'works on my machine' bugs because different hosts have different image layers cached.
io/thecodeforge/cicd_pipeline.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#!/bin/bash
# ProductionCI/CD pipeline forDocker images
# ── Multi-stage Dockerfile ───────────────────────────────────────────────────
cat <<'EOF' > /tmp/Dockerfile
# Stage1: BuildFROM node:20.11-alpine AS builder
WORKDIR /app
COPYpackage*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
# Stage2: Runtime (minimal image)
FROM node:20.11-alpine AS runtime
RUN addgroup -g 1001 -S appgroup && \
adduser -S appuser -u 1001 -G appgroup
WORKDIR /app
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE3000HEALTHCHECK --interval=10s --timeout=5s --retries=3 \
CMD wget -qO- http://localhost:3000/health || exit 1CMD ["node", "dist/server.js"]
EOF
# ── Build with BuildKit ──────────────────────────────────────────────────────
GIT_SHA=$(git rev-parse --shortHEAD)
IMAGE_TAG="registry.example.com/io-thecodeforge/api:${GIT_SHA}"
DOCKER_BUILDKIT=1 docker build \
--tag ${IMAGE_TAG} \
--label "io.thecodeforge.build.sha=${GIT_SHA}" \
--label "io.thecodeforge.build.date=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--file /tmp/Dockerfile \
.
docker push ${IMAGE_TAG}
# ── Image scanning with Trivy ───────────────────────────────────────────────
trivy image --severity HIGH,CRITICAL --exit-code 1 ${IMAGE_TAG}
# Exit code 1 = vulnerabilities found, block deployment
# Exit code 0 = no critical/high vulnerabilities
# ── Deployment: rolling update (Kubernetes) ──────────────────────────────────
kubectl -n production set image deployment/io-thecodeforge-api \
api=${IMAGE_TAG}
# Monitor rollout
kubectl -n production rollout status deployment/io-thecodeforge-api --timeout=300s
# ── Rollback (one command) ──────────────────────────────────────────────────
kubectl -n production rollout undo deployment/io-thecodeforge-api
# ── Canarydeployment (Kubernetes with Istio) ───────────────────────────────
cat <<'EOF' > /tmp/canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: io-thecodeforge-api
namespace: production
spec:
hosts:
- api.example.com
http:
- route:
- destination:
host: io-thecodeforge-api
subset: stable
weight: 95
- destination:
host: io-thecodeforge-api
subset: canary
weight: 5EOF
kubectl apply -f /tmp/canary-virtualservice.yaml
# Monitor canary error rate
# If error rate > 1%, rollback:
kubectl delete -f /tmp/canary-virtualservice.yaml
# ── Verify image immutability ────────────────────────────────────────────────
# Check that no two images share the same tag
docker images --format '{{.Repository}}:{{.Tag}} {{.ID}}' | sort | uniq -w 50 -d
# If output is non-empty, the same tag points to different image IDs (bad)
Output
# Build:
#5 [builder 4/4] RUN npm run build
#5 DONE 12.3s
#7 [runtime 5/5] CMD ["node", "dist/server.js"]
#7 DONE 0.1s
# Push:
The push refers to registry.example.com/io-thecodeforge/api
abc123: Pushed
def456: Pushed
abc789: digest: sha256:xyz... size: 1570
# Scan:
Total: 0 (HIGH: 0, CRITICAL: 0)
# No critical vulnerabilities — deployment approved
# Rollout:
Waiting for deployment "io-thecodeforge-api" rollout to finish...
deployment "io-thecodeforge-api" successfully rolled out
# Rollback:
deployment.apps/io-thecodeforge-api rolled back
# Immutability check:
# (empty output = no duplicate tags = good)
CI/CD Pipeline as a Manufacturing Assembly Line
Mutable tags cause different hosts to run different image versions — the same tag means different things on different machines.
Immutable tags (git SHA) guarantee that every host runs the exact same binary.
Rollback is trivial with immutable tags — just point to the previous SHA.
Debugging is deterministic — the git SHA maps directly to the source code that produced the image.
Production Insight
Image scanning must be part of the CI pipeline, not a separate process. If scanning is optional, it will be skipped under deadline pressure. Make scanning a gate: the pipeline fails if critical CVEs are found. Scan both the base image (OS-level CVEs) and the application dependencies (npm, pip, Maven CVEs). Update base images weekly — they accumulate CVEs over time.
Key Takeaway
Production CI/CD requires: multi-stage builds, pinned base images, git SHA tags, image scanning (Trivy), one-command rollback, and immutable images. Never use :latest in production. Deployment strategy choice (rolling, blue-green, canary) depends on your rollback tolerance and traffic pattern.
Security Hardening: Root, Secrets, Network, and Supply Chain
Production Docker security is a layered defense — no single measure is sufficient. Each layer (image, runtime, network, host) must be hardened independently.
Run as non-root: Containers running as root (uid 0) can exploit kernel vulnerabilities with maximum privileges. Every production container should run as a non-root user. Set USER in the Dockerfile or use --user in the run command. Drop all capabilities and add back only what is needed: --cap-drop=ALL --cap-add=NET_BIND_SERVICE.
Secrets management: Never bake secrets (API keys, passwords, certificates) into Docker images. Secrets in images are visible to anyone who can pull the image. Use: Docker secrets (Swarm), Kubernetes secrets (with external secret managers like Vault or AWS Secrets Manager), or environment variables injected at runtime from a secret manager. Use --mount=type=secret for build-time secrets in BuildKit.
Image provenance: Verify the source of base images. Use official images or images from trusted registries. Enable Docker Content Trust (DOCKER_CONTENT_TRUST=1) to verify image signatures. Use SBOM (Software Bill of Materials) tools to track all components in your images.
Runtime security: - seccomp: filters syscalls. The default profile blocks ~44 dangerous syscalls. Use custom profiles for stricter filtering. - AppArmor/SELinux: mandatory access control. The docker-default AppArmor profile restricts container capabilities. - Read-only filesystem: --read-only prevents the container from modifying its filesystem. Use tmpfs for writable directories. - No new privileges: --security-opt=no-new-privileges prevents privilege escalation.
Daemon security: The Docker daemon runs as root and has full access to the host. The daemon socket (/var/run/docker.sock) is equivalent to root access. Never mount the daemon socket into containers. Never expose the daemon over TCP without TLS client authentication. Use rootless Docker for environments where daemon root access is unacceptable.
io/thecodeforge/security_hardening.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
#!/bin/bash
# Production security hardening forDocker
# ── Non-root container ───────────────────────────────────────────────────────
# Dockerfile best practice
# RUN addgroup -g 1001 -S appgroup && adduser -S appuser -u 1001 -G appgroup
# USER appuser
# Runtime: run as non-root with dropped capabilities
docker run \
--user 1001:1001 \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--security-opt=no-new-privileges \
--read-only \
--tmpfs /tmp:size=64m \
--name hardened-api \
io-thecodeforge/api:1.0.0
# Verify non-root
docker exec hardened-api id
# uid=1001(appuser) gid=1001(appgroup)
# Verify capabilities
docker inspect hardened-api --format '{{.HostConfig.CapAdd}} {{.HostConfig.CapDrop}}'
# [NET_BIND_SERVICE] [ALL]
# ── Secrets management ──────────────────────────────────────────────────────
# DockerSwarm secrets
echo 'my-database-password' | docker secret create db-password -
docker service create --secret db-password --name io-thecodeforge-api \
io-thecodeforge/api:1.0.0
# Secret is available at /run/secrets/db-password inside the container
# BuildKit: mount secrets during build (not in final image)
DOCKER_BUILDKIT=1 docker build --secret id=npmrc,src=$HOME/.npmrc -t api:1.0 .
# InDockerfile: RUN --mount=type=secret,id=npmrc cp /run/secrets/npmrc $HOME/.npmrc && npm ci
# ── Image scanning and provenance ───────────────────────────────────────────
# Scanfor vulnerabilities
trivy image --severity HIGH,CRITICAL io-thecodeforge/api:1.0.0
# GenerateSBOM (SoftwareBill of Materials)
syft io-thecodeforge/api:1.0.0 -o spdx-json > sbom.json
# Verify image signature (DockerContentTrust)
export DOCKER_CONTENT_TRUST=1
docker pull io-thecodeforge/api:1.0.0
# Failsif the image is not signed
# ── Seccomp profile ─────────────────────────────────────────────────────────
# Use the default seccomp profile
docker run --security-opt seccomp=/etc/docker/seccomp/default.json \
io-thecodeforge/api:1.0.0
# Create a custom seccomp profile (allow only required syscalls)
cat <<'EOF' > /tmp/seccomp-api.json
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"names": ["read", "write", "open", "close", "stat", "fstat",
"mmap", "mprotect", "munmap", "brk", "ioctl",
"access", "socket", "connect", "sendto", "recvfrom",
"clone", "execve", "exit", "exit_group", "futex",
"epoll_create1", "epoll_ctl", "epoll_wait",
"accept4", "listen", "bind", "setsockopt"],
"action": "SCMP_ACT_ALLOW"
}
]
}
EOF
docker run --security-opt seccomp=/tmp/seccomp-api.json \
io-thecodeforge/api:1.0.0
# ── Daemon security ─────────────────────────────────────────────────────────
# Checkif daemon socket is exposed
curl --unix-socket /var/run/docker.sock http://localhost/version
# Ifthis works, the daemon is accessible — anyone with socket access has root
# Checkif daemon is exposed over TCP
netstat -tlnp | grep 2375
# Port2375 = unencrypted DockerAPI (NEVER expose this)
netstat -tlnp | grep 2376
# Port2376 = TLS-encrypted DockerAPI (OKifTLS client auth is configured)
# Verify daemon configuration
cat /etc/docker/daemon.json
# ShouldNOT contain: "hosts": ["tcp://0.0.0.0:2375"]
Output
# Non-root verification:
uid=1001(appuser) gid=1001(appgroup)
# Capabilities:
[NET_BIND_SERVICE] [ALL]
# Image scan:
Total: 0 (HIGH: 0, CRITICAL: 0)
# Daemon socket check:
{"Version":"24.0.7","ApiVersion":"1.43"}
# Socket is accessible — ensure proper file permissions
# TCP exposure:
# (no output on 2375 = good, daemon not exposed unencrypted)
Security as a Castle Defense
The daemon runs as root and has full access to the host filesystem, network, and processes.
Anyone who can access the socket can create a container with the host filesystem mounted.
Mounting the host filesystem into a container gives the container root access to the host.
Never mount /var/run/docker.sock into containers. Never expose the daemon over TCP without TLS.
Production Insight
The default seccomp profile blocks ~44 dangerous syscalls but allows ~260 others. For high-security environments, create a custom profile that allows only the syscalls your application uses. Use strace or auditd to determine the required syscalls, then build a minimal profile. This reduces the attack surface by 80% compared to the default profile.
Key Takeaway
Production security requires: non-root users, dropped capabilities, secret managers (not environment variables), image scanning in CI, seccomp profiles, read-only filesystems, and daemon socket protection. The Docker daemon socket is root access — never expose it. Defense in depth means every layer must be hardened independently.
Scaling Strategies: Horizontal, Vertical, and Auto-Scaling
Scaling Docker in production means adding capacity to handle increased traffic. The strategy depends on the workload pattern: predictable traffic, bursty traffic, or event-driven traffic.
Horizontal scaling (scale out): Add more container replicas. Each replica handles a portion of the traffic. Horizontal scaling is preferred for stateless workloads — it provides redundancy (if one replica fails, others continue), and it scales linearly. Docker Swarm: docker service scale api=10. Kubernetes: kubectl scale deployment/api --replicas=10. ECS: update the service desired count.
Vertical scaling (scale up): Increase the resources (CPU, memory) of existing containers. Vertical scaling is simpler but limited by the host's capacity. It also requires restarting the container with new resource limits. Vertical scaling is appropriate for stateful workloads (databases) that cannot easily distribute across replicas.
Auto-scaling: Automatically adjust replica count based on metrics. The most common triggers: - CPU utilization > 70% for 5 minutes -> add replicas - Request rate > 1000 req/s -> add replicas - Queue depth > 100 messages -> add worker replicas - Custom metrics (response latency, error rate) -> add or remove replicas
Pre-warming: Container startup is fast (0.3-2s) but application cold start can be 10-60s (JVM startup, dependency initialization, connection pool warmup). Pre-warm containers by pulling images before scaling events and using readiness probes that wait for full initialization. For JVM applications, use class data sharing (CDS) or GraalVM native images to reduce cold start.
Scale-down strategy: Removing replicas must be graceful. The replica should stop accepting new requests, drain in-flight requests, close connections, and then exit. Kubernetes handles this with preStop hooks and terminationGracePeriodSeconds. Without graceful shutdown, in-flight requests are dropped during scale-down, causing user-facing errors.
Capacity planning: Monitor resource usage trends over weeks. If average CPU usage is growing 5% per week, plan to add capacity before it reaches 80%. Auto-scaling handles burst traffic, but baseline capacity must be planned manually.
io/thecodeforge/scaling_strategies.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
#!/bin/bash
# Production scaling strategies and configuration
# ── Horizontalscaling (DockerSwarm) ────────────────────────────────────────
# Scale to 10 replicas
docker service scale io-thecodeforge-api=10
# Verify replicas are distributed across nodes
docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}'
# manager1 Running
# worker1 Running
# worker2 Running
# (distributed across 3 nodes)
# ── Horizontalscaling (Kubernetes) ──────────────────────────────────────────
# Manual scaling
kubectl -n production scale deployment/io-thecodeforge-api --replicas=10
# Auto-scaling based on CPU utilization
cat <<'EOF' > /tmp/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: io-thecodeforge-api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: io-thecodeforge-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 60EOF
kubectl apply -f /tmp/hpa.yaml
# ── Gracefulshutdown (preStop hook) ─────────────────────────────────────────
cat <<'EOF' > /tmp/graceful-shutdown-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: io-thecodeforge-api
namespace: production
spec:
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: api
image: registry.example.com/io-thecodeforge/api:1.0.0
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"]
# The preStop hook:
# 1. Sleep 5s (allow load balancer to drain the pod)
# 2. SendSIGTERM to PID1 (the application)
# 3. Application drains in-flight requests and exits
# 4. Kubernetes waits up to terminationGracePeriodSeconds (30s)
EOF
kubectl apply -f /tmp/graceful-shutdown-deployment.yaml
# ── Pre-warming: pull images before scaling events ───────────────────────────
# Pre-pull images on all nodes (KubernetesDaemonSet)
cat <<'EOF' > /tmp/prepull-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: image-prepull
namespace: kube-system
spec:
selector:
matchLabels:
app: image-prepull
template:
metadata:
labels:
app: image-prepull
spec:
initContainers:
- name: prepull
image: registry.example.com/io-thecodeforge/api:1.0.0
command: ["true"]
containers:
- name: pause
image: registry.k8s.io/pause:3.9EOF
kubectl apply -f /tmp/prepull-daemonset.yaml
# ThisDaemonSet runs on every node and pulls the image into the node's cache
# ── Monitor scaling effectiveness ────────────────────────────────────────────
# Check current replica count and resource usage
kubectl -n production get deployment io-thecodeforge-api -o wide
# NAMEREADYUP-TO-DATEAVAILABLEAGE
# io-thecodeforge-api 10/101010 5d
# CheckHPA status
kubectl -n production get hpa io-thecodeforge-api-hpa
# NAMEREFERENCETARGETSMINPODSMAXPODSREPLICAS
# io-thecodeforge-api-hpa Deployment/io-thecodeforge-api 45%/70% 3205
# (45% CPU — below 70% threshold — HPA will scale down after stabilization window)
Output
# Swarm scaling:
io-thecodeforge-api scaled to 10
# HPA status:
horizontalpodautoscaler.autoscaling/io-thecodeforge-api-hpa created
Without graceful shutdown, Kubernetes sends SIGTERM and immediately removes the pod from the load balancer.
In-flight requests (requests that have already been routed to the pod) are dropped mid-processing.
The preStop hook adds a delay, allowing the load balancer to stop sending new requests before the pod exits.
The application must also handle SIGTERM by stopping new request acceptance and draining in-flight requests.
Production Insight
Auto-scaling down is more dangerous than scaling up. Scaling up adds capacity — worst case, you waste money. Scaling down removes capacity — worst case, you drop traffic. Set a longer stabilization window for scale-down (300s) than scale-up (60s). Monitor error rates during scale-down events — if errors spike, increase the stabilization window.
Key Takeaway
Horizontal scaling adds replicas (preferred for stateless). Vertical scaling adds resources (for stateful). Auto-scaling adjusts replicas based on metrics. Pre-warm by pulling images before scaling events. Graceful shutdown with preStop hooks prevents dropped requests during scale-down. Scale-down stabilization window should be 3-5x longer than scale-up.
High Availability and Disaster Recovery
Production Docker deployments must survive host failures, network partitions, and data center outages. High availability (HA) ensures continuous operation during failures. Disaster recovery (DR) ensures data and service restoration after catastrophic failures.
Multi-host redundancy: Run multiple replicas of each service across multiple hosts. If one host fails, the orchestrator reschedules containers on healthy hosts. Docker Swarm: use --replicas=3 and ensure the swarm has 3+ manager nodes. Kubernetes: use pod anti-affinity to spread replicas across nodes and zones.
Multi-AZ deployment: Deploy across multiple availability zones (data centers within a region). If one AZ fails, services continue in other AZs. AWS: use ECS/Kubernetes with nodes in 3+ AZs. Use Application Load Balancer (ALB) to distribute traffic across AZs.
Stateful services (databases): Databases require special HA strategies: - PostgreSQL: streaming replication with automatic failover (Patroni, pg_auto_failover) - MySQL: Group Replication or Galera Cluster - Redis: Redis Sentinel or Redis Cluster - Use volumes for data persistence. Back up volumes to object storage (S3) regularly.
Data backup and recovery: - Volume snapshots: snapshot named volumes regularly (docker volume snapshot or cloud provider snapshots). - Database backups: pg_dump, mysqldump, or continuous WAL archiving to object storage. - Image registry backup: replicate images across regions (ECR replication, Harbor replication). - Configuration backup: store all configuration (Docker Compose, Kubernetes manifests, daemon.json) in version control.
Health checks and self-healing: The orchestrator uses health checks to detect unhealthy containers and automatically restart or reschedule them. Liveness probes detect deadlocked processes (restart the container). Readiness probes detect services that are not ready to receive traffic (remove from load balancer). Startup probes detect slow-starting applications (give them more time before health checking).
Failover testing: HA is only as good as your last failover test. Regularly simulate failures: kill a container, drain a node, shut down an AZ. Measure the time to recovery and the error rate during failover. If you have never tested failover, you do not have HA.
io/thecodeforge/ha_dr.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#!/bin/bash
# High availability and disaster recovery configuration
# ── Multi-host redundancy (DockerSwarm) ─────────────────────────────────────
# Create a service with replicas spread across nodes
docker service create \
--name io-thecodeforge-api \
--replicas 6 \
--constraint 'node.role==worker' \
--placement-pref 'spread=node.id' \
--limit-cpu 1.0 \
--limit-memory 512m \
--update-parallelism 2 \
--update-delay 10s \
--update-failure-action rollback \
--restart-condition on-failure \
--restart-delay 5s \
--restart-max-attempts 3 \
registry.example.com/io-thecodeforge/api:1.0.0
# Verify distribution across nodes
docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}'
# worker1 Running
# worker2 Running
# worker3 Running
# worker1 Running
# worker2 Running
# worker3 Running
# ── Multi-AZ pod anti-affinity (Kubernetes) ──────────────────────────────────
cat <<'EOF' > /tmp/multi-az-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: io-thecodeforge-api
namespace: production
spec:
replicas: 6
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- io-thecodeforge-api
topologyKey: topology.kubernetes.io/zone
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- io-thecodeforge-api
topologyKey: kubernetes.io/hostname
EOF
kubectl apply -f /tmp/multi-az-deployment.yaml
# ── Volumebackup (named volume to S3) ──────────────────────────────────────
# Create a backup container that mounts the volume and uploads to S3
docker run --rm \
-v postgres-data:/data:ro \
-e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
amazon/aws-cli s3 cp /data s3://my-backups/postgres-data/$(date +%Y-%m-%d)/ --recursive
# ── KubernetesCronJobfor automated backups ─────────────────────────────────
cat <<'EOF' > /tmp/backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: production
spec:
schedule: "0 */6 * * *" # Every6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:16
command:
- /bin/sh
- -c
- |
pg_dump -h db-host -U postgres -d mydb | \
gzip | \
aws s3 cp - s3://my-backups/db/mydb-$(date +%Y%m%d-%H%M%S).sql.gz
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
restartPolicy: OnFailureEOF
kubectl apply -f /tmp/backup-cronjob.yaml
# ── Failover testing ────────────────────────────────────────────────────────
# Kill a random container and verify self-healing
docker service scale io-thecodeforge-api=3
sleep 5
# Kill a container
docker kill $(docker ps -q | head -1)
# Watch the orchestrator reschedule
watch docker service ps io-thecodeforge-api
# A new container should start within seconds
# Kubernetes: simulate node failure
cordon k8s-worker-2 # Mark node as unschedulable
drain k8s-worker-2 # Evict all pods from the node
# Pods are rescheduled to other nodes
# Verify all pods are running on remaining nodes
kubectl get pods -o wide | grep io-thecodeforge-api
Output
# Service distribution:
worker1 Running
worker2 Running
worker3 Running
worker1 Running
worker2 Running
worker3 Running
# Evenly distributed across 3 nodes
# Multi-AZ:
deployment.apps/io-thecodeforge-api configured
# Backup:
upload: data/ to s3://my-backups/postgres-data/2026-04-05/
# Failover test:
# Container killed on worker2
# New container started on worker1 within 3 seconds
# Service remained available throughout
HA as a Safety Net
Configuration without testing is an assumption. Failover may fail due to DNS TTL, connection pool exhaustion, or split-brain scenarios.
Regular failover tests reveal hidden dependencies that are not visible in configuration.
Measure time-to-recovery (TTR) and error-rate-during-failover. If TTR > 30s or error rate > 5%, the failover is inadequate.
Run failover tests monthly. Test killing containers, draining nodes, and simulating AZ failures.
Production Insight
The most common HA failure: all replicas are on the same host. Docker Swarm's default placement does not guarantee cross-node distribution. Use --placement-pref 'spread=node.id' to force distribution. In Kubernetes, use pod anti-affinity with requiredDuringSchedulingIgnoredDuringExecution to enforce cross-node placement. Without explicit anti-affinity, the scheduler may place all replicas on the same node.
Key Takeaway
HA requires multi-host replicas with cross-node placement (spread or anti-affinity). DR requires regular backups to object storage and tested restore procedures. Failover testing is mandatory — untested failover is an assumption, not a guarantee. Measure time-to-recovery and error rate during every failover test.
● Production incidentPOST-MORTEMseverity: high
Docker Swarm Overlay Network Failure — 45-Minute Outage During Black Friday Traffic
Symptom
All services reported connection refused errors to their downstream dependencies. HTTP requests timed out after 30 seconds. The load balancer returned 502 errors for 100% of external traffic. docker service ls showed all services as 'running' with the correct replica count. Containers were healthy — they just could not communicate with each other.
Assumption
The team assumed a container crash or OOM kill — but all containers were running. They assumed a DNS failure — but DNS resolution worked from the host. They assumed a firewall rule change — but iptables rules were unchanged. They assumed a cloud provider network issue — but VPC connectivity was healthy.
Root cause
The overlay network was created with the default MTU of 1450 (VXLAN encapsulation overhead). The underlying VPC had an MTU of 9001 (jumbo frames). When traffic spiked, large packets were being fragmented at the VXLAN boundary. Under normal load, the fragmentation overhead was negligible. Under Black Friday load (10x normal), the fragmentation rate exceeded the kernel's IP fragmentation queue capacity (net.ipv4.ipfrag_high_thresh). Packets were dropped silently. The overlay network appeared healthy (containers were running) but all inter-container traffic was being dropped.
Fix
1. Immediate: recreated the overlay network with the correct MTU: docker network create --opt com.docker.network.driver.mtu=8950 app-overlay. 2. Drained and rejoined all swarm nodes to flush corrupted network state. 3. Set net.ipv4.ipfrag_high_thresh to 4x the default on all nodes. 4. Added MTU verification to the deployment pipeline — any overlay network with MTU < 8900 blocks the deploy. 5. Added Prometheus alerts for IP fragment queue drops. 6. Documented that overlay MTU must be calculated: VPC_MTU - VXLAN_OVERHEAD(50) = overlay_MTU.
Key lesson
Overlay network MTU is invisible until traffic volume exceeds the fragmentation queue capacity. Always calculate overlay MTU as: underlay_MTU - 50 (VXLAN overhead).
docker service ls shows containers as 'running' even when the overlay network is broken. Network health is not visible in Docker's built-in status checks.
Add network-level health checks (TCP connectivity to downstream services) in addition to HTTP health checks. A container can be HTTP-healthy but network-unreachable.
Load test at 2x expected peak traffic before any high-traffic event. The MTU issue had been dormant for 6 months — it only manifested under 10x traffic.
Monitor IP fragment queue drops (netstat -s | grep -i frag) on all container hosts. Fragment queue exhaustion is a silent failure mode.
Production debug guideFrom container crashes to network partitions — real debugging paths through production Docker deployments.6 entries
Symptom · 01
Container is running but not serving traffic.
→
Fix
Check if the container is listening on the correct port and interface. Run docker exec <container> ss -tlnp to verify the process is listening. Check if the process is listening on 0.0.0.0 (all interfaces) or 127.0.0.1 (localhost only — unreachable from outside the container). Check if health checks are passing: docker inspect <container> --format '{{.State.Health.Status}}'. If unhealthy, the container may be removed from the load balancer.
Symptom · 02
Container is OOM-killed repeatedly.
→
Fix
Check the container's memory limit: docker inspect <container> --format '{{.HostConfig.Memory}}'. Check the container's memory usage before crash: docker stats --no-stream <container>. Check if the limit is too low for the application's peak memory. Check for memory leaks: monitor RSS over time with docker stats. Fix: increase the memory limit or fix the leak. Add --oom-kill-disable only if you want the entire host to freeze instead of killing the container.
Symptom · 03
Inter-container communication is failing.
→
Fix
Check if containers are on the same network: docker network inspect <network> | grep -A5 <container>. Check DNS resolution: docker exec <container> nslookup <target-service>. Check if the overlay network is healthy: docker network inspect <network> --format '{{.Peers}}'. Check MTU: docker exec <container> cat /sys/class/net/eth0/mtu. Check IP fragment queue: cat /proc/net/snmp | grep -i frag on the host.
Symptom · 04
Docker daemon is consuming excessive disk space.
→
Fix
Check Docker disk usage: docker system df. Check detailed breakdown: docker system df -v. Check for dangling images: docker images --filter dangling=true. Check for orphaned volumes: docker volume ls --filter dangling=true. Check for large log files: ls -lhS /var/lib/docker/containers//.log. Fix: docker system prune -a --volumes (WARNING: removes all unused resources). Set log rotation in daemon.json.
Symptom · 05
Deploy is stuck — new containers are not starting.
→
Fix
Check if there are enough resources: docker node ls (for Swarm) or kubectl describe nodes (for Kubernetes). Check if the image pull is failing: docker pull <image>. Check if port conflicts exist: docker ps -a | grep <port>. Check if the container immediately crashes: docker logs <container>. Check if the health check is failing: docker inspect <container> --format '{{.State.Health}}'.
Symptom · 06
Container logs are missing or incomplete.
→
Fix
Check if the application writes to stdout/stderr (Docker captures these). Check if logs are in the container filesystem (lost on restart). Check the log driver: docker info --format '{{.LoggingDriver}}'. Check if log rotation is configured: cat /etc/docker/daemon.json | grep log. Check if the logging agent (Fluentd, Filebeat) is running and healthy.
★ Docker Production Triage Cheat SheetFirst-response commands when containers are crashing, networking is broken, or resources are exhausted in production.
If ReasmFails > 0, fragment queue is exhausted. Recreate overlay with correct MTU: VPC_MTU - 50.
Host CPU is 100% but individual containers show low usage.+
Immediate action
Check for daemon overhead and system processes.
Commands
ps aux --sort=-%cpu | head -10
docker stats --no-stream
Fix now
If dockerd is consuming high CPU, check for concurrent builds. If containerd is high, check for snapshot corruption.
Docker Production: Orchestrator Comparison
Aspect
Docker Swarm
Kubernetes
AWS ECS
AWS Fargate
Complexity
Low
High
Medium
Low
Learning curve
1-2 weeks
2-6 months
2-4 weeks
1-2 weeks
Self-healing
Yes
Yes
Yes
Yes
Auto-scaling
Limited (external)
HPA, VPA, KEDA
Service Auto Scaling
Service Auto Scaling
Service mesh
No
Istio, Linkerd
App Mesh
App Mesh
Multi-AZ
Manual
Built-in (topology spread)
Built-in
Built-in
Host management
Self-managed
Self-managed (or EKS/GKE)
EC2 instances
No hosts (serverless)
Cost
Lowest (self-managed)
Medium (EKS $73/mo + nodes)
Medium (EC2 + ECS)
Highest (20-30% premium)
Ecosystem
Small
Massive
AWS-native
AWS-native
Best for
Small teams, simple deployments
Large teams, complex workloads
AWS-native, medium complexity
Minimal ops, AWS-native
Key takeaways
1
Production Docker requires an orchestrator (Swarm, Kubernetes, ECS) for scheduling, scaling, self-healing, and networking. Choose based on team size and ecosystem needs.
2
Resource limits are mandatory
set --memory and --cpus on every container. Without limits, one container can starve all others on the host.
3
Logging must go to stdout/stderr with rotation. Ship to a centralized system. Never write logs to the container filesystem.
4
CI/CD requires multi-stage builds, git SHA tags, image scanning, one-command rollback, and immutable images. Never use :latest in production.
5
Security requires non-root users, dropped capabilities, secret managers, seccomp profiles, and daemon socket protection. Defense in depth at every layer.
6
Scaling requires horizontal replicas for stateless workloads, auto-scaling based on metrics, pre-warming for cold start, and graceful shutdown for zero-downtime deploys.
7
HA requires multi-host replicas with cross-node placement, multi-AZ deployment, regular backups, and tested failover. Untested failover is not HA.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
FAQ · 6 QUESTIONS
Frequently Asked Questions
01
Should I use Docker Swarm or Kubernetes in production?
Use Docker Swarm if you have a small team (< 10 engineers), fewer than 50 services, and want simplicity. Swarm is easier to learn and operate. Use Kubernetes if you need the ecosystem (service mesh, GitOps, custom operators), have a platform team, or run more than 50 services. Kubernetes has a steeper learning curve but provides more flexibility and a larger ecosystem.
Was this helpful?
02
How do I handle persistent data in Docker production?
Use named volumes (docker volume create) for persistent data. Volumes survive container restarts and removals. Back up volumes regularly to object storage (S3) using a backup container or CronJob. For databases, use cloud-managed database services (RDS, Cloud SQL) when possible — they handle replication, backups, and failover automatically.
Was this helpful?
03
How do I debug a container that is running but not serving traffic?
Check in order: (1) Is the process listening on the correct port and interface? (docker exec <container> ss -tlnp). (2) Is the health check passing? (docker inspect <container> --format '{{.State.Health.Status}}'). (3) Is the container on the correct network? (docker network inspect <network>). (4) Are there iptables rules blocking traffic? (iptables -L -n). (5) Is the application actually started? (docker logs <container>).
Was this helpful?
04
What is the difference between liveness and readiness probes?
A liveness probe checks if the container is alive. If it fails, Kubernetes restarts the container. Use liveness for detecting deadlocked processes. A readiness probe checks if the container is ready to serve traffic. If it fails, Kubernetes removes the container from the load balancer but does not restart it. Use readiness for detecting initialization issues or temporary overload. A startup probe gives slow-starting applications extra time before liveness checks begin.
Was this helpful?
05
How do I achieve zero-downtime deployments with Docker?
Use rolling updates with health checks. The orchestrator starts new containers, waits for them to pass health checks, then stops old containers. Add preStop hooks with a sleep delay to allow load balancers to drain connections. Handle SIGTERM in your application to stop accepting new requests and drain in-flight requests. Set terminationGracePeriodSeconds to the maximum drain time.
Was this helpful?
06
How do I monitor Docker in production?
Collect three types of signals: (1) Logs — ship stdout/stderr to ELK, Datadog, or Loki. (2) Metrics — use Prometheus with node_exporter and cAdvisor. Alert on memory > 80% of limit, CPU throttling > 10%, restart count > 5/hour. (3) Traces — use OpenTelemetry with Jaeger or Zipkin for distributed tracing across microservices.