Advanced 14 min · April 05, 2026

Docker Overlay Network MTU — Silent Failures at Scale

Q: Should I use Docker Swarm or Kubernetes in production?

Use Docker Swarm if you have a small team (< 10 engineers), fewer than 50 services, and want simplicity. Swarm is easier to learn and operate. Use Kubernetes if you need the ecosystem (service mesh, GitOps, custom operators), have a platform team, or run more than 50 services. Kubernetes has a steeper learning curve but provides more flexibility and a larger ecosystem.

Q: How do I handle persistent data in Docker production?

Use named volumes (docker volume create) for persistent data. Volumes survive container restarts and removals. Back up volumes regularly to object storage (S3) using a backup container or CronJob. For databases, use cloud-managed database services (RDS, Cloud SQL) when possible — they handle replication, backups, and failover automatically.

Q: How do I debug a container that is running but not serving traffic?

Check in order: (1) Is the process listening on the correct port and interface? (docker exec ss -tlnp). (2) Is the health check passing? (docker inspect --format '{{.State.Health.Status}}'). (3) Is the container on the correct network? (docker network inspect ). (4) Are there iptables rules blocking traffic? (iptables -L -n). (5) Is the application actually started? (docker logs ).

Q: What is the difference between liveness and readiness probes?

A liveness probe checks if the container is alive. If it fails, Kubernetes restarts the container. Use liveness for detecting deadlocked processes. A readiness probe checks if the container is ready to serve traffic. If it fails, Kubernetes removes the container from the load balancer but does not restart it. Use readiness for detecting initialization issues or temporary overload. A startup probe gives slow-starting applications extra time before liveness checks begin.

Q: How do I achieve zero-downtime deployments with Docker?

Use rolling updates with health checks. The orchestrator starts new containers, waits for them to pass health checks, then stops old containers. Add preStop hooks with a sleep delay to allow load balancers to drain connections. Handle SIGTERM in your application to stop accepting new requests and drain in-flight requests. Set terminationGracePeriodSeconds to the maximum drain time.

Q: How do I monitor Docker in production?

Collect three types of signals: (1) Logs — ship stdout/stderr to ELK, Datadog, or Loki. (2) Metrics — use Prometheus with node_exporter and cAdvisor. Alert on memory > 80% of limit, CPU throttling > 10%, restart count > 5/hour. (3) Traces — use OpenTelemetry with Jaeger or Zipkin for distributed tracing across microservices.

VXLAN fragmentation silently dropped 100% of traffic at 10x load while docker service ls showed healthy.

Naren · Founder

Plain-English first. Then code. Then the interview question.

About

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Orchestration layer (Kubernetes, ECS, Docker Swarm) manages container scheduling, scaling, and self-healing
Container runtime (containerd, runc) executes containers with namespace isolation and cgroup limits
Image registry (ECR, GCR, private Harbor) stores and distributes images with vulnerability scanning
Service mesh (Istio, Linkerd) handles mTLS, traffic management, and observability
Resource limits are mandatory — without --memory and --cpus, one container can starve all others on the host
Logging must go to stdout/stderr — never write logs to container filesystem (lost on restart)
Images must be immutable — tag with git SHA, never use :latest in production
Health checks must be liveness + readiness — liveness restarts, readiness removes from load balancer

Plain-English First

Running Docker in production is like running a restaurant kitchen versus cooking at home. At home, you can leave dishes in the sink, ignore the smoke alarm, and run to the store if you forgot an ingredient. In a restaurant kitchen, every dish must be tracked, every station must be clean, every appliance must be monitored, and if the oven breaks at 7 PM on Saturday, you need a backup plan immediately. The cooking technique (Docker) is the same — the operational requirements are completely different.

Docker works out of the box for development. Running a single container on a laptop requires no orchestration, no monitoring, and no security hardening. Production is a different problem entirely — hundreds of containers across dozens of hosts, with requirements for zero-downtime deploys, automatic scaling, persistent data, and compliance auditing.

Most Docker-in-production failures fall into five categories: resource exhaustion (no limits set), networking misconfiguration (DNS, overlay MTU, port conflicts), logging gaps (logs in container filesystem, not shipped to central system), security exposure (root containers, exposed daemon socket, unpinned images), and deployment errors (no health checks, no rollback strategy, no canary testing).

This article covers the architecture decisions, operational patterns, and failure scenarios that determine whether your Docker deployment survives production traffic or collapses under it. Every section includes real debugging commands and failure stories.

What Docker Overlay Networking Actually Does

Docker overlay networking creates a virtual Layer 2 network across multiple Docker hosts using VXLAN encapsulation. Each container gets its own IP from a private subnet, and traffic between containers on different hosts is wrapped in UDP packets (typically port 4789) by the kernel's VXLAN implementation. This allows containers to communicate as if they're on the same switch, regardless of physical host placement.

The key mechanic is the VXLAN Tunnel Endpoint (VTEP) — each Docker host runs a VTEP that maps container IPs to host IPs. When a container sends a packet to another container on a different host, the source VTEP encapsulates the original Ethernet frame inside a UDP packet with the destination host's IP. The destination VTEP decapsulates and delivers it. This adds 50 bytes of overhead per packet (20 IP + 8 UDP + 8 VXLAN + 14 inner Ethernet). That overhead is invisible to applications but directly impacts MTU: if the physical network's MTU is 1500, the effective MTU for containers becomes 1450. Ignoring this causes silent packet fragmentation or drops.

Use overlay networks when you need multi-host container communication without modifying the underlying network infrastructure — typical in Docker Swarm or Kubernetes clusters where hosts span different subnets or cloud regions. The critical production concern is MTU mismatch: if your physical network uses jumbo frames (9000 MTU) but your cloud provider's underlay caps at 1500, or if you set container MTU to 1450 but the host's physical interface is 1500, you'll see intermittent TCP timeouts, slow transfers, and mysterious connection resets that only appear under load. This is not a theoretical issue — it's the #1 cause of silent networking failures in Docker overlay deployments.

MTU Mismatch Is Invisible

Docker does not auto-detect the physical network MTU. If your host MTU is 1500, set overlay MTU to 1450. Otherwise, packets >1450 bytes silently fragment or drop.

Production Insight

Teams migrating from bare-metal (jumbo frames) to AWS (1500 MTU) keep default overlay MTU 1500, causing 50% packet loss for large writes.

Symptom: TCP connections succeed for small payloads but stall on any transfer >1450 bytes — no errors in application logs, only 'connection reset' or 'timeout'.

Rule: Always set 'com.docker.network.driver.mtu' to (physical MTU - 50) on every overlay network creation.

Key Takeaway

Overlay MTU must be 50 bytes less than the physical network MTU to avoid fragmentation.

Silent packet drops from MTU mismatch look like application bugs — always check MTU first.

Docker does not enforce or warn about MTU; it's your responsibility to configure it correctly per environment.

Production Architecture: Single Host to Multi-Host Orchestration

Running Docker in production requires an orchestration layer that manages container scheduling, networking, scaling, and self-healing across multiple hosts. Without orchestration, you are managing containers manually — which does not scale beyond 10-20 containers.

Single-host Docker (development only): Running docker run on a single host works for development but fails in production. There is no self-healing (if a container crashes, it stays dead unless you add --restart=always). There is no load balancing (all traffic goes to one container). There is no horizontal scaling (you must manually start more containers). There is no rolling deployment (you must stop the old container before starting the new one, causing downtime).

Docker Swarm: Docker's built-in orchestrator. Manages a cluster of Docker hosts as a single virtual host. Supports service definitions (desired state), rolling updates, and overlay networking. Swarm is simpler than Kubernetes but has fewer features — no custom resource definitions, limited networking options, and a smaller ecosystem. Swarm is adequate for small-to-medium deployments (< 100 services).

Kubernetes (K8s): The industry-standard orchestrator. Manages containers across a cluster with declarative configuration, automated scaling, self-healing, and a rich ecosystem of networking, storage, and observability tools. Kubernetes has a steep learning curve and significant operational overhead — it requires dedicated platform engineers to operate. Kubernetes is the right choice for large deployments (> 50 services) or when you need the ecosystem (service mesh, GitOps, custom operators).

AWS ECS / Fargate: AWS's managed container orchestration. ECS manages container scheduling on EC2 instances. Fargate abstracts the hosts entirely — you pay per container, not per host. ECS is simpler than Kubernetes (no control plane to manage) but locks you to AWS. Fargate eliminates host management entirely but costs 20-30% more than self-managed EC2.

Architecture pattern: The production architecture stack is: Load Balancer -> Ingress Controller -> Orchestrator -> Container Runtime -> Host. Each layer has specific failure modes and debugging approaches. Understanding the full stack is essential for production debugging.

io/thecodeforge/production_architecture.shBASH

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

#!/bin/bash
# Production architecture setup and verification

# ── Docker Swarm: production setup ───────────────────────────────────────────

# Initialize the swarm on the manager node
docker swarm init --advertise-addr <manager-ip>
# Output: join token for worker nodes

# Join worker nodes
docker swarm join --token <token> <manager-ip>:2377

# Verify cluster status
docker node ls
# ID    HOSTNAME    STATUS    AVAILABILITY    MANAGER STATUS
# abc*  manager1    Ready     Active          Leader
# def   worker1     Ready     Active
# ghi   worker2     Ready     Active

# Create a production service with resource limits
docker service create \
  --name io-thecodeforge-api \
  --replicas 3 \
  --limit-cpu 1.0 \
  --limit-memory 512m \
  --reserve-cpu 0.5 \
  --reserve-memory 256m \
  --publish published=80,target=3000 \
  --update-parallelism 1 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-max-failure-ratio 0.25 \
  --health-cmd 'curl -f http://localhost:3000/health || exit 1' \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --network app-overlay \
  registry.example.com/io-thecodeforge/api:1.0.0

# Verify service status
docker service ls
# ID    NAME                    MODE        REPLICAS  IMAGE
# abc   io-thecodeforge-api     replicated  3/3       registry.example.com/io-thecodeforge/api:1.0.0

# Check service logs across all replicas
docker service logs io-thecodeforge-api --tail 50 --follow

# ── Kubernetes: production deployment manifest ───────────────────────────────
cat <<'EOF' > /tmp/io-thecodeforge-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: io-thecodeforge-api
  namespace: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: io-thecodeforge-api
  template:
    metadata:
      labels:
        app: io-thecodeforge-api
    spec:
      containers:
      - name: api
        image: registry.example.com/io-thecodeforge/api:1.0.0
        ports:
        - containerPort: 3000
        resources:
          requests:
            cpu: 500m
            memory: 256Mi
          limits:
            cpu: 1000m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 15
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 3
EOF

kubectl apply -f /tmp/io-thecodeforge-api-deployment.yaml
kubectl -n production rollout status deployment/io-thecodeforge-api

# ── AWS ECS: production task definition ───────────────────────────────────────
cat <<'EOF' > /tmp/io-thecodeforge-api-task.json
{
  "family": "io-thecodeforge-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge/api:1.0.0",
      "portMappings": [{"containerPort": 3000, "protocol": "tcp"}],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      },
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "cloudwatch",
          "region": "us-east-1",
          "log_group_name": "/ecs/io-thecodeforge-api",
          "auto_create_group": "true"
        }
      }
    }
  ]
}
EOF

aws ecs register-task-definition --cli-input-json file:///tmp/io-thecodeforge-api-task.json

Output

# Swarm cluster:

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS

abc* manager1 Ready Active Leader

def worker1 Ready Active

ghi worker2 Ready Active

# Service status:

ID NAME MODE REPLICAS IMAGE

abc io-thecodeforge-api replicated 3/3 registry.example.com/io-thecodeforge/api:1.0.0

# Kubernetes deployment:

deployment.apps/io-thecodeforge-api configured

Waiting for deployment "io-thecodeforge-api" rollout to finish: 0 of 3 updated replicas are available...

Waiting for deployment "io-thecodeforge-api" rollout to finish: 1 of 3 updated replicas are available...

Waiting for deployment "io-thecodeforge-api" rollout to finish: 2 of 3 updated replicas are available...

deployment "io-thecodeforge-api" successfully rolled out

Orchestration as a Conductor

Kubernetes manages not just containers but networking (CNI), storage (CSI), service discovery, ingress, RBAC, and custom resources.
Swarm is simpler because it delegates networking and storage to Docker's built-in drivers.
Kubernetes' complexity is the cost of flexibility — it can model any production topology.
For simple deployments (< 50 services), Swarm is sufficient and far easier to operate.

Production Insight

The choice of orchestrator determines your operational complexity for years. Migrating from Swarm to Kubernetes (or vice versa) is a multi-month project. Choose based on team size (Swarm for small teams, Kubernetes for platform teams), ecosystem needs (Kubernetes wins on ecosystem), and cloud strategy (ECS for AWS-only, Kubernetes for multi-cloud). Do not default to Kubernetes because it is popular — default to it because your workload requires it.

Key Takeaway

Production Docker requires an orchestrator — Swarm for simplicity, Kubernetes for flexibility, ECS/Fargate for AWS-native. The orchestrator manages scheduling, scaling, self-healing, and networking. Choose based on team size and ecosystem needs, not popularity.

Orchestrator Selection

IfSmall team (< 10 engineers), < 50 services, single cloud

→

UseDocker Swarm or AWS ECS. Simpler to operate, lower learning curve.

IfPlatform team available, > 50 services, need ecosystem (service mesh, GitOps)

→

UseKubernetes. Higher complexity but maximum flexibility and ecosystem support.

IfAWS-only, want to minimize infrastructure management

→

UseAWS Fargate. No hosts to manage, pay per container, but higher cost.

IfMulti-cloud or on-premises requirement

→

UseKubernetes. Portable across clouds with consistent API.

Resource Management: CPU, Memory, OOM, and Noisy Neighbors

Resource management is the most critical production concern for shared container hosts. Without explicit resource limits, one misbehaving container can starve every other container on the same host.

CPU limits: Docker uses cgroups to enforce CPU limits. --cpus=1.0 gives the container access to 1 CPU core worth of time. Without a limit, a container can consume all available CPU. CPU is a compressible resource — the kernel throttles CPU-intensive containers, but does not kill them. This means a CPU-hungry container slows down other containers but does not kill them.

Memory limits: Memory is an incompressible resource. When a container exceeds its memory limit, the kernel OOM killer terminates it. The OOM killer selects processes based on oom_score — a heuristic that considers memory usage, process age, and oom_score_adj. Without a memory limit, a leaking container consumes all host memory, and the OOM killer may kill unrelated containers or critical host processes (kubelet, containerd).

Requests vs limits (Kubernetes): Requests guarantee a minimum allocation — the scheduler places the pod on a node with enough available resources. Limits set the maximum — the container is throttled (CPU) or killed (memory) if exceeded. Best practice: set requests equal to limits for critical services (guaranteed QoS). Set requests lower than limits for burstable services (burstable QoS).

Noisy neighbor problem: Multiple containers on the same host compete for CPU, memory, disk I/O, and network bandwidth. Without resource limits, one container's spike affects all others. The fix: set limits on every production container. Monitor host-level resource usage with docker stats and Prometheus node_exporter.

OOM score and priority: The kernel assigns each process an oom_score from 0 to 1000. Higher scores are killed first. Docker sets oom_score_adj for each container — containers with higher scores are killed before lower-scored containers. Critical services (databases) should have oom_score_adj=-999 (almost never killed). Non-critical services should have oom_score_adj=1000 (killed first).

io/thecodeforge/resource_management.shBASH

#!/bin/bash
# Production resource management configuration and monitoring

# ── CPU limits ───────────────────────────────────────────────────────────────

# Run with 1 CPU core limit
docker run --cpus=1.0 --name cpu-test alpine:3.19 stress --cpu 2 --timeout 10s
# The container is throttled to 1 CPU even if stress spawns 2 workers

# Check CPU throttling
cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.stat
# nr_periods: total scheduling periods
# nr_throttled: periods where the container was throttled
# throttled_time: total time throttled (nanoseconds)

# Check CPU shares (relative priority)
cat /sys/fs/cgroup/cpu/docker/<container-id>/cpu.shares
# Default: 1024. Set with --cpu-shares=512 for lower priority

# ── Memory limits ────────────────────────────────────────────────────────────

# Run with 256MB memory limit
docker run --memory=256m --memory-swap=256m --name mem-test alpine:3.19 stress --vm 1 --vm-bytes 300M --timeout 10s
# Container is OOM-killed because it exceeds 256MB limit

# Check memory usage before OOM
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.max_usage_in_bytes

# Check OOM events
dmesg | grep -i 'oom\|killed process' | tail -10
# [12345.678] Out of memory: Killed process 5678 (node) total-vm:123456kB, anon-rss:98765kB

# ── OOM score management ────────────────────────────────────────────────────

# Check a container's OOM score
CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}')
cat /proc/$CONTAINER_PID/oom_score
# 0-1000: higher = more likely to be killed

cat /proc/$CONTAINER_PID/oom_score_adj
# -1000 to 1000: adjust the score

# Set OOM priority for critical services (database)
docker run --oom-score-adj=-999 --name critical-db postgres:16
# This container is almost never killed by the OOM killer

# Set OOM priority for non-critical services (cache)
docker run --oom-score-adj=1000 --name expendable-cache redis:7
# This container is killed first in an OOM situation

# ── Kubernetes resource management ──────────────────────────────────────────

# Guaranteed QoS: requests == limits (never evicted for resource reasons)
cat <<'EOF'
resources:
  requests:
    cpu: 1000m
    memory: 512Mi
  limits:
    cpu: 1000m
    memory: 512Mi
EOF

# Burstable QoS: requests < limits (can burst but may be throttled/killed)
cat <<'EOF'
resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 1000m
    memory: 512Mi
EOF

# ── Monitor resource usage across all containers ─────────────────────────────

# Real-time resource usage
docker stats --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}'

# Find containers without resource limits
docker ps -q | xargs -I{} docker inspect {} --format '{{.Name}}: CPU={{.HostConfig.NanoCpus}} MEM={{.HostConfig.Memory}}'
# Containers with NanoCpus=0 or Memory=0 have no limits

# Host-level resource check
free -h
cat /proc/loadavg
uptime

Output

# CPU throttling:

nr_periods: 1000

nr_throttled: 350

throttled_time: 3500000000 (3.5 seconds)

# Memory OOM:

[12345.678] Out of memory: Killed process 5678 (stress) total-vm:312320kB, anon-rss:262144kB

# OOM scores:

Container PID: 5678

oom_score: 300

oom_score_adj: 0

# Docker stats:

NAME CPU % MEM USAGE / LIMIT NET I/O BLOCK I/O

io-api-1 12.34% 256MiB / 512MiB 1.2GB / 800MB 50MB / 100MB

io-api-2 8.76% 198MiB / 512MiB 900MB / 600MB 30MB / 80MB

io-db-1 45.67% 1.2GiB / 2GiB 5GB / 3GB 2GB / 500MB

io-cache-1 2.34% 85MiB / 128MiB 500MB / 400MB 10MB / 5MB

Resources as a Shared Apartment Building

CPU is compressible — the kernel throttles a CPU-hungry container but does not kill it.
Memory is incompressible — when physical memory is exhausted, the kernel must kill a process.
Without memory limits, the OOM killer may kill critical host processes (containerd, kubelet).
With memory limits, only the offending container is killed — other containers are unaffected.

Production Insight

The OOM killer's process selection is not random — it uses oom_score to decide which process to kill. Without explicit oom_score_adj, the OOM killer may kill a critical database before killing a non-critical cache. Set oom_score_adj=-999 for databases and oom_score_adj=1000 for expendable services. In Kubernetes, Guaranteed QoS pods (requests == limits) are protected from eviction — use this for critical services.

Key Takeaway

CPU limits prevent throttling. Memory limits prevent OOM kills. Without limits, one container can starve all others on the host. Set requests == limits for critical services (Guaranteed QoS). Set oom_score_adj for priority-based OOM protection. Monitor host-level resource usage — container-level metrics miss cross-container contention.

Resource Limit Strategy

IfCritical stateful service (database, queue)

→

UseGuaranteed QoS: requests == limits. Set oom_score_adj=-999. Use dedicated node pools.

IfStateless API with predictable load

→

UseGuaranteed QoS: requests == limits based on load testing. Monitor for throttling.

IfBatch job or worker with variable load

→

UseBurstable QoS: requests < limits. Allow bursting but set sensible limits.

IfDevelopment or testing environment

→

UseNo limits acceptable. But never deploy without limits to production.

Networking in Production: DNS, Overlay, Load Balancing, and Service Mesh

Production Docker networking requires reliable DNS resolution, load balancing, and health-aware traffic routing. The default bridge network provides none of these — production deployments must use user-defined networks or an orchestrator's networking layer.

DNS-based service discovery: User-defined Docker networks and Kubernetes provide DNS-based service discovery. Containers resolve service names to IP addresses via an embedded DNS server (127.0.0.11 in Docker, CoreDNS in Kubernetes). The default bridge network has no DNS — containers can only reach each other by IP, which changes on every restart.

Overlay networking: For multi-host deployments, overlay networks use VXLAN encapsulation to create a virtual Layer 2 network across hosts. Each overlay network has an MTU of 1450 (VXLAN adds 50 bytes of overhead). Misconfigured MTU is a common production failure — packets larger than the overlay MTU are fragmented, and under high load, the fragment queue can overflow, causing silent packet drops.

Load balancing: Docker Swarm provides built-in load balancing via a routing mesh — any node can route traffic to any service replica. Kubernetes provides kube-proxy (iptables/IPVS-based) and ingress controllers (NGINX, Traefik, Envoy) for external traffic. For production, an ingress controller with TLS termination, rate limiting, and circuit breaking is mandatory.

Service mesh: A service mesh (Istio, Linkerd) adds mTLS between services, traffic splitting (canary deployments), circuit breaking, and observability (distributed tracing, metrics). The trade-off: added latency (1-3ms per hop) and operational complexity. Use a service mesh when you need mTLS or traffic splitting. Do not add one 'just in case.'

Network policies: In Kubernetes, NetworkPolicy resources restrict which pods can communicate with each other. Without network policies, all pods can communicate — a compromised pod can reach the database directly. Default-deny network policies are a production best practice.

io/thecodeforge/production_networking.shBASH

100

101

102

103

104

#!/bin/bash
# Production networking configuration and debugging

# ── Docker Swarm overlay network ─────────────────────────────────────────────

# Create an overlay network with correct MTU
docker network create \
  --driver overlay \
  --opt com.docker.network.driver.mtu=8950 \
  --subnet 10.0.0.0/24 \
  --gateway 10.0.0.1 \
  app-overlay
# MTU calculation: VPC MTU (9001) - VXLAN overhead (50) = 8951, round to 8950

# Verify overlay network
docker network inspect app-overlay --format '{{.Driver}} {{.Options}}'
# overlay map[com.docker.network.driver.mtu:8950]

# ── DNS resolution verification ──────────────────────────────────────────────

# Check embedded DNS server
docker exec <container> cat /etc/resolv.conf
# nameserver 127.0.0.11
# options ndots:0

# Resolve a service name
docker exec <container> nslookup io-thecodeforge-api
# Server: 127.0.0.11
# Address: 10.0.0.5

# Check DNS query logs (Docker daemon)
sudo journalctl -u docker | grep 'DNS query' | tail -10

# ── Network health checks ───────────────────────────────────────────────────

# Check overlay network peer status
docker network inspect app-overlay --format '{{.Peers}}'
# Shows all nodes participating in the overlay

# Check IP fragment queue (critical for overlay networks)
cat /proc/net/snmp | grep -i frag
# Ip: FragCreates  FragOKs  FragFails
# If FragFails > 0, packets are being dropped due to fragment queue overflow

# Check MTU of container interface
docker exec <container> cat /sys/class/net/eth0/mtu
# Should match the overlay network MTU (8950)

# ── Traffic debugging with tcpdump ───────────────────────────────────────────

# Capture traffic on the overlay bridge
sudo tcpdump -i docker_gwbridge -n -c 20

# Capture traffic inside a container's namespace
CONTAINER_PID=$(docker inspect <container> --format '{{.State.Pid}}')
sudo nsenter --net --target $CONTAINER_PID tcpdump -i eth0 -n -c 20

# ── Kubernetes network policies (default-deny) ──────────────────────────────
cat <<'EOF' > /tmp/default-deny-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
EOF

cat <<'EOF' > /tmp/allow-api-to-db.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-db
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: io-thecodeforge-db
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: io-thecodeforge-api
    ports:
    - protocol: TCP
      port: 5432
EOF

kubectl apply -f /tmp/default-deny-policy.yaml
kubectl apply -f /tmp/allow-api-to-db.yaml

# ── Load balancer health verification ────────────────────────────────────────

# Check which containers are receiving traffic
curl -s http://localhost:80/health | jq .hostname
# Repeat 10 times — should show different hostnames (round-robin)
for i in $(seq 1 10); do
  curl -s http://localhost:80/health | jq -r .hostname
done

Output

# Overlay network created:

abc123def456

# DNS resolution:

Server: 127.0.0.11

Address: 10.0.0.5

# Fragment queue:

Ip: FragCreates FragOKs FragFails

123456789 123456000 789

# FragFails > 0 indicates fragment queue overflow

# MTU check:

8950

# Network policies:

networkpolicy.networking.k8s.io/default-deny-all created

networkpolicy.networking.k8s.io/allow-api-to-db created

# Load balancer verification:

api-1

api-2

api-3

api-1

api-2

# Round-robin distribution confirmed

Production Networking as a Postal System

VXLAN encapsulation adds 50 bytes of overhead — the overlay MTU must be 50 bytes less than the underlay MTU.
If the overlay MTU is too large, packets are fragmented at the VXLAN boundary.
Under normal load, fragmentation is slow but functional. Under high load, the fragment queue overflows and packets are silently dropped.
The failure is silent — containers appear healthy but inter-service communication fails.

Production Insight

Default-deny network policies are the single most impactful security improvement for Kubernetes deployments. Without them, any compromised pod can reach any other pod, including databases and secrets stores. Start with a default-deny policy, then add allow rules for each required communication path. This is zero-trust networking at the pod level.

Key Takeaway

Production networking requires DNS-based service discovery, correct overlay MTU, load balancing with health checks, and network policies. Default-deny network policies are mandatory for security. Overlay MTU miscalculation is the most common silent failure — always calculate overlay_MTU = underlay_MTU - 50.

Logging, Monitoring, and Observability

Production observability is the difference between debugging a failure in 5 minutes and debugging it in 5 hours. Docker provides basic logging — production requires a centralized logging pipeline, metrics collection, and distributed tracing.

Container logging model: Docker captures stdout and stderr from each container and writes them to JSON files under /var/lib/docker/containers/<id>/<id>-json.log. Applications must write logs to stdout/stderr — never to a file inside the container. Log files inside the container are lost when the container restarts.

Log rotation: Docker's default JSON log driver has no size limit — log files grow unbounded until disk is full. Production deployments must configure log rotation in daemon.json: max-size (e.g., 10m) and max-file (e.g., 3). Without rotation, a chatty application can fill the host disk in hours.

Centralized logging: Container logs must be shipped to a centralized system (ELK, Datadog, CloudWatch, Loki) for search, alerting, and retention. Use a logging agent (Fluentd, Filebeat, FireLens) as a DaemonSet or sidecar. The agent reads container logs and ships them to the central system.

Metrics collection: Container metrics (CPU, memory, network, disk I/O) are exposed by Docker (docker stats) and cAdvisor. For production, use Prometheus with node_exporter (host metrics) and cAdvisor (container metrics). Kubernetes exposes metrics via the metrics-server. Alert on: container memory usage > 80% of limit, CPU throttling > 10%, restart count > 5 in 1 hour.

Distributed tracing: For microservices, distributed tracing (Jaeger, Zipkin, OpenTelemetry) tracks a request across multiple services. Each service adds a trace ID to outgoing requests and logs. The tracing system aggregates these logs into a single trace view. Essential for debugging latency issues in multi-service architectures.

Structured logging: Applications should emit structured logs (JSON) with fields: timestamp, level, message, trace_id, service, request_id. Unstructured logs (plain text) are impossible to parse and alert on at scale.

io/thecodeforge/production_logging.shBASH

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

#!/bin/bash
# Production logging, monitoring, and observability setup

# ── Docker daemon log rotation ───────────────────────────────────────────────
cat <<'EOF' | sudo tee /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "compress": "true"
  }
}
EOF
sudo systemctl restart docker

# Verify log rotation is configured
docker info --format '{{.LoggingDriver}}'
# json-file

# ── Check container log size ────────────────────────────────────────────────

# Find large container logs
find /var/lib/docker/containers -name '*-json.log' -exec ls -lhS {} + | head -10
# If any log is > 100MB, rotation is not working

# Check total log disk usage
du -sh /var/lib/docker/containers/*

# ── Fluentd DaemonSet logging agent (Kubernetes) ────────────────────────────
cat <<'EOF' > /tmp/fluentd-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch8-1
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: dockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: dockercontainers
        hostPath:
          path: /var/lib/docker/containers
EOF
kubectl apply -f /tmp/fluentd-daemonset.yaml

# ── Prometheus alerting rules ────────────────────────────────────────────────
cat <<'EOF' > /tmp/container-alerts.yaml
groups:
- name: container-alerts
  rules:
  - alert: ContainerMemoryHigh
    expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Container {{ $labels.name }} memory usage above 80%'

  - alert: ContainerCPUThrottled
    expr: rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Container {{ $labels.name }} CPU throttled > 10%'

  - alert: ContainerRestarting
    expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: 'Container {{ $labels.container }} restarted {{ $value }} times in 1 hour'
EOF

# ── Structured logging example (Java) ────────────────────────────────────────
cat <<'EOF'
// io.thecodeforge.logging.StructuredLogger.java
package io.thecodeforge.logging;

import com.fasterxml.jackson.databind.ObjectMapper;
import java.time.Instant;
import java.util.Map;

public class StructuredLogger {
    private static final ObjectMapper mapper = new ObjectMapper();
    private final String serviceName;

    public StructuredLogger(String serviceName) {
        this.serviceName = serviceName;
    }

    public void info(String message, String traceId, Map<String, Object> fields) {
        try {
            Map<String, Object> logEntry = Map.of(
                "timestamp", Instant.now().toString(),
                "level", "INFO",
                "message", message,
                "service", serviceName,
                "trace_id", traceId != null ? traceId : "",
                "fields", fields != null ? fields : Map.of()
            );
            System.out.println(mapper.writeValueAsString(logEntry));
        } catch (Exception e) {
            System.err.println("LOG_ERROR: " + e.getMessage());
        }
    }

    public void error(String message, String traceId, Throwable throwable) {
        try {
            Map<String, Object> logEntry = Map.of(
                "timestamp", Instant.now().toString(),
                "level", "ERROR",
                "message", message,
                "service", serviceName,
                "trace_id", traceId != null ? traceId : "",
                "error_class", throwable.getClass().getName(),
                "error_message", throwable.getMessage(),
                "stack_trace", throwable.getStackTrace()[0].toString()
            );
            System.err.println(mapper.writeValueAsString(logEntry));
        } catch (Exception e) {
            System.err.println("LOG_ERROR: " + e.getMessage());
        }
    }
}
EOF

Output

# Log rotation configured:

json-file

# Container log sizes:

-rw-r----- 1 root root 8.2M /var/lib/docker/containers/abc/abc-json.log

-rw-r----- 1 root root 3.1M /var/lib/docker/containers/def/def-json.log

# All under 10MB — rotation working

# Fluentd DaemonSet:

daemonset.apps/fluentd configured

# Prometheus alerts:

Alert rules written to /tmp/container-alerts.yaml

# Structured log output:

{"timestamp":"2026-04-05T10:30:00Z","level":"INFO","message":"Request processed","service":"io-thecodeforge-api","trace_id":"abc123","fields":{"duration_ms":45,"status":200}}

Observability as a Hospital

Unstructured logs (plain text) cannot be parsed by log aggregation systems at scale.
Structured logs (JSON) allow filtering by service, level, trace_id, and custom fields.
Alerts on structured logs (e.g., 'error rate > 5% in 5 minutes') require parseable fields.
Without structured logging, you are grepping through terabytes of text files.

Production Insight

Log rotation is the most overlooked production configuration. A team deployed 50 containers without log rotation. Within 48 hours, the host disk was full. The Docker daemon crashed because it could not write log files. All containers became unreachable. The fix was 3 lines in daemon.json. Set log rotation on every production host before deploying the first container.

Key Takeaway

Production logging requires: stdout/stderr output, log rotation (max-size, max-file), centralized aggregation (ELK, Datadog, Loki), structured JSON format, and trace IDs. Metrics require Prometheus with alerts on memory > 80%, CPU throttling > 10%, and restart count > 5/hour. Distributed tracing is mandatory for microservices debugging.

CI/CD Pipeline: Image Building, Scanning, and Deployment Strategies

Production deployments require a CI/CD pipeline that builds, scans, tests, and deploys container images with zero downtime. Manual docker build && docker push does not scale and introduces human error.

Image building best practices: - Use multi-stage builds to separate build dependencies from runtime. The final image should contain only the application binary and runtime dependencies. - Pin base image versions (node:20.11-alpine, not node:latest). Latest tags change without notice. - Use .dockerignore to exclude build context bloat (node_modules, .git, *.log). - Enable BuildKit (DOCKER_BUILDKIT=1) for parallel builds and secret mounting. - Tag images with git SHA (not :latest, not :v1). The git SHA is immutable and traceable.

Image scanning: Every image must be scanned for known CVEs before deployment. Tools: Trivy, Snyk, AWS ECR scanning, Grivy. Block deployment if critical or high CVEs are found. Scan the base image AND the application dependencies.

Deployment strategies: - Rolling update: replace containers one at a time. Simple but no rollback guarantee. - Blue-green: deploy new version alongside old, switch traffic atomically. Instant rollback. - Canary: deploy to 5% of traffic, monitor for errors, then gradually increase. Best for catching regressions. - A/B testing: deploy two versions simultaneously, split traffic by user segment. Best for feature testing.

Rollback: Every deployment must have a one-command rollback. In Kubernetes: kubectl rollout undo. In Docker Swarm: docker service rollback. In ECS: update the service to the previous task definition. If rollback requires a new build, you do not have a rollback strategy.

Image immutability: Never push to the same tag twice. If you rebuild an image, use a new tag (new git SHA). Mutable tags (pushing to :latest or :v1 twice) cause 'works on my machine' bugs because different hosts have different image layers cached.

io/thecodeforge/cicd_pipeline.shBASH

#!/bin/bash
# Production CI/CD pipeline for Docker images

# ── Multi-stage Dockerfile ───────────────────────────────────────────────────
cat <<'EOF' > /tmp/Dockerfile
# Stage 1: Build
FROM node:20.11-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Stage 2: Runtime (minimal image)
FROM node:20.11-alpine AS runtime
RUN addgroup -g 1001 -S appgroup && \
    adduser -S appuser -u 1001 -G appgroup
WORKDIR /app
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=10s --timeout=5s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
EOF

# ── Build with BuildKit ──────────────────────────────────────────────────────
GIT_SHA=$(git rev-parse --short HEAD)
IMAGE_TAG="registry.example.com/io-thecodeforge/api:${GIT_SHA}"

DOCKER_BUILDKIT=1 docker build \
  --tag ${IMAGE_TAG} \
  --label "io.thecodeforge.build.sha=${GIT_SHA}" \
  --label "io.thecodeforge.build.date=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --file /tmp/Dockerfile \
  .

docker push ${IMAGE_TAG}

# ── Image scanning with Trivy ───────────────────────────────────────────────
trivy image --severity HIGH,CRITICAL --exit-code 1 ${IMAGE_TAG}
# Exit code 1 = vulnerabilities found, block deployment
# Exit code 0 = no critical/high vulnerabilities

# ── Deployment: rolling update (Kubernetes) ──────────────────────────────────
kubectl -n production set image deployment/io-thecodeforge-api \
  api=${IMAGE_TAG}

# Monitor rollout
kubectl -n production rollout status deployment/io-thecodeforge-api --timeout=300s

# ── Rollback (one command) ──────────────────────────────────────────────────
kubectl -n production rollout undo deployment/io-thecodeforge-api

# ── Canary deployment (Kubernetes with Istio) ───────────────────────────────
cat <<'EOF' > /tmp/canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: io-thecodeforge-api
  namespace: production
spec:
  hosts:
  - api.example.com
  http:
  - route:
    - destination:
        host: io-thecodeforge-api
        subset: stable
      weight: 95
    - destination:
        host: io-thecodeforge-api
        subset: canary
      weight: 5
EOF
kubectl apply -f /tmp/canary-virtualservice.yaml

# Monitor canary error rate
# If error rate > 1%, rollback:
kubectl delete -f /tmp/canary-virtualservice.yaml

# ── Verify image immutability ────────────────────────────────────────────────
# Check that no two images share the same tag
docker images --format '{{.Repository}}:{{.Tag}} {{.ID}}' | sort | uniq -w 50 -d
# If output is non-empty, the same tag points to different image IDs (bad)

Output

# Build:

#5 [builder 4/4] RUN npm run build

#5 DONE 12.3s

#7 [runtime 5/5] CMD ["node", "dist/server.js"]

#7 DONE 0.1s

# Push:

The push refers to registry.example.com/io-thecodeforge/api

abc123: Pushed

def456: Pushed

abc789: digest: sha256:xyz... size: 1570

# Scan:

Total: 0 (HIGH: 0, CRITICAL: 0)

# No critical vulnerabilities — deployment approved

# Rollout:

Waiting for deployment "io-thecodeforge-api" rollout to finish...

deployment "io-thecodeforge-api" successfully rolled out

# Rollback:

deployment.apps/io-thecodeforge-api rolled back

# Immutability check:

# (empty output = no duplicate tags = good)

CI/CD Pipeline as a Manufacturing Assembly Line

Mutable tags cause different hosts to run different image versions — the same tag means different things on different machines.
Immutable tags (git SHA) guarantee that every host runs the exact same binary.
Rollback is trivial with immutable tags — just point to the previous SHA.
Debugging is deterministic — the git SHA maps directly to the source code that produced the image.

Production Insight

Image scanning must be part of the CI pipeline, not a separate process. If scanning is optional, it will be skipped under deadline pressure. Make scanning a gate: the pipeline fails if critical CVEs are found. Scan both the base image (OS-level CVEs) and the application dependencies (npm, pip, Maven CVEs). Update base images weekly — they accumulate CVEs over time.

Key Takeaway

Production CI/CD requires: multi-stage builds, pinned base images, git SHA tags, image scanning (Trivy), one-command rollback, and immutable images. Never use :latest in production. Deployment strategy choice (rolling, blue-green, canary) depends on your rollback tolerance and traffic pattern.

Security Hardening: Root, Secrets, Network, and Supply Chain

Production Docker security is a layered defense — no single measure is sufficient. Each layer (image, runtime, network, host) must be hardened independently.

Run as non-root: Containers running as root (uid 0) can exploit kernel vulnerabilities with maximum privileges. Every production container should run as a non-root user. Set USER in the Dockerfile or use --user in the run command. Drop all capabilities and add back only what is needed: --cap-drop=ALL --cap-add=NET_BIND_SERVICE.

Secrets management: Never bake secrets (API keys, passwords, certificates) into Docker images. Secrets in images are visible to anyone who can pull the image. Use: Docker secrets (Swarm), Kubernetes secrets (with external secret managers like Vault or AWS Secrets Manager), or environment variables injected at runtime from a secret manager. Use --mount=type=secret for build-time secrets in BuildKit.

Image provenance: Verify the source of base images. Use official images or images from trusted registries. Enable Docker Content Trust (DOCKER_CONTENT_TRUST=1) to verify image signatures. Use SBOM (Software Bill of Materials) tools to track all components in your images.

Runtime security: - seccomp: filters syscalls. The default profile blocks ~44 dangerous syscalls. Use custom profiles for stricter filtering. - AppArmor/SELinux: mandatory access control. The docker-default AppArmor profile restricts container capabilities. - Read-only filesystem: --read-only prevents the container from modifying its filesystem. Use tmpfs for writable directories. - No new privileges: --security-opt=no-new-privileges prevents privilege escalation.

Daemon security: The Docker daemon runs as root and has full access to the host. The daemon socket (/var/run/docker.sock) is equivalent to root access. Never mount the daemon socket into containers. Never expose the daemon over TCP without TLS client authentication. Use rootless Docker for environments where daemon root access is unacceptable.

io/thecodeforge/security_hardening.shBASH

#!/bin/bash
# Production security hardening for Docker

# ── Non-root container ───────────────────────────────────────────────────────

# Dockerfile best practice
# RUN addgroup -g 1001 -S appgroup && adduser -S appuser -u 1001 -G appgroup
# USER appuser

# Runtime: run as non-root with dropped capabilities
docker run \
  --user 1001:1001 \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  --security-opt=no-new-privileges \
  --read-only \
  --tmpfs /tmp:size=64m \
  --name hardened-api \
  io-thecodeforge/api:1.0.0

# Verify non-root
docker exec hardened-api id
# uid=1001(appuser) gid=1001(appgroup)

# Verify capabilities
docker inspect hardened-api --format '{{.HostConfig.CapAdd}} {{.HostConfig.CapDrop}}'
# [NET_BIND_SERVICE] [ALL]

# ── Secrets management ──────────────────────────────────────────────────────

# Docker Swarm secrets
echo 'my-database-password' | docker secret create db-password -
docker service create --secret db-password --name io-thecodeforge-api \
  io-thecodeforge/api:1.0.0
# Secret is available at /run/secrets/db-password inside the container

# BuildKit: mount secrets during build (not in final image)
DOCKER_BUILDKIT=1 docker build --secret id=npmrc,src=$HOME/.npmrc -t api:1.0 .
# In Dockerfile: RUN --mount=type=secret,id=npmrc cp /run/secrets/npmrc $HOME/.npmrc && npm ci

# ── Image scanning and provenance ───────────────────────────────────────────

# Scan for vulnerabilities
trivy image --severity HIGH,CRITICAL io-thecodeforge/api:1.0.0

# Generate SBOM (Software Bill of Materials)
syft io-thecodeforge/api:1.0.0 -o spdx-json > sbom.json

# Verify image signature (Docker Content Trust)
export DOCKER_CONTENT_TRUST=1
docker pull io-thecodeforge/api:1.0.0
# Fails if the image is not signed

# ── Seccomp profile ─────────────────────────────────────────────────────────

# Use the default seccomp profile
docker run --security-opt seccomp=/etc/docker/seccomp/default.json \
  io-thecodeforge/api:1.0.0

# Create a custom seccomp profile (allow only required syscalls)
cat <<'EOF' > /tmp/seccomp-api.json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", "stat", "fstat",
                "mmap", "mprotect", "munmap", "brk", "ioctl",
                "access", "socket", "connect", "sendto", "recvfrom",
                "clone", "execve", "exit", "exit_group", "futex",
                "epoll_create1", "epoll_ctl", "epoll_wait",
                "accept4", "listen", "bind", "setsockopt"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}
EOF
docker run --security-opt seccomp=/tmp/seccomp-api.json \
  io-thecodeforge/api:1.0.0

# ── Daemon security ─────────────────────────────────────────────────────────

# Check if daemon socket is exposed
curl --unix-socket /var/run/docker.sock http://localhost/version
# If this works, the daemon is accessible — anyone with socket access has root

# Check if daemon is exposed over TCP
netstat -tlnp | grep 2375
# Port 2375 = unencrypted Docker API (NEVER expose this)
netstat -tlnp | grep 2376
# Port 2376 = TLS-encrypted Docker API (OK if TLS client auth is configured)

# Verify daemon configuration
cat /etc/docker/daemon.json
# Should NOT contain: "hosts": ["tcp://0.0.0.0:2375"]

Output

# Non-root verification:

uid=1001(appuser) gid=1001(appgroup)

# Capabilities:

[NET_BIND_SERVICE] [ALL]

# Image scan:

Total: 0 (HIGH: 0, CRITICAL: 0)

# Daemon socket check:

{"Version":"24.0.7","ApiVersion":"1.43"}

# Socket is accessible — ensure proper file permissions

# TCP exposure:

# (no output on 2375 = good, daemon not exposed unencrypted)

Security as a Castle Defense

The daemon runs as root and has full access to the host filesystem, network, and processes.
Anyone who can access the socket can create a container with the host filesystem mounted.
Mounting the host filesystem into a container gives the container root access to the host.
Never mount /var/run/docker.sock into containers. Never expose the daemon over TCP without TLS.

Production Insight

The default seccomp profile blocks ~44 dangerous syscalls but allows ~260 others. For high-security environments, create a custom profile that allows only the syscalls your application uses. Use strace or auditd to determine the required syscalls, then build a minimal profile. This reduces the attack surface by 80% compared to the default profile.

Key Takeaway

Production security requires: non-root users, dropped capabilities, secret managers (not environment variables), image scanning in CI, seccomp profiles, read-only filesystems, and daemon socket protection. The Docker daemon socket is root access — never expose it. Defense in depth means every layer must be hardened independently.

Scaling Strategies: Horizontal, Vertical, and Auto-Scaling

Scaling Docker in production means adding capacity to handle increased traffic. The strategy depends on the workload pattern: predictable traffic, bursty traffic, or event-driven traffic.

Horizontal scaling (scale out): Add more container replicas. Each replica handles a portion of the traffic. Horizontal scaling is preferred for stateless workloads — it provides redundancy (if one replica fails, others continue), and it scales linearly. Docker Swarm: docker service scale api=10. Kubernetes: kubectl scale deployment/api --replicas=10. ECS: update the service desired count.

Vertical scaling (scale up): Increase the resources (CPU, memory) of existing containers. Vertical scaling is simpler but limited by the host's capacity. It also requires restarting the container with new resource limits. Vertical scaling is appropriate for stateful workloads (databases) that cannot easily distribute across replicas.

Auto-scaling: Automatically adjust replica count based on metrics. The most common triggers: - CPU utilization > 70% for 5 minutes -> add replicas - Request rate > 1000 req/s -> add replicas - Queue depth > 100 messages -> add worker replicas - Custom metrics (response latency, error rate) -> add or remove replicas

Pre-warming: Container startup is fast (0.3-2s) but application cold start can be 10-60s (JVM startup, dependency initialization, connection pool warmup). Pre-warm containers by pulling images before scaling events and using readiness probes that wait for full initialization. For JVM applications, use class data sharing (CDS) or GraalVM native images to reduce cold start.

Scale-down strategy: Removing replicas must be graceful. The replica should stop accepting new requests, drain in-flight requests, close connections, and then exit. Kubernetes handles this with preStop hooks and terminationGracePeriodSeconds. Without graceful shutdown, in-flight requests are dropped during scale-down, causing user-facing errors.

Capacity planning: Monitor resource usage trends over weeks. If average CPU usage is growing 5% per week, plan to add capacity before it reaches 80%. Auto-scaling handles burst traffic, but baseline capacity must be planned manually.

io/thecodeforge/scaling_strategies.shBASH

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

#!/bin/bash
# Production scaling strategies and configuration

# ── Horizontal scaling (Docker Swarm) ────────────────────────────────────────

# Scale to 10 replicas
docker service scale io-thecodeforge-api=10

# Verify replicas are distributed across nodes
docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}'
# manager1  Running
# worker1   Running
# worker2   Running
# (distributed across 3 nodes)

# ── Horizontal scaling (Kubernetes) ──────────────────────────────────────────

# Manual scaling
kubectl -n production scale deployment/io-thecodeforge-api --replicas=10

# Auto-scaling based on CPU utilization
cat <<'EOF' > /tmp/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: io-thecodeforge-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: io-thecodeforge-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
EOF
kubectl apply -f /tmp/hpa.yaml

# ── Graceful shutdown (preStop hook) ─────────────────────────────────────────
cat <<'EOF' > /tmp/graceful-shutdown-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: io-thecodeforge-api
  namespace: production
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: api
        image: registry.example.com/io-thecodeforge/api:1.0.0
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5 && kill -SIGTERM 1"]
        # The preStop hook:
        # 1. Sleep 5s (allow load balancer to drain the pod)
        # 2. Send SIGTERM to PID 1 (the application)
        # 3. Application drains in-flight requests and exits
        # 4. Kubernetes waits up to terminationGracePeriodSeconds (30s)
EOF
kubectl apply -f /tmp/graceful-shutdown-deployment.yaml

# ── Pre-warming: pull images before scaling events ───────────────────────────

# Pre-pull images on all nodes (Kubernetes DaemonSet)
cat <<'EOF' > /tmp/prepull-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: image-prepull
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: image-prepull
  template:
    metadata:
      labels:
        app: image-prepull
    spec:
      initContainers:
      - name: prepull
        image: registry.example.com/io-thecodeforge/api:1.0.0
        command: ["true"]
      containers:
      - name: pause
        image: registry.k8s.io/pause:3.9
EOF
kubectl apply -f /tmp/prepull-daemonset.yaml
# This DaemonSet runs on every node and pulls the image into the node's cache

# ── Monitor scaling effectiveness ────────────────────────────────────────────

# Check current replica count and resource usage
kubectl -n production get deployment io-thecodeforge-api -o wide
# NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
# io-thecodeforge-api   10/10   10           10          5d

# Check HPA status
kubectl -n production get hpa io-thecodeforge-api-hpa
# NAME                       REFERENCE                  TARGETS       MINPODS   MAXPODS   REPLICAS
# io-thecodeforge-api-hpa   Deployment/io-thecodeforge-api   45%/70%       3         20        5
# (45% CPU — below 70% threshold — HPA will scale down after stabilization window)

Output

# Swarm scaling:

io-thecodeforge-api scaled to 10

# HPA status:

horizontalpodautoscaler.autoscaling/io-thecodeforge-api-hpa created

# Graceful shutdown:

deployment.apps/io-thecodeforge-api configured

# Pre-pull:

daemonset.apps/image-prepull created

# Scaling status:

NAME READY UP-TO-DATE AVAILABLE AGE

io-thecodeforge-api 10/10 10 10 5d

# HPA:

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS

io-thecodeforge-api-hpa Deployment/io-thecodeforge-api 45%/70% 3 20 5

Scaling as a Restaurant

Without graceful shutdown, Kubernetes sends SIGTERM and immediately removes the pod from the load balancer.
In-flight requests (requests that have already been routed to the pod) are dropped mid-processing.
The preStop hook adds a delay, allowing the load balancer to stop sending new requests before the pod exits.
The application must also handle SIGTERM by stopping new request acceptance and draining in-flight requests.

Production Insight

Auto-scaling down is more dangerous than scaling up. Scaling up adds capacity — worst case, you waste money. Scaling down removes capacity — worst case, you drop traffic. Set a longer stabilization window for scale-down (300s) than scale-up (60s). Monitor error rates during scale-down events — if errors spike, increase the stabilization window.

Key Takeaway

Horizontal scaling adds replicas (preferred for stateless). Vertical scaling adds resources (for stateful). Auto-scaling adjusts replicas based on metrics. Pre-warm by pulling images before scaling events. Graceful shutdown with preStop hooks prevents dropped requests during scale-down. Scale-down stabilization window should be 3-5x longer than scale-up.

High Availability and Disaster Recovery

Production Docker deployments must survive host failures, network partitions, and data center outages. High availability (HA) ensures continuous operation during failures. Disaster recovery (DR) ensures data and service restoration after catastrophic failures.

Multi-host redundancy: Run multiple replicas of each service across multiple hosts. If one host fails, the orchestrator reschedules containers on healthy hosts. Docker Swarm: use --replicas=3 and ensure the swarm has 3+ manager nodes. Kubernetes: use pod anti-affinity to spread replicas across nodes and zones.

Multi-AZ deployment: Deploy across multiple availability zones (data centers within a region). If one AZ fails, services continue in other AZs. AWS: use ECS/Kubernetes with nodes in 3+ AZs. Use Application Load Balancer (ALB) to distribute traffic across AZs.

Stateful services (databases): Databases require special HA strategies: - PostgreSQL: streaming replication with automatic failover (Patroni, pg_auto_failover) - MySQL: Group Replication or Galera Cluster - Redis: Redis Sentinel or Redis Cluster - Use volumes for data persistence. Back up volumes to object storage (S3) regularly.

Data backup and recovery: - Volume snapshots: snapshot named volumes regularly (docker volume snapshot or cloud provider snapshots). - Database backups: pg_dump, mysqldump, or continuous WAL archiving to object storage. - Image registry backup: replicate images across regions (ECR replication, Harbor replication). - Configuration backup: store all configuration (Docker Compose, Kubernetes manifests, daemon.json) in version control.

Health checks and self-healing: The orchestrator uses health checks to detect unhealthy containers and automatically restart or reschedule them. Liveness probes detect deadlocked processes (restart the container). Readiness probes detect services that are not ready to receive traffic (remove from load balancer). Startup probes detect slow-starting applications (give them more time before health checking).

Failover testing: HA is only as good as your last failover test. Regularly simulate failures: kill a container, drain a node, shut down an AZ. Measure the time to recovery and the error rate during failover. If you have never tested failover, you do not have HA.

io/thecodeforge/ha_dr.shBASH

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

#!/bin/bash
# High availability and disaster recovery configuration

# ── Multi-host redundancy (Docker Swarm) ─────────────────────────────────────

# Create a service with replicas spread across nodes
docker service create \
  --name io-thecodeforge-api \
  --replicas 6 \
  --constraint 'node.role==worker' \
  --placement-pref 'spread=node.id' \
  --limit-cpu 1.0 \
  --limit-memory 512m \
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --restart-condition on-failure \
  --restart-delay 5s \
  --restart-max-attempts 3 \
  registry.example.com/io-thecodeforge/api:1.0.0

# Verify distribution across nodes
docker service ps io-thecodeforge-api --format '{{.Node}} {{.CurrentState}}'
# worker1  Running
# worker2  Running
# worker3  Running
# worker1  Running
# worker2  Running
# worker3  Running

# ── Multi-AZ pod anti-affinity (Kubernetes) ──────────────────────────────────
cat <<'EOF' > /tmp/multi-az-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: io-thecodeforge-api
  namespace: production
spec:
  replicas: 6
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - io-thecodeforge-api
            topologyKey: topology.kubernetes.io/zone
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - io-thecodeforge-api
              topologyKey: kubernetes.io/hostname
EOF
kubectl apply -f /tmp/multi-az-deployment.yaml

# ── Volume backup (named volume to S3) ──────────────────────────────────────

# Create a backup container that mounts the volume and uploads to S3
docker run --rm \
  -v postgres-data:/data:ro \
  -e AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
  -e AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
  amazon/aws-cli s3 cp /data s3://my-backups/postgres-data/$(date +%Y-%m-%d)/ --recursive

# ── Kubernetes CronJob for automated backups ─────────────────────────────────
cat <<'EOF' > /tmp/backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: production
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:16
            command:
            - /bin/sh
            - -c
            - |
              pg_dump -h db-host -U postgres -d mydb | \
              gzip | \
              aws s3 cp - s3://my-backups/db/mydb-$(date +%Y%m%d-%H%M%S).sql.gz
            env:
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password
          restartPolicy: OnFailure
EOF
kubectl apply -f /tmp/backup-cronjob.yaml

# ── Failover testing ────────────────────────────────────────────────────────

# Kill a random container and verify self-healing
docker service scale io-thecodeforge-api=3
sleep 5

# Kill a container
docker kill $(docker ps -q | head -1)

# Watch the orchestrator reschedule
watch docker service ps io-thecodeforge-api
# A new container should start within seconds

# Kubernetes: simulate node failure
cordon k8s-worker-2  # Mark node as unschedulable
drain k8s-worker-2   # Evict all pods from the node
# Pods are rescheduled to other nodes

# Verify all pods are running on remaining nodes
kubectl get pods -o wide | grep io-thecodeforge-api

Output

# Service distribution:

worker1 Running

worker2 Running

worker3 Running

worker1 Running

worker2 Running

worker3 Running

# Evenly distributed across 3 nodes

# Multi-AZ:

deployment.apps/io-thecodeforge-api configured

# Backup:

upload: data/ to s3://my-backups/postgres-data/2026-04-05/

# Failover test:

# Container killed on worker2

# New container started on worker1 within 3 seconds

# Service remained available throughout

HA as a Safety Net

Configuration without testing is an assumption. Failover may fail due to DNS TTL, connection pool exhaustion, or split-brain scenarios.
Regular failover tests reveal hidden dependencies that are not visible in configuration.
Measure time-to-recovery (TTR) and error-rate-during-failover. If TTR > 30s or error rate > 5%, the failover is inadequate.
Run failover tests monthly. Test killing containers, draining nodes, and simulating AZ failures.

Production Insight

The most common HA failure: all replicas are on the same host. Docker Swarm's default placement does not guarantee cross-node distribution. Use --placement-pref 'spread=node.id' to force distribution. In Kubernetes, use pod anti-affinity with requiredDuringSchedulingIgnoredDuringExecution to enforce cross-node placement. Without explicit anti-affinity, the scheduler may place all replicas on the same node.

Key Takeaway

HA requires multi-host replicas with cross-node placement (spread or anti-affinity). DR requires regular backups to object storage and tested restore procedures. Failover testing is mandatory — untested failover is an assumption, not a guarantee. Measure time-to-recovery and error rate during every failover test.

● Production incidentPOST-MORTEMseverity: high

Docker Swarm Overlay Network Failure — 45-Minute Outage During Black Friday Traffic

Symptom

All services reported connection refused errors to their downstream dependencies. HTTP requests timed out after 30 seconds. The load balancer returned 502 errors for 100% of external traffic. docker service ls showed all services as 'running' with the correct replica count. Containers were healthy — they just could not communicate with each other.

Assumption

The team assumed a container crash or OOM kill — but all containers were running. They assumed a DNS failure — but DNS resolution worked from the host. They assumed a firewall rule change — but iptables rules were unchanged. They assumed a cloud provider network issue — but VPC connectivity was healthy.

Root cause

The overlay network was created with the default MTU of 1450 (VXLAN encapsulation overhead). The underlying VPC had an MTU of 9001 (jumbo frames). When traffic spiked, large packets were being fragmented at the VXLAN boundary. Under normal load, the fragmentation overhead was negligible. Under Black Friday load (10x normal), the fragmentation rate exceeded the kernel's IP fragmentation queue capacity (net.ipv4.ipfrag_high_thresh). Packets were dropped silently. The overlay network appeared healthy (containers were running) but all inter-container traffic was being dropped.

Fix

1. Immediate: recreated the overlay network with the correct MTU: docker network create --opt com.docker.network.driver.mtu=8950 app-overlay. 2. Drained and rejoined all swarm nodes to flush corrupted network state. 3. Set net.ipv4.ipfrag_high_thresh to 4x the default on all nodes. 4. Added MTU verification to the deployment pipeline — any overlay network with MTU < 8900 blocks the deploy. 5. Added Prometheus alerts for IP fragment queue drops. 6. Documented that overlay MTU must be calculated: VPC_MTU - VXLAN_OVERHEAD(50) = overlay_MTU.

Key lesson

Overlay network MTU is invisible until traffic volume exceeds the fragmentation queue capacity. Always calculate overlay MTU as: underlay_MTU - 50 (VXLAN overhead).
docker service ls shows containers as 'running' even when the overlay network is broken. Network health is not visible in Docker's built-in status checks.
Add network-level health checks (TCP connectivity to downstream services) in addition to HTTP health checks. A container can be HTTP-healthy but network-unreachable.
Load test at 2x expected peak traffic before any high-traffic event. The MTU issue had been dormant for 6 months — it only manifested under 10x traffic.
Monitor IP fragment queue drops (netstat -s | grep -i frag) on all container hosts. Fragment queue exhaustion is a silent failure mode.

Production debug guideFrom container crashes to network partitions — real debugging paths through production Docker deployments.6 entries

Symptom · 01

Container is running but not serving traffic.

→

Fix

Check if the container is listening on the correct port and interface. Run docker exec <container> ss -tlnp to verify the process is listening. Check if the process is listening on 0.0.0.0 (all interfaces) or 127.0.0.1 (localhost only — unreachable from outside the container). Check if health checks are passing: docker inspect <container> --format '{{.State.Health.Status}}'. If unhealthy, the container may be removed from the load balancer.

Symptom · 02

Container is OOM-killed repeatedly.

→

Fix

Check the container's memory limit: docker inspect <container> --format '{{.HostConfig.Memory}}'. Check the container's memory usage before crash: docker stats --no-stream <container>. Check if the limit is too low for the application's peak memory. Check for memory leaks: monitor RSS over time with docker stats. Fix: increase the memory limit or fix the leak. Add --oom-kill-disable only if you want the entire host to freeze instead of killing the container.

Symptom · 03

Inter-container communication is failing.

→

Fix

Check if containers are on the same network: docker network inspect <network> | grep -A5 <container>. Check DNS resolution: docker exec <container> nslookup <target-service>. Check if the overlay network is healthy: docker network inspect <network> --format '{{.Peers}}'. Check MTU: docker exec <container> cat /sys/class/net/eth0/mtu. Check IP fragment queue: cat /proc/net/snmp | grep -i frag on the host.

Symptom · 04

Docker daemon is consuming excessive disk space.

→

Fix

Check Docker disk usage: docker system df. Check detailed breakdown: docker system df -v. Check for dangling images: docker images --filter dangling=true. Check for orphaned volumes: docker volume ls --filter dangling=true. Check for large log files: ls -lhS /var/lib/docker/containers//.log. Fix: docker system prune -a --volumes (WARNING: removes all unused resources). Set log rotation in daemon.json.

Symptom · 05

Deploy is stuck — new containers are not starting.

→

Fix

Check if there are enough resources: docker node ls (for Swarm) or kubectl describe nodes (for Kubernetes). Check if the image pull is failing: docker pull <image>. Check if port conflicts exist: docker ps -a | grep <port>. Check if the container immediately crashes: docker logs <container>. Check if the health check is failing: docker inspect <container> --format '{{.State.Health}}'.

Symptom · 06

Container logs are missing or incomplete.

→

Fix

Check if the application writes to stdout/stderr (Docker captures these). Check if logs are in the container filesystem (lost on restart). Check the log driver: docker info --format '{{.LoggingDriver}}'. Check if log rotation is configured: cat /etc/docker/daemon.json | grep log. Check if the logging agent (Fluentd, Filebeat) is running and healthy.

★ Docker Production Triage Cheat SheetFirst-response commands when containers are crashing, networking is broken, or resources are exhausted in production.

Container is OOM-killed repeatedly.−

Immediate action

Check memory limit and current usage.

Commands

docker inspect <container> --format '{{.HostConfig.Memory}}'

docker stats --no-stream <container>

Fix now

Increase --memory limit or fix the memory leak. Check dmesg | grep -i oom for host-level OOM events.

Container is running but not serving traffic.+

Inter-container communication failing.+

Docker daemon disk usage growing rapidly.+

Image pull failing in CI/CD pipeline.+

Container immediately exits on start.+

Overlay network latency spike.+

Host CPU is 100% but individual containers show low usage.+

Docker Production: Orchestrator Comparison

Aspect	Docker Swarm	Kubernetes	AWS ECS	AWS Fargate
Complexity	Low	High	Medium	Low
Learning curve	1-2 weeks	2-6 months	2-4 weeks	1-2 weeks
Self-healing	Yes	Yes	Yes	Yes
Auto-scaling	Limited (external)	HPA, VPA, KEDA	Service Auto Scaling	Service Auto Scaling
Service mesh	No	Istio, Linkerd	App Mesh	App Mesh
Multi-AZ	Manual	Built-in (topology spread)	Built-in	Built-in
Host management	Self-managed	Self-managed (or EKS/GKE)	EC2 instances	No hosts (serverless)
Cost	Lowest (self-managed)	Medium (EKS $73/mo + nodes)	Medium (EC2 + ECS)	Highest (20-30% premium)
Ecosystem	Small	Massive	AWS-native	AWS-native
Best for	Small teams, simple deployments	Large teams, complex workloads	AWS-native, medium complexity	Minimal ops, AWS-native

Key takeaways

Production Docker requires an orchestrator (Swarm, Kubernetes, ECS) for scheduling, scaling, self-healing, and networking. Choose based on team size and ecosystem needs.

Resource limits are mandatory

set --memory and --cpus on every container. Without limits, one container can starve all others on the host.

Logging must go to stdout/stderr with rotation. Ship to a centralized system. Never write logs to the container filesystem.

CI/CD requires multi-stage builds, git SHA tags, image scanning, one-command rollback, and immutable images. Never use :latest in production.

Security requires non-root users, dropped capabilities, secret managers, seccomp profiles, and daemon socket protection. Defense in depth at every layer.

Scaling requires horizontal replicas for stateless workloads, auto-scaling based on metrics, pre-warming for cold start, and graceful shutdown for zero-downtime deploys.

HA requires multi-host replicas with cross-node placement, multi-AZ deployment, regular backups, and tested failover. Untested failover is not HA.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 6 QUESTIONS

Frequently Asked Questions

Should I use Docker Swarm or Kubernetes in production?

How do I handle persistent data in Docker production?

How do I debug a container that is running but not serving traffic?

What is the difference between liveness and readiness probes?

How do I achieve zero-downtime deployments with Docker?

How do I monitor Docker in production?

🔥

That's Docker. Mark it forged?

14 min read · try the examples if you haven't