Advanced 11 min · March 06, 2026

Docker Swarm — Why 4 Managers Caused a 3-Hour Outage

4 manager nodes lost quorum when 2 failed — freezing all deployments for 3 hours.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Manager nodes: run the Raft consensus algorithm, maintain cluster state, schedule services
  • Worker nodes: execute tasks (containers) assigned by managers
  • Services: the declarative unit — you define desired state, Swarm converges reality to match
  • Tasks: the atomic scheduling unit — one task = one container
  • Raft consensus requires a quorum (majority) of managers to agree on state changes
  • Overlay networks span hosts so containers can communicate across nodes
  • Ingress routing mesh load-balances published ports across all nodes
  • Rolling updates replace containers incrementally with zero downtime
✦ Definition~90s read
What is Docker Swarm Basics?

Docker Swarm is Docker's native clustering and orchestration solution, bundled directly into the Docker Engine since version 1.12. It turns a pool of Docker hosts into a single, logical virtual server. You deploy services declaratively, and Swarm handles scheduling, scaling, networking, and maintaining desired state across the cluster.

Imagine a restaurant chain with one head office (the manager) and ten kitchens across the city (the workers).

Its killer feature is simplicity: you don't need to install a separate orchestrator or manage external dependencies like etcd or Zookeeper — the Raft consensus group runs inside the manager nodes themselves. This makes Swarm ideal for teams that want container orchestration without the operational overhead of Kubernetes, especially in small-to-medium deployments or edge computing scenarios where a full K8s control plane would be overkill.

Swarm's architecture splits nodes into managers and workers. Managers run the Raft consensus algorithm to maintain cluster state — they're the brain. Workers just execute tasks. The Raft group requires a majority (quorum) to function: with 3 managers, you can lose 1; with 5, you can lose 2.

The outage described in this article happened because 4 managers were deployed — an even number that creates a split-brain risk. If the cluster partitions, neither side can form a majority, and the entire cluster freezes. This is why production Swarm clusters always use an odd number of managers (3 or 5).

The article walks through exactly how Raft consensus breaks with 4 managers and why that caused a 3-hour outage.

Beyond manager count, Swarm provides built-in service discovery, load balancing, and overlay networking via VXLAN. You can pin services to specific nodes with placement constraints, set CPU/memory limits, and perform rolling updates with health checks and automatic rollback.

Secrets and configs are encrypted at rest and in transit, stored in the Raft log, and mounted into containers as tmpfs files — immutable and rotatable without redeploying the service. Swarm is not as feature-rich as Kubernetes (no custom resource definitions, no built-in service mesh, no autoscaling based on custom metrics), but for teams that need a simple, reliable orchestrator with minimal moving parts, it's a solid choice.

Just don't use an even number of managers.

Plain-English First

Imagine a restaurant chain with one head office (the manager) and ten kitchens across the city (the workers). A customer order comes in — the head office decides which kitchen handles it, monitors the food being made, and if one kitchen burns down, it quietly reroutes the order to another kitchen without the customer ever knowing. Docker Swarm is exactly that: one command-and-control brain (the manager node) coordinating a fleet of worker nodes, making sure your containers keep running no matter what breaks.

Every production app eventually outgrows a single server. Traffic spikes, hardware fails, deployments need to happen without downtime. Docker Swarm is the native clustering and orchestration layer baked directly into the Docker Engine.

Swarm solves coordination across multiple hosts. When you have ten nodes, you need something to decide where a container lands, what happens when a node dies, how containers on different hosts communicate, and how you push a new image without dropping requests. Swarm encodes those answers into a distributed state machine backed by the Raft consensus algorithm.

Common misconceptions: Swarm is not deprecated (Docker continues to maintain it alongside Compose). Swarm is not Kubernetes-lite (it has a fundamentally different architecture — no pods, no CRDs, no etcd). Swarm's simplicity is its strength for small-to-medium deployments that do not need Kubernetes' complexity.

Why Docker Swarm's Manager Count Matters More Than You Think

Docker Swarm is a container orchestration engine built into Docker Engine that groups multiple hosts into a single virtual cluster. Its core mechanic is the Raft consensus algorithm: manager nodes elect a leader to coordinate all cluster state changes. Every service definition, secret, and configuration update must pass through the leader, which replicates it to a majority of managers before it's committed.

Swarm's key property is that it tolerates up to (N-1)/2 manager failures — but only if you run an odd number. With 4 managers, a single failure drops you to 3, which is still a majority. But if another fails, you're at 2 — no majority, and the cluster freezes. No deployments, no scaling, no health checks. The system is alive but brain-dead. Raft requires a strict majority of all configured managers, not just the ones currently online.

Use Swarm when you need a simple, low-overhead orchestrator for a small-to-medium cluster (under 50 nodes) and you want zero external dependencies — no etcd, no ZooKeeper. It's ideal for teams that already run Docker and need basic HA without the operational complexity of Kubernetes. But the manager count is not a scaling knob; it's a fault-tolerance decision. Run 3 or 5, never 4.

Even-numbered managers are a trap
A 4-manager cluster has the same fault tolerance as a 3-manager cluster (one failure) but requires one more failure to lose quorum — which is worse, not better.
Production Insight
A team added a fourth manager for 'extra capacity' during a holiday sale.
The cluster lost quorum after two managers went down for routine patching — all deployments froze for 3 hours.
Always run an odd number of managers; even numbers increase failure risk without adding tolerance.
Key Takeaway
Swarm uses Raft consensus — a majority of managers must be alive for any state change.
3 managers tolerate 1 failure; 5 tolerate 2; 4 tolerates 1 but is more likely to lose quorum on a second failure.
Manager count is about fault tolerance, not performance — never run 2, 4, or 6 managers.
Docker Swarm Manager Count and Raft Consensus THECODEFORGE.IO Docker Swarm Manager Count and Raft Consensus Why 4 managers caused a 3-hour outage due to Raft quorum loss Raft Consensus Requires majority (N/2+1) for leader election Manager Nodes Odd number recommended (3,5,7) to avoid split-brain 4 Managers Even count; quorum = 3, one failure breaks majority Quorum Loss Cluster halts; no scheduling or updates possible 3-Hour Outage Recovery required manual intervention or node removal Best Practice Use 3 or 5 managers; never an even number ⚠ Even number of managers can cause quorum loss on single failure Always use odd count: 3 for production, 5 for large clusters THECODEFORGE.IO
thecodeforge.io
Docker Swarm Manager Count and Raft Consensus
Docker Swarm Basics

Raft Consensus and Manager Node Architecture

Swarm's cluster state is stored in a distributed log managed by the Raft consensus algorithm. Every manager node runs a full copy of the Raft log. State changes (service updates, node joins, secret creation) are proposed by the leader, replicated to a quorum of followers, and then committed.

The quorum formula is floor(n/2) + 1, where n is the number of managers. With 3 managers, quorum is 2. With 5 managers, quorum is 3. The cluster can tolerate floor((n-1)/2) manager failures. With 3 managers, you can lose 1. With 5 managers, you can lose 2.

An even number of managers provides no additional fault tolerance over the next lower odd number. With 4 managers, quorum is 3 — you can still only lose 1 manager, same as with 3 managers. The 4th node is wasted.

Leader election: When the leader fails or becomes unreachable, the remaining managers hold an election. The manager with the most up-to-date Raft log and the lowest election timeout wins. The default election timeout is 1 second. Network partitions can cause split-brain if two groups of managers each elect their own leader, but only the group with quorum can commit new state changes.

Failure scenario — manager resource starvation: A team ran a memory-intensive batch job on a manager node. The job consumed all available RAM, causing the Docker daemon to be OOM-killed. The daemon restart triggered a Raft leader election. During the election window (1-2 seconds), no state changes could be committed. The team noticed brief delays in service updates. The fix: cordon manager nodes from workloads using docker node update --availability drain <manager-node>.

io/thecodeforge/swarm-manager-setup.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/bin/bash
# Swarm cluster bootstrap with proper manager configuration

# ── Initialize the Swarm on the first manager ─────────────────────
# --advertise-addr: the IP other nodes will use to reach this manager
# --listen-addr: the interface the manager binds to
docker swarm init \
  --advertise-addr 10.0.1.10 \
  --listen-addr 0.0.0.0:2377 \
  --data-path-addr 10.0.1.10

# Get the join tokens
docker swarm join-token manager  # For other managers
docker swarm join-token worker   # For workers

# ── Join additional managers (run on each new manager node) ────────
docker swarm join \
  --token SWMTKN-1-xxxxx-manager-token-xxxxx \
  --advertise-addr 10.0.1.11 \
  10.0.1.10:2377

# ── Verify manager count (should be 3 or 5, never even) ──────────
docker node ls --filter role=manager
# ID    HOSTNAME   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
# abc * manager-1  Ready     Active         Leader           24.0.7
# def   manager-2  Ready     Active         Reachable        24.0.7
# ghi   manager-3  Ready     Active         Reachable        24.0.7

# ── Drain manager nodes to prevent workloads from running on them ─
for node in $(docker node ls --filter role=manager -q); do
  docker node update --availability drain $node
done
# Drained managers cannot run tasks — they are dedicated to orchestration

# ── Check Raft cluster health ─────────────────────────────────────
docker info --format '{{.Swarm.Nodes}} managers, {{.Swarm.Nodes}} total'
# Or inspect the Raft status on each manager
docker node inspect self --format '{{.ManagerStatus.Leader}}'
# true on the leader, false on followers

# ── Configure auto-lock (encrypt Raft logs at rest) ───────────────
docker swarm update --autolock=true
# This requires unlocking the swarm after daemon restart:
# docker swarm unlock
# Enter unlock key: SWMKEY-1-xxxxx
Output
Swarm initialized: current node (abc123) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-xxxxx 10.0.1.10:2377
To add a manager to this swarm, run:
docker swarm join --token SWMTKN-1-yyyyy 10.0.1.10:2377
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
abc * manager-1 Ready Active Leader 24.0.7
def manager-2 Ready Active Reachable 24.0.7
ghi manager-3 Ready Active Reachable 24.0.7
Raft as a Committee Vote
  • Quorum = floor(n/2) + 1. With 3 managers, quorum is 2. With 4 managers, quorum is 3.
  • With 3 managers, you can lose 1 and still have quorum (2 >= 2).
  • With 4 managers, you can lose 1 and still have quorum (3 >= 3). But losing 2 breaks quorum (2 < 3).
  • The 4th manager adds cost (server, maintenance) without adding fault tolerance. Always use 3 or 5.
Production Insight
The autolock feature (--autolock=true) encrypts the Raft log at rest. Without it, anyone with access to the manager's disk can read the Raft data, which includes secrets and service definitions. The trade-off: after a daemon restart, you must manually enter the unlock key. Automate this with a secrets manager or a secure boot script.
Key Takeaway
Raft consensus requires a quorum of managers to agree on state changes. Always use an odd number of managers (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes — resource contention can starve the Raft process.
Manager Node Count Decision
IfSmall cluster, 1-10 nodes, budget-conscious
Use3 managers. Tolerates 1 failure. Minimal overhead.
IfMedium cluster, 10-100 nodes, production-critical
Use5 managers. Tolerates 2 failures. Better consensus performance under load.
IfLarge cluster, 100+ nodes
UseConsider migrating to Kubernetes. Swarm's consensus model does not scale well beyond ~100 nodes.
IfDevelopment/testing environment
Use1 manager is sufficient. No quorum concerns. Not suitable for production.

Service Scheduling, Placement Constraints and Resource Limits

A Swarm service is a declarative specification of the desired state: which image to run, how many replicas, resource limits, placement constraints, and update policy. The Swarm scheduler assigns tasks (individual containers) to nodes that satisfy the constraints and have available resources.

Scheduling algorithm: Swarm uses a spread scheduler by default — it places tasks on the node with the fewest existing tasks of the same service. This provides natural load distribution. You can override this with placement constraints and preferences.

Placement constraints: Hard requirements that a node must satisfy. Examples: - node.role==manager: only run on manager nodes - node.labels.zone==us-east-1a: only run in a specific availability zone - node.hostname==worker-3: pin to a specific node

Placement preferences: Soft preferences that guide scheduling but do not prevent placement. Example: --placement-pref 'spread=node.labels.zone' distributes tasks evenly across zones.

Resource limits: - --limit-cpu: maximum CPU a task can consume (e.g., 0.5 = half a core) - --limit-memory: maximum memory (e.g., 512m) - --reserve-cpu: guaranteed CPU allocation - --reserve-memory: guaranteed memory allocation

Without resource limits, a single misbehaving container can consume all resources on a node, starving other tasks. Resource reservations ensure critical services always have the resources they need.

Failure scenario — no resource limits, noisy neighbor: A team deployed a memory-intensive analytics service without --limit-memory. The service gradually consumed all available RAM on a worker node. The kernel OOM-killed other containers on the same node, including a critical payment service. The payment service was rescheduled to another node (Swarm's self-healing), but the 30-second rescheduling delay caused a brief payment outage. The fix: add --limit-memory to all services and --reserve-memory for critical services.

io/thecodeforge/swarm-service-deploy.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#!/bin/bash
# Production-grade service deployment with constraints and resource limits

# ── Deploy a service with full production settings ────────────────
docker service create \
  --name io-thecodeforge-api \
  --image io.thecodeforge/api:v2.3.1 \
  --replicas 6 \
  \
  # Resource limits — hard ceiling
  --limit-cpu 1.0 \
  --limit-memory 512m \
  \
  # Resource reservations — guaranteed allocation
  --reserve-cpu 0.25 \
  --reserve-memory 128m \
  \
  # Placement: spread across availability zones
  --placement-pref 'spread=node.labels.zone' \
  \
  # Constraint: never run on manager nodes
  --constraint 'node.role!=manager' \
  \
  # Constraint: only run on nodes with SSD label
  --constraint 'node.labels.disk==ssd' \
  \
  # Health check
  --health-cmd 'curl -f http://localhost:8080/health || exit 1' \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 30s \
  \
  # Rolling update policy
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-order start-first \
  \
  # Rollback policy
  --rollback-parallelism 1 \
  --rollback-delay 5s \
  \
  # Network
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080 \
  \
  # Environment
  --env DATABASE_URL='{{DATABASE_URL}}' \
  --secret io-thecodeforge-db-password \
  \
  # Restart policy
  --restart-condition on-failure \
  --restart-delay 5s \
  --restart-max-attempts 3 \
  --restart-window 60s

# ── Label nodes for placement constraints ─────────────────────────
docker node update --label-add zone=us-east-1a worker-1
docker node update --label-add zone=us-east-1b worker-2
docker node update --label-add zone=us-east-1a worker-3
docker node update --label-add disk=ssd worker-1
docker node update --label-add disk=ssd worker-2

# ── Verify placement ──────────────────────────────────────────────
docker service ps io-thecodeforge-api --format '{{.Name}} {{.Node}} {{.CurrentState}}'
# api.1  worker-1  Running
# api.2  worker-2  Running
# api.3  worker-1  Running
# api.4  worker-2  Running
# api.5  worker-3  Running
# api.6  worker-3  Running
Output
overall progress: 6 out of 6 tasks
1/6: running [==================================================>]
2/6: running [==================================================>]
3/6: running [==================================================>]
4/6: running [==================================================>]
5/6: running [==================================================>]
6/6: running [==================================================>]
verify: Service converged
Scheduling as Hotel Room Assignment
  • Constraints are hard requirements. If no node satisfies the constraint, the task stays in 'Pending' state forever.
  • Preferences are soft guidelines. Swarm tries to satisfy them but can place the task on any node if no preference match exists.
  • Use constraints for critical requirements: 'must run on SSD', 'must not run on managers'.
  • Use preferences for optimization: 'prefer to spread across zones', 'prefer nodes with fewer tasks'.
Production Insight
The --update-order start-first flag starts the new container before stopping the old one. This provides zero-downtime deployments but temporarily doubles the resource usage. If you have --limit-memory 512m and 6 replicas, the deployment temporarily needs 6GB instead of 3GB. Ensure your cluster has enough headroom for rolling updates. If headroom is limited, use stop-first order instead.
Key Takeaway
Always set resource limits on production services. Without limits, a single misbehaving container can OOM-kill other containers on the same node. Use placement constraints to isolate critical services and spread across availability zones. The spread scheduler distributes tasks evenly by default.
Resource Limit Strategy
IfStateless web API with predictable resource usage
UseSet --limit-cpu and --limit-memory based on load testing. Use --reserve-memory for critical services.
IfMemory-intensive batch processing
UseSet generous --limit-memory but low --limit-cpu. Use placement constraints to isolate on dedicated nodes.
IfLatency-sensitive service (trading, real-time)
UseUse --reserve-cpu to guarantee CPU. Consider host-mode publishing to bypass routing mesh. Pin to dedicated nodes.
IfDevelopment/testing
UseSkip resource limits. They add complexity without benefit in non-production environments.

Overlay Networks and Cross-Host Container Communication

Docker Swarm uses overlay networks to enable containers on different hosts to communicate as if they were on the same network. The overlay network uses VXLAN (Virtual Extensible LAN) encapsulation to tunnel Layer 2 traffic over the underlying Layer 3 network.

How it works: When container A on node 1 sends a packet to container B on node 2, the VXLAN driver encapsulates the packet in a UDP datagram on port 4789 and sends it to node 2. Node 2 decapsulates the packet and delivers it to container B. The containers see each other's overlay IP addresses as if they were on the same LAN.

The ingress routing mesh: When you publish a port with --publish, Swarm creates a route in the ingress network that load-balances incoming traffic across all nodes running the service. Any node in the cluster can receive traffic for any service, regardless of whether that node is running the service's containers. The routing mesh forwards the traffic to a node that is running a healthy task.

The extra-hop problem: The routing mesh adds one network hop. A request to node 1 may be routed to a container on node 3. This adds latency. For latency-sensitive services, use host-mode publishing: --publish published=8080,target=8080,mode=host. This bypasses the routing mesh and binds directly to the host's port. The trade-off: only nodes running the service's containers accept traffic — you lose the any-node routing benefit.

Failure scenario — VXLAN port blocked by firewall: A team deployed a 3-node Swarm cluster across two data centers. Containers in data center A could not reach containers in data center B. The team spent 4 hours debugging DNS, service discovery, and overlay configuration. The root cause: the firewall between data centers blocked UDP port 4789 (VXLAN). After opening the port, overlay connectivity was restored immediately.

io/thecodeforge/swarm-networking.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/bin/bash
# Overlay network setup and troubleshooting

# ── Create an overlay network with encryption ─────────────────────
docker network create \
  --driver overlay \
  --attachable \
  --opt encrypted \
  --subnet 10.0.10.0/24 \
  io-thecodeforge-overlay
# --driver overlay: VXLAN-based cross-host networking
# --attachable: allows standalone containers to join (useful for debugging)
# --opt encrypted: encrypts VXLAN traffic with IPsec (adds ~10% overhead)
# --subnet: explicit IP range for the overlay network

# ── Deploy a service on the overlay network ───────────────────────
docker service create \
  --name api \
  --network io-thecodeforge-overlay \
  --replicas 3 \
  io.thecodeforge/api:v2.3.1

# ── Verify overlay network peers (should list all nodes) ──────────
docker network inspect io-thecodeforge-overlay --format '{{json .Peers}}' | python3 -m json.tool
# Each peer represents a node participating in the overlay network
# If a peer is missing, that node cannot communicate on the overlay

# ── Test cross-host connectivity ──────────────────────────────────
# From any node, run a debug container on the overlay network
docker run --rm -it --network io-thecodeforge-overlay alpine sh
# Inside the container:
# ping <overlay-ip-of-service-task>
# nslookup tasks.api  # DNS round-robin for all service tasks

# ── Required ports for Swarm networking ───────────────────────────
# TCP 2377: Swarm cluster management (Raft)
# TCP/UDP 7946: Gossip-based node discovery
# UDP 4789: VXLAN overlay network traffic
# Protocol 50 (ESP): IPsec encryption (if --opt encrypted)

# ── Host-mode publishing (bypass routing mesh) ────────────────────
docker service create \
  --name api-latency-sensitive \
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080,mode=host \
  --mode global \
  io.thecodeforge/api:v2.3.1
# mode=global: one task per node (every node runs the service)
# mode=host: binds directly to host port 8080, no routing mesh hop
Output
Network io-thecodeforge-overlay created
[
{
"Name": "manager-1",
"IP": "10.0.1.10"
},
{
"Name": "worker-1",
"IP": "10.0.1.11"
},
{
"Name": "worker-2",
"IP": "10.0.1.12"
}
]
# All 3 peers are present — overlay network is healthy
Overlay Network as a Virtual Office Floor
  • Latency-sensitive services where the extra routing mesh hop adds unacceptable delay.
  • Services that need to bind to specific host ports for external load balancer integration.
  • Services running in --mode global (one per node) where every node already has a container.
  • Trade-off: you lose the any-node routing benefit. Traffic only reaches nodes running the service.
Production Insight
The --opt encrypted flag adds IPsec encryption to VXLAN traffic. This is important for multi-data-center or cloud deployments where traffic crosses untrusted networks. The overhead is approximately 10% throughput reduction and slightly higher CPU usage. For single-data-center deployments on a trusted network, skip encryption to avoid the overhead.
Key Takeaway
Overlay networks use VXLAN on UDP port 4789. If this port is blocked by firewalls, containers on different nodes cannot communicate. The routing mesh adds one network hop — use host-mode publishing for latency-sensitive services. Always use --opt encrypted for cross-data-center overlays.

Rolling Updates, Rollback and Zero-Downtime Deployments

Swarm's rolling update mechanism replaces old containers with new ones incrementally, ensuring the service remains available throughout the deployment. The update configuration controls the pace and failure behavior.

Update parameters: - --update-parallelism: how many tasks to update simultaneously (default: 1) - --update-delay: wait time between updating batches (default: 0s) - --update-failure-action: what to do if a new task fails (pause, continue, rollback) - --update-order: start-first (new container starts before old stops) or stop-first (old stops before new starts) - --update-max-failure-ratio: percentage of failures that triggers the failure action

The start-first vs stop-first trade-off: - start-first: zero downtime, but temporarily doubles resource usage during deployment - stop-first: lower resource usage, but brief window where one fewer replica is running

Rollback: If a rolling update fails, Swarm can automatically roll back to the previous version. The rollback configuration mirrors the update configuration. Manual rollback: docker service rollback <service>.

Failure scenario — update without health check causes cascading failure: A team deployed a new API version with a startup bug that caused the health check to fail after 30 seconds. The team did not configure --health-start-period. The health check failed immediately (before the app was ready), causing Swarm to mark the task as failed. With --update-failure-action continue (the default), Swarm continued replacing all healthy containers with the failing new version. Within 2 minutes, all containers were running the broken version. The fix: set --update-failure-action rollback and configure --health-start-period to allow startup time.

io/thecodeforge/swarm-rolling-update.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#!/bin/bash
# Zero-downtime rolling update with automatic rollback

# ── Initial deployment ────────────────────────────────────────────
docker service create \
  --name io-thecodeforge-api \
  --image io.thecodeforge/api:v2.3.0 \
  --replicas 6 \
  --limit-cpu 1.0 \
  --limit-memory 512m \
  --health-cmd 'curl -f http://localhost:8080/health || exit 1' \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 40s \
  \
  # Rolling update: 2 at a time, 10s delay, auto-rollback on failure
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-max-failure-ratio 0.25 \
  --update-order start-first \
  \
  # Rollback policy
  --rollback-parallelism 1 \
  --rollback-delay 5s \
  --rollback-order stop-first \
  \
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080 \
  io.thecodeforge/api:v2.3.0

# ── Rolling update to new version ─────────────────────────────────
docker service update \
  --image io.thecodeforge/api:v2.3.1 \
  --update-parallelism 2 \
  --update-delay 10s \
  io-thecodeforge-api

# ── Monitor the update progress ───────────────────────────────────
docker service ps io-thecodeforge-api \
  --format '{{.Name}} {{.Image}} {{.CurrentState}} {{.Error}}' \
  | head -20
# You will see old tasks shutting down and new tasks starting

# ── Manual rollback if needed ─────────────────────────────────────
docker service rollback io-thecodeforge-api
# Reverts to the previous image and configuration

# ── Force update (redeploy without changing image) ────────────────
docker service update --force io-thecodeforge-api
# Useful when container config has changed but image tag is the same
Output
overall progress: 6 out of 6 tasks
1/6: running [==================================================>]
2/6: running [==================================================>]
3/6: running [==================================================>]
4/6: running [==================================================>]
5/6: running [==================================================>]
6/6: running [==================================================>]
verify: Service converged
# Rolling update in progress:
# api.1 io.thecodeforge/api:v2.3.1 Running
# api.2 io.thecodeforge/api:v2.3.1 Running
# api.3 io.thecodeforge/api:v2.3.0 Running (waiting for delay)
# api.4 io.thecodeforge/api:v2.3.0 Running (waiting for delay)
# api.5 io.thecodeforge/api:v2.3.0 Running
# api.6 io.thecodeforge/api:v2.3.0 Running
Rolling Update as Renovating a Hotel Floor by Floor
  • Without rollback, a failing update continues replacing all healthy containers with the broken version.
  • With rollback, Swarm detects failures and automatically reverts to the previous working version.
  • The --update-max-failure-ratio flag controls the failure threshold. 0.25 means 25% failure triggers rollback.
  • Always pair rollback with health checks. Without health checks, Swarm cannot detect a broken container.
Production Insight
The --health-start-period flag is essential for services with slow startup times (JVM warmup, database migrations, cache hydration). Without it, the health check runs immediately and may fail before the application is ready, triggering an unnecessary rollback. Set it to the expected maximum startup time plus a buffer.
Key Takeaway
Always set --update-failure-action rollback in production. Without it, a broken update replaces all healthy containers. Use --health-start-period for services with slow startup. start-first provides zero downtime but doubles resource usage during deployment — ensure cluster headroom.

Swarm Secrets and Configs — Immutable, Encrypted, Rotatable

Docker Swarm provides built-in secrets management through the Raft log. Secrets are encrypted at rest and in transit, mounted as files in /run/secrets/ inside containers, and never written to image layers.

How secrets work: - docker secret create: stores the secret in the Raft log (encrypted with the swarm unlock key) - The secret is distributed to every manager node (encrypted) - When a service references a secret, it is mounted as a file at /run/secrets/<secret-name> - Secrets are immutable — updating a secret creates a new version

How configs work: - docker config create: stores configuration files in the Raft log - Configs are mounted as files in the container (not encrypted at rest — use secrets for sensitive data) - Useful for nginx.conf, application.yaml, or any configuration file

Secret rotation: Secrets are immutable. To rotate a secret: 1. Create a new secret: docker secret create db-password-v2 - 2. Update the service to use the new secret: docker service update --secret-rm db-password --secret-add db-password-v2 <service> 3. The service restarts with the new secret mounted 4. Delete the old secret: docker secret rm db-password

Failure scenario — secret not updating in running service: A team updated a database password by creating a new secret and updating the service. However, the application inside the container still read the old password from /run/secrets/db-password. The team did not realize that Docker secrets are immutable — the old secret file remained mounted until the service was explicitly updated to remove it. The fix: use --secret-rm to remove the old secret and --secret-add to add the new one in the same update command.

io/thecodeforge/swarm-secrets.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/bin/bash
# Secrets management in Docker Swarm

# ── Create a secret from stdin ────────────────────────────────────
echo 's3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password -

# ── Create a secret from a file ───────────────────────────────────
docker secret create io-thecodeforge-tls-cert /path/to/cert.pem

# ── Create a config (non-sensitive configuration) ─────────────────
docker config create io-thecodeforge-nginx-conf /path/to/nginx.conf

# ── Deploy a service with secrets and configs ─────────────────────
docker service create \
  --name io-thecodeforge-api \
  --secret io-thecodeforge-db-password \
  --secret io-thecodeforge-tls-cert \
  --config source=io-thecodeforge-nginx-conf,target=/etc/nginx/nginx.conf \
  io.thecodeforge/api:v2.3.1

# ── Access secrets inside the container ────────────────────────────
docker exec <container> cat /run/secrets/io-thecodeforge-db-password
# Output: s3cret_p@ssw0rd

docker exec <container> ls /run/secrets/
# io-thecodeforge-db-password
# io-thecodeforge-tls-cert

# ── Rotate a secret ───────────────────────────────────────────────
# Step 1: Create new version
echo 'new_s3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password-v2 -

# Step 2: Update service — remove old, add new
docker service update \
  --secret-rm io-thecodeforge-db-password \
  --secret-add io-thecodeforge-db-password-v2 \
  io-thecodeforge-api

# Step 3: Verify new secret is mounted
docker exec <container> cat /run/secrets/io-thecodeforge-db-password-v2
# Output: new_s3cret_p@ssw0rd

# Step 4: Clean up old secret
docker secret rm io-thecodeforge-db-password

# ── List all secrets ──────────────────────────────────────────────
docker secret ls
# ID          NAME                          CREATED
# abc123      io-thecodeforge-db-password   2 hours ago
def456      io-thecodeforge-tls-cert      2 hours ago
Output
Secret io-thecodeforge-db-password created
Config io-thecodeforge-nginx-conf created
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service converged
s3cret_p@ssw0rd
Secrets as Sealed Envelopes
  • Secrets are encrypted at rest in the Raft log. ENV variables are stored in plaintext in container metadata.
  • Secrets are mounted as files — they do not appear in docker inspect, docker ps, or process listings.
  • Secrets are distributed only to nodes running tasks that reference them. ENV variables are visible to anyone with image access.
  • Secrets are immutable and versioned. ENV variables can be accidentally changed or logged.
Production Insight
Docker secrets are Swarm-only. If you use standalone Docker (not Swarm), you must use alternative secrets management: Docker Compose secrets (file-based, not encrypted), HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets. Plan your secrets strategy before choosing an orchestration platform.
Key Takeaway
Docker secrets are encrypted, immutable, and mounted as files in /run/secrets/. Never use ENV for secrets — they are visible in docker inspect. To rotate a secret, create a new version and update the service with --secret-rm and --secret-add. Secrets are Swarm-only — standalone Docker requires alternative solutions.

Tasks and Services: The Two Abstractions You Can't Afford to Confuse

Newcomers treat 'service' and 'task' like synonyms. They're not. Get this wrong and your rolling updates will silently fail, your health checks will fire at ghosts, and you'll be debugging at 2 AM while your manager asks why production is serving 503s.

A Service is the declarative spec. You define the image, replicas, network, ports, resource limits — the desired state. Docker Swarm reconciles actual state to match. A Task is a running instance of that service. One replica = one task. When you scale to 10, you get 10 tasks, each with a unique ID tied to a specific node.

Here's the nasty bit: tasks are ephemeral. They fail, get rescheduled, get replaced during updates. Your monitoring must track task IDs, not container names. If you're scraping logs by container name, you'll lose the trail after any reschedule. Tag your logs with task ID and service name from environment variables injected by Swarm.

TaskServiceExample.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — devops tutorial

version: '3.8'

services:
  auth-api:
    image: registry.thecodeforge.io/auth-api:v2.4.1
    deploy:
      replicas: 5
      resources:
        limits:
          cpus: '0.5'
          memory: 256M
        reservations:
          cpus: '0.25'
          memory: 128M
    environment:
      - SERVICE_NAME=auth-api
      - LOG_FORMAT=json
    logging:
      driver: json-file
      options:
        tag: "{{.Name}}/{{.ID}}"
Output
Task ID: z8xk3v9m0n1a2b4c
Container: auth-api.1.z8xk3v9m0n1a2b4c
Node: worker-03
Status: Running 2h 7m 34s
Production Trap: Overlapping Task IDs During Rollback
During rollback, old and new task IDs coexist for seconds. Your log aggregator will see duplicate entries unless you filter by service version label or task creation timestamp.
Key Takeaway
A service is what you deploy. A task is what runs. Never confuse the declarative spec with the ephemeral instance.

Ports and Protocols: The Firewall Dance That Breaks Your Swarm

You've initialized your swarm, added workers, and everything works on your laptop. Then you deploy to bare metal in a colo and nodes can't talk to each other. Welcome to networking hell.

Swarm mode needs specific ports open between all nodes — not just manager to worker, but worker to worker, and manager to manager. The Raft consensus traffic uses TCP and UDP port 2377. Container ingress traffic routes through a VXLAN overlay on UDP port 4789. Node-to-node gossip protocol uses UDP port 7946.

Here's what the docs won't scream at you: opening these ports on cloud firewalls isn't enough. If your nodes are in different subnets with network ACLs between them, VXLAN encapsulation might get dropped. Check your MTU too — overlay networks add 50 bytes of overhead. Standard 1500 MTU on the underlay will fragment packets if you're not careful, and some cloud providers drop fragments silently.

Test with a simple service that pings between nodes before you declare victory.

SwarmPortCheck.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — devops tutorial

docker service create \
  --name netcheck \
  --network swarm-overlay \
  --replicas 3 \
  alpine sh -c "while true; do \
    ping -c1 netcheck-2; sleep 5; done"

# Verify with:
docker service logs netcheck | grep -E "(time=|unreachable)"
Output
64 bytes from 10.0.0.7: seq=0 ttl=64 time=0.853 ms
64 bytes from 10.0.0.9: seq=1 ttl=64 time=0.921 ms
64 bytes from 10.0.0.8: seq=2 ttl=64 time=0.887 ms
Senior Shortcut: Use Your Cloud Service Mesh Routing
Skip VXLAN headaches in multi-region setups. Deploy a global ingress service (like HAProxy or Envoy) on each region's swarm and route traffic via DNS-based load balancing. Overlay networks across data centers will destroy your latency budget.
Key Takeaway
Four ports control your swarm's life: 2377/tcp+udp for Raft, 7946/tcp+udp for gossip, 4789/udp for overlay. Miss any and your cluster degrades silently.

Three Host Machines — Don't Even Think About Fewer

Your swarm needs at least three manager nodes. Not two. Not one. Three. This is the minimum to survive a single node failure without losing the Raft quorum you read about earlier.

Why three? Raft consensus requires a majority. With three managers, you can lose one and still have two — that's a majority. With two, you lose one and you're at fifty-fifty tie. The swarm freezes. No scheduling, no updates, nothing. You're down. Production shops that run two managers are one disk failure away from a cluster-wide lockup.

Managers hold the cluster state. That state is distributed via Raft logs. Even if your applications run on worker nodes, the managers coordinate everything — service discovery, scheduling, scaling. Three hosts means you can reboot one for patches and the swarm keeps chewing. Anything less is gambling with your production pipeline.

docker-stack.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// io.thecodeforge — devops tutorial

version: '3.9'

services:
  api:
    image: internals-api:2.1.4
    deploy:
      mode: replicated
      replicas: 3
      placement:
        constraints:
          - node.role == worker
      resources:
        limits:
          cpus: '1.5'
          memory: 1024M
        reservations:
          cpus: '0.5'
          memory: 512M
    ports:
      - "8080:8080"

networks:
  app-net:
    driver: overlay
    attachable: true
Output
Service api-svc scheduled across 3 worker nodes
Manager nodes: 3 (node-m1, node-m2, node-m3)
Raft quorum: healthy (3/3)
Production Trap:
Docker Swarm allows a single-manager setup for development. Never take that to production. You will hit a network partition and your swarm will become a paperweight.
Key Takeaway
Three managers or go home. Two is a quorum bomb waiting to explode.

Don't Run Apps on Managers — That's Not Their Job

Managers run the control plane. They gossip cluster state, maintain Raft logs, and serve the Docker API. They are not compute nodes. You wouldn't run your web server on the Kubernetes control plane, so don't do it in Swarm.

By default, Swarm schedules services onto manager nodes. You must explicitly drain managers or add placement constraints to force workloads onto worker nodes. The node.role == worker constraint in your compose file or service create command does exactly that. Without it, your API container could land on a manager during a rolling update, consuming CPU and memory that your cluster brain needs to stay responsive.

Separate concerns = separate node roles. Managers handle orchestration. Workers run containers. If one manager crashes under load because your app ate its memory, you lose not just that host but potentially the quorum. Keep managers lean, dedicated, and isolated from application traffic. Your future self — and your on-call team — will thank you.

service-constraint.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial

version: '3.9'

services:
  payment-worker:
    image: payment-backend:3.0.1
    deploy:
      replicas: 4
      placement:
        constraints:
          - node.role == worker
    environment:
      - NODE_ENV=production
    networks:
      - backend-net

networks:
  backend-net:
    driver: overlay
Output
payment-worker scheduled on worker-node-01
payment-worker scheduled on worker-node-02
payment-worker scheduled on worker-node-03
payment-worker scheduled on worker-node-04
No managers used for workload
Senior Shortcut:
After initializing your swarm, immediately run docker node update --availability drain <manager-hostname> on all managers to prevent any service from landing there accidentally.
Key Takeaway
Managers orchestrate, workers execute. Never blur the line.
● Production incidentPOST-MORTEMseverity: high

Cluster Split-Brain After Losing 2 of 4 Manager Nodes — All Services Unreachable for 3 Hours

Symptom
After a planned data center maintenance window, the operations team could not deploy new services. docker service ls hung for 30 seconds then returned 'Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded'. Existing services continued running but were unreachable via the routing mesh. docker node ls on the surviving managers showed the 2 offline managers as 'Down' but 'Reachable' was false for all managers.
Assumption
Team assumed the offline managers would come back after maintenance and the cluster would self-heal. They waited 2 hours. The managers came back online, but the cluster was still unresponsive. They assumed a Docker daemon bug and considered rebuilding the entire cluster from scratch.
Root cause
With 4 manager nodes, the Raft quorum requires at least 3 managers to agree on any state change (quorum = floor(n/2) + 1 = floor(4/2) + 1 = 3). When 2 managers went offline, only 2 remained — insufficient for quorum. The Raft consensus algorithm froze. No new state changes could be committed. When the offline managers returned, they had stale Raft logs. The cluster needed manual intervention to re-establish consensus. The root design flaw was using an even number of managers (4) instead of an odd number (3 or 5).
Fix
1. Demoted one offline manager to worker: docker node demote <node-id>. This reduced the manager count to 3, making quorum = 2, which the 2 surviving managers could satisfy. 2. Promoted a worker to manager: docker node promote <worker-id>. This restored the manager count to 3 (odd). 3. Added a monitoring alert for Raft quorum health: docker node ls | grep -c 'Leader\|Reachable' to detect quorum loss early. 4. Documented the rule: always use 3 or 5 managers, never 4 or 6. 5. Migrated critical services to Kubernetes for the long term, as the team's scale exceeded Swarm's sweet spot.
Key lesson
  • Always use an odd number of manager nodes: 3 or 5. An even number (4, 6) wastes a node without improving fault tolerance.
  • Quorum = floor(n/2) + 1. With 3 managers, you can lose 1. With 5 managers, you can lose 2. With 4 managers, you can still only lose 1 — the 4th node provides no additional resilience.
  • Monitor Raft quorum health proactively. A cluster that loses quorum cannot schedule, scale, or update services — even though existing containers keep running.
  • Never run application workloads on manager nodes. Resource contention can starve the Raft process and cause the manager to appear unreachable, triggering unnecessary leader elections.
  • When quorum is lost, do not reboot all managers simultaneously. Restore one manager at a time and verify Raft log consistency before bringing up the next.
Production debug guideFrom quorum loss to service scheduling failures — systematic debugging paths.6 entries
Symptom · 01
docker service ls hangs or returns 'DeadlineExceeded'.
Fix
Check Raft quorum health. Run docker node ls on each manager. If fewer than quorum managers show 'Reachable', the cluster has lost quorum. Check if managers are reachable via SSH. Restart the Docker daemon on unreachable managers one at a time. If quorum cannot be restored, demote a failed manager to reduce the manager count.
Symptom · 02
Service tasks are stuck in 'Pending' state and never start.
Fix
Check resource constraints: docker service ps <service> --no-trunc. Look for 'no suitable node' errors. Verify node availability: docker node ls. Check if nodes have enough CPU/memory: docker node inspect <node> --format '{{.Description.Resources}}'. Check placement constraints: docker service inspect <service> --format '{{.Spec.TaskTemplate.Placement.Constraints}}'.
Symptom · 03
Service is running but unreachable via published port.
Fix
Check if the routing mesh is functioning: curl http://<any-node-ip>:<published-port>. If it works on some nodes but not others, the ingress network may be misconfigured. Inspect the ingress network: docker network inspect ingress. Check if the service has healthy tasks: docker service ps <service> --filter desired-state=running. Verify the container is listening: docker exec <container> ss -tlnp.
Symptom · 04
Rolling update is stuck and not progressing.
Fix
Check update status: docker service ps <service> --filter desired-state=running. Look for tasks in 'Failed' state. Check the new image exists and is pullable: docker pull <image>. Check if the new container fails health checks: docker service inspect <service> --format '{{.Spec.UpdateConfig}}'. Adjust update parallelism and delay: docker service update --update-parallelism 1 --update-delay 30s <service>.
Symptom · 05
Node shows 'Down' but the server is online.
Fix
Check Docker daemon status on the node: systemctl status docker. Check if the node's IP changed (common in cloud environments with dynamic IPs). Swarm uses the IP from docker swarm init/join. If the IP changed, the node must rejoin the cluster. Check firewall rules: Swarm requires ports 2377 (Raft), 7946 (gossip), 4789 (overlay VXLAN) to be open between all nodes.
Symptom · 06
Secrets or configs not updating in running services.
Fix
Docker secrets and configs are immutable. Updating a secret creates a new version. The service must be updated to reference the new secret: docker service update --secret-rm <old-secret> --secret-add <new-secret> <service>. Verify the secret is mounted: docker exec <container> ls /run/secrets/.
★ Docker Swarm Triage Cheat SheetFirst-response commands when Swarm cluster or service issues are reported.
Cluster unresponsive — docker service ls hangs.
Immediate action
Check Raft quorum across all manager nodes.
Commands
docker node ls
docker info --format '{{.Swarm.ControlAvailable}}' (run on each manager)
Fix now
If fewer than quorum managers are reachable, restart Docker daemon on one manager at a time. If a manager is permanently dead, demote it: docker node demote <node-id>.
Service tasks stuck in 'Pending' or 'Failed' state.+
Immediate action
Check task failure reason and node resource availability.
Commands
docker service ps <service> --no-trunc
docker node inspect <node> --format '{{.Status.Addr}} {{.Spec.Availability}}'
Fix now
If 'no suitable node', check constraints and resource limits. Remove constraints or add nodes. If container crashes, check logs: docker service logs <service> --tail 50.
Service unreachable via published port on specific nodes.+
Immediate action
Test routing mesh and ingress network.
Commands
curl -s -o /dev/null -w '%{http_code}' http://<node-ip>:<port>
docker network inspect ingress --format '{{.Peers}}'
Fix now
If ingress network peers are missing, restart Docker on affected nodes. If VXLAN port 4789 is blocked, open it in firewall.
Rolling update stuck — old tasks not being replaced.+
Immediate action
Check update configuration and new image availability.
Commands
docker service ps <service> --filter desired-state=running
docker service inspect <service> --format '{{.Spec.UpdateConfig}}'
Fix now
Force rollback: docker service rollback <service>. Then fix the new image and retry with --update-parallelism 1 --update-delay 10s.
Node shows 'Down' but server is reachable via SSH.+
Immediate action
Check Docker daemon and network connectivity.
Commands
ssh <node> 'systemctl status docker'
ssh <node> 'docker info --format "{{.Swarm.LocalNodeState}}"'
Fix now
If daemon is stopped, restart it. If IP changed, the node must leave and rejoin: docker swarm leave --force then rejoin with the new token.
Overlay network connectivity issues between containers on different nodes.+
Immediate action
Check VXLAN port and overlay network peer status.
Commands
docker network inspect <network> --format '{{.Peers}}'
nc -zuv <other-node-ip> 4789
Fix now
If VXLAN port is blocked, open UDP 4789 in firewall. If peers are missing, restart Docker on affected nodes. Consider using host networking for latency-sensitive services.
Docker Swarm vs Kubernetes — When to Choose Which
AspectDocker SwarmKubernetes
Setup complexitySingle command: docker swarm initRequires kubeadm, kops, or managed service (EKS, GKE)
Learning curveLow — uses standard Docker CLISteep — new concepts (pods, deployments, services, ingress)
Built-in featuresService discovery, load balancing, secrets, rolling updatesAll of the above plus CRDs, operators, admission controllers
NetworkingVXLAN overlay with routing meshCNI plugin model (Calico, Cilium, Flannel)
State managementRaft consensus (embedded in Docker daemon)etcd (external cluster)
ScalingGood up to ~100 nodesDesigned for 1000+ nodes
EcosystemLimited — fewer third-party toolsMassive — Helm, ArgoCD, Istio, Prometheus, etc.
Best forSmall-to-medium teams, simple deployments, Docker-native workflowsLarge-scale, complex workloads, teams with dedicated platform engineers

Key takeaways

1
Docker Swarm is the native orchestration layer built into the Docker Engine. It uses Raft consensus for state management and VXLAN overlay networks for cross-host communication.
2
Always use an odd number of manager nodes (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes.
3
The ingress routing mesh adds one network hop. For latency-sensitive services, use host-mode publishing. Always open UDP 4789, TCP/UDP 7946, and TCP 2377 between nodes.
4
Always set --update-failure-action rollback and health checks with --health-start-period. Without rollback, a broken update replaces all healthy containers.
5
Docker secrets are encrypted, immutable, and mounted as files. Never use ENV for secrets. Secrets are Swarm-only
standalone Docker requires alternatives.
6
Swarm is ideal for small-to-medium deployments. For 100+ nodes or complex workloads, consider migrating to Kubernetes.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Is Docker Swarm still maintained?
02
How many manager nodes should I use?
03
What is the difference between a service and a task in Docker Swarm?
04
How does the routing mesh work?
05
Can I use Docker Swarm in production?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Docker. Mark it forged?

11 min read · try the examples if you haven't

Previous
Multi-stage Docker Builds
13 / 18 · Docker
Next
Optimising Docker Images