Skip to content
Home DevOps Docker Swarm Explained: Clustering, Orchestration and Production Gotchas

Docker Swarm Explained: Clustering, Orchestration and Production Gotchas

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Docker → Topic 13 of 18
Docker Swarm orchestration deep-dive: learn how Raft consensus, service scheduling, overlay networks and rolling updates work under the hood — with real production tips.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
Docker Swarm orchestration deep-dive: learn how Raft consensus, service scheduling, overlay networks and rolling updates work under the hood — with real production tips.
  • Docker Swarm is the native orchestration layer built into the Docker Engine. It uses Raft consensus for state management and VXLAN overlay networks for cross-host communication.
  • Always use an odd number of manager nodes (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes.
  • The ingress routing mesh adds one network hop. For latency-sensitive services, use host-mode publishing. Always open UDP 4789, TCP/UDP 7946, and TCP 2377 between nodes.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Manager nodes: run the Raft consensus algorithm, maintain cluster state, schedule services
  • Worker nodes: execute tasks (containers) assigned by managers
  • Services: the declarative unit — you define desired state, Swarm converges reality to match
  • Tasks: the atomic scheduling unit — one task = one container
  • Raft consensus requires a quorum (majority) of managers to agree on state changes
  • Overlay networks span hosts so containers can communicate across nodes
  • Ingress routing mesh load-balances published ports across all nodes
  • Rolling updates replace containers incrementally with zero downtime
🚨 START HERE
Docker Swarm Triage Cheat Sheet
First-response commands when Swarm cluster or service issues are reported.
🟡Cluster unresponsive — docker service ls hangs.
Immediate ActionCheck Raft quorum across all manager nodes.
Commands
docker node ls
docker info --format '{{.Swarm.ControlAvailable}}' (run on each manager)
Fix NowIf fewer than quorum managers are reachable, restart Docker daemon on one manager at a time. If a manager is permanently dead, demote it: docker node demote <node-id>.
🟡Service tasks stuck in 'Pending' or 'Failed' state.
Immediate ActionCheck task failure reason and node resource availability.
Commands
docker service ps <service> --no-trunc
docker node inspect <node> --format '{{.Status.Addr}} {{.Spec.Availability}}'
Fix NowIf 'no suitable node', check constraints and resource limits. Remove constraints or add nodes. If container crashes, check logs: docker service logs <service> --tail 50.
🟡Service unreachable via published port on specific nodes.
Immediate ActionTest routing mesh and ingress network.
Commands
curl -s -o /dev/null -w '%{http_code}' http://<node-ip>:<port>
docker network inspect ingress --format '{{.Peers}}'
Fix NowIf ingress network peers are missing, restart Docker on affected nodes. If VXLAN port 4789 is blocked, open it in firewall.
🟡Rolling update stuck — old tasks not being replaced.
Immediate ActionCheck update configuration and new image availability.
Commands
docker service ps <service> --filter desired-state=running
docker service inspect <service> --format '{{.Spec.UpdateConfig}}'
Fix NowForce rollback: docker service rollback <service>. Then fix the new image and retry with --update-parallelism 1 --update-delay 10s.
🟡Node shows 'Down' but server is reachable via SSH.
Immediate ActionCheck Docker daemon and network connectivity.
Commands
ssh <node> 'systemctl status docker'
ssh <node> 'docker info --format "{{.Swarm.LocalNodeState}}"'
Fix NowIf daemon is stopped, restart it. If IP changed, the node must leave and rejoin: docker swarm leave --force then rejoin with the new token.
🟡Overlay network connectivity issues between containers on different nodes.
Immediate ActionCheck VXLAN port and overlay network peer status.
Commands
docker network inspect <network> --format '{{.Peers}}'
nc -zuv <other-node-ip> 4789
Fix NowIf VXLAN port is blocked, open UDP 4789 in firewall. If peers are missing, restart Docker on affected nodes. Consider using host networking for latency-sensitive services.
Production IncidentCluster Split-Brain After Losing 2 of 4 Manager Nodes — All Services Unreachable for 3 HoursA team ran a 4-node Swarm cluster with 4 managers and 2 workers. During a data center maintenance, 2 managers went offline simultaneously. The remaining 2 managers could not form a quorum (2 < 3 required for 4-manager cluster), so the entire cluster became unresponsive — no new deployments, no scaling, no failover.
SymptomAfter a planned data center maintenance window, the operations team could not deploy new services. docker service ls hung for 30 seconds then returned 'Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded'. Existing services continued running but were unreachable via the routing mesh. docker node ls on the surviving managers showed the 2 offline managers as 'Down' but 'Reachable' was false for all managers.
AssumptionTeam assumed the offline managers would come back after maintenance and the cluster would self-heal. They waited 2 hours. The managers came back online, but the cluster was still unresponsive. They assumed a Docker daemon bug and considered rebuilding the entire cluster from scratch.
Root causeWith 4 manager nodes, the Raft quorum requires at least 3 managers to agree on any state change (quorum = floor(n/2) + 1 = floor(4/2) + 1 = 3). When 2 managers went offline, only 2 remained — insufficient for quorum. The Raft consensus algorithm froze. No new state changes could be committed. When the offline managers returned, they had stale Raft logs. The cluster needed manual intervention to re-establish consensus. The root design flaw was using an even number of managers (4) instead of an odd number (3 or 5).
Fix1. Demoted one offline manager to worker: docker node demote <node-id>. This reduced the manager count to 3, making quorum = 2, which the 2 surviving managers could satisfy. 2. Promoted a worker to manager: docker node promote <worker-id>. This restored the manager count to 3 (odd). 3. Added a monitoring alert for Raft quorum health: docker node ls | grep -c 'Leader\|Reachable' to detect quorum loss early. 4. Documented the rule: always use 3 or 5 managers, never 4 or 6. 5. Migrated critical services to Kubernetes for the long term, as the team's scale exceeded Swarm's sweet spot.
Key Lesson
Always use an odd number of manager nodes: 3 or 5. An even number (4, 6) wastes a node without improving fault tolerance.Quorum = floor(n/2) + 1. With 3 managers, you can lose 1. With 5 managers, you can lose 2. With 4 managers, you can still only lose 1 — the 4th node provides no additional resilience.Monitor Raft quorum health proactively. A cluster that loses quorum cannot schedule, scale, or update services — even though existing containers keep running.Never run application workloads on manager nodes. Resource contention can starve the Raft process and cause the manager to appear unreachable, triggering unnecessary leader elections.When quorum is lost, do not reboot all managers simultaneously. Restore one manager at a time and verify Raft log consistency before bringing up the next.
Production Debug GuideFrom quorum loss to service scheduling failures — systematic debugging paths.
docker service ls hangs or returns 'DeadlineExceeded'.Check Raft quorum health. Run docker node ls on each manager. If fewer than quorum managers show 'Reachable', the cluster has lost quorum. Check if managers are reachable via SSH. Restart the Docker daemon on unreachable managers one at a time. If quorum cannot be restored, demote a failed manager to reduce the manager count.
Service tasks are stuck in 'Pending' state and never start.Check resource constraints: docker service ps <service> --no-trunc. Look for 'no suitable node' errors. Verify node availability: docker node ls. Check if nodes have enough CPU/memory: docker node inspect <node> --format '{{.Description.Resources}}'. Check placement constraints: docker service inspect <service> --format '{{.Spec.TaskTemplate.Placement.Constraints}}'.
Service is running but unreachable via published port.Check if the routing mesh is functioning: curl http://<any-node-ip>:<published-port>. If it works on some nodes but not others, the ingress network may be misconfigured. Inspect the ingress network: docker network inspect ingress. Check if the service has healthy tasks: docker service ps <service> --filter desired-state=running. Verify the container is listening: docker exec <container> ss -tlnp.
Rolling update is stuck and not progressing.Check update status: docker service ps <service> --filter desired-state=running. Look for tasks in 'Failed' state. Check the new image exists and is pullable: docker pull <image>. Check if the new container fails health checks: docker service inspect <service> --format '{{.Spec.UpdateConfig}}'. Adjust update parallelism and delay: docker service update --update-parallelism 1 --update-delay 30s <service>.
Node shows 'Down' but the server is online.Check Docker daemon status on the node: systemctl status docker. Check if the node's IP changed (common in cloud environments with dynamic IPs). Swarm uses the IP from docker swarm init/join. If the IP changed, the node must rejoin the cluster. Check firewall rules: Swarm requires ports 2377 (Raft), 7946 (gossip), 4789 (overlay VXLAN) to be open between all nodes.
Secrets or configs not updating in running services.Docker secrets and configs are immutable. Updating a secret creates a new version. The service must be updated to reference the new secret: docker service update --secret-rm <old-secret> --secret-add <new-secret> <service>. Verify the secret is mounted: docker exec <container> ls /run/secrets/.

Every production app eventually outgrows a single server. Traffic spikes, hardware fails, deployments need to happen without downtime. Docker Swarm is the native clustering and orchestration layer baked directly into the Docker Engine.

Swarm solves coordination across multiple hosts. When you have ten nodes, you need something to decide where a container lands, what happens when a node dies, how containers on different hosts communicate, and how you push a new image without dropping requests. Swarm encodes those answers into a distributed state machine backed by the Raft consensus algorithm.

Common misconceptions: Swarm is not deprecated (Docker continues to maintain it alongside Compose). Swarm is not Kubernetes-lite (it has a fundamentally different architecture — no pods, no CRDs, no etcd). Swarm's simplicity is its strength for small-to-medium deployments that do not need Kubernetes' complexity.

Raft Consensus and Manager Node Architecture

Swarm's cluster state is stored in a distributed log managed by the Raft consensus algorithm. Every manager node runs a full copy of the Raft log. State changes (service updates, node joins, secret creation) are proposed by the leader, replicated to a quorum of followers, and then committed.

The quorum formula is floor(n/2) + 1, where n is the number of managers. With 3 managers, quorum is 2. With 5 managers, quorum is 3. The cluster can tolerate floor((n-1)/2) manager failures. With 3 managers, you can lose 1. With 5 managers, you can lose 2.

An even number of managers provides no additional fault tolerance over the next lower odd number. With 4 managers, quorum is 3 — you can still only lose 1 manager, same as with 3 managers. The 4th node is wasted.

Leader election: When the leader fails or becomes unreachable, the remaining managers hold an election. The manager with the most up-to-date Raft log and the lowest election timeout wins. The default election timeout is 1 second. Network partitions can cause split-brain if two groups of managers each elect their own leader, but only the group with quorum can commit new state changes.

Failure scenario — manager resource starvation: A team ran a memory-intensive batch job on a manager node. The job consumed all available RAM, causing the Docker daemon to be OOM-killed. The daemon restart triggered a Raft leader election. During the election window (1-2 seconds), no state changes could be committed. The team noticed brief delays in service updates. The fix: cordon manager nodes from workloads using docker node update --availability drain <manager-node>.

io/thecodeforge/swarm-manager-setup.sh · BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445
#!/bin/bash
# Swarm cluster bootstrap with proper manager configuration

# ── Initialize the Swarm on the first manager ─────────────────────
# --advertise-addr: the IP other nodes will use to reach this manager
# --listen-addr: the interface the manager binds to
docker swarm init \
  --advertise-addr 10.0.1.10 \
  --listen-addr 0.0.0.0:2377 \
  --data-path-addr 10.0.1.10

# Get the join tokens
docker swarm join-token manager  # For other managers
docker swarm join-token worker   # For workers

# ── Join additional managers (run on each new manager node) ────────
docker swarm join \
  --token SWMTKN-1-xxxxx-manager-token-xxxxx \
  --advertise-addr 10.0.1.11 \
  10.0.1.10:2377

# ── Verify manager count (should be 3 or 5, never even) ──────────
docker node ls --filter role=manager
# ID    HOSTNAME   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
# abc * manager-1  Ready     Active         Leader           24.0.7
# def   manager-2  Ready     Active         Reachable        24.0.7
# ghi   manager-3  Ready     Active         Reachable        24.0.7

# ── Drain manager nodes to prevent workloads from running on them ─
for node in $(docker node ls --filter role=manager -q); do
  docker node update --availability drain $node
done
# Drained managers cannot run tasks — they are dedicated to orchestration

# ── Check Raft cluster health ─────────────────────────────────────
docker info --format '{{.Swarm.Nodes}} managers, {{.Swarm.Nodes}} total'
# Or inspect the Raft status on each manager
docker node inspect self --format '{{.ManagerStatus.Leader}}'
# true on the leader, false on followers

# ── Configure auto-lock (encrypt Raft logs at rest) ───────────────
docker swarm update --autolock=true
# This requires unlocking the swarm after daemon restart:
# docker swarm unlock
# Enter unlock key: SWMKEY-1-xxxxx
▶ Output
Swarm initialized: current node (abc123) is now a manager.

To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-xxxxx 10.0.1.10:2377

To add a manager to this swarm, run:
docker swarm join --token SWMTKN-1-yyyyy 10.0.1.10:2377

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
abc * manager-1 Ready Active Leader 24.0.7
def manager-2 Ready Active Reachable 24.0.7
ghi manager-3 Ready Active Reachable 24.0.7
Mental Model
Raft as a Committee Vote
Why does an even number of managers not improve fault tolerance?
  • Quorum = floor(n/2) + 1. With 3 managers, quorum is 2. With 4 managers, quorum is 3.
  • With 3 managers, you can lose 1 and still have quorum (2 >= 2).
  • With 4 managers, you can lose 1 and still have quorum (3 >= 3). But losing 2 breaks quorum (2 < 3).
  • The 4th manager adds cost (server, maintenance) without adding fault tolerance. Always use 3 or 5.
📊 Production Insight
The autolock feature (--autolock=true) encrypts the Raft log at rest. Without it, anyone with access to the manager's disk can read the Raft data, which includes secrets and service definitions. The trade-off: after a daemon restart, you must manually enter the unlock key. Automate this with a secrets manager or a secure boot script.
🎯 Key Takeaway
Raft consensus requires a quorum of managers to agree on state changes. Always use an odd number of managers (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes — resource contention can starve the Raft process.
Manager Node Count Decision
IfSmall cluster, 1-10 nodes, budget-conscious
Use3 managers. Tolerates 1 failure. Minimal overhead.
IfMedium cluster, 10-100 nodes, production-critical
Use5 managers. Tolerates 2 failures. Better consensus performance under load.
IfLarge cluster, 100+ nodes
UseConsider migrating to Kubernetes. Swarm's consensus model does not scale well beyond ~100 nodes.
IfDevelopment/testing environment
Use1 manager is sufficient. No quorum concerns. Not suitable for production.

Service Scheduling, Placement Constraints and Resource Limits

A Swarm service is a declarative specification of the desired state: which image to run, how many replicas, resource limits, placement constraints, and update policy. The Swarm scheduler assigns tasks (individual containers) to nodes that satisfy the constraints and have available resources.

Scheduling algorithm: Swarm uses a spread scheduler by default — it places tasks on the node with the fewest existing tasks of the same service. This provides natural load distribution. You can override this with placement constraints and preferences.

Placement constraints: Hard requirements that a node must satisfy. Examples: - node.role==manager: only run on manager nodes - node.labels.zone==us-east-1a: only run in a specific availability zone - node.hostname==worker-3: pin to a specific node

Placement preferences: Soft preferences that guide scheduling but do not prevent placement. Example: --placement-pref 'spread=node.labels.zone' distributes tasks evenly across zones.

Resource limits: - --limit-cpu: maximum CPU a task can consume (e.g., 0.5 = half a core) - --limit-memory: maximum memory (e.g., 512m) - --reserve-cpu: guaranteed CPU allocation - --reserve-memory: guaranteed memory allocation

Without resource limits, a single misbehaving container can consume all resources on a node, starving other tasks. Resource reservations ensure critical services always have the resources they need.

Failure scenario — no resource limits, noisy neighbor: A team deployed a memory-intensive analytics service without --limit-memory. The service gradually consumed all available RAM on a worker node. The kernel OOM-killed other containers on the same node, including a critical payment service. The payment service was rescheduled to another node (Swarm's self-healing), but the 30-second rescheduling delay caused a brief payment outage. The fix: add --limit-memory to all services and --reserve-memory for critical services.

io/thecodeforge/swarm-service-deploy.sh · BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
#!/bin/bash
# Production-grade service deployment with constraints and resource limits

# ── Deploy a service with full production settings ────────────────
docker service create \
  --name io-thecodeforge-api \
  --image io.thecodeforge/api:v2.3.1 \
  --replicas 6 \
  \
  # Resource limits — hard ceiling
  --limit-cpu 1.0 \
  --limit-memory 512m \
  \
  # Resource reservations — guaranteed allocation
  --reserve-cpu 0.25 \
  --reserve-memory 128m \
  \
  # Placement: spread across availability zones
  --placement-pref 'spread=node.labels.zone' \
  \
  # Constraint: never run on manager nodes
  --constraint 'node.role!=manager' \
  \
  # Constraint: only run on nodes with SSD label
  --constraint 'node.labels.disk==ssd' \
  \
  # Health check
  --health-cmd 'curl -f http://localhost:8080/health || exit 1' \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 30s \
  \
  # Rolling update policy
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-order start-first \
  \
  # Rollback policy
  --rollback-parallelism 1 \
  --rollback-delay 5s \
  \
  # Network
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080 \
  \
  # Environment
  --env DATABASE_URL='{{DATABASE_URL}}' \
  --secret io-thecodeforge-db-password \
  \
  # Restart policy
  --restart-condition on-failure \
  --restart-delay 5s \
  --restart-max-attempts 3 \
  --restart-window 60s

# ── Label nodes for placement constraints ─────────────────────────
docker node update --label-add zone=us-east-1a worker-1
docker node update --label-add zone=us-east-1b worker-2
docker node update --label-add zone=us-east-1a worker-3
docker node update --label-add disk=ssd worker-1
docker node update --label-add disk=ssd worker-2

# ── Verify placement ──────────────────────────────────────────────
docker service ps io-thecodeforge-api --format '{{.Name}} {{.Node}} {{.CurrentState}}'
# api.1  worker-1  Running
# api.2  worker-2  Running
# api.3  worker-1  Running
# api.4  worker-2  Running
# api.5  worker-3  Running
# api.6  worker-3  Running
▶ Output
overall progress: 6 out of 6 tasks
1/6: running [==================================================>]
2/6: running [==================================================>]
3/6: running [==================================================>]
4/6: running [==================================================>]
5/6: running [==================================================>]
6/6: running [==================================================>]
verify: Service converged
Mental Model
Scheduling as Hotel Room Assignment
What is the difference between a constraint and a preference?
  • Constraints are hard requirements. If no node satisfies the constraint, the task stays in 'Pending' state forever.
  • Preferences are soft guidelines. Swarm tries to satisfy them but can place the task on any node if no preference match exists.
  • Use constraints for critical requirements: 'must run on SSD', 'must not run on managers'.
  • Use preferences for optimization: 'prefer to spread across zones', 'prefer nodes with fewer tasks'.
📊 Production Insight
The --update-order start-first flag starts the new container before stopping the old one. This provides zero-downtime deployments but temporarily doubles the resource usage. If you have --limit-memory 512m and 6 replicas, the deployment temporarily needs 6GB instead of 3GB. Ensure your cluster has enough headroom for rolling updates. If headroom is limited, use stop-first order instead.
🎯 Key Takeaway
Always set resource limits on production services. Without limits, a single misbehaving container can OOM-kill other containers on the same node. Use placement constraints to isolate critical services and spread across availability zones. The spread scheduler distributes tasks evenly by default.
Resource Limit Strategy
IfStateless web API with predictable resource usage
UseSet --limit-cpu and --limit-memory based on load testing. Use --reserve-memory for critical services.
IfMemory-intensive batch processing
UseSet generous --limit-memory but low --limit-cpu. Use placement constraints to isolate on dedicated nodes.
IfLatency-sensitive service (trading, real-time)
UseUse --reserve-cpu to guarantee CPU. Consider host-mode publishing to bypass routing mesh. Pin to dedicated nodes.
IfDevelopment/testing
UseSkip resource limits. They add complexity without benefit in non-production environments.

Overlay Networks and Cross-Host Container Communication

Docker Swarm uses overlay networks to enable containers on different hosts to communicate as if they were on the same network. The overlay network uses VXLAN (Virtual Extensible LAN) encapsulation to tunnel Layer 2 traffic over the underlying Layer 3 network.

How it works: When container A on node 1 sends a packet to container B on node 2, the VXLAN driver encapsulates the packet in a UDP datagram on port 4789 and sends it to node 2. Node 2 decapsulates the packet and delivers it to container B. The containers see each other's overlay IP addresses as if they were on the same LAN.

The ingress routing mesh: When you publish a port with --publish, Swarm creates a route in the ingress network that load-balances incoming traffic across all nodes running the service. Any node in the cluster can receive traffic for any service, regardless of whether that node is running the service's containers. The routing mesh forwards the traffic to a node that is running a healthy task.

The extra-hop problem: The routing mesh adds one network hop. A request to node 1 may be routed to a container on node 3. This adds latency. For latency-sensitive services, use host-mode publishing: --publish published=8080,target=8080,mode=host. This bypasses the routing mesh and binds directly to the host's port. The trade-off: only nodes running the service's containers accept traffic — you lose the any-node routing benefit.

Failure scenario — VXLAN port blocked by firewall: A team deployed a 3-node Swarm cluster across two data centers. Containers in data center A could not reach containers in data center B. The team spent 4 hours debugging DNS, service discovery, and overlay configuration. The root cause: the firewall between data centers blocked UDP port 4789 (VXLAN). After opening the port, overlay connectivity was restored immediately.

io/thecodeforge/swarm-networking.sh · BASH
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
#!/bin/bash
# Overlay network setup and troubleshooting

# ── Create an overlay network with encryption ─────────────────────
docker network create \
  --driver overlay \
  --attachable \
  --opt encrypted \
  --subnet 10.0.10.0/24 \
  io-thecodeforge-overlay
# --driver overlay: VXLAN-based cross-host networking
# --attachable: allows standalone containers to join (useful for debugging)
# --opt encrypted: encrypts VXLAN traffic with IPsec (adds ~10% overhead)
# --subnet: explicit IP range for the overlay network

# ── Deploy a service on the overlay network ───────────────────────
docker service create \
  --name api \
  --network io-thecodeforge-overlay \
  --replicas 3 \
  io.thecodeforge/api:v2.3.1

# ── Verify overlay network peers (should list all nodes) ──────────
docker network inspect io-thecodeforge-overlay --format '{{json .Peers}}' | python3 -m json.tool
# Each peer represents a node participating in the overlay network
# If a peer is missing, that node cannot communicate on the overlay

# ── Test cross-host connectivity ──────────────────────────────────
# From any node, run a debug container on the overlay network
docker run --rm -it --network io-thecodeforge-overlay alpine sh
# Inside the container:
# ping <overlay-ip-of-service-task>
# nslookup tasks.api  # DNS round-robin for all service tasks

# ── Required ports for Swarm networking ───────────────────────────
# TCP 2377: Swarm cluster management (Raft)
# TCP/UDP 7946: Gossip-based node discovery
# UDP 4789: VXLAN overlay network traffic
# Protocol 50 (ESP): IPsec encryption (if --opt encrypted)

# ── Host-mode publishing (bypass routing mesh) ────────────────────
docker service create \
  --name api-latency-sensitive \
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080,mode=host \
  --mode global \
  io.thecodeforge/api:v2.3.1
# mode=global: one task per node (every node runs the service)
# mode=host: binds directly to host port 8080, no routing mesh hop
▶ Output
Network io-thecodeforge-overlay created

[
{
"Name": "manager-1",
"IP": "10.0.1.10"
},
{
"Name": "worker-1",
"IP": "10.0.1.11"
},
{
"Name": "worker-2",
"IP": "10.0.1.12"
}
]

# All 3 peers are present — overlay network is healthy
Mental Model
Overlay Network as a Virtual Office Floor
When should you use host-mode publishing instead of the routing mesh?
  • Latency-sensitive services where the extra routing mesh hop adds unacceptable delay.
  • Services that need to bind to specific host ports for external load balancer integration.
  • Services running in --mode global (one per node) where every node already has a container.
  • Trade-off: you lose the any-node routing benefit. Traffic only reaches nodes running the service.
📊 Production Insight
The --opt encrypted flag adds IPsec encryption to VXLAN traffic. This is important for multi-data-center or cloud deployments where traffic crosses untrusted networks. The overhead is approximately 10% throughput reduction and slightly higher CPU usage. For single-data-center deployments on a trusted network, skip encryption to avoid the overhead.
🎯 Key Takeaway
Overlay networks use VXLAN on UDP port 4789. If this port is blocked by firewalls, containers on different nodes cannot communicate. The routing mesh adds one network hop — use host-mode publishing for latency-sensitive services. Always use --opt encrypted for cross-data-center overlays.

Rolling Updates, Rollback and Zero-Downtime Deployments

Swarm's rolling update mechanism replaces old containers with new ones incrementally, ensuring the service remains available throughout the deployment. The update configuration controls the pace and failure behavior.

Update parameters: - --update-parallelism: how many tasks to update simultaneously (default: 1) - --update-delay: wait time between updating batches (default: 0s) - --update-failure-action: what to do if a new task fails (pause, continue, rollback) - --update-order: start-first (new container starts before old stops) or stop-first (old stops before new starts) - --update-max-failure-ratio: percentage of failures that triggers the failure action

The start-first vs stop-first trade-off: - start-first: zero downtime, but temporarily doubles resource usage during deployment - stop-first: lower resource usage, but brief window where one fewer replica is running

Rollback: If a rolling update fails, Swarm can automatically roll back to the previous version. The rollback configuration mirrors the update configuration. Manual rollback: docker service rollback <service>.

Failure scenario — update without health check causes cascading failure: A team deployed a new API version with a startup bug that caused the health check to fail after 30 seconds. The team did not configure --health-start-period. The health check failed immediately (before the app was ready), causing Swarm to mark the task as failed. With --update-failure-action continue (the default), Swarm continued replacing all healthy containers with the failing new version. Within 2 minutes, all containers were running the broken version. The fix: set --update-failure-action rollback and configure --health-start-period to allow startup time.

io/thecodeforge/swarm-rolling-update.sh · BASH
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
#!/bin/bash
# Zero-downtime rolling update with automatic rollback

# ── Initial deployment ────────────────────────────────────────────
docker service create \
  --name io-thecodeforge-api \
  --image io.thecodeforge/api:v2.3.0 \
  --replicas 6 \
  --limit-cpu 1.0 \
  --limit-memory 512m \
  --health-cmd 'curl -f http://localhost:8080/health || exit 1' \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 40s \
  \
  # Rolling update: 2 at a time, 10s delay, auto-rollback on failure
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-max-failure-ratio 0.25 \
  --update-order start-first \
  \
  # Rollback policy
  --rollback-parallelism 1 \
  --rollback-delay 5s \
  --rollback-order stop-first \
  \
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080 \
  io.thecodeforge/api:v2.3.0

# ── Rolling update to new version ─────────────────────────────────
docker service update \
  --image io.thecodeforge/api:v2.3.1 \
  --update-parallelism 2 \
  --update-delay 10s \
  io-thecodeforge-api

# ── Monitor the update progress ───────────────────────────────────
docker service ps io-thecodeforge-api \
  --format '{{.Name}} {{.Image}} {{.CurrentState}} {{.Error}}' \
  | head -20
# You will see old tasks shutting down and new tasks starting

# ── Manual rollback if needed ─────────────────────────────────────
docker service rollback io-thecodeforge-api
# Reverts to the previous image and configuration

# ── Force update (redeploy without changing image) ────────────────
docker service update --force io-thecodeforge-api
# Useful when container config has changed but image tag is the same
▶ Output
overall progress: 6 out of 6 tasks
1/6: running [==================================================>]
2/6: running [==================================================>]
3/6: running [==================================================>]
4/6: running [==================================================>]
5/6: running [==================================================>]
6/6: running [==================================================>]
verify: Service converged

# Rolling update in progress:
# api.1 io.thecodeforge/api:v2.3.1 Running
# api.2 io.thecodeforge/api:v2.3.1 Running
# api.3 io.thecodeforge/api:v2.3.0 Running (waiting for delay)
# api.4 io.thecodeforge/api:v2.3.0 Running (waiting for delay)
# api.5 io.thecodeforge/api:v2.3.0 Running
# api.6 io.thecodeforge/api:v2.3.0 Running
Mental Model
Rolling Update as Renovating a Hotel Floor by Floor
Why is --update-failure-action rollback critical for production?
  • Without rollback, a failing update continues replacing all healthy containers with the broken version.
  • With rollback, Swarm detects failures and automatically reverts to the previous working version.
  • The --update-max-failure-ratio flag controls the failure threshold. 0.25 means 25% failure triggers rollback.
  • Always pair rollback with health checks. Without health checks, Swarm cannot detect a broken container.
📊 Production Insight
The --health-start-period flag is essential for services with slow startup times (JVM warmup, database migrations, cache hydration). Without it, the health check runs immediately and may fail before the application is ready, triggering an unnecessary rollback. Set it to the expected maximum startup time plus a buffer.
🎯 Key Takeaway
Always set --update-failure-action rollback in production. Without it, a broken update replaces all healthy containers. Use --health-start-period for services with slow startup. start-first provides zero downtime but doubles resource usage during deployment — ensure cluster headroom.

Swarm Secrets and Configs — Immutable, Encrypted, Rotatable

Docker Swarm provides built-in secrets management through the Raft log. Secrets are encrypted at rest and in transit, mounted as files in /run/secrets/ inside containers, and never written to image layers.

How secrets work: - docker secret create: stores the secret in the Raft log (encrypted with the swarm unlock key) - The secret is distributed to every manager node (encrypted) - When a service references a secret, it is mounted as a file at /run/secrets/<secret-name> - Secrets are immutable — updating a secret creates a new version

How configs work: - docker config create: stores configuration files in the Raft log - Configs are mounted as files in the container (not encrypted at rest — use secrets for sensitive data) - Useful for nginx.conf, application.yaml, or any configuration file

Secret rotation: Secrets are immutable. To rotate a secret: 1. Create a new secret: docker secret create db-password-v2 - 2. Update the service to use the new secret: docker service update --secret-rm db-password --secret-add db-password-v2 <service> 3. The service restarts with the new secret mounted 4. Delete the old secret: docker secret rm db-password

Failure scenario — secret not updating in running service: A team updated a database password by creating a new secret and updating the service. However, the application inside the container still read the old password from /run/secrets/db-password. The team did not realize that Docker secrets are immutable — the old secret file remained mounted until the service was explicitly updated to remove it. The fix: use --secret-rm to remove the old secret and --secret-add to add the new one in the same update command.

io/thecodeforge/swarm-secrets.sh · BASH
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
#!/bin/bash
# Secrets management in Docker Swarm

# ── Create a secret from stdin ────────────────────────────────────
echo 's3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password -

# ── Create a secret from a file ───────────────────────────────────
docker secret create io-thecodeforge-tls-cert /path/to/cert.pem

# ── Create a config (non-sensitive configuration) ─────────────────
docker config create io-thecodeforge-nginx-conf /path/to/nginx.conf

# ── Deploy a service with secrets and configs ─────────────────────
docker service create \
  --name io-thecodeforge-api \
  --secret io-thecodeforge-db-password \
  --secret io-thecodeforge-tls-cert \
  --config source=io-thecodeforge-nginx-conf,target=/etc/nginx/nginx.conf \
  io.thecodeforge/api:v2.3.1

# ── Access secrets inside the container ────────────────────────────
docker exec <container> cat /run/secrets/io-thecodeforge-db-password
# Output: s3cret_p@ssw0rd

docker exec <container> ls /run/secrets/
# io-thecodeforge-db-password
# io-thecodeforge-tls-cert

# ── Rotate a secret ───────────────────────────────────────────────
# Step 1: Create new version
echo 'new_s3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password-v2 -

# Step 2: Update service — remove old, add new
docker service update \
  --secret-rm io-thecodeforge-db-password \
  --secret-add io-thecodeforge-db-password-v2 \
  io-thecodeforge-api

# Step 3: Verify new secret is mounted
docker exec <container> cat /run/secrets/io-thecodeforge-db-password-v2
# Output: new_s3cret_p@ssw0rd

# Step 4: Clean up old secret
docker secret rm io-thecodeforge-db-password

# ── List all secrets ──────────────────────────────────────────────
docker secret ls
# ID          NAME                          CREATED
# abc123      io-thecodeforge-db-password   2 hours ago
def456      io-thecodeforge-tls-cert      2 hours ago
▶ Output
Secret io-thecodeforge-db-password created
Config io-thecodeforge-nginx-conf created

overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service converged

s3cret_p@ssw0rd
Mental Model
Secrets as Sealed Envelopes
Why are Docker secrets more secure than environment variables?
  • Secrets are encrypted at rest in the Raft log. ENV variables are stored in plaintext in container metadata.
  • Secrets are mounted as files — they do not appear in docker inspect, docker ps, or process listings.
  • Secrets are distributed only to nodes running tasks that reference them. ENV variables are visible to anyone with image access.
  • Secrets are immutable and versioned. ENV variables can be accidentally changed or logged.
📊 Production Insight
Docker secrets are Swarm-only. If you use standalone Docker (not Swarm), you must use alternative secrets management: Docker Compose secrets (file-based, not encrypted), HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets. Plan your secrets strategy before choosing an orchestration platform.
🎯 Key Takeaway
Docker secrets are encrypted, immutable, and mounted as files in /run/secrets/. Never use ENV for secrets — they are visible in docker inspect. To rotate a secret, create a new version and update the service with --secret-rm and --secret-add. Secrets are Swarm-only — standalone Docker requires alternative solutions.
🗂 Docker Swarm vs Kubernetes — When to Choose Which
Architectural differences, operational complexity, and sweet spots for each orchestrator.
AspectDocker SwarmKubernetes
Setup complexitySingle command: docker swarm initRequires kubeadm, kops, or managed service (EKS, GKE)
Learning curveLow — uses standard Docker CLISteep — new concepts (pods, deployments, services, ingress)
Built-in featuresService discovery, load balancing, secrets, rolling updatesAll of the above plus CRDs, operators, admission controllers
NetworkingVXLAN overlay with routing meshCNI plugin model (Calico, Cilium, Flannel)
State managementRaft consensus (embedded in Docker daemon)etcd (external cluster)
ScalingGood up to ~100 nodesDesigned for 1000+ nodes
EcosystemLimited — fewer third-party toolsMassive — Helm, ArgoCD, Istio, Prometheus, etc.
Best forSmall-to-medium teams, simple deployments, Docker-native workflowsLarge-scale, complex workloads, teams with dedicated platform engineers

🎯 Key Takeaways

  • Docker Swarm is the native orchestration layer built into the Docker Engine. It uses Raft consensus for state management and VXLAN overlay networks for cross-host communication.
  • Always use an odd number of manager nodes (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes.
  • The ingress routing mesh adds one network hop. For latency-sensitive services, use host-mode publishing. Always open UDP 4789, TCP/UDP 7946, and TCP 2377 between nodes.
  • Always set --update-failure-action rollback and health checks with --health-start-period. Without rollback, a broken update replaces all healthy containers.
  • Docker secrets are encrypted, immutable, and mounted as files. Never use ENV for secrets. Secrets are Swarm-only — standalone Docker requires alternatives.
  • Swarm is ideal for small-to-medium deployments. For 100+ nodes or complex workloads, consider migrating to Kubernetes.

⚠ Common Mistakes to Avoid

    Using an even number of manager nodes
    Symptom

    losing 2 of 4 managers breaks quorum, same as losing 2 of 3 managers, but you paid for an extra node —

    Fix

    always use 3 or 5 managers. An even number provides no additional fault tolerance.

    Running workloads on manager nodes
    Symptom

    resource contention starves the Raft consensus process, causing leader elections and cluster instability —

    Fix

    drain manager nodes: docker node update --availability drain <manager-node>. Dedicate managers to orchestration only.

    Not setting --update-failure-action rollback
    Symptom

    a broken update replaces all healthy containers with the failing new version, causing a total service outage —

    Fix

    always set --update-failure-action rollback and configure health checks with --health-start-period.

    Using ENV for secrets instead of docker secret
    Symptom

    secrets visible in docker inspect, docker history, and process listings —

    Fix

    use docker secret create and mount secrets as files in /run/secrets/. Never put secrets in ENV, ARG, or Dockerfile.

    Not opening UDP port 4789 between nodes
    Symptom

    containers on different nodes cannot communicate on the overlay network —

    Fix

    open UDP 4789 (VXLAN), TCP/UDP 7946 (gossip), and TCP 2377 (Raft) between all Swarm nodes.

    Not setting resource limits on services
    Symptom

    a misbehaving container consumes all CPU or memory on a node, OOM-killing other containers —

    Fix

    set --limit-cpu and --limit-memory on every production service. Use --reserve-memory for critical services.

    Using :latest tag in production services
    Symptom

    docker service update --image <image> does not pull a new image if the tag has not changed, even if the image content has changed —

    Fix

    use specific version tags (v2.3.1) or SHA digests. Use --with-registry-auth if pulling from private registries.

Interview Questions on This Topic

  • QExplain how Raft consensus works in Docker Swarm. What happens when a manager node fails? How many managers can you lose with 3 vs 5 managers?
  • QYour team deployed a 4-node Swarm cluster with 4 managers. During maintenance, 2 managers went offline and the cluster became unresponsive. What happened and how would you fix it?
  • QWhat is the ingress routing mesh? How does it work and what is the performance trade-off? When would you use host-mode publishing instead?
  • QWalk me through a zero-downtime rolling update in Docker Swarm. What parameters control the update behavior and what happens if the new version fails health checks?
  • QHow do Docker Swarm secrets differ from environment variables for storing sensitive data? How would you rotate a secret without downtime?
  • QYour service tasks are stuck in 'Pending' state. Walk me through the debugging steps you would take to identify the root cause.

Frequently Asked Questions

Is Docker Swarm still maintained?

Yes. Docker continues to maintain Swarm as part of the Docker Engine. It is not deprecated. Swarm is the right choice for small-to-medium deployments that do not need Kubernetes' complexity. Docker Compose also supports deploying to Swarm with docker stack deploy.

How many manager nodes should I use?

Always use 3 or 5 manager nodes, never an even number. With 3 managers, you can tolerate 1 failure. With 5 managers, you can tolerate 2 failures. An even number (4, 6) provides no additional fault tolerance over the next lower odd number. Never run workloads on manager nodes.

What is the difference between a service and a task in Docker Swarm?

A service is the declarative specification: which image, how many replicas, resource limits, update policy. A task is the atomic scheduling unit — one task equals one container. A service with 6 replicas has 6 tasks. The Swarm scheduler assigns tasks to nodes.

How does the routing mesh work?

The routing mesh is an ingress network that load-balances published ports across all nodes in the cluster. Any node can receive traffic for any service, regardless of whether that node is running the service's containers. The mesh forwards traffic to a node with a healthy task. The trade-off: one extra network hop.

Can I use Docker Swarm in production?

Yes, for small-to-medium deployments (up to ~100 nodes). Swarm provides self-healing, rolling updates, secrets management, and overlay networking. For larger scale or complex workloads (custom controllers, CRDs, advanced networking), Kubernetes is a better fit.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousMulti-stage Docker BuildsNext →Optimising Docker Images
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged