Manager nodes: run the Raft consensus algorithm, maintain cluster state, schedule services
Worker nodes: execute tasks (containers) assigned by managers
Services: the declarative unit — you define desired state, Swarm converges reality to match
Tasks: the atomic scheduling unit — one task = one container
Raft consensus requires a quorum (majority) of managers to agree on state changes
Overlay networks span hosts so containers can communicate across nodes
Ingress routing mesh load-balances published ports across all nodes
Rolling updates replace containers incrementally with zero downtime
Plain-English First
Imagine a restaurant chain with one head office (the manager) and ten kitchens across the city (the workers). A customer order comes in — the head office decides which kitchen handles it, monitors the food being made, and if one kitchen burns down, it quietly reroutes the order to another kitchen without the customer ever knowing. Docker Swarm is exactly that: one command-and-control brain (the manager node) coordinating a fleet of worker nodes, making sure your containers keep running no matter what breaks.
Every production app eventually outgrows a single server. Traffic spikes, hardware fails, deployments need to happen without downtime. Docker Swarm is the native clustering and orchestration layer baked directly into the Docker Engine.
Swarm solves coordination across multiple hosts. When you have ten nodes, you need something to decide where a container lands, what happens when a node dies, how containers on different hosts communicate, and how you push a new image without dropping requests. Swarm encodes those answers into a distributed state machine backed by the Raft consensus algorithm.
Common misconceptions: Swarm is not deprecated (Docker continues to maintain it alongside Compose). Swarm is not Kubernetes-lite (it has a fundamentally different architecture — no pods, no CRDs, no etcd). Swarm's simplicity is its strength for small-to-medium deployments that do not need Kubernetes' complexity.
Raft Consensus and Manager Node Architecture
Swarm's cluster state is stored in a distributed log managed by the Raft consensus algorithm. Every manager node runs a full copy of the Raft log. State changes (service updates, node joins, secret creation) are proposed by the leader, replicated to a quorum of followers, and then committed.
The quorum formula is floor(n/2) + 1, where n is the number of managers. With 3 managers, quorum is 2. With 5 managers, quorum is 3. The cluster can tolerate floor((n-1)/2) manager failures. With 3 managers, you can lose 1. With 5 managers, you can lose 2.
An even number of managers provides no additional fault tolerance over the next lower odd number. With 4 managers, quorum is 3 — you can still only lose 1 manager, same as with 3 managers. The 4th node is wasted.
Leader election: When the leader fails or becomes unreachable, the remaining managers hold an election. The manager with the most up-to-date Raft log and the lowest election timeout wins. The default election timeout is 1 second. Network partitions can cause split-brain if two groups of managers each elect their own leader, but only the group with quorum can commit new state changes.
Failure scenario — manager resource starvation: A team ran a memory-intensive batch job on a manager node. The job consumed all available RAM, causing the Docker daemon to be OOM-killed. The daemon restart triggered a Raft leader election. During the election window (1-2 seconds), no state changes could be committed. The team noticed brief delays in service updates. The fix: cordon manager nodes from workloads using docker node update --availability drain <manager-node>.
io/thecodeforge/swarm-manager-setup.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/bin/bash
# Swarm cluster bootstrap with proper manager configuration
# ── Initialize the Swarm on the first manager ─────────────────────
# --advertise-addr: the IP other nodes will use to reach this manager
# --listen-addr: the interface the manager binds to
docker swarm init \
--advertise-addr 10.0.1.10 \
--listen-addr 0.0.0.0:2377 \
--data-path-addr 10.0.1.10
# Get the join tokens
docker swarm join-token manager # For other managers
docker swarm join-token worker # For workers
# ── Join additional managers (run on each new manager node) ────────
docker swarm join \
--token SWMTKN-1-xxxxx-manager-token-xxxxx \
--advertise-addr 10.0.1.11 \
10.0.1.10:2377
# ── Verify manager count (should be 3 or 5, never even) ──────────
docker node ls --filter role=manager
# IDHOSTNAMESTATUSAVAILABILITYMANAGERSTATUSENGINEVERSION
# abc * manager-1ReadyActiveLeader24.0.7
# def manager-2ReadyActiveReachable24.0.7
# ghi manager-3ReadyActiveReachable24.0.7
# ── Drain manager nodes to prevent workloads from running on them ─
for node in $(docker node ls --filter role=manager -q); do
docker node update --availability drain $node
done
# Drained managers cannot run tasks — they are dedicated to orchestration
# ── CheckRaft cluster health ─────────────────────────────────────
docker info --format '{{.Swarm.Nodes}} managers, {{.Swarm.Nodes}} total'
# Or inspect the Raft status on each manager
docker node inspect self --format '{{.ManagerStatus.Leader}}'
# true on the leader, false on followers
# ── Configure auto-lock (encrypt Raft logs at rest) ───────────────
docker swarm update --autolock=true
# This requires unlocking the swarm after daemon restart:
# docker swarm unlock
# Enter unlock key: SWMKEY-1-xxxxx
Output
Swarm initialized: current node (abc123) is now a manager.
To add a worker to this swarm, run the following command:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
abc * manager-1 Ready Active Leader 24.0.7
def manager-2 Ready Active Reachable 24.0.7
ghi manager-3 Ready Active Reachable 24.0.7
Raft as a Committee Vote
Quorum = floor(n/2) + 1. With 3 managers, quorum is 2. With 4 managers, quorum is 3.
With 3 managers, you can lose 1 and still have quorum (2 >= 2).
With 4 managers, you can lose 1 and still have quorum (3 >= 3). But losing 2 breaks quorum (2 < 3).
The 4th manager adds cost (server, maintenance) without adding fault tolerance. Always use 3 or 5.
Production Insight
The autolock feature (--autolock=true) encrypts the Raft log at rest. Without it, anyone with access to the manager's disk can read the Raft data, which includes secrets and service definitions. The trade-off: after a daemon restart, you must manually enter the unlock key. Automate this with a secrets manager or a secure boot script.
Key Takeaway
Raft consensus requires a quorum of managers to agree on state changes. Always use an odd number of managers (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes — resource contention can starve the Raft process.
Use5 managers. Tolerates 2 failures. Better consensus performance under load.
IfLarge cluster, 100+ nodes
→
UseConsider migrating to Kubernetes. Swarm's consensus model does not scale well beyond ~100 nodes.
IfDevelopment/testing environment
→
Use1 manager is sufficient. No quorum concerns. Not suitable for production.
Service Scheduling, Placement Constraints and Resource Limits
A Swarm service is a declarative specification of the desired state: which image to run, how many replicas, resource limits, placement constraints, and update policy. The Swarm scheduler assigns tasks (individual containers) to nodes that satisfy the constraints and have available resources.
Scheduling algorithm: Swarm uses a spread scheduler by default — it places tasks on the node with the fewest existing tasks of the same service. This provides natural load distribution. You can override this with placement constraints and preferences.
Placement constraints: Hard requirements that a node must satisfy. Examples: - node.role==manager: only run on manager nodes - node.labels.zone==us-east-1a: only run in a specific availability zone - node.hostname==worker-3: pin to a specific node
Placement preferences: Soft preferences that guide scheduling but do not prevent placement. Example: --placement-pref 'spread=node.labels.zone' distributes tasks evenly across zones.
Resource limits: - --limit-cpu: maximum CPU a task can consume (e.g., 0.5 = half a core) - --limit-memory: maximum memory (e.g., 512m) - --reserve-cpu: guaranteed CPU allocation - --reserve-memory: guaranteed memory allocation
Without resource limits, a single misbehaving container can consume all resources on a node, starving other tasks. Resource reservations ensure critical services always have the resources they need.
Failure scenario — no resource limits, noisy neighbor: A team deployed a memory-intensive analytics service without --limit-memory. The service gradually consumed all available RAM on a worker node. The kernel OOM-killed other containers on the same node, including a critical payment service. The payment service was rescheduled to another node (Swarm's self-healing), but the 30-second rescheduling delay caused a brief payment outage. The fix: add --limit-memory to all services and --reserve-memory for critical services.
Constraints are hard requirements. If no node satisfies the constraint, the task stays in 'Pending' state forever.
Preferences are soft guidelines. Swarm tries to satisfy them but can place the task on any node if no preference match exists.
Use constraints for critical requirements: 'must run on SSD', 'must not run on managers'.
Use preferences for optimization: 'prefer to spread across zones', 'prefer nodes with fewer tasks'.
Production Insight
The --update-order start-first flag starts the new container before stopping the old one. This provides zero-downtime deployments but temporarily doubles the resource usage. If you have --limit-memory 512m and 6 replicas, the deployment temporarily needs 6GB instead of 3GB. Ensure your cluster has enough headroom for rolling updates. If headroom is limited, use stop-first order instead.
Key Takeaway
Always set resource limits on production services. Without limits, a single misbehaving container can OOM-kill other containers on the same node. Use placement constraints to isolate critical services and spread across availability zones. The spread scheduler distributes tasks evenly by default.
Resource Limit Strategy
IfStateless web API with predictable resource usage
→
UseSet --limit-cpu and --limit-memory based on load testing. Use --reserve-memory for critical services.
IfMemory-intensive batch processing
→
UseSet generous --limit-memory but low --limit-cpu. Use placement constraints to isolate on dedicated nodes.
IfLatency-sensitive service (trading, real-time)
→
UseUse --reserve-cpu to guarantee CPU. Consider host-mode publishing to bypass routing mesh. Pin to dedicated nodes.
IfDevelopment/testing
→
UseSkip resource limits. They add complexity without benefit in non-production environments.
Overlay Networks and Cross-Host Container Communication
Docker Swarm uses overlay networks to enable containers on different hosts to communicate as if they were on the same network. The overlay network uses VXLAN (Virtual Extensible LAN) encapsulation to tunnel Layer 2 traffic over the underlying Layer 3 network.
How it works: When container A on node 1 sends a packet to container B on node 2, the VXLAN driver encapsulates the packet in a UDP datagram on port 4789 and sends it to node 2. Node 2 decapsulates the packet and delivers it to container B. The containers see each other's overlay IP addresses as if they were on the same LAN.
The ingress routing mesh: When you publish a port with --publish, Swarm creates a route in the ingress network that load-balances incoming traffic across all nodes running the service. Any node in the cluster can receive traffic for any service, regardless of whether that node is running the service's containers. The routing mesh forwards the traffic to a node that is running a healthy task.
The extra-hop problem: The routing mesh adds one network hop. A request to node 1 may be routed to a container on node 3. This adds latency. For latency-sensitive services, use host-mode publishing: --publish published=8080,target=8080,mode=host. This bypasses the routing mesh and binds directly to the host's port. The trade-off: only nodes running the service's containers accept traffic — you lose the any-node routing benefit.
Failure scenario — VXLAN port blocked by firewall: A team deployed a 3-node Swarm cluster across two data centers. Containers in data center A could not reach containers in data center B. The team spent 4 hours debugging DNS, service discovery, and overlay configuration. The root cause: the firewall between data centers blocked UDP port 4789 (VXLAN). After opening the port, overlay connectivity was restored immediately.
io/thecodeforge/swarm-networking.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/bin/bash
# Overlay network setup and troubleshooting
# ── Create an overlay network with encryption ─────────────────────
docker network create \
--driver overlay \
--attachable \
--opt encrypted \
--subnet 10.0.10.0/24 \
io-thecodeforge-overlay
# --driver overlay: VXLAN-based cross-host networking
# --attachable: allows standalone containers to join (useful for debugging)
# --opt encrypted: encrypts VXLAN traffic with IPsec (adds ~10% overhead)
# --subnet: explicit IP range for the overlay network
# ── Deploy a service on the overlay network ───────────────────────
docker service create \
--name api \
--network io-thecodeforge-overlay \
--replicas 3 \
io.thecodeforge/api:v2.3.1
# ── Verify overlay network peers (should list all nodes) ──────────
docker network inspect io-thecodeforge-overlay --format '{{json .Peers}}' | python3 -m json.tool
# Each peer represents a node participating in the overlay network
# If a peer is missing, that node cannot communicate on the overlay
# ── Test cross-host connectivity ──────────────────────────────────
# From any node, run a debug container on the overlay network
docker run --rm -it --network io-thecodeforge-overlay alpine sh
# Inside the container:
# ping <overlay-ip-of-service-task>
# nslookup tasks.api # DNS round-robin for all service tasks
# ── Required ports forSwarm networking ───────────────────────────
# TCP2377: Swarm cluster management (Raft)
# TCP/UDP7946: Gossip-based node discovery
# UDP4789: VXLAN overlay network traffic
# Protocol50 (ESP): IPsecencryption (if --opt encrypted)
# ── Host-mode publishing (bypass routing mesh) ────────────────────
docker service create \
--name api-latency-sensitive \
--network io-thecodeforge-overlay \
--publish published=8080,target=8080,mode=host \
--mode global \
io.thecodeforge/api:v2.3.1
# mode=global: one task per node (every node runs the service)
# mode=host: binds directly to host port 8080, no routing mesh hop
Output
Network io-thecodeforge-overlay created
[
{
"Name": "manager-1",
"IP": "10.0.1.10"
},
{
"Name": "worker-1",
"IP": "10.0.1.11"
},
{
"Name": "worker-2",
"IP": "10.0.1.12"
}
]
# All 3 peers are present — overlay network is healthy
Overlay Network as a Virtual Office Floor
Latency-sensitive services where the extra routing mesh hop adds unacceptable delay.
Services that need to bind to specific host ports for external load balancer integration.
Services running in --mode global (one per node) where every node already has a container.
Trade-off: you lose the any-node routing benefit. Traffic only reaches nodes running the service.
Production Insight
The --opt encrypted flag adds IPsec encryption to VXLAN traffic. This is important for multi-data-center or cloud deployments where traffic crosses untrusted networks. The overhead is approximately 10% throughput reduction and slightly higher CPU usage. For single-data-center deployments on a trusted network, skip encryption to avoid the overhead.
Key Takeaway
Overlay networks use VXLAN on UDP port 4789. If this port is blocked by firewalls, containers on different nodes cannot communicate. The routing mesh adds one network hop — use host-mode publishing for latency-sensitive services. Always use --opt encrypted for cross-data-center overlays.
Rolling Updates, Rollback and Zero-Downtime Deployments
Swarm's rolling update mechanism replaces old containers with new ones incrementally, ensuring the service remains available throughout the deployment. The update configuration controls the pace and failure behavior.
Update parameters: - --update-parallelism: how many tasks to update simultaneously (default: 1) - --update-delay: wait time between updating batches (default: 0s) - --update-failure-action: what to do if a new task fails (pause, continue, rollback) - --update-order: start-first (new container starts before old stops) or stop-first (old stops before new starts) - --update-max-failure-ratio: percentage of failures that triggers the failure action
The start-first vs stop-first trade-off: - start-first: zero downtime, but temporarily doubles resource usage during deployment - stop-first: lower resource usage, but brief window where one fewer replica is running
Rollback: If a rolling update fails, Swarm can automatically roll back to the previous version. The rollback configuration mirrors the update configuration. Manual rollback: docker service rollback <service>.
Failure scenario — update without health check causes cascading failure: A team deployed a new API version with a startup bug that caused the health check to fail after 30 seconds. The team did not configure --health-start-period. The health check failed immediately (before the app was ready), causing Swarm to mark the task as failed. With --update-failure-action continue (the default), Swarm continued replacing all healthy containers with the failing new version. Within 2 minutes, all containers were running the broken version. The fix: set --update-failure-action rollback and configure --health-start-period to allow startup time.
io/thecodeforge/swarm-rolling-update.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#!/bin/bash
# Zero-downtime rolling update with automatic rollback
# ── Initial deployment ────────────────────────────────────────────
docker service create \
--name io-thecodeforge-api \
--image io.thecodeforge/api:v2.3.0 \
--replicas 6 \
--limit-cpu 1.0 \
--limit-memory 512m \
--health-cmd 'curl -f http://localhost:8080/health || exit 1' \
--health-interval 10s \
--health-timeout 5s \
--health-retries 3 \
--health-start-period 40s \
\
# Rolling update: 2 at a time, 10s delay, auto-rollback on failure
--update-parallelism 2 \
--update-delay 10s \
--update-failure-action rollback \
--update-max-failure-ratio 0.25 \
--update-order start-first \
\
# Rollback policy
--rollback-parallelism 1 \
--rollback-delay 5s \
--rollback-order stop-first \
\
--network io-thecodeforge-overlay \
--publish published=8080,target=8080 \
io.thecodeforge/api:v2.3.0
# ── Rolling update to new version ─────────────────────────────────
docker service update \
--image io.thecodeforge/api:v2.3.1 \
--update-parallelism 2 \
--update-delay 10s \
io-thecodeforge-api
# ── Monitor the update progress ───────────────────────────────────
docker service ps io-thecodeforge-api \
--format '{{.Name}} {{.Image}} {{.CurrentState}} {{.Error}}' \
| head -20
# You will see old tasks shutting down and new tasks starting
# ── Manual rollback if needed ─────────────────────────────────────
docker service rollback io-thecodeforge-api
# Reverts to the previous image and configuration
# ── Forceupdate (redeploy without changing image) ────────────────
docker service update --force io-thecodeforge-api
# Useful when container config has changed but image tag is the same
# api.3 io.thecodeforge/api:v2.3.0 Running (waiting for delay)
# api.4 io.thecodeforge/api:v2.3.0 Running (waiting for delay)
# api.5 io.thecodeforge/api:v2.3.0 Running
# api.6 io.thecodeforge/api:v2.3.0 Running
Rolling Update as Renovating a Hotel Floor by Floor
Without rollback, a failing update continues replacing all healthy containers with the broken version.
With rollback, Swarm detects failures and automatically reverts to the previous working version.
The --update-max-failure-ratio flag controls the failure threshold. 0.25 means 25% failure triggers rollback.
Always pair rollback with health checks. Without health checks, Swarm cannot detect a broken container.
Production Insight
The --health-start-period flag is essential for services with slow startup times (JVM warmup, database migrations, cache hydration). Without it, the health check runs immediately and may fail before the application is ready, triggering an unnecessary rollback. Set it to the expected maximum startup time plus a buffer.
Key Takeaway
Always set --update-failure-action rollback in production. Without it, a broken update replaces all healthy containers. Use --health-start-period for services with slow startup. start-first provides zero downtime but doubles resource usage during deployment — ensure cluster headroom.
Swarm Secrets and Configs — Immutable, Encrypted, Rotatable
Docker Swarm provides built-in secrets management through the Raft log. Secrets are encrypted at rest and in transit, mounted as files in /run/secrets/ inside containers, and never written to image layers.
How secrets work: - docker secret create: stores the secret in the Raft log (encrypted with the swarm unlock key) - The secret is distributed to every manager node (encrypted) - When a service references a secret, it is mounted as a file at /run/secrets/<secret-name> - Secrets are immutable — updating a secret creates a new version
How configs work: - docker config create: stores configuration files in the Raft log - Configs are mounted as files in the container (not encrypted at rest — use secrets for sensitive data) - Useful for nginx.conf, application.yaml, or any configuration file
Secret rotation: Secrets are immutable. To rotate a secret: 1. Create a new secret: docker secret create db-password-v2 - 2. Update the service to use the new secret: docker service update --secret-rm db-password --secret-add db-password-v2 <service> 3. The service restarts with the new secret mounted 4. Delete the old secret: docker secret rm db-password
Failure scenario — secret not updating in running service: A team updated a database password by creating a new secret and updating the service. However, the application inside the container still read the old password from /run/secrets/db-password. The team did not realize that Docker secrets are immutable — the old secret file remained mounted until the service was explicitly updated to remove it. The fix: use --secret-rm to remove the old secret and --secret-add to add the new one in the same update command.
io/thecodeforge/swarm-secrets.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/bin/bash
# Secrets management in DockerSwarm
# ── Create a secret from stdin ────────────────────────────────────
echo 's3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password -
# ── Create a secret from a file ───────────────────────────────────
docker secret create io-thecodeforge-tls-cert /path/to/cert.pem
# ── Create a config (non-sensitive configuration) ─────────────────
docker config create io-thecodeforge-nginx-conf /path/to/nginx.conf
# ── Deploy a service with secrets and configs ─────────────────────
docker service create \
--name io-thecodeforge-api \
--secret io-thecodeforge-db-password \
--secret io-thecodeforge-tls-cert \
--config source=io-thecodeforge-nginx-conf,target=/etc/nginx/nginx.conf \
io.thecodeforge/api:v2.3.1
# ── Access secrets inside the container ────────────────────────────
docker exec <container> cat /run/secrets/io-thecodeforge-db-password
# Output: s3cret_p@ssw0rd
docker exec <container> ls /run/secrets/
# io-thecodeforge-db-password
# io-thecodeforge-tls-cert
# ── Rotate a secret ───────────────────────────────────────────────
# Step1: Createnew version
echo 'new_s3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password-v2 -
# Step2: Update service — remove old, add new
docker service update \
--secret-rm io-thecodeforge-db-password \
--secret-add io-thecodeforge-db-password-v2 \
io-thecodeforge-api
# Step3: Verifynew secret is mounted
docker exec <container> cat /run/secrets/io-thecodeforge-db-password-v2
# Output: new_s3cret_p@ssw0rd
# Step4: Clean up old secret
docker secret rm io-thecodeforge-db-password
# ── List all secrets ──────────────────────────────────────────────
docker secret ls
# IDNAMECREATED
# abc123 io-thecodeforge-db-password 2 hours ago
def456 io-thecodeforge-tls-cert 2 hours ago
Secrets are encrypted at rest in the Raft log. ENV variables are stored in plaintext in container metadata.
Secrets are mounted as files — they do not appear in docker inspect, docker ps, or process listings.
Secrets are distributed only to nodes running tasks that reference them. ENV variables are visible to anyone with image access.
Secrets are immutable and versioned. ENV variables can be accidentally changed or logged.
Production Insight
Docker secrets are Swarm-only. If you use standalone Docker (not Swarm), you must use alternative secrets management: Docker Compose secrets (file-based, not encrypted), HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets. Plan your secrets strategy before choosing an orchestration platform.
Key Takeaway
Docker secrets are encrypted, immutable, and mounted as files in /run/secrets/. Never use ENV for secrets — they are visible in docker inspect. To rotate a secret, create a new version and update the service with --secret-rm and --secret-add. Secrets are Swarm-only — standalone Docker requires alternative solutions.
● Production incidentPOST-MORTEMseverity: high
Cluster Split-Brain After Losing 2 of 4 Manager Nodes — All Services Unreachable for 3 Hours
Symptom
After a planned data center maintenance window, the operations team could not deploy new services. docker service ls hung for 30 seconds then returned 'Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded'. Existing services continued running but were unreachable via the routing mesh. docker node ls on the surviving managers showed the 2 offline managers as 'Down' but 'Reachable' was false for all managers.
Assumption
Team assumed the offline managers would come back after maintenance and the cluster would self-heal. They waited 2 hours. The managers came back online, but the cluster was still unresponsive. They assumed a Docker daemon bug and considered rebuilding the entire cluster from scratch.
Root cause
With 4 manager nodes, the Raft quorum requires at least 3 managers to agree on any state change (quorum = floor(n/2) + 1 = floor(4/2) + 1 = 3). When 2 managers went offline, only 2 remained — insufficient for quorum. The Raft consensus algorithm froze. No new state changes could be committed. When the offline managers returned, they had stale Raft logs. The cluster needed manual intervention to re-establish consensus. The root design flaw was using an even number of managers (4) instead of an odd number (3 or 5).
Fix
1. Demoted one offline manager to worker: docker node demote <node-id>. This reduced the manager count to 3, making quorum = 2, which the 2 surviving managers could satisfy. 2. Promoted a worker to manager: docker node promote <worker-id>. This restored the manager count to 3 (odd). 3. Added a monitoring alert for Raft quorum health: docker node ls | grep -c 'Leader\|Reachable' to detect quorum loss early. 4. Documented the rule: always use 3 or 5 managers, never 4 or 6. 5. Migrated critical services to Kubernetes for the long term, as the team's scale exceeded Swarm's sweet spot.
Key lesson
Always use an odd number of manager nodes: 3 or 5. An even number (4, 6) wastes a node without improving fault tolerance.
Quorum = floor(n/2) + 1. With 3 managers, you can lose 1. With 5 managers, you can lose 2. With 4 managers, you can still only lose 1 — the 4th node provides no additional resilience.
Monitor Raft quorum health proactively. A cluster that loses quorum cannot schedule, scale, or update services — even though existing containers keep running.
Never run application workloads on manager nodes. Resource contention can starve the Raft process and cause the manager to appear unreachable, triggering unnecessary leader elections.
When quorum is lost, do not reboot all managers simultaneously. Restore one manager at a time and verify Raft log consistency before bringing up the next.
Production debug guideFrom quorum loss to service scheduling failures — systematic debugging paths.6 entries
Symptom · 01
docker service ls hangs or returns 'DeadlineExceeded'.
→
Fix
Check Raft quorum health. Run docker node ls on each manager. If fewer than quorum managers show 'Reachable', the cluster has lost quorum. Check if managers are reachable via SSH. Restart the Docker daemon on unreachable managers one at a time. If quorum cannot be restored, demote a failed manager to reduce the manager count.
Symptom · 02
Service tasks are stuck in 'Pending' state and never start.
→
Fix
Check resource constraints: docker service ps <service> --no-trunc. Look for 'no suitable node' errors. Verify node availability: docker node ls. Check if nodes have enough CPU/memory: docker node inspect <node> --format '{{.Description.Resources}}'. Check placement constraints: docker service inspect <service> --format '{{.Spec.TaskTemplate.Placement.Constraints}}'.
Symptom · 03
Service is running but unreachable via published port.
→
Fix
Check if the routing mesh is functioning: curl http://<any-node-ip>:<published-port>. If it works on some nodes but not others, the ingress network may be misconfigured. Inspect the ingress network: docker network inspect ingress. Check if the service has healthy tasks: docker service ps <service> --filter desired-state=running. Verify the container is listening: docker exec <container> ss -tlnp.
Symptom · 04
Rolling update is stuck and not progressing.
→
Fix
Check update status: docker service ps <service> --filter desired-state=running. Look for tasks in 'Failed' state. Check the new image exists and is pullable: docker pull <image>. Check if the new container fails health checks: docker service inspect <service> --format '{{.Spec.UpdateConfig}}'. Adjust update parallelism and delay: docker service update --update-parallelism 1 --update-delay 30s <service>.
Symptom · 05
Node shows 'Down' but the server is online.
→
Fix
Check Docker daemon status on the node: systemctl status docker. Check if the node's IP changed (common in cloud environments with dynamic IPs). Swarm uses the IP from docker swarm init/join. If the IP changed, the node must rejoin the cluster. Check firewall rules: Swarm requires ports 2377 (Raft), 7946 (gossip), 4789 (overlay VXLAN) to be open between all nodes.
Symptom · 06
Secrets or configs not updating in running services.
→
Fix
Docker secrets and configs are immutable. Updating a secret creates a new version. The service must be updated to reference the new secret: docker service update --secret-rm <old-secret> --secret-add <new-secret> <service>. Verify the secret is mounted: docker exec <container> ls /run/secrets/.
★ Docker Swarm Triage Cheat SheetFirst-response commands when Swarm cluster or service issues are reported.
Cluster unresponsive — docker service ls hangs.−
Immediate action
Check Raft quorum across all manager nodes.
Commands
docker node ls
docker info --format '{{.Swarm.ControlAvailable}}' (run on each manager)
Fix now
If fewer than quorum managers are reachable, restart Docker daemon on one manager at a time. If a manager is permanently dead, demote it: docker node demote <node-id>.
Service tasks stuck in 'Pending' or 'Failed' state.+
Immediate action
Check task failure reason and node resource availability.
If 'no suitable node', check constraints and resource limits. Remove constraints or add nodes. If container crashes, check logs: docker service logs <service> --tail 50.
Service unreachable via published port on specific nodes.+
If VXLAN port is blocked, open UDP 4789 in firewall. If peers are missing, restart Docker on affected nodes. Consider using host networking for latency-sensitive services.
Docker Swarm vs Kubernetes — When to Choose Which
Aspect
Docker Swarm
Kubernetes
Setup complexity
Single command: docker swarm init
Requires kubeadm, kops, or managed service (EKS, GKE)
Learning curve
Low — uses standard Docker CLI
Steep — new concepts (pods, deployments, services, ingress)
Built-in features
Service discovery, load balancing, secrets, rolling updates
All of the above plus CRDs, operators, admission controllers
Large-scale, complex workloads, teams with dedicated platform engineers
Key takeaways
1
Docker Swarm is the native orchestration layer built into the Docker Engine. It uses Raft consensus for state management and VXLAN overlay networks for cross-host communication.
2
Always use an odd number of manager nodes (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes.
3
The ingress routing mesh adds one network hop. For latency-sensitive services, use host-mode publishing. Always open UDP 4789, TCP/UDP 7946, and TCP 2377 between nodes.
4
Always set --update-failure-action rollback and health checks with --health-start-period. Without rollback, a broken update replaces all healthy containers.
5
Docker secrets are encrypted, immutable, and mounted as files. Never use ENV for secrets. Secrets are Swarm-only
standalone Docker requires alternatives.
6
Swarm is ideal for small-to-medium deployments. For 100+ nodes or complex workloads, consider migrating to Kubernetes.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Is Docker Swarm still maintained?
Yes. Docker continues to maintain Swarm as part of the Docker Engine. It is not deprecated. Swarm is the right choice for small-to-medium deployments that do not need Kubernetes' complexity. Docker Compose also supports deploying to Swarm with docker stack deploy.
Was this helpful?
02
How many manager nodes should I use?
Always use 3 or 5 manager nodes, never an even number. With 3 managers, you can tolerate 1 failure. With 5 managers, you can tolerate 2 failures. An even number (4, 6) provides no additional fault tolerance over the next lower odd number. Never run workloads on manager nodes.
Was this helpful?
03
What is the difference between a service and a task in Docker Swarm?
A service is the declarative specification: which image, how many replicas, resource limits, update policy. A task is the atomic scheduling unit — one task equals one container. A service with 6 replicas has 6 tasks. The Swarm scheduler assigns tasks to nodes.
Was this helpful?
04
How does the routing mesh work?
The routing mesh is an ingress network that load-balances published ports across all nodes in the cluster. Any node can receive traffic for any service, regardless of whether that node is running the service's containers. The mesh forwards traffic to a node with a healthy task. The trade-off: one extra network hop.
Was this helpful?
05
Can I use Docker Swarm in production?
Yes, for small-to-medium deployments (up to ~100 nodes). Swarm provides self-healing, rolling updates, secrets management, and overlay networking. For larger scale or complex workloads (custom controllers, CRDs, advanced networking), Kubernetes is a better fit.