Docker Swarm — Why 4 Managers Caused a 3-Hour Outage
4 manager nodes lost quorum when 2 failed — freezing all deployments for 3 hours.
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
- Manager nodes: run the Raft consensus algorithm, maintain cluster state, schedule services
- Worker nodes: execute tasks (containers) assigned by managers
- Services: the declarative unit — you define desired state, Swarm converges reality to match
- Tasks: the atomic scheduling unit — one task = one container
- Raft consensus requires a quorum (majority) of managers to agree on state changes
- Overlay networks span hosts so containers can communicate across nodes
- Ingress routing mesh load-balances published ports across all nodes
- Rolling updates replace containers incrementally with zero downtime
Imagine a restaurant chain with one head office (the manager) and ten kitchens across the city (the workers). A customer order comes in — the head office decides which kitchen handles it, monitors the food being made, and if one kitchen burns down, it quietly reroutes the order to another kitchen without the customer ever knowing. Docker Swarm is exactly that: one command-and-control brain (the manager node) coordinating a fleet of worker nodes, making sure your containers keep running no matter what breaks.
Every production app eventually outgrows a single server. Traffic spikes, hardware fails, deployments need to happen without downtime. Docker Swarm is the native clustering and orchestration layer baked directly into the Docker Engine.
Swarm solves coordination across multiple hosts. When you have ten nodes, you need something to decide where a container lands, what happens when a node dies, how containers on different hosts communicate, and how you push a new image without dropping requests. Swarm encodes those answers into a distributed state machine backed by the Raft consensus algorithm.
Common misconceptions: Swarm is not deprecated (Docker continues to maintain it alongside Compose). Swarm is not Kubernetes-lite (it has a fundamentally different architecture — no pods, no CRDs, no etcd). Swarm's simplicity is its strength for small-to-medium deployments that do not need Kubernetes' complexity.
Why Docker Swarm's Manager Count Matters More Than You Think
Docker Swarm is a container orchestration engine built into Docker Engine that groups multiple hosts into a single virtual cluster. Its core mechanic is the Raft consensus algorithm: manager nodes elect a leader to coordinate all cluster state changes. Every service definition, secret, and configuration update must pass through the leader, which replicates it to a majority of managers before it's committed.
Swarm's key property is that it tolerates up to (N-1)/2 manager failures — but only if you run an odd number. With 4 managers, a single failure drops you to 3, which is still a majority. But if another fails, you're at 2 — no majority, and the cluster freezes. No deployments, no scaling, no health checks. The system is alive but brain-dead. Raft requires a strict majority of all configured managers, not just the ones currently online.
Use Swarm when you need a simple, low-overhead orchestrator for a small-to-medium cluster (under 50 nodes) and you want zero external dependencies — no etcd, no ZooKeeper. It's ideal for teams that already run Docker and need basic HA without the operational complexity of Kubernetes. But the manager count is not a scaling knob; it's a fault-tolerance decision. Run 3 or 5, never 4.
Raft Consensus and Manager Node Architecture
Swarm's cluster state is stored in a distributed log managed by the Raft consensus algorithm. Every manager node runs a full copy of the Raft log. State changes (service updates, node joins, secret creation) are proposed by the leader, replicated to a quorum of followers, and then committed.
The quorum formula is floor(n/2) + 1, where n is the number of managers. With 3 managers, quorum is 2. With 5 managers, quorum is 3. The cluster can tolerate floor((n-1)/2) manager failures. With 3 managers, you can lose 1. With 5 managers, you can lose 2.
An even number of managers provides no additional fault tolerance over the next lower odd number. With 4 managers, quorum is 3 — you can still only lose 1 manager, same as with 3 managers. The 4th node is wasted.
Leader election: When the leader fails or becomes unreachable, the remaining managers hold an election. The manager with the most up-to-date Raft log and the lowest election timeout wins. The default election timeout is 1 second. Network partitions can cause split-brain if two groups of managers each elect their own leader, but only the group with quorum can commit new state changes.
Failure scenario — manager resource starvation: A team ran a memory-intensive batch job on a manager node. The job consumed all available RAM, causing the Docker daemon to be OOM-killed. The daemon restart triggered a Raft leader election. During the election window (1-2 seconds), no state changes could be committed. The team noticed brief delays in service updates. The fix: cordon manager nodes from workloads using docker node update --availability drain <manager-node>.
- Quorum = floor(n/2) + 1. With 3 managers, quorum is 2. With 4 managers, quorum is 3.
- With 3 managers, you can lose 1 and still have quorum (2 >= 2).
- With 4 managers, you can lose 1 and still have quorum (3 >= 3). But losing 2 breaks quorum (2 < 3).
- The 4th manager adds cost (server, maintenance) without adding fault tolerance. Always use 3 or 5.
Service Scheduling, Placement Constraints and Resource Limits
A Swarm service is a declarative specification of the desired state: which image to run, how many replicas, resource limits, placement constraints, and update policy. The Swarm scheduler assigns tasks (individual containers) to nodes that satisfy the constraints and have available resources.
Scheduling algorithm: Swarm uses a spread scheduler by default — it places tasks on the node with the fewest existing tasks of the same service. This provides natural load distribution. You can override this with placement constraints and preferences.
Placement constraints: Hard requirements that a node must satisfy. Examples: - node.role==manager: only run on manager nodes - node.labels.zone==us-east-1a: only run in a specific availability zone - node.hostname==worker-3: pin to a specific node
Placement preferences: Soft preferences that guide scheduling but do not prevent placement. Example: --placement-pref 'spread=node.labels.zone' distributes tasks evenly across zones.
Resource limits: - --limit-cpu: maximum CPU a task can consume (e.g., 0.5 = half a core) - --limit-memory: maximum memory (e.g., 512m) - --reserve-cpu: guaranteed CPU allocation - --reserve-memory: guaranteed memory allocation
Without resource limits, a single misbehaving container can consume all resources on a node, starving other tasks. Resource reservations ensure critical services always have the resources they need.
Failure scenario — no resource limits, noisy neighbor: A team deployed a memory-intensive analytics service without --limit-memory. The service gradually consumed all available RAM on a worker node. The kernel OOM-killed other containers on the same node, including a critical payment service. The payment service was rescheduled to another node (Swarm's self-healing), but the 30-second rescheduling delay caused a brief payment outage. The fix: add --limit-memory to all services and --reserve-memory for critical services.
- Constraints are hard requirements. If no node satisfies the constraint, the task stays in 'Pending' state forever.
- Preferences are soft guidelines. Swarm tries to satisfy them but can place the task on any node if no preference match exists.
- Use constraints for critical requirements: 'must run on SSD', 'must not run on managers'.
- Use preferences for optimization: 'prefer to spread across zones', 'prefer nodes with fewer tasks'.
Overlay Networks and Cross-Host Container Communication
Docker Swarm uses overlay networks to enable containers on different hosts to communicate as if they were on the same network. The overlay network uses VXLAN (Virtual Extensible LAN) encapsulation to tunnel Layer 2 traffic over the underlying Layer 3 network.
How it works: When container A on node 1 sends a packet to container B on node 2, the VXLAN driver encapsulates the packet in a UDP datagram on port 4789 and sends it to node 2. Node 2 decapsulates the packet and delivers it to container B. The containers see each other's overlay IP addresses as if they were on the same LAN.
The ingress routing mesh: When you publish a port with --publish, Swarm creates a route in the ingress network that load-balances incoming traffic across all nodes running the service. Any node in the cluster can receive traffic for any service, regardless of whether that node is running the service's containers. The routing mesh forwards the traffic to a node that is running a healthy task.
The extra-hop problem: The routing mesh adds one network hop. A request to node 1 may be routed to a container on node 3. This adds latency. For latency-sensitive services, use host-mode publishing: --publish published=8080,target=8080,mode=host. This bypasses the routing mesh and binds directly to the host's port. The trade-off: only nodes running the service's containers accept traffic — you lose the any-node routing benefit.
Failure scenario — VXLAN port blocked by firewall: A team deployed a 3-node Swarm cluster across two data centers. Containers in data center A could not reach containers in data center B. The team spent 4 hours debugging DNS, service discovery, and overlay configuration. The root cause: the firewall between data centers blocked UDP port 4789 (VXLAN). After opening the port, overlay connectivity was restored immediately.
- Latency-sensitive services where the extra routing mesh hop adds unacceptable delay.
- Services that need to bind to specific host ports for external load balancer integration.
- Services running in --mode global (one per node) where every node already has a container.
- Trade-off: you lose the any-node routing benefit. Traffic only reaches nodes running the service.
Rolling Updates, Rollback and Zero-Downtime Deployments
Swarm's rolling update mechanism replaces old containers with new ones incrementally, ensuring the service remains available throughout the deployment. The update configuration controls the pace and failure behavior.
Update parameters: - --update-parallelism: how many tasks to update simultaneously (default: 1) - --update-delay: wait time between updating batches (default: 0s) - --update-failure-action: what to do if a new task fails (pause, continue, rollback) - --update-order: start-first (new container starts before old stops) or stop-first (old stops before new starts) - --update-max-failure-ratio: percentage of failures that triggers the failure action
The start-first vs stop-first trade-off: - start-first: zero downtime, but temporarily doubles resource usage during deployment - stop-first: lower resource usage, but brief window where one fewer replica is running
Rollback: If a rolling update fails, Swarm can automatically roll back to the previous version. The rollback configuration mirrors the update configuration. Manual rollback: docker service rollback <service>.
Failure scenario — update without health check causes cascading failure: A team deployed a new API version with a startup bug that caused the health check to fail after 30 seconds. The team did not configure --health-start-period. The health check failed immediately (before the app was ready), causing Swarm to mark the task as failed. With --update-failure-action continue (the default), Swarm continued replacing all healthy containers with the failing new version. Within 2 minutes, all containers were running the broken version. The fix: set --update-failure-action rollback and configure --health-start-period to allow startup time.
- Without rollback, a failing update continues replacing all healthy containers with the broken version.
- With rollback, Swarm detects failures and automatically reverts to the previous working version.
- The --update-max-failure-ratio flag controls the failure threshold. 0.25 means 25% failure triggers rollback.
- Always pair rollback with health checks. Without health checks, Swarm cannot detect a broken container.
Swarm Secrets and Configs — Immutable, Encrypted, Rotatable
Docker Swarm provides built-in secrets management through the Raft log. Secrets are encrypted at rest and in transit, mounted as files in /run/secrets/ inside containers, and never written to image layers.
How secrets work: - docker secret create: stores the secret in the Raft log (encrypted with the swarm unlock key) - The secret is distributed to every manager node (encrypted) - When a service references a secret, it is mounted as a file at /run/secrets/<secret-name> - Secrets are immutable — updating a secret creates a new version
How configs work: - docker config create: stores configuration files in the Raft log - Configs are mounted as files in the container (not encrypted at rest — use secrets for sensitive data) - Useful for nginx.conf, application.yaml, or any configuration file
Secret rotation: Secrets are immutable. To rotate a secret: 1. Create a new secret: docker secret create db-password-v2 - 2. Update the service to use the new secret: docker service update --secret-rm db-password --secret-add db-password-v2 <service> 3. The service restarts with the new secret mounted 4. Delete the old secret: docker secret rm db-password
Failure scenario — secret not updating in running service: A team updated a database password by creating a new secret and updating the service. However, the application inside the container still read the old password from /run/secrets/db-password. The team did not realize that Docker secrets are immutable — the old secret file remained mounted until the service was explicitly updated to remove it. The fix: use --secret-rm to remove the old secret and --secret-add to add the new one in the same update command.
- Secrets are encrypted at rest in the Raft log. ENV variables are stored in plaintext in container metadata.
- Secrets are mounted as files — they do not appear in docker inspect, docker ps, or process listings.
- Secrets are distributed only to nodes running tasks that reference them. ENV variables are visible to anyone with image access.
- Secrets are immutable and versioned. ENV variables can be accidentally changed or logged.
Tasks and Services: The Two Abstractions You Can't Afford to Confuse
Newcomers treat 'service' and 'task' like synonyms. They're not. Get this wrong and your rolling updates will silently fail, your health checks will fire at ghosts, and you'll be debugging at 2 AM while your manager asks why production is serving 503s.
A Service is the declarative spec. You define the image, replicas, network, ports, resource limits — the desired state. Docker Swarm reconciles actual state to match. A Task is a running instance of that service. One replica = one task. When you scale to 10, you get 10 tasks, each with a unique ID tied to a specific node.
Here's the nasty bit: tasks are ephemeral. They fail, get rescheduled, get replaced during updates. Your monitoring must track task IDs, not container names. If you're scraping logs by container name, you'll lose the trail after any reschedule. Tag your logs with task ID and service name from environment variables injected by Swarm.
Ports and Protocols: The Firewall Dance That Breaks Your Swarm
You've initialized your swarm, added workers, and everything works on your laptop. Then you deploy to bare metal in a colo and nodes can't talk to each other. Welcome to networking hell.
Swarm mode needs specific ports open between all nodes — not just manager to worker, but worker to worker, and manager to manager. The Raft consensus traffic uses TCP and UDP port 2377. Container ingress traffic routes through a VXLAN overlay on UDP port 4789. Node-to-node gossip protocol uses UDP port 7946.
Here's what the docs won't scream at you: opening these ports on cloud firewalls isn't enough. If your nodes are in different subnets with network ACLs between them, VXLAN encapsulation might get dropped. Check your MTU too — overlay networks add 50 bytes of overhead. Standard 1500 MTU on the underlay will fragment packets if you're not careful, and some cloud providers drop fragments silently.
Test with a simple service that pings between nodes before you declare victory.
Three Host Machines — Don't Even Think About Fewer
Your swarm needs at least three manager nodes. Not two. Not one. Three. This is the minimum to survive a single node failure without losing the Raft quorum you read about earlier.
Why three? Raft consensus requires a majority. With three managers, you can lose one and still have two — that's a majority. With two, you lose one and you're at fifty-fifty tie. The swarm freezes. No scheduling, no updates, nothing. You're down. Production shops that run two managers are one disk failure away from a cluster-wide lockup.
Managers hold the cluster state. That state is distributed via Raft logs. Even if your applications run on worker nodes, the managers coordinate everything — service discovery, scheduling, scaling. Three hosts means you can reboot one for patches and the swarm keeps chewing. Anything less is gambling with your production pipeline.
Don't Run Apps on Managers — That's Not Their Job
Managers run the control plane. They gossip cluster state, maintain Raft logs, and serve the Docker API. They are not compute nodes. You wouldn't run your web server on the Kubernetes control plane, so don't do it in Swarm.
By default, Swarm schedules services onto manager nodes. You must explicitly drain managers or add placement constraints to force workloads onto worker nodes. The node.role == worker constraint in your compose file or service create command does exactly that. Without it, your API container could land on a manager during a rolling update, consuming CPU and memory that your cluster brain needs to stay responsive.
Separate concerns = separate node roles. Managers handle orchestration. Workers run containers. If one manager crashes under load because your app ate its memory, you lose not just that host but potentially the quorum. Keep managers lean, dedicated, and isolated from application traffic. Your future self — and your on-call team — will thank you.
docker node update --availability drain <manager-hostname> on all managers to prevent any service from landing there accidentally.Cluster Split-Brain After Losing 2 of 4 Manager Nodes — All Services Unreachable for 3 Hours
- Always use an odd number of manager nodes: 3 or 5. An even number (4, 6) wastes a node without improving fault tolerance.
- Quorum = floor(n/2) + 1. With 3 managers, you can lose 1. With 5 managers, you can lose 2. With 4 managers, you can still only lose 1 — the 4th node provides no additional resilience.
- Monitor Raft quorum health proactively. A cluster that loses quorum cannot schedule, scale, or update services — even though existing containers keep running.
- Never run application workloads on manager nodes. Resource contention can starve the Raft process and cause the manager to appear unreachable, triggering unnecessary leader elections.
- When quorum is lost, do not reboot all managers simultaneously. Restore one manager at a time and verify Raft log consistency before bringing up the next.
docker node lsdocker info --format '{{.Swarm.ControlAvailable}}' (run on each manager)Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
That's Docker. Mark it forged?
11 min read · try the examples if you haven't