Advanced 16 min · March 06, 2026

Docker Swarm Basics

Docker Swarm — Why 4 Managers Caused a 3-Hour Outage

Q: Is Docker Swarm still maintained?

Yes. Docker continues to maintain Swarm as part of the Docker Engine. It is not deprecated. Swarm is the right choice for small-to-medium deployments that do not need Kubernetes' complexity. Docker Compose also supports deploying to Swarm with docker stack deploy.

Q: How many manager nodes should I use?

Always use 3 or 5 manager nodes, never an even number. With 3 managers, you can tolerate 1 failure. With 5 managers, you can tolerate 2 failures. An even number (4, 6) provides no additional fault tolerance over the next lower odd number. Never run workloads on manager nodes.

Q: What is the difference between a service and a task in Docker Swarm?

A service is the declarative specification: which image, how many replicas, resource limits, update policy. A task is the atomic scheduling unit — one task equals one container. A service with 6 replicas has 6 tasks. The Swarm scheduler assigns tasks to nodes.

Q: How does the routing mesh work?

The routing mesh is an ingress network that load-balances published ports across all nodes in the cluster. Any node can receive traffic for any service, regardless of whether that node is running the service's containers. The mesh forwards traffic to a node with a healthy task. The trade-off: one extra network hop.

Q: Can I use Docker Swarm in production?

Yes, for small-to-medium deployments (up to ~100 nodes). Swarm provides self-healing, rolling updates, secrets management, and overlay networking. For larger scale or complex workloads (custom controllers, CRDs, advanced networking), Kubernetes is a better fit.

Q: How does the swarm-external-secrets Vault plugin work?

The swarm-external-secrets Docker plugin replaces the default secrets backend with HashiCorp Vault. Secrets are stored in Vault (not the Raft log), fetched at container startup. It supports Vault KV v2 with versioning, AppRole authentication (RoleID + SecretID), and SHA256 hash-based rotation detection. When a secret is rotated in Vault, the plugin detects the hash change and can restart the service task to pick up the new value. Install the plugin on every manager, configure Vault connection details via a Docker config, and create secrets with --driver vault-secrets.

Q: How does the cost of Docker Swarm compare to Kubernetes?

Docker Swarm is significantly cheaper for small-to-medium deployments. Swarm is built into the Docker Engine — no external control plane costs. A 6-node Swarm cluster (3 managers + 3 workers) running 24 containers costs roughly $800-$1,200/year on AWS EC2. The same workload on EKS would cost $876/year just for the EKS control plane ($73/month) plus the worker nodes ($500+/year). On self-managed K8s, you avoid the control plane fee but pay for the etcd instances and a dedicated ops engineer — K8s operational overhead is typically 0.5-1 FTE. Swarm's simplicity costs less in both infrastructure and engineering time, but K8s scales to 1,000+ nodes while Swarm tops out around 100 nodes.

4 manager nodes lost quorum when 2 failed — freezing all deployments for 3 hours.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Production DevOps experience
✓Deep understanding of the tool's internals
✓Experience debugging distributed systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Manager nodes: run the Raft consensus algorithm, maintain cluster state, schedule services
Worker nodes: execute tasks (containers) assigned by managers
Services: the declarative unit — you define desired state, Swarm converges reality to match
Tasks: the atomic scheduling unit — one task = one container
Raft consensus requires a quorum (majority) of managers to agree on state changes
Overlay networks span hosts so containers can communicate across nodes
Ingress routing mesh load-balances published ports across all nodes
Rolling updates replace containers incrementally with zero downtime

✦ Definition~90s read

What is Docker Swarm Basics?

Docker Swarm is Docker's native clustering and orchestration solution, bundled directly into the Docker Engine since version 1.12. It turns a pool of Docker hosts into a single, logical virtual server. You deploy services declaratively, and Swarm handles scheduling, scaling, networking, and maintaining desired state across the cluster.

★

Imagine a restaurant chain with one head office (the manager) and ten kitchens across the city (the workers).

Its killer feature is simplicity: you don't need to install a separate orchestrator or manage external dependencies like etcd or Zookeeper — the Raft consensus group runs inside the manager nodes themselves. This makes Swarm ideal for teams that want container orchestration without the operational overhead of Kubernetes, especially in small-to-medium deployments or edge computing scenarios where a full K8s control plane would be overkill.

Swarm's architecture splits nodes into managers and workers. Managers run the Raft consensus algorithm to maintain cluster state — they're the brain. Workers just execute tasks. The Raft group requires a majority (quorum) to function: with 3 managers, you can lose 1; with 5, you can lose 2.

The outage described in this article happened because 4 managers were deployed — an even number that creates a split-brain risk. If the cluster partitions, neither side can form a majority, and the entire cluster freezes. This is why production Swarm clusters always use an odd number of managers (3 or 5).

The article walks through exactly how Raft consensus breaks with 4 managers and why that caused a 3-hour outage.

Beyond manager count, Swarm provides built-in service discovery, load balancing, and overlay networking via VXLAN. You can pin services to specific nodes with placement constraints, set CPU/memory limits, and perform rolling updates with health checks and automatic rollback.

Secrets and configs are encrypted at rest and in transit, stored in the Raft log, and mounted into containers as tmpfs files — immutable and rotatable without redeploying the service. Swarm is not as feature-rich as Kubernetes (no custom resource definitions, no built-in service mesh, no autoscaling based on custom metrics), but for teams that need a simple, reliable orchestrator with minimal moving parts, it's a solid choice.

Just don't use an even number of managers.

Plain-English First

Imagine a restaurant chain with one head office (the manager) and ten kitchens across the city (the workers). A customer order comes in — the head office decides which kitchen handles it, monitors the food being made, and if one kitchen burns down, it quietly reroutes the order to another kitchen without the customer ever knowing. Docker Swarm is exactly that: one command-and-control brain (the manager node) coordinating a fleet of worker nodes, making sure your containers keep running no matter what breaks.

Every production app eventually outgrows a single server. Traffic spikes, hardware fails, deployments need to happen without downtime. Docker Swarm is the native clustering and orchestration layer baked directly into the Docker Engine.

Swarm solves coordination across multiple hosts. When you have ten nodes, you need something to decide where a container lands, what happens when a node dies, how containers on different hosts communicate, and how you push a new image without dropping requests. Swarm encodes those answers into a distributed state machine backed by the Raft consensus algorithm.

Common misconceptions: Swarm is not deprecated (Docker continues to maintain it alongside Compose). Swarm is not Kubernetes-lite (it has a fundamentally different architecture — no pods, no CRDs, no etcd). Swarm's simplicity is its strength for small-to-medium deployments that do not need Kubernetes' complexity.

Why Docker Swarm's Manager Count Matters More Than You Think

Docker Swarm is a container orchestration engine built into Docker Engine that groups multiple hosts into a single virtual cluster. Its core mechanic is the Raft consensus algorithm: manager nodes elect a leader to coordinate all cluster state changes. Every service definition, secret, and configuration update must pass through the leader, which replicates it to a majority of managers before it's committed.

Swarm's key property is that it tolerates up to (N-1)/2 manager failures — but only if you run an odd number. With 4 managers, a single failure drops you to 3, which is still a majority. But if another fails, you're at 2 — no majority, and the cluster freezes. No deployments, no scaling, no health checks. The system is alive but brain-dead. Raft requires a strict majority of all configured managers, not just the ones currently online.

Use Swarm when you need a simple, low-overhead orchestrator for a small-to-medium cluster (under 50 nodes) and you want zero external dependencies — no etcd, no ZooKeeper. It's ideal for teams that already run Docker and need basic HA without the operational complexity of Kubernetes. But the manager count is not a scaling knob; it's a fault-tolerance decision. Run 3 or 5, never 4.

⚠ Even-numbered managers are a trap

A 4-manager cluster has the same fault tolerance as a 3-manager cluster (one failure) but requires one more failure to lose quorum — which is worse, not better.

📊 Production Insight

A team added a fourth manager for 'extra capacity' during a holiday sale.

The cluster lost quorum after two managers went down for routine patching — all deployments froze for 3 hours.

Always run an odd number of managers; even numbers increase failure risk without adding tolerance.

🎯 Key Takeaway

Swarm uses Raft consensus — a majority of managers must be alive for any state change.

3 managers tolerate 1 failure; 5 tolerate 2; 4 tolerates 1 but is more likely to lose quorum on a second failure.

Manager count is about fault tolerance, not performance — never run 2, 4, or 6 managers.

thecodeforge.io

Docker Swarm Basics

Raft Consensus and Manager Node Architecture

Swarm's cluster state is stored in a distributed log managed by the Raft consensus algorithm. Every manager node runs a full copy of the Raft log. State changes (service updates, node joins, secret creation) are proposed by the leader, replicated to a quorum of followers, and then committed.

The quorum formula is floor(n/2) + 1, where n is the number of managers. With 3 managers, quorum is 2. With 5 managers, quorum is 3. The cluster can tolerate floor((n-1)/2) manager failures. With 3 managers, you can lose 1. With 5 managers, you can lose 2.

An even number of managers provides no additional fault tolerance over the next lower odd number. With 4 managers, quorum is 3 — you can still only lose 1 manager, same as with 3 managers. The 4th node is wasted.

Leader election: When the leader fails or becomes unreachable, the remaining managers hold an election. The manager with the most up-to-date Raft log and the lowest election timeout wins. The default election timeout is 1 second. Network partitions can cause split-brain if two groups of managers each elect their own leader, but only the group with quorum can commit new state changes.

Failure scenario — manager resource starvation: A team ran a memory-intensive batch job on a manager node. The job consumed all available RAM, causing the Docker daemon to be OOM-killed. The daemon restart triggered a Raft leader election. During the election window (1-2 seconds), no state changes could be committed. The team noticed brief delays in service updates. The fix: cordon manager nodes from workloads using docker node update --availability drain .

io/thecodeforge/swarm-manager-setup.shBASH

#!/bin/bash
# Swarm cluster bootstrap with proper manager configuration

# ── Initialize the Swarm on the first manager ─────────────────────
# --advertise-addr: the IP other nodes will use to reach this manager
# --listen-addr: the interface the manager binds to
docker swarm init \
  --advertise-addr 10.0.1.10 \
  --listen-addr 0.0.0.0:2377 \
  --data-path-addr 10.0.1.10

# Get the join tokens
docker swarm join-token manager  # For other managers
docker swarm join-token worker   # For workers

# ── Join additional managers (run on each new manager node) ────────
docker swarm join \
  --token SWMTKN-1-xxxxx-manager-token-xxxxx \
  --advertise-addr 10.0.1.11 \
  10.0.1.10:2377

# ── Verify manager count (should be 3 or 5, never even) ──────────
docker node ls --filter role=manager
# ID    HOSTNAME   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
# abc * manager-1  Ready     Active         Leader           24.0.7
# def   manager-2  Ready     Active         Reachable        24.0.7
# ghi   manager-3  Ready     Active         Reachable        24.0.7

# ── Drain manager nodes to prevent workloads from running on them ─
for node in $(docker node ls --filter role=manager -q); do
  docker node update --availability drain $node
done
# Drained managers cannot run tasks — they are dedicated to orchestration

# ── Check Raft cluster health ─────────────────────────────────────
docker info --format '{{.Swarm.Nodes}} managers, {{.Swarm.Nodes}} total'
# Or inspect the Raft status on each manager
docker node inspect self --format '{{.ManagerStatus.Leader}}'
# true on the leader, false on followers

# ── Configure auto-lock (encrypt Raft logs at rest) ───────────────
docker swarm update --autolock=true
# This requires unlocking the swarm after daemon restart:
# docker swarm unlock
# Enter unlock key: SWMKEY-1-xxxxx

Output

Swarm initialized: current node (abc123) is now a manager.

To add a worker to this swarm, run the following command:

docker swarm join --token SWMTKN-1-xxxxx 10.0.1.10:2377

To add a manager to this swarm, run:

docker swarm join --token SWMTKN-1-yyyyy 10.0.1.10:2377

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION

abc * manager-1 Ready Active Leader 24.0.7

def manager-2 Ready Active Reachable 24.0.7

ghi manager-3 Ready Active Reachable 24.0.7

Mental Model

Raft as a Committee Vote

Why does an even number of managers not improve fault tolerance?

Quorum = floor(n/2) + 1. With 3 managers, quorum is 2. With 4 managers, quorum is 3.
With 3 managers, you can lose 1 and still have quorum (2 >= 2).
With 4 managers, you can lose 1 and still have quorum (3 >= 3). But losing 2 breaks quorum (2 < 3).
The 4th manager adds cost (server, maintenance) without adding fault tolerance. Always use 3 or 5.

📊 Production Insight

The autolock feature (--autolock=true) encrypts the Raft log at rest. Without it, anyone with access to the manager's disk can read the Raft data, which includes secrets and service definitions. The trade-off: after a daemon restart, you must manually enter the unlock key. Automate this with a secrets manager or a secure boot script.

🎯 Key Takeaway

Raft consensus requires a quorum of managers to agree on state changes. Always use an odd number of managers (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes — resource contention can starve the Raft process.

Manager Node Count Decision

IfSmall cluster, 1-10 nodes, budget-conscious

→

Use3 managers. Tolerates 1 failure. Minimal overhead.

IfMedium cluster, 10-100 nodes, production-critical

→

Use5 managers. Tolerates 2 failures. Better consensus performance under load.

IfLarge cluster, 100+ nodes

→

UseConsider migrating to Kubernetes. Swarm's consensus model does not scale well beyond ~100 nodes.

IfDevelopment/testing environment

→

Use1 manager is sufficient. No quorum concerns. Not suitable for production.

Service Scheduling, Placement Constraints and Resource Limits

A Swarm service is a declarative specification of the desired state: which image to run, how many replicas, resource limits, placement constraints, and update policy. The Swarm scheduler assigns tasks (individual containers) to nodes that satisfy the constraints and have available resources.

Scheduling algorithm: Swarm uses a spread scheduler by default — it places tasks on the node with the fewest existing tasks of the same service. This provides natural load distribution. You can override this with placement constraints and preferences.

Placement constraints: Hard requirements that a node must satisfy. Examples: - node.role==manager: only run on manager nodes - node.labels.zone==us-east-1a: only run in a specific availability zone - node.hostname==worker-3: pin to a specific node

Placement preferences: Soft preferences that guide scheduling but do not prevent placement. Example: --placement-pref 'spread=node.labels.zone' distributes tasks evenly across zones.

Resource limits: - --limit-cpu: maximum CPU a task can consume (e.g., 0.5 = half a core) - --limit-memory: maximum memory (e.g., 512m) - --reserve-cpu: guaranteed CPU allocation - --reserve-memory: guaranteed memory allocation

Without resource limits, a single misbehaving container can consume all resources on a node, starving other tasks. Resource reservations ensure critical services always have the resources they need.

Failure scenario — no resource limits, noisy neighbor: A team deployed a memory-intensive analytics service without --limit-memory. The service gradually consumed all available RAM on a worker node. The kernel OOM-killed other containers on the same node, including a critical payment service. The payment service was rescheduled to another node (Swarm's self-healing), but the 30-second rescheduling delay caused a brief payment outage. The fix: add --limit-memory to all services and --reserve-memory for critical services.

io/thecodeforge/swarm-service-deploy.shBASH

#!/bin/bash
# Production-grade service deployment with constraints and resource limits

# ── Deploy a service with full production settings ────────────────
docker service create \
  --name io-thecodeforge-api \
  --image io.thecodeforge/api:v2.3.1 \
  --replicas 6 \
  \
  # Resource limits — hard ceiling
  --limit-cpu 1.0 \
  --limit-memory 512m \
  \
  # Resource reservations — guaranteed allocation
  --reserve-cpu 0.25 \
  --reserve-memory 128m \
  \
  # Placement: spread across availability zones
  --placement-pref 'spread=node.labels.zone' \
  \
  # Constraint: never run on manager nodes
  --constraint 'node.role!=manager' \
  \
  # Constraint: only run on nodes with SSD label
  --constraint 'node.labels.disk==ssd' \
  \
  # Health check
  --health-cmd 'curl -f http://localhost:8080/health || exit 1' \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 30s \
  \
  # Rolling update policy
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-order start-first \
  \
  # Rollback policy
  --rollback-parallelism 1 \
  --rollback-delay 5s \
  \
  # Network
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080 \
  \
  # Environment
  --env DATABASE_URL='{{DATABASE_URL}}' \
  --secret io-thecodeforge-db-password \
  \
  # Restart policy
  --restart-condition on-failure \
  --restart-delay 5s \
  --restart-max-attempts 3 \
  --restart-window 60s

# ── Label nodes for placement constraints ─────────────────────────
docker node update --label-add zone=us-east-1a worker-1
docker node update --label-add zone=us-east-1b worker-2
docker node update --label-add zone=us-east-1a worker-3
docker node update --label-add disk=ssd worker-1
docker node update --label-add disk=ssd worker-2

# ── Verify placement ──────────────────────────────────────────────
docker service ps io-thecodeforge-api --format '{{.Name}} {{.Node}} {{.CurrentState}}'
# api.1  worker-1  Running
# api.2  worker-2  Running
# api.3  worker-1  Running
# api.4  worker-2  Running
# api.5  worker-3  Running
# api.6  worker-3  Running

Output

overall progress: 6 out of 6 tasks

1/6: running [==================================================>]

2/6: running [==================================================>]

3/6: running [==================================================>]

4/6: running [==================================================>]

5/6: running [==================================================>]

6/6: running [==================================================>]

verify: Service converged

Mental Model

Scheduling as Hotel Room Assignment

What is the difference between a constraint and a preference?

Constraints are hard requirements. If no node satisfies the constraint, the task stays in 'Pending' state forever.
Preferences are soft guidelines. Swarm tries to satisfy them but can place the task on any node if no preference match exists.
Use constraints for critical requirements: 'must run on SSD', 'must not run on managers'.
Use preferences for optimization: 'prefer to spread across zones', 'prefer nodes with fewer tasks'.

📊 Production Insight

The --update-order start-first flag starts the new container before stopping the old one. This provides zero-downtime deployments but temporarily doubles the resource usage. If you have --limit-memory 512m and 6 replicas, the deployment temporarily needs 6GB instead of 3GB. Ensure your cluster has enough headroom for rolling updates. If headroom is limited, use stop-first order instead.

🎯 Key Takeaway

Always set resource limits on production services. Without limits, a single misbehaving container can OOM-kill other containers on the same node. Use placement constraints to isolate critical services and spread across availability zones. The spread scheduler distributes tasks evenly by default.

Resource Limit Strategy

IfStateless web API with predictable resource usage

→

UseSet --limit-cpu and --limit-memory based on load testing. Use --reserve-memory for critical services.

IfMemory-intensive batch processing

→

UseSet generous --limit-memory but low --limit-cpu. Use placement constraints to isolate on dedicated nodes.

IfLatency-sensitive service (trading, real-time)

→

UseUse --reserve-cpu to guarantee CPU. Consider host-mode publishing to bypass routing mesh. Pin to dedicated nodes.

IfDevelopment/testing

→

UseSkip resource limits. They add complexity without benefit in non-production environments.

thecodeforge.io

Docker Swarm Basics

Overlay Networks and Cross-Host Container Communication

Docker Swarm uses overlay networks to enable containers on different hosts to communicate as if they were on the same network. The overlay network uses VXLAN (Virtual Extensible LAN) encapsulation to tunnel Layer 2 traffic over the underlying Layer 3 network.

How it works: When container A on node 1 sends a packet to container B on node 2, the VXLAN driver encapsulates the packet in a UDP datagram on port 4789 and sends it to node 2. Node 2 decapsulates the packet and delivers it to container B. The containers see each other's overlay IP addresses as if they were on the same LAN.

The ingress routing mesh: When you publish a port with --publish, Swarm creates a route in the ingress network that load-balances incoming traffic across all nodes running the service. Any node in the cluster can receive traffic for any service, regardless of whether that node is running the service's containers. The routing mesh forwards the traffic to a node that is running a healthy task.

The extra-hop problem: The routing mesh adds one network hop. A request to node 1 may be routed to a container on node 3. This adds latency. For latency-sensitive services, use host-mode publishing: --publish published=8080,target=8080,mode=host. This bypasses the routing mesh and binds directly to the host's port. The trade-off: only nodes running the service's containers accept traffic — you lose the any-node routing benefit.

Failure scenario — VXLAN port blocked by firewall: A team deployed a 3-node Swarm cluster across two data centers. Containers in data center A could not reach containers in data center B. The team spent 4 hours debugging DNS, service discovery, and overlay configuration. The root cause: the firewall between data centers blocked UDP port 4789 (VXLAN). After opening the port, overlay connectivity was restored immediately.

io/thecodeforge/swarm-networking.shBASH

#!/bin/bash
# Overlay network setup and troubleshooting

# ── Create an overlay network with encryption ─────────────────────
docker network create \
  --driver overlay \
  --attachable \
  --opt encrypted \
  --subnet 10.0.10.0/24 \
  io-thecodeforge-overlay
# --driver overlay: VXLAN-based cross-host networking
# --attachable: allows standalone containers to join (useful for debugging)
# --opt encrypted: encrypts VXLAN traffic with IPsec (adds ~10% overhead)
# --subnet: explicit IP range for the overlay network

# ── Deploy a service on the overlay network ───────────────────────
docker service create \
  --name api \
  --network io-thecodeforge-overlay \
  --replicas 3 \
  io.thecodeforge/api:v2.3.1

# ── Verify overlay network peers (should list all nodes) ──────────
docker network inspect io-thecodeforge-overlay --format '{{json .Peers}}' | python3 -m json.tool
# Each peer represents a node participating in the overlay network
# If a peer is missing, that node cannot communicate on the overlay

# ── Test cross-host connectivity ──────────────────────────────────
# From any node, run a debug container on the overlay network
docker run --rm -it --network io-thecodeforge-overlay alpine sh
# Inside the container:
# ping <overlay-ip-of-service-task>
# nslookup tasks.api  # DNS round-robin for all service tasks

# ── Required ports for Swarm networking ───────────────────────────
# TCP 2377: Swarm cluster management (Raft)
# TCP/UDP 7946: Gossip-based node discovery
# UDP 4789: VXLAN overlay network traffic
# Protocol 50 (ESP): IPsec encryption (if --opt encrypted)

# ── Host-mode publishing (bypass routing mesh) ────────────────────
docker service create \
  --name api-latency-sensitive \
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080,mode=host \
  --mode global \
  io.thecodeforge/api:v2.3.1
# mode=global: one task per node (every node runs the service)
# mode=host: binds directly to host port 8080, no routing mesh hop

Output

Network io-thecodeforge-overlay created

[

{

"Name": "manager-1",

"IP": "10.0.1.10"

{

"Name": "worker-1",

"IP": "10.0.1.11"

{

"Name": "worker-2",

"IP": "10.0.1.12"

}

]

# All 3 peers are present — overlay network is healthy

Mental Model

Overlay Network as a Virtual Office Floor

When should you use host-mode publishing instead of the routing mesh?

Latency-sensitive services where the extra routing mesh hop adds unacceptable delay.
Services that need to bind to specific host ports for external load balancer integration.
Services running in --mode global (one per node) where every node already has a container.
Trade-off: you lose the any-node routing benefit. Traffic only reaches nodes running the service.

📊 Production Insight

The --opt encrypted flag adds IPsec encryption to VXLAN traffic. This is important for multi-data-center or cloud deployments where traffic crosses untrusted networks. The overhead is approximately 10% throughput reduction and slightly higher CPU usage. For single-data-center deployments on a trusted network, skip encryption to avoid the overhead.

🎯 Key Takeaway

Overlay networks use VXLAN on UDP port 4789. If this port is blocked by firewalls, containers on different nodes cannot communicate. The routing mesh adds one network hop — use host-mode publishing for latency-sensitive services. Always use --opt encrypted for cross-data-center overlays.

Rolling Updates, Rollback and Zero-Downtime Deployments

Swarm's rolling update mechanism replaces old containers with new ones incrementally, ensuring the service remains available throughout the deployment. The update configuration controls the pace and failure behavior.

Update parameters: - --update-parallelism: how many tasks to update simultaneously (default: 1) - --update-delay: wait time between updating batches (default: 0s) - --update-failure-action: what to do if a new task fails (pause, continue, rollback) - --update-order: start-first (new container starts before old stops) or stop-first (old stops before new starts) - --update-max-failure-ratio: percentage of failures that triggers the failure action

The start-first vs stop-first trade-off: - start-first: zero downtime, but temporarily doubles resource usage during deployment - stop-first: lower resource usage, but brief window where one fewer replica is running

Rollback: If a rolling update fails, Swarm can automatically roll back to the previous version. The rollback configuration mirrors the update configuration. Manual rollback: docker service rollback <service>.

Failure scenario — update without health check causes cascading failure: A team deployed a new API version with a startup bug that caused the health check to fail after 30 seconds. The team did not configure --health-start-period. The health check failed immediately (before the app was ready), causing Swarm to mark the task as failed. With --update-failure-action continue (the default), Swarm continued replacing all healthy containers with the failing new version. Within 2 minutes, all containers were running the broken version. The fix: set --update-failure-action rollback and configure --health-start-period to allow startup time.

io/thecodeforge/swarm-rolling-update.shBASH

#!/bin/bash
# Zero-downtime rolling update with automatic rollback

# ── Initial deployment ────────────────────────────────────────────
docker service create \
  --name io-thecodeforge-api \
  --image io.thecodeforge/api:v2.3.0 \
  --replicas 6 \
  --limit-cpu 1.0 \
  --limit-memory 512m \
  --health-cmd 'curl -f http://localhost:8080/health || exit 1' \
  --health-interval 10s \
  --health-timeout 5s \
  --health-retries 3 \
  --health-start-period 40s \
  \
  # Rolling update: 2 at a time, 10s delay, auto-rollback on failure
  --update-parallelism 2 \
  --update-delay 10s \
  --update-failure-action rollback \
  --update-max-failure-ratio 0.25 \
  --update-order start-first \
  \
  # Rollback policy
  --rollback-parallelism 1 \
  --rollback-delay 5s \
  --rollback-order stop-first \
  \
  --network io-thecodeforge-overlay \
  --publish published=8080,target=8080 \
  io.thecodeforge/api:v2.3.0

# ── Rolling update to new version ─────────────────────────────────
docker service update \
  --image io.thecodeforge/api:v2.3.1 \
  --update-parallelism 2 \
  --update-delay 10s \
  io-thecodeforge-api

# ── Monitor the update progress ───────────────────────────────────
docker service ps io-thecodeforge-api \
  --format '{{.Name}} {{.Image}} {{.CurrentState}} {{.Error}}' \
  | head -20
# You will see old tasks shutting down and new tasks starting

# ── Manual rollback if needed ─────────────────────────────────────
docker service rollback io-thecodeforge-api
# Reverts to the previous image and configuration

# ── Force update (redeploy without changing image) ────────────────
docker service update --force io-thecodeforge-api
# Useful when container config has changed but image tag is the same

Output

overall progress: 6 out of 6 tasks

1/6: running [==================================================>]

2/6: running [==================================================>]

3/6: running [==================================================>]

4/6: running [==================================================>]

5/6: running [==================================================>]

6/6: running [==================================================>]

verify: Service converged

# Rolling update in progress:

# api.1 io.thecodeforge/api:v2.3.1 Running

# api.2 io.thecodeforge/api:v2.3.1 Running

# api.3 io.thecodeforge/api:v2.3.0 Running (waiting for delay)

# api.4 io.thecodeforge/api:v2.3.0 Running (waiting for delay)

# api.5 io.thecodeforge/api:v2.3.0 Running

# api.6 io.thecodeforge/api:v2.3.0 Running

Mental Model

Rolling Update as Renovating a Hotel Floor by Floor

Why is --update-failure-action rollback critical for production?

Without rollback, a failing update continues replacing all healthy containers with the broken version.
With rollback, Swarm detects failures and automatically reverts to the previous working version.
The --update-max-failure-ratio flag controls the failure threshold. 0.25 means 25% failure triggers rollback.
Always pair rollback with health checks. Without health checks, Swarm cannot detect a broken container.

📊 Production Insight

The --health-start-period flag is essential for services with slow startup times (JVM warmup, database migrations, cache hydration). Without it, the health check runs immediately and may fail before the application is ready, triggering an unnecessary rollback. Set it to the expected maximum startup time plus a buffer.

🎯 Key Takeaway

Always set --update-failure-action rollback in production. Without it, a broken update replaces all healthy containers. Use --health-start-period for services with slow startup. start-first provides zero downtime but doubles resource usage during deployment — ensure cluster headroom.

Swarm Secrets and Configs — Immutable, Encrypted, Rotatable

Docker Swarm provides built-in secrets management through the Raft log. Secrets are encrypted at rest and in transit, mounted as files in /run/secrets/ inside containers, and never written to image layers.

How secrets work: - docker secret create: stores the secret in the Raft log (encrypted with the swarm unlock key) - The secret is distributed to every manager node (encrypted) - When a service references a secret, it is mounted as a file at /run/secrets/ - Secrets are immutable — updating a secret creates a new version

How configs work: - docker config create: stores configuration files in the Raft log - Configs are mounted as files in the container (not encrypted at rest — use secrets for sensitive data) - Useful for nginx.conf, application.yaml, or any configuration file

Secret rotation: Secrets are immutable. To rotate a secret: 1. Create a new secret: docker secret create db-password-v2 - 2. Update the service to use the new secret: docker service update --secret-rm db-password --secret-add db-password-v2 3. The service restarts with the new secret mounted 4. Delete the old secret: docker secret rm db-password

Failure scenario — secret not updating in running service: A team updated a database password by creating a new secret and updating the service. However, the application inside the container still read the old password from /run/secrets/db-password. The team did not realize that Docker secrets are immutable — the old secret file remained mounted until the service was explicitly updated to remove it. The fix: use --secret-rm to remove the old secret and --secret-add to add the new one in the same update command.

io/thecodeforge/swarm-secrets.shBASH

#!/bin/bash
# Secrets management in Docker Swarm

# ── Create a secret from stdin ────────────────────────────────────
echo 's3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password -

# ── Create a secret from a file ───────────────────────────────────
docker secret create io-thecodeforge-tls-cert /path/to/cert.pem

# ── Create a config (non-sensitive configuration) ─────────────────
docker config create io-thecodeforge-nginx-conf /path/to/nginx.conf

# ── Deploy a service with secrets and configs ─────────────────────
docker service create \
  --name io-thecodeforge-api \
  --secret io-thecodeforge-db-password \
  --secret io-thecodeforge-tls-cert \
  --config source=io-thecodeforge-nginx-conf,target=/etc/nginx/nginx.conf \
  io.thecodeforge/api:v2.3.1

# ── Access secrets inside the container ────────────────────────────
docker exec <container> cat /run/secrets/io-thecodeforge-db-password
# Output: s3cret_p@ssw0rd

docker exec <container> ls /run/secrets/
# io-thecodeforge-db-password
# io-thecodeforge-tls-cert

# ── Rotate a secret ───────────────────────────────────────────────
# Step 1: Create new version
echo 'new_s3cret_p@ssw0rd' | docker secret create io-thecodeforge-db-password-v2 -

# Step 2: Update service — remove old, add new
docker service update \
  --secret-rm io-thecodeforge-db-password \
  --secret-add io-thecodeforge-db-password-v2 \
  io-thecodeforge-api

# Step 3: Verify new secret is mounted
docker exec <container> cat /run/secrets/io-thecodeforge-db-password-v2
# Output: new_s3cret_p@ssw0rd

# Step 4: Clean up old secret
docker secret rm io-thecodeforge-db-password

# ── List all secrets ──────────────────────────────────────────────
docker secret ls
# ID          NAME                          CREATED
# abc123      io-thecodeforge-db-password   2 hours ago
def456      io-thecodeforge-tls-cert      2 hours ago

Output

Secret io-thecodeforge-db-password created

Config io-thecodeforge-nginx-conf created

overall progress: 1 out of 1 tasks

1/1: running [==================================================>]

verify: Service converged

s3cret_p@ssw0rd

Mental Model

Secrets as Sealed Envelopes

Why are Docker secrets more secure than environment variables?

Secrets are encrypted at rest in the Raft log. ENV variables are stored in plaintext in container metadata.
Secrets are mounted as files — they do not appear in docker inspect, docker ps, or process listings.
Secrets are distributed only to nodes running tasks that reference them. ENV variables are visible to anyone with image access.
Secrets are immutable and versioned. ENV variables can be accidentally changed or logged.

📊 Production Insight

Docker secrets are Swarm-only. If you use standalone Docker (not Swarm), you must use alternative secrets management: Docker Compose secrets (file-based, not encrypted), HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets. Plan your secrets strategy before choosing an orchestration platform.

🎯 Key Takeaway

Docker secrets are encrypted, immutable, and mounted as files in /run/secrets/. Never use ENV for secrets — they are visible in docker inspect. To rotate a secret, create a new version and update the service with --secret-rm and --secret-add. Secrets are Swarm-only — standalone Docker requires alternative solutions.

Tasks and Services: The Two Abstractions You Can't Afford to Confuse

Newcomers treat 'service' and 'task' like synonyms. They're not. Get this wrong and your rolling updates will silently fail, your health checks will fire at ghosts, and you'll be debugging at 2 AM while your manager asks why production is serving 503s.

A Service is the declarative spec. You define the image, replicas, network, ports, resource limits — the desired state. Docker Swarm reconciles actual state to match. A Task is a running instance of that service. One replica = one task. When you scale to 10, you get 10 tasks, each with a unique ID tied to a specific node.

Here's the nasty bit: tasks are ephemeral. They fail, get rescheduled, get replaced during updates. Your monitoring must track task IDs, not container names. If you're scraping logs by container name, you'll lose the trail after any reschedule. Tag your logs with task ID and service name from environment variables injected by Swarm.

TaskServiceExample.ymlYAML

// io.thecodeforge — devops tutorial

version: '3.8'

services:
  auth-api:
    image: registry.thecodeforge.io/auth-api:v2.4.1
    deploy:
      replicas: 5
      resources:
        limits:
          cpus: '0.5'
          memory: 256M
        reservations:
          cpus: '0.25'
          memory: 128M
    environment:
      - SERVICE_NAME=auth-api
      - LOG_FORMAT=json
    logging:
      driver: json-file
      options:
        tag: "{{.Name}}/{{.ID}}"

Output

Task ID: z8xk3v9m0n1a2b4c

Container: auth-api.1.z8xk3v9m0n1a2b4c

Node: worker-03

Status: Running 2h 7m 34s

⚠ Production Trap: Overlapping Task IDs During Rollback

During rollback, old and new task IDs coexist for seconds. Your log aggregator will see duplicate entries unless you filter by service version label or task creation timestamp.

🎯 Key Takeaway

A service is what you deploy. A task is what runs. Never confuse the declarative spec with the ephemeral instance.

Ports and Protocols: The Firewall Dance That Breaks Your Swarm

You've initialized your swarm, added workers, and everything works on your laptop. Then you deploy to bare metal in a colo and nodes can't talk to each other. Welcome to networking hell.

Swarm mode needs specific ports open between all nodes — not just manager to worker, but worker to worker, and manager to manager. The Raft consensus traffic uses TCP and UDP port 2377. Container ingress traffic routes through a VXLAN overlay on UDP port 4789. Node-to-node gossip protocol uses UDP port 7946.

Here's what the docs won't scream at you: opening these ports on cloud firewalls isn't enough. If your nodes are in different subnets with network ACLs between them, VXLAN encapsulation might get dropped. Check your MTU too — overlay networks add 50 bytes of overhead. Standard 1500 MTU on the underlay will fragment packets if you're not careful, and some cloud providers drop fragments silently.

Test with a simple service that pings between nodes before you declare victory.

SwarmPortCheck.ymlYAML

// io.thecodeforge — devops tutorial

docker service create \
  --name netcheck \
  --network swarm-overlay \
  --replicas 3 \
  alpine sh -c "while true; do \
    ping -c1 netcheck-2; sleep 5; done"

# Verify with:
docker service logs netcheck | grep -E "(time=|unreachable)"

Output

64 bytes from 10.0.0.7: seq=0 ttl=64 time=0.853 ms

64 bytes from 10.0.0.9: seq=1 ttl=64 time=0.921 ms

64 bytes from 10.0.0.8: seq=2 ttl=64 time=0.887 ms

🔥Senior Shortcut: Use Your Cloud Service Mesh Routing

Skip VXLAN headaches in multi-region setups. Deploy a global ingress service (like HAProxy or Envoy) on each region's swarm and route traffic via DNS-based load balancing. Overlay networks across data centers will destroy your latency budget.

🎯 Key Takeaway

Four ports control your swarm's life: 2377/tcp+udp for Raft, 7946/tcp+udp for gossip, 4789/udp for overlay. Miss any and your cluster degrades silently.

Three Host Machines — Don't Even Think About Fewer

Your swarm needs at least three manager nodes. Not two. Not one. Three. This is the minimum to survive a single node failure without losing the Raft quorum you read about earlier.

Why three? Raft consensus requires a majority. With three managers, you can lose one and still have two — that's a majority. With two, you lose one and you're at fifty-fifty tie. The swarm freezes. No scheduling, no updates, nothing. You're down. Production shops that run two managers are one disk failure away from a cluster-wide lockup.

Managers hold the cluster state. That state is distributed via Raft logs. Even if your applications run on worker nodes, the managers coordinate everything — service discovery, scheduling, scaling. Three hosts means you can reboot one for patches and the swarm keeps chewing. Anything less is gambling with your production pipeline.

docker-stack.ymlYAML

// io.thecodeforge — devops tutorial

version: '3.9'

services:
  api:
    image: internals-api:2.1.4
    deploy:
      mode: replicated
      replicas: 3
      placement:
        constraints:
          - node.role == worker
      resources:
        limits:
          cpus: '1.5'
          memory: 1024M
        reservations:
          cpus: '0.5'
          memory: 512M
    ports:
      - "8080:8080"

networks:
  app-net:
    driver: overlay
    attachable: true

Output

Service api-svc scheduled across 3 worker nodes

Manager nodes: 3 (node-m1, node-m2, node-m3)

Raft quorum: healthy (3/3)

⚠ Production Trap:

Docker Swarm allows a single-manager setup for development. Never take that to production. You will hit a network partition and your swarm will become a paperweight.

🎯 Key Takeaway

Three managers or go home. Two is a quorum bomb waiting to explode.

Don't Run Apps on Managers — That's Not Their Job

Managers run the control plane. They gossip cluster state, maintain Raft logs, and serve the Docker API. They are not compute nodes. You wouldn't run your web server on the Kubernetes control plane, so don't do it in Swarm.

By default, Swarm schedules services onto manager nodes. You must explicitly drain managers or add placement constraints to force workloads onto worker nodes. The node.role == worker constraint in your compose file or service create command does exactly that. Without it, your API container could land on a manager during a rolling update, consuming CPU and memory that your cluster brain needs to stay responsive.

Separate concerns = separate node roles. Managers handle orchestration. Workers run containers. If one manager crashes under load because your app ate its memory, you lose not just that host but potentially the quorum. Keep managers lean, dedicated, and isolated from application traffic. Your future self — and your on-call team — will thank you.

service-constraint.ymlYAML

// io.thecodeforge — devops tutorial

version: '3.9'

services:
  payment-worker:
    image: payment-backend:3.0.1
    deploy:
      replicas: 4
      placement:
        constraints:
          - node.role == worker
    environment:
      - NODE_ENV=production
    networks:
      - backend-net

networks:
  backend-net:
    driver: overlay

Output

payment-worker scheduled on worker-node-01

payment-worker scheduled on worker-node-02

payment-worker scheduled on worker-node-03

payment-worker scheduled on worker-node-04

No managers used for workload

💡Senior Shortcut:

After initializing your swarm, immediately run docker node update --availability drain <manager-hostname> on all managers to prevent any service from landing there accidentally.

🎯 Key Takeaway

Managers orchestrate, workers execute. Never blur the line.

Autoscaling in Swarm — Script-Based Scaling and Adaptive Polling

Docker Swarm lacks a built-in autoscaler (unlike Kubernetes HPA). Teams must build their own using the Docker API, external metrics, and cron-driven or event-driven triggers. The standard pattern: a monitoring script polls CPU, memory, or custom metrics (queue depth, request latency) and calls docker service scale when thresholds are crossed.

Script-based autoscaling pattern: - Polling agent runs on a manager node or a dedicated monitoring host - Agent queries metrics source (Prometheus, Datadog, CloudWatch, docker stats) - Evaluates scale-up/down rules: if CPU > 80% for 3 consecutive intervals, scale up by 2 - Executes docker service scale = - Optional cooldown period prevents thrashing

Adaptive polling rate: Static intervals (every 60s) waste resources during steady state and react too slowly during traffic spikes. Adaptive polling adjusts frequency based on metric volatility: - Low volatility (steady traffic): poll every 120s - Medium volatility (gradual ramp): poll every 30s - High volatility (flash crowd): poll every 10s - Use standard deviation of the last 5 data points as the volatility signal

docker service scale patterns: - Scale up aggressively, scale down conservatively: scale up 2 at a time when CPU > 80%, scale down 1 at a time when CPU < 30% for 5 minutes - Minimum replicas floor: never scale below 2 (HA) or 3 (rolling update headroom) - Maximum replicas ceiling: cap at cluster capacity minus headroom for rolling updates (if --update-order start-first, you need 2x the max replica count in resource headroom) - Global services cannot be scaled with docker service scale — they run one task per node by definition

Production autoscaling script example: ```bash #!/bin/bash # Simple CPU-based autoscaler SERVICE="io-thecodeforge-api" MAX_REPLICAS=20 MIN_REPLICAS=3

while true; do CPU=$(docker stats --no-stream --format '{{.CPUPerc}}' $(docker service ps -q $SERVICE) \ | sed 's/%//' | awk '{s+=$1} END {print s/NR}') REPLICAS=$(docker service ls --filter name=$SERVICE --format '{{.Replicas}}' | cut -d/ -f1)

if (( $(echo "$CPU > 80" | bc -l) )) && (( $REPLICAS < $MAX_REPLICAS )); then docker service scale $SERVICE=$((REPLICAS + 2)) sleep 60 # cooldown elif (( $(echo "$CPU < 30" | bc -l) )) && (( $REPLICAS > $MIN_REPLICAS )); then docker service scale $SERVICE=$((REPLICAS - 1)) sleep 120 # cooldown fi sleep 30 done ```

Limitation: docker service scale triggers a rolling update. Scaling from 6 to 20 replicas in a single command creates 14 new tasks simultaneously, which may overwhelm the scheduler or hit resource limits. Consider stepping: scale to 10, wait 30s, scale to 14, etc. Or use --update-parallelism to control the rate of task creation.

⚠ Autoscaling is not built-in — plan your custom solution early

Docker Swarm has no HPA equivalent. Every autoscaling solution is a custom script or external tool. The Swarm API is simple (docker service scale), but you must build the metric collection, evaluation logic, cooldown, and safety guards yourself. For production, consider projects like Flocker,Traefik (with Swarm provider + autoscaling middleware), or wrapping the Docker SDK in a small service.

📊 Production Insight

A team built an autoscaler that checked CPU every 10 seconds. During a flash sale, the metric spiked for 1 polling cycle, triggered a scale-up, then dropped — but the cooldown prevented scale-down for 2 minutes. The service was over-provisioned for 2 minutes (acceptable). Worse: the opposite — scale-down triggered on a transient dip, then scale-up immediately after, thrashing the cluster. Solution: require 3 consecutive intervals above/below threshold before acting.

🎯 Key Takeaway

Swarm lacks built-in autoscaling — you must build custom scripts or use third-party tools. Use adaptive polling rates to balance reactivity and resource cost. Scale up aggressively, scale down conservatively, and always enforce min/max replica bounds. Require multiple consecutive threshold crossings to avoid thrashing.

Swarm External Secrets Plugin — HashiCorp Vault Integration

Docker's built-in secrets store secrets in the Raft log. For enterprise teams, this is insufficient — they need secrets stored in a dedicated secrets manager like HashiCorp Vault with audit logging, automatic rotation, and fine-grained access control. The swarm-external-secrets plugin bridges this gap.

swarm-external-secrets plugin: - A Docker Engine plugin that replaces the default secrets backend with HashiCorp Vault - Secrets are never stored in the Raft log — they are fetched from Vault at container startup - Supports Vault KV v2 (key-value with versioning) and KV v1 - Integrates with Vault AppRole authentication (RoleID + SecretID) - SHA256 hash-based rotation detection: the plugin periodically checksums the secret in Vault and compares it to the mounted version — if the hash differs, the plugin signals the service to restart with the updated secret

Vault KV v2 integration: - Secrets stored under a Vault path like secret/data/docker/swarm/db-password - Versioning is automatic — each write creates a new version - The plugin reads the latest version unless a specific version is requested - Supports rollback: specify version=X in the secret options to pin to a specific version

AppRole authentication: - Vault AppRole provides machine-to-machine authentication - RoleID is like a username (public, stored in Docker config) - SecretID is like a password (sensitive, rotated regularly) - The plugin authenticates with RoleID + SecretID to obtain a Vault token - The token has constrained policies that limit which secrets the plugin can read

SHA256 hash-based rotation detection: - The plugin caches the secret content in memory and stores its SHA256 hash - At each polling interval (default: 60s), it re-reads the secret from Vault and compares the SHA256 hash - If the hash differs, the secret has been rotated - The plugin can take one of three actions: 1. Emit an event (monitoring alert) 2. Signal running containers to reload the secret (SIGHUP) 3. Force restart the service task (daemon set update) - Action is configured per-secret via the secret driver options

Installation and setup: ```bash # Install the plugin docker plugin install swarm-external-secrets --alias vault-secrets

# Create a Docker config with Vault connection details docker config create vault-config.yaml <" vault_secret_id: "" auth_method: approle kv_version: 2 poll_interval: 60s EOF

# Create a secret referencing Vault path docker secret create \ --driver vault-secrets \ --template-driver vault-secrets \ --label vault.path=secret/data/docker/swarm/db-password \ --label vault.poll=true \ --label vault.onchange=restart \ db-password - ```

Production caveats: - Plugin must be installed on every manager node - Vault must be highly available (Vault HA cluster) - Network latency to Vault adds to container startup time (typically 100-500ms per secret) - If Vault is unreachable, new container starts will fail — build Vault health monitoring into your Swarm alerting - SHA256 polling adds load to Vault — for 100+ secrets, increase poll_interval to 300s

🔥When to use swarm-external-secrets vs built-in secrets

Built-in secrets are fine for most teams: simple, zero infrastructure, encrypted in Raft. Use swarm-external-secrets when you need Vault audit logs for compliance (SOC2, PCI-DSS), cross-platform secrets that K8s and Swarm share, or automatic rotation without service redeployment. The external plugin adds operational complexity — a Vault cluster to maintain, network latency, and a failure domain that can block container starts.

📊 Production Insight

A fintech team used swarm-external-secrets with SHA256 polling set to restart containers on secret rotation. Their compliance policy required database password rotation every 30 days. When the credential rotated, the plugin detected it within 60s, restarted all service tasks (one by one, respecting update parallelism), and the application picked up the new password — zero downtime, no manual intervention. The Vault audit log showed exactly when each task read the secret, satisfying the SOC2 audit requirement.

🎯 Key Takeaway

swarm-external-secrets plugin replaces the Raft-backed secret store with HashiCorp Vault. It supports KV v2, AppRole auth, and SHA256 hash-based rotation detection. Use it for compliance requirements (audit logs), cross-platform secrets, or automatic rotation. The trade-off: you must maintain a Vault cluster and the plugin adds startup latency and a new failure domain.

Placement Constraints and Preferences — Controlling Where Containers Land

In a multi-node Swarm cluster, the default spread scheduler places tasks on the node with the fewest existing tasks of the same service. For most services, this is sufficient. But production deployments require fine-grained control: pin critical services to SSD nodes, distribute replicas across availability zones, keep batch jobs on dedicated nodes, or reserve certain nodes for stateful workloads.

Node labels: The foundation of placement control. Labels are key-value pairs attached to nodes. Apply them with docker node update --label-add: ``bash docker node update --label-add disk=ssd worker-1 docker node update --label-add zone=us-east-1a worker-1 docker node update --label-add zone=us-east-1b worker-2 docker node update --label-add tier=frontend worker-1 docker node update --label-add tier=backend worker-2 docker node update --label-add gpu=true worker-3 `` Labels are persistent until removed. They survive node reboots and daemon restarts. View labels with: docker node inspect --format '{{.Spec.Labels}}'

--constraint flag (hard requirements): If the constraint cannot be satisfied, the task stays in 'Pending' state forever. Common constraint patterns: - node.role==worker — never run on managers (essential for production) - node.role==manager — run only on managers (for monitoring agents that need API access) - node.labels.zone==us-east-1a — specific availability zone - node.labels.zone!=us-east-1a — exclude a zone - node.labels.disk==ssd — SSD only - node.platform.os==linux — OS type filter - node.hostname==specific-node — pin to one node (avoid unless necessary — creates a single point of failure)

Constraints are ANDed. Multiple constraints must all be satisfied. If you specify --constraint node.labels.zone==us-east-1a --constraint node.labels.disk==ssd, the node must have both zone=us-east-1a AND disk=ssd labels.

--placement-pref flag (soft preferences): Preferences are optimization hints, not hard rules. Swarm tries to satisfy them but can place the task on any eligible node. Two strategies: - spread: evenly distribute tasks across nodes matching the label value. Example: --placement-pref 'spread=node.labels.zone' spreads tasks across all zones. - binpack: pack tasks onto as few nodes as possible (resource-efficient, but reduces fault tolerance). Example: --placement-pref 'binpack=node.labels.zone'

Multiple preferences are evaluated in order. The first preference has the highest priority.

Global vs replicated mode: Understanding placement requires knowing the service mode: - --mode replicated: defines a specific number of replicas. The scheduler places them according to constraints and preferences. Best for stateless services where you control the count. - --mode global: runs exactly one task per node (matching constraints). The scheduler does not choose which nodes — every eligible node gets one task. Best for monitoring agents, log shippers, node-level daemons.

Global services respect constraints. A global service with --constraint node.labels.zone==us-east-1a runs one task on every node that has zone=us-east-1a. Nodes without that label get zero tasks.

Failure scenario — constraint prevents scheduling entirely: A team added --constraint node.labels.disk==ssd to their database service but forgot to label any nodes with disk=ssd. The service stayed in 'Pending' for hours. The team assumed Swarm was broken and considered rebuilding the cluster. The fix: label the appropriate nodes (docker node update --label-add disk=ssd worker-2) and the tasks started immediately. Lesson: always verify labels exist before adding constraints.

io/thecodeforge/swarm-placement.shBASH

#!/bin/bash
# Placement constraints and preferences for production

# ── Label nodes for availability zones ────────────────────────────
docker node update --label-add az=us-east-1a node-1
docker node update --label-add az=us-east-1a node-2
docker node update --label-add az=us-east-1b node-3
docker node update --label-add az=us-east-1b node-4
docker node update --label-add az=us-east-1c node-5
docker node update --label-add az=us-east-1c node-6

# ── Label nodes for disk type ─────────────────────────────────────
docker node update --label-add disk=ssd node-1 node-3 node-5
docker node update --label-add disk=hdd node-2 node-4 node-6

# ── Deploy with constraints and preferences ───────────────────────
docker service create \
  --name io-thecodeforge-api \
  --replicas 9 \
  --constraint 'node.role==worker' \
  --constraint 'node.labels.disk==ssd' \
  --placement-pref 'spread=node.labels.az' \
  --placement-pref 'spread=node.hostname' \
  io.thecodeforge/api:v2.3.1
# 9 replicas spread across 3 SSD nodes in 3 AZs
# Only nodes with disk=ssd and role!=manager are eligible
# Swarm spreads evenly across AZs first, then across hostnames

# ── Global service (one task per eligible node) ───────────────────
docker service create \
  --name io-thecodeforge-log-agent \
  --mode global \
  --constraint 'node.labels.disk==ssd' \
  --mount type=bind,source=/var/log,target=/var/log \
  io.thecodeforge/log-agent:latest
# Runs on every node with disk=ssd

# ── Verify placement ──────────────────────────────────────────────
docker service ps io-thecodeforge-api \
  --format 'table {{.Name}}\t{{.Node}}\t{{.CurrentState}}'

# ── Update constraints on a running service ───────────────────────
docker service update \
  --constraint-add 'node.labels.tier==frontend' \
  --constraint-rm 'node.labels.disk==ssd' \
  io-thecodeforge-api

Output

NAME NODE CURRENT STATE

io-thecodeforge-api.1 node-1 Running 2m

io-thecodeforge-api.2 node-3 Running 2m

io-thecodeforge-api.3 node-5 Running 2m

io-thecodeforge-api.4 node-1 Running 1m

io-thecodeforge-api.5 node-3 Running 1m

io-thecodeforge-api.6 node-5 Running 1m

io-thecodeforge-api.7 node-1 Running 1m

io-thecodeforge-api.8 node-3 Running 1m

io-thecodeforge-api.9 node-5 Running 1m

# 3 tasks per node, 3 nodes, 3 AZs — perfect spread

Mental Model

Constraints are gates, preferences are gravity

What's the difference between --mode global and --mode replicated with replicas=N and spread across all nodes?

global runs exactly one task per eligible node, always. If you add a node, a new task starts automatically.
replicated runs N tasks total. Adding a node does not automatically increase tasks — you must docker service scale.
global cannot be scaled with docker service scale. replicated can be scaled up and down.
global is ideal for node-level daemons (log collectors, monitoring agents). replicated is ideal for stateless application services.

📊 Production Insight

A team ran a database service with --mode replicated --replicas 3 and --constraint node.labels.disk==ssd. They added 3 new nodes without the SSD label. The tasks stayed pinned to the original 3 SSD nodes. When one SSD node failed, the task was rescheduled to another SSD node — there were only 2 left, so that node got 2 tasks. This is correct behavior but reduces fault tolerance. For stateful services, pin to specific nodes and use health checks to detect rescheduling delays.

🎯 Key Takeaway

Use node labels as the foundation for placement control. Constraints are hard requirements (tasks stay Pending if unmet). Preferences are soft optimization hints (spread, binpack). --mode global runs one task per eligible node; --mode replicated runs a fixed number. Verify labels exist before adding constraints to avoid invisible scheduling failures.

thecodeforge.io

Docker Swarm Basics

Multi-Region Swarm Architecture — Production Deployment Across Continents

Docker Swarm is designed for single-region deployments. Its VXLAN overlay network assumes low-latency, high-bandwidth links between nodes. Across continents, VXLAN encapsulation adds unbearable latency and packet loss. But teams do run Swarm across regions — the trick is to treat each region's Swarm as an independent cluster and use external load balancing to route traffic between them.

Dual-continent production pattern (US + EU): - Two independent Swarm clusters: swarm-us (3 managers, N workers in us-east-1) and swarm-eu (3 managers, N workers in eu-west-1) - Each cluster has its own overlay network, secrets, and services - A global DNS load balancer (Route53, Cloudflare) routes traffic to the nearest region (latency-based routing) - Each cluster runs the same services with the same stack definition - Database and cache are shared (RDS, ElastiCache, or a global database like CockroachDB) - No overlay network spans continents — only the external load balancer connects the regions

docker stack deploy vs docker compose up: - docker stack deploy: deploys to a Swarm cluster. Uses the deploy: section in docker-compose.yml. Supports all Swarm features (constraints, secrets, modes, rolling updates). Best for production. - docker compose up: deploys to a single Docker host. Does not use the deploy: section. Ignores Swarm-specific fields. Best for local development. - Common mistake: using docker compose up on a Swarm node expecting Swarm scheduling. The containers run on that one host only and are not managed by Swarm. - Multi-region pattern: each region has a docker-compose.yml with deploy: sections. Deploy with: DOCKER_HOST=tcp://swarm-us-manager:2375 docker stack deploy -c docker-compose.yml app (one per region).

Cost data — $166/year for 24 containers: A production Swarm cluster running 24 containers across 3 worker nodes (t3.medium, 2 vCPU, 4GB RAM each) in us-east-1: - 3 manager nodes (t3.small, 2 vCPU, 2GB RAM): ~$28/month = $336/year - 3 worker nodes (t3.medium): ~$42/month = $504/year - 6 nodes total, 24 containers (4 containers per worker, modest resource limits) - Networking (NLB + data transfer): ~$25/year - Total: ~$865/year - Per-container cost: ~$36/year per container

When the article says $166/year for 24 containers, this assumes minimal instance types (t3.nano for managers, t3.micro for workers) in a low-cost region or spot instances. The realistic cost for a production-grade setup with HA, EBS volumes, and standard instance types is $800-$1,200/year.

Stack deployment across regions: ```bash # Deploy to US cluster docker --context swarm-us stack deploy -c docker-compose.yml app

# Deploy to EU cluster docker --context swarm-eu stack deploy -c docker-compose.yml app

# Configure contexts docker context create swarm-us --docker host=tcp://us-manager:2375 docker context create swarm-eu --docker host=tcp://eu-manager:2375 ```

Failure scenario — cross-region overlay: A team naively joined worker nodes from us-east-1 and eu-west-1 to the same Swarm cluster. The overlay network worked but request latency jumped from 5ms to 150ms due to VXLAN encapsulation across 8,000 km. Worse, the gossip protocol (TCP/UDP 7946) timed out frequently, causing nodes to be marked as 'Unreachable'. The team fixed it by splitting into two clusters and using an external load balancer.

⚠ Never span a single Swarm cluster across continents

VXLAN overlay networks assume LAN-latency links. Cross-continent VXLAN adds 100-200ms per packet, the gossip protocol times out, Raft latency degrades, and nodes are frequently marked unreachable. Always deploy independent Swarm clusters per region and use external DNS/load balancer to route traffic.

📊 Production Insight

A SaaS company ran 24 containers across 3 nodes in their US cluster and 24 containers across 3 nodes in their EU cluster. Total cost: ~$1,000/year (including the DNS load balancer). For a startup with $5M ARR and two engineers, this was the right call — Kubernetes would have added $400+/month in EKS control plane costs alone, requiring a dedicated platform engineer. Swarm's simplicity kept their infra spend under control while serving customers on two continents.

🎯 Key Takeaway

For multi-region deployments, treat each region as an independent Swarm cluster with its own overlay network. Use external DNS load balancing to route traffic. docker stack deploy targets Swarm (uses deploy: section); docker compose up targets a single host. Realistic cost for a 24-container production Swarm cluster is $800-$1,200/year, not counting the $166 figure which assumes minimal instances.

● Production incidentPOST-MORTEMseverity: high

Cluster Split-Brain After Losing 2 of 4 Manager Nodes — All Services Unreachable for 3 Hours

Symptom

After a planned data center maintenance window, the operations team could not deploy new services. docker service ls hung for 30 seconds then returned 'Error response from daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded'. Existing services continued running but were unreachable via the routing mesh. docker node ls on the surviving managers showed the 2 offline managers as 'Down' but 'Reachable' was false for all managers.

Assumption

Team assumed the offline managers would come back after maintenance and the cluster would self-heal. They waited 2 hours. The managers came back online, but the cluster was still unresponsive. They assumed a Docker daemon bug and considered rebuilding the entire cluster from scratch.

Root cause

With 4 manager nodes, the Raft quorum requires at least 3 managers to agree on any state change (quorum = floor(n/2) + 1 = floor(4/2) + 1 = 3). When 2 managers went offline, only 2 remained — insufficient for quorum. The Raft consensus algorithm froze. No new state changes could be committed. When the offline managers returned, they had stale Raft logs. The cluster needed manual intervention to re-establish consensus. The root design flaw was using an even number of managers (4) instead of an odd number (3 or 5).

Fix

1. Demoted one offline manager to worker: docker node demote <node-id>. This reduced the manager count to 3, making quorum = 2, which the 2 surviving managers could satisfy. 2. Promoted a worker to manager: docker node promote <worker-id>. This restored the manager count to 3 (odd). 3. Added a monitoring alert for Raft quorum health: docker node ls | grep -c 'Leader\|Reachable' to detect quorum loss early. 4. Documented the rule: always use 3 or 5 managers, never 4 or 6. 5. Migrated critical services to Kubernetes for the long term, as the team's scale exceeded Swarm's sweet spot.

Key lesson

Always use an odd number of manager nodes: 3 or 5. An even number (4, 6) wastes a node without improving fault tolerance.
Quorum = floor(n/2) + 1. With 3 managers, you can lose 1. With 5 managers, you can lose 2. With 4 managers, you can still only lose 1 — the 4th node provides no additional resilience.
Monitor Raft quorum health proactively. A cluster that loses quorum cannot schedule, scale, or update services — even though existing containers keep running.
Never run application workloads on manager nodes. Resource contention can starve the Raft process and cause the manager to appear unreachable, triggering unnecessary leader elections.
When quorum is lost, do not reboot all managers simultaneously. Restore one manager at a time and verify Raft log consistency before bringing up the next.

Production debug guideFrom quorum loss to service scheduling failures — systematic debugging paths.6 entries

Symptom · 01

docker service ls hangs or returns 'DeadlineExceeded'.

→

Fix

Check Raft quorum health. Run docker node ls on each manager. If fewer than quorum managers show 'Reachable', the cluster has lost quorum. Check if managers are reachable via SSH. Restart the Docker daemon on unreachable managers one at a time. If quorum cannot be restored, demote a failed manager to reduce the manager count.

Symptom · 02

Service tasks are stuck in 'Pending' state and never start.

→

Fix

Check resource constraints: docker service ps <service> --no-trunc. Look for 'no suitable node' errors. Verify node availability: docker node ls. Check if nodes have enough CPU/memory: docker node inspect <node> --format '{{.Description.Resources}}'. Check placement constraints: docker service inspect <service> --format '{{.Spec.TaskTemplate.Placement.Constraints}}'.

Symptom · 03

Service is running but unreachable via published port.

→

Fix

Check if the routing mesh is functioning: curl http://<any-node-ip>:<published-port>. If it works on some nodes but not others, the ingress network may be misconfigured. Inspect the ingress network: docker network inspect ingress. Check if the service has healthy tasks: docker service ps <service> --filter desired-state=running. Verify the container is listening: docker exec <container> ss -tlnp.

Symptom · 04

Rolling update is stuck and not progressing.

→

Fix

Check update status: docker service ps <service> --filter desired-state=running. Look for tasks in 'Failed' state. Check the new image exists and is pullable: docker pull <image>. Check if the new container fails health checks: docker service inspect <service> --format '{{.Spec.UpdateConfig}}'. Adjust update parallelism and delay: docker service update --update-parallelism 1 --update-delay 30s <service>.

Symptom · 05

Node shows 'Down' but the server is online.

→

Fix

Check Docker daemon status on the node: systemctl status docker. Check if the node's IP changed (common in cloud environments with dynamic IPs). Swarm uses the IP from docker swarm init/join. If the IP changed, the node must rejoin the cluster. Check firewall rules: Swarm requires ports 2377 (Raft), 7946 (gossip), 4789 (overlay VXLAN) to be open between all nodes.

Symptom · 06

Secrets or configs not updating in running services.

→

Fix

Docker secrets and configs are immutable. Updating a secret creates a new version. The service must be updated to reference the new secret: docker service update --secret-rm <old-secret> --secret-add <new-secret> <service>. Verify the secret is mounted: docker exec <container> ls /run/secrets/.

★ Docker Swarm Triage Cheat SheetFirst-response commands when Swarm cluster or service issues are reported.

Cluster unresponsive — docker service ls hangs.−

Immediate action

Check Raft quorum across all manager nodes.

Commands

docker node ls

docker info --format '{{.Swarm.ControlAvailable}}' (run on each manager)

Fix now

If fewer than quorum managers are reachable, restart Docker daemon on one manager at a time. If a manager is permanently dead, demote it: docker node demote <node-id>.

Service tasks stuck in 'Pending' or 'Failed' state.+

Service unreachable via published port on specific nodes.+

Rolling update stuck — old tasks not being replaced.+

Node shows 'Down' but server is reachable via SSH.+

Overlay network connectivity issues between containers on different nodes.+

Docker Swarm vs Kubernetes — When to Choose Which

Aspect	Docker Swarm	Kubernetes
Setup complexity	Single command: docker swarm init	Requires kubeadm, kops, or managed service (EKS, GKE)
Learning curve	Low — uses standard Docker CLI	Steep — new concepts (pods, deployments, services, ingress)
Built-in features	Service discovery, load balancing, secrets, rolling updates	All of the above plus CRDs, operators, admission controllers
Networking	VXLAN overlay with routing mesh	CNI plugin model (Calico, Cilium, Flannel)
State management	Raft consensus (embedded in Docker daemon)	etcd (external cluster)
Scaling	Good up to ~100 nodes	Designed for 1000+ nodes
Ecosystem	Limited — fewer third-party tools	Massive — Helm, ArgoCD, Istio, Prometheus, etc.
Best for	Small-to-medium teams, simple deployments, Docker-native workflows	Large-scale, complex workloads, teams with dedicated platform engineers

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
iothecodeforgeswarm-manager-setup.sh	docker swarm init \	Raft Consensus and Manager Node Architecture
iothecodeforgeswarm-service-deploy.sh	docker service create \	Service Scheduling, Placement Constraints and Resource Limit
iothecodeforgeswarm-networking.sh	docker network create \	Overlay Networks and Cross-Host Container Communication
iothecodeforgeswarm-rolling-update.sh	docker service create \	Rolling Updates, Rollback and Zero-Downtime Deployments
iothecodeforgeswarm-secrets.sh	echo 's3cret_p@ssw0rd' \| docker secret create io-thecodeforge-db-password -	Swarm Secrets and Configs
TaskServiceExample.yml	version: '3.8'	Tasks and Services
SwarmPortCheck.yml	docker service create \	Ports and Protocols
docker-stack.yml	version: '3.9'	Three Host Machines
service-constraint.yml	version: '3.9'	Don't Run Apps on Managers
iothecodeforgeswarm-placement.sh	docker node update --label-add az=us-east-1a node-1	Placement Constraints and Preferences

Key takeaways

Docker Swarm is the native orchestration layer built into the Docker Engine. It uses Raft consensus for state management and VXLAN overlay networks for cross-host communication.

Always use an odd number of manager nodes (3 or 5). Even numbers waste a node without improving fault tolerance. Never run workloads on manager nodes.

The ingress routing mesh adds one network hop. For latency-sensitive services, use host-mode publishing. Always open UDP 4789, TCP/UDP 7946, and TCP 2377 between nodes.

Always set --update-failure-action rollback and health checks with --health-start-period. Without rollback, a broken update replaces all healthy containers.

Docker secrets are encrypted, immutable, and mounted as files. Never use ENV for secrets. Secrets are Swarm-only

standalone Docker requires alternatives.

Swarm is ideal for small-to-medium deployments. For 100+ nodes or complex workloads, consider migrating to Kubernetes.

Swarm has no built-in autoscaling

implement custom scripts using docker service scale, adaptive polling rates, and cooldowns to avoid thrashing. Scale up aggressively, down conservatively.

For multi-region deployments, use independent Swarm clusters per region with external DNS load balancing. Never span a single Swarm across continents

VXLAN latency and gossip timeouts will break the cluster.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 7 QUESTIONS

Frequently Asked Questions

Is Docker Swarm still maintained?

How many manager nodes should I use?

What is the difference between a service and a task in Docker Swarm?

How does the routing mesh work?

Can I use Docker Swarm in production?

How does the swarm-external-secrets Vault plugin work?

How does the cost of Docker Swarm compare to Kubernetes?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's Docker. Mark it forged?

16 min read · try the examples if you haven't