Kubernetes StatefulSets — Kafka's 14-Hour Split-Brain
Kafka split-brain from Deployment: 14 hours lost.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
- StatefulSets provide stable network identity, persistent storage, and ordered deployment/scaling for stateful workloads
- Each pod gets a deterministic name (ordinal index) and a dedicated PVC that survives rescheduling
- A headless Service (clusterIP: None) creates DNS A records per pod: pod-name.service-name.namespace.svc.cluster.local
- Ordered rollouts: pods are created/deleted sequentially (0, 1, 2...) — not in parallel like Deployments
- Rolling updates follow reverse ordinal order (highest to lowest) — pod N-1 updates first, pod 0 last
- The biggest mistake: running databases as Deployments — pod identity loss causes split-brain, data corruption, or cluster rejoin failures
Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order. That is a StatefulSet. Unlike a regular Deployment — where pods are interchangeable guests who can sleep in any available room — a StatefulSet guarantees each pod has a permanent name, its own private storage that follows it everywhere, and a predictable position in line. Think of it as the difference between a row of numbered safety deposit boxes (StatefulSet) versus a pile of shopping carts where you grab whichever one is closest (Deployment). The box number matters. The cart number does not.
Stateful applications in Kubernetes demand persistent identity, stable network names, and ordered lifecycle—none of which vanilla Deployments provide. Without StatefulSets, your database pods will fight over the same volumes, restart in random order, and break quorum. StatefulSets enforce pod-ordering and bind storage to identity, letting you run Postgres, Cassandra, or Kafka the same way you run stateless microservices, without data corruption or split-brain disasters.
What is a Kubernetes StatefulSet?
A StatefulSet is a Kubernetes workload API object designed specifically for managing stateful applications. The name is deliberately chosen to contrast with the default Kubernetes mental model — stateless pods that are interchangeable, ephemeral, and replaceable. StatefulSets break that model intentionally and provide three guarantees that distributed stateful systems depend on at the protocol level.
First: stable, unique network identity. Each pod in a StatefulSet receives a deterministic name based on its ordinal index — web-0, web-1, web-2. This name is not a random hash suffix. It survives pod restarts, node failures, and rescheduling. Combined with a headless Service, this creates a stable DNS entry that other pods can use to discover and connect to a specific instance — which is critical for systems like ZooKeeper, etcd, and Kafka where cluster membership is a first-class concept baked into the protocol.
Second: stable, persistent storage. volumeClaimTemplates provision a dedicated PersistentVolumeClaim per pod using a naming convention tied to the pod's ordinal. When a pod is deleted and recreated, it reattaches to the same PVC. The storage follows the identity, not the node.
Third: ordered deployment and scaling. Pods are created sequentially — pod 0 must be Running and Ready before pod 1 is created. Pods are deleted in reverse order — pod N-1 before pod N-2 before pod 0. This ordering is not a soft preference — it is enforced by the StatefulSet controller at the API level.
Understanding which of these three guarantees your workload actually needs is more important than knowing the YAML syntax. A single-instance PostgreSQL database needs storage persistence but does not need the ordering guarantees. A ZooKeeper ensemble needs all three.
- Pod identity = ordinal index: web-0, web-1, web-2 — deterministic, survives restarts, rescheduling, and node failures
- DNS identity = pod-name.service-name.namespace.svc.cluster.local — per-pod A record created by the headless Service
- Storage identity = volumeClaimTemplate name + ordinal: data-web-0, data-web-1 — PVCs are not shared and are not deleted when the pod is
- Ordering guarantee: pods are created sequentially (0 before 1 before 2) and deleted in reverse (2 before 1 before 0) — enforced by the controller, not advisory
- Rolling update order: pod N-1 updates first, pod 0 updates last — preserving the stability of lower-ordinal pods that often hold more critical cluster roles
Ordered Operations: Creation, Deletion, and Rolling Updates
StatefulSet operations are strictly ordered in ways that feel unusual if you are used to Deployment behaviour. Understanding this ordering is essential for both operating StatefulSets correctly and for debugging when something gets stuck.
Pod creation is strictly sequential: the controller creates pod 0, then waits until it is Running and Ready before creating pod 1. Pod 1 must be Running and Ready before pod 2 is created. This is not just about startup — it reflects the dependency structure of many distributed systems where node 0 bootstraps the cluster and node 1 joins as a follower. If you scale a StatefulSet from 3 to 5 replicas, pods 3 and 4 are created sequentially in that order.
Pod deletion is strictly reverse-sequential: the controller deletes pod N-1, waits for it to fully terminate, then deletes pod N-2, and so on to pod 0. This preserves quorum during scale-down — a ZooKeeper ensemble being scaled from 5 to 3 nodes loses its two highest-ordinal nodes first, maintaining the 3-node quorum throughout the process rather than potentially losing the primary.
Rolling updates follow a special reverse-ordinal pattern that is different from what most engineers expect. The controller updates pod N-1 first, waits for it to become Ready, then proceeds to pod N-2, down to pod 0 last. Pod 0, which often holds the most critical role in distributed systems (initial voter in ZooKeeper, partition leader for critical topics in Kafka), is updated last — preserving cluster stability for as long as possible during the rollout.
The partition field in the rolling update strategy is one of the most useful and most forgotten StatefulSet features. Setting partition: 3 in a 5-pod StatefulSet means only pods 3 and 4 update when you apply a new template. Pods 0, 1, and 2 remain on the old version. This is a native canary mechanism built into the StatefulSet API — you validate pods 3 and 4 under production traffic, then lower partition to 0 to complete the rollout. The failure mode to remember: if you set partition for a canary and forget to reset it to 0, your cluster runs in a permanently split-version state indefinitely.
- A single failing readiness probe on any pod blocks the entire rollout — the controller will not advance to the next ordinal until the current pod passes
- terminationGracePeriodSeconds must be long enough for clean shutdown — databases need 60-120 seconds minimum for WAL flush and checkpoint, not the Kubernetes default of 30
- If you manually delete a pod during a rolling update, the controller recreates it with the OLD version first, then applies the update — this is by design but can be confusing
- OnDelete strategy means pods only update when you manually delete them — useful for workloads that need a human gate between each pod update, but easy to forget and end up with a stalled fleet
- Rule: always define a readiness probe on StatefulSet pods — without one, a CrashLooping pod is considered Ready immediately after its container starts and the rollout advances to a broken state
PVC Lifecycle: Storage That Follows Identity
StatefulSets use volumeClaimTemplates to provision one PersistentVolumeClaim per pod. The naming convention is deterministic: template-name-statefulset-name-ordinal. For a StatefulSet named io-thecodeforge-db with a template named data, the PVCs are data-io-thecodeforge-db-0, data-io-thecodeforge-db-1, and data-io-thecodeforge-db-2. This naming is not configurable — it is generated by the StatefulSet controller.
The storage lifecycle has two properties that every engineer running StatefulSets in production must understand deeply.
First, PVCs survive pod deletion. When a pod is deleted — whether by a rolling update, a manual kubectl delete pod, a node failure, or a scale-down — its PVC is not deleted. When the pod is recreated with the same ordinal, it reattaches to the same PVC. This is the 'sticky storage' guarantee. It is what makes your PostgreSQL data directory survive node failures without data loss.
Second, PVCs are NOT deleted when the StatefulSet is deleted. This is a deliberate safety mechanism — accidentally deleting a StatefulSet should not destroy production databases. But it means that scaling down a StatefulSet from 5 to 3 replicas leaves two orphaned PVCs (data-name-3 and data-name-4) that consume cloud storage indefinitely. At $0.08-0.15 per GB per month on most cloud providers, a 1TB database with ten orphaned scale-down PVCs accumulates $800-1500 per month in silent storage waste. This is one of the most common cost anomalies in Kubernetes clusters and one of the least visible.
Kubernetes 1.27 introduced the persistentVolumeClaimRetentionPolicy field on StatefulSets, which allows you to configure automatic PVC deletion on scale-down (whenScaled) or StatefulSet deletion (whenDeleted). For most production stateful workloads, setting whenScaled to Delete is appropriate — the PVC for a scaled-down pod can reasonably be considered ephemeral. Setting whenDeleted to Delete is more dangerous and should only be used when you have confirmed out-of-cluster backups.
- Scaling down a StatefulSet leaves PVCs orphaned — they are not deleted automatically unless persistentVolumeClaimRetentionPolicy is configured
- Orphaned PVCs consume cloud storage at full price indefinitely — at $0.10/GB/month, a 50GB PVC costs $5/month sitting unused, and teams typically have dozens
- If the StorageClass has a volume count limit or quota, orphaned PVCs can block new pod scheduling silently
- Deleting a StatefulSet does NOT delete its PVCs by default — you must clean up manually or configure whenDeleted: Delete with confirmed backup coverage
- Rule: after every scale-down or StatefulSet deletion, verify PVC state: kubectl get pvc -l app=<name> and reconcile against expected counts
Headless Services and DNS: How Pods Find Each Other
A headless Service is the DNS backbone of a StatefulSet. Without it, StatefulSet pods have stable names but no way for other pods to resolve those names to IP addresses. With it, every pod in the StatefulSet gets an individual DNS A record that points directly to that pod's IP — bypassing the load-balancing layer that regular Services add.
The distinction between a regular Service and a headless Service is important to internalise. A regular Service (clusterIP: something) creates one DNS name that resolves to a virtual IP, and kube-proxy load-balances traffic from that virtual IP to any matching pod. You cannot target a specific pod by DNS with a regular Service. A headless Service (clusterIP: None) creates no virtual IP and no load balancing. Instead, CoreDNS creates individual A records for each pod, one per pod, each pointing to that pod's actual IP address. This is how pod-to-pod targeting works.
The DNS naming convention for StatefulSet pods is: pod-name.service-name.namespace.svc.cluster.local. For a StatefulSet named db with a headless Service named db-headless in namespace prod, pod 0's DNS entry is db-0.db-headless.prod.svc.cluster.local. This entry is stable — it always points to the pod with that identity, regardless of which node it runs on. When the pod is rescheduled to a different node with a different IP, CoreDNS updates the A record to reflect the new IP. The DNS name remains constant; only the IP it resolves to changes.
The headless Service also creates a SRV record for each pod: _port-name._protocol.service-name.namespace.svc.cluster.local. SRV records carry both the hostname and port, which is how distributed systems like Kafka and ZooKeeper bootstrap cluster membership at startup — they query the SRV record to discover all current pod hostnames without needing to know ordinal count in advance.
One practical detail: CoreDNS caches DNS responses with a TTL. For headless Services the default TTL is 5 seconds. If a pod is rescheduled quickly, there is a brief window where other pods may try to connect to the old IP before the cache expires. Design your application's connection retry logic to tolerate this — most database connection pools handle it correctly if configured with appropriate connection timeouts.
- Regular Service: one DNS name, one virtual IP, load-balanced across pods — you cannot target a specific pod
- Headless Service: individual DNS A records per pod — you CAN target a specific pod by its stable name
- Pod DNS pattern: pod-name.headless-svc.namespace.svc.cluster.local — stable across restarts and rescheduling
- SRV records allow applications to discover all pods dynamically without knowing the replica count — used by ZooKeeper, Kafka, and etcd for bootstrap
- CoreDNS TTL for headless Services defaults to 5 seconds — design connection retry logic to tolerate this brief staleness window after pod rescheduling
PodDisruptionBudget: Protecting Quorum During Voluntary Disruptions
PodDisruptionBudgets (PDBs) are among the most important and most skipped Kubernetes objects for StatefulSets in production. A PDB is a policy object that limits how many pods in a set can be simultaneously unavailable during voluntary disruptions — node drains, cluster upgrades, autoscaler scale-downs, and manual evictions.
For a quorum-based system, the requirement is concrete: a 3-node ZooKeeper cluster needs at least 2 nodes to maintain quorum. A 5-node etcd cluster needs at least 3 nodes. A PDB with minAvailable: 2 tells Kubernetes — during any voluntary disruption, you must ensure at least 2 of my pods are running and Ready before proceeding with the eviction. If a node drain would cause a second pod to become unavailable while the first is still evicting, the drain blocks until the first pod is rescheduled and Ready elsewhere.
This blocking behaviour is the point. Without a PDB, kubectl drain proceeds freely, evicting pods without regard for quorum. A node drain during a cluster upgrade with three StatefulSet pods all on the same node (a common situation if pod anti-affinity is not configured) will evict all three simultaneously, destroying quorum completely.
The combination of PDB plus pod anti-affinity is the production-grade pattern. The PDB protects against voluntary disruptions. Pod anti-affinity with requiredDuringSchedulingIgnoredDuringExecution and topologyKey: kubernetes.io/hostname protects against involuntary disruptions by ensuring no two pods land on the same node. Together, they mean: node drains cannot break quorum, and a single node failure cannot take out more than one pod.
One operational detail: kubectl drain respects PDBs. kubectl delete pod does not. If your operational runbook uses kubectl delete pod to perform maintenance, PDBs provide no protection. Use kubectl drain or kubernetes eviction API calls for any maintenance that should respect disruption budgets.
- PDBs only apply to voluntary disruptions: node drains initiated by kubectl drain, cluster upgrades, autoscaler scale-downs, and kubernetes eviction API calls
- Node crashes, kernel panics, OOM kills, and hardware failures are involuntary — PDBs do not block or delay them in any way
- For protection against involuntary disruptions, use pod anti-affinity with topologyKey: topology.kubernetes.io/zone to spread pods across availability zones
- A PDB with minAvailable: 1 on a 3-pod StatefulSet allows two simultaneous evictions — that is two out of three nodes gone, which breaks quorum for any majority-voting system
- Rule: set minAvailable to exactly your quorum threshold (floor(n/2) + 1), not to 1 unless you have explicitly accepted the quorum implications
Before You Begin: The Prerequisites Nobody Tells You About
StatefulSets look simple until they break your database at 3 AM. Before you touch one, you need to understand three things that most tutorials gloss over.
First, your cluster needs a dynamic PersistentVolume provisioner with a StorageClass that supports ReadWriteOnce. If you're on a bare-metal cluster without a proper CSI driver, your PVCs will stay Pending forever. Check with kubectl get storageclass and verify your provisioner supports volumeBindingMode: WaitForFirstConsumer for StatefulSets with topology constraints.
Second, you need a headless Service. This isn't optional — StatefulSets rely on DNS records to assign stable hostnames like pod-name-0.service-name.namespace.svc.cluster.local. Without a headless Service (spec.clusterIP: None), your pods won't get predictable network identities.
Third, understand that StatefulSets require ordered pod management by default. This means scaling down waits for pod-2 to terminate before touching pod-1. If your app doesn't need ordering, you're paying a latency tax for nothing — consider setting spec.podManagementPolicy: Parallel.
Creating a StatefulSet: The Manifest That Actually Works
Stop copy-pasting random YAML from blogs. Here's how a real production StatefulSet looks for a PostgreSQL cluster — and why every field exists.
The critical parts: serviceName must match your headless Service's name exactly. Kubernetes uses this to generate DNS records. spec.selector.matchLabels must match the pod template labels — no mismatch or the StatefulSet controller rejects it silently.
Notice volumeClaimTemplates instead of volumes. This is the magic that gives each pod its own PVC. Each template creates one PVC per replica, named {volume-name}-{statefulset-name}-{ordinal}. The PVC survives pod deletion, so when the pod comes back (maybe on a different node), it reattaches to the same storage.
The podManagementPolicy defaults to OrderedReady, which waits for pod-0 to be Running and Ready before creating pod-1. This is required for quorum-based apps like databases. If you're running a queue worker or cache, switch to Parallel to avoid the serial bottleneck.
Scaling a StatefulSet: Why Your Database Will Thank You for Ordered Pod Termination
Scaling a StatefulSet is not like scaling a Deployment. Deployment scaling is stupid-fast: it kills or spawns pods in parallel. StatefulSet scaling is surgical and ordered, and that matters when your app has a quorum.
When you scale down from 3 to 1 replica, Kubernetes terminates the highest ordinal pod first — postgres-cluster-2, then postgres-cluster-1. It waits for each pod to fully terminate (status.phase: Succeeded or Failed) before touching the next. No race conditions. No two databases trying to write to the same volume.
There's a hidden gotcha: kubectl scale statefulset postgres-cluster --replicas=0 doesn't delete the PVCs. The volumes hang around with their data intact. This is by design — you can scale back up and the pods reattach to the same disks. But if you manually delete PVCs while the pods are gone, you lose data permanently.
For applications that don't care about ordering (like a distributed cache), set spec.podManagementPolicy: Parallel — scaling goes from O(n) to O(1) time. But for databases, leave it as OrderedReady and let the controller handle the serial dance.
OnDelete: Manual Pod Replacement When You Control the Timing
The OnDelete update strategy exists because rolling updates are dangerous for stateful workloads. Unlike RollingUpdate, which replaces pods in reverse ordinal order automatically, OnDelete requires you to delete each pod manually before it gets recreated with the new spec. Why use this? When upgrading a database cluster, you may need to drain connections, wait for replication lag to subside, or run pre-upgrade health checks on each node. OnDelete gives you precise per-pod control over when updates happen. Set updateStrategy.type: OnDelete in your StatefulSet spec. After updating the pod template, no pods change until you run kubectl delete pod <pod-name>. The controller then rebuilds that pod using the new template, preserving its identity and storage. This prevents the controller from deciding the pace of your production upgrade.
Cleaning Up: Why StatefulSets Refuse to Die Cleanly
Deleting a StatefulSet does not automatically delete its PersistentVolumeClaims. By design, PVCs survive deletion to prevent accidental data loss. Run kubectl delete statefulset <name> and pods terminate in reverse order (pod-2, pod-1, pod-0), but the PVCs remain. If you recreate the StatefulSet with the same name and volumeClaimTemplates, it will reclaim those exact PVCs. To fully clean up, you must delete the PVCs manually after the StatefulSet is gone: kubectl delete pvc -l app=<your-app>. For cascading cleanup, set persistentVolumeClaimRetentionPolicy to delete in the StatefulSet spec. This policy defines what happens to PVCs when the StatefulSet is scaled down or deleted. With whenDeleted: Delete, PVCs are removed automatically. Without this, orphaned PVCs can rack up cloud storage costs indefinitely. Always audit orphaned resources after cleanup.
whenDeleted: Delete does not protect against accidental deletion. If someone deletes the StatefulSet, PVCs are destroyed instantly — no confirmation. For critical data, use Retain and build an external backup pipeline.The Kafka Split-Brain: How a Deployment Killed 14 Hours of Messages
- Never run stateful distributed systems — Kafka, ZooKeeper, Elasticsearch, etcd, Redis Cluster — as Deployments; pod identity loss is catastrophic at the protocol level and the failure mode is silent until it is not
- PodDisruptionBudget is mandatory for StatefulSets in production — without one, a node drain can evict every pod simultaneously and break quorum with no warning
- Node drains respect PodDisruptionBudgets, but kubectl delete pod does not — operational runbooks must distinguish between the two
- Test failure modes in staging by simulating node drains before they happen in production: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data and verify quorum is maintained throughout
kubectl describe pod <pod-name> | grep -A15 Eventskubectl get pvc -o wide --selector app=<statefulset-label>Key takeaways
Common mistakes to avoid
6 patternsRunning stateful workloads (Kafka, ZooKeeper, Elasticsearch, Redis Cluster) as Deployments
Not setting a PodDisruptionBudget on production StatefulSets
Using a regular ClusterIP Service instead of a headless Service for StatefulSet DNS
Forgetting to clean up orphaned PVCs after scaling down or deleting a StatefulSet
Not setting terminationGracePeriodSeconds high enough for database pods
Setting partition for a canary update and forgetting to reset it to 0
Interview Questions on This Topic
What are the three guarantees that a StatefulSet provides that a Deployment does not?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
That's Kubernetes. Mark it forged?
11 min read · try the examples if you haven't