Kubernetes StatefulSets — Kafka's 14-Hour Split-Brain
Kafka split-brain from Deployment: 14 hours lost.
- StatefulSets provide stable network identity, persistent storage, and ordered deployment/scaling for stateful workloads
- Each pod gets a deterministic name (ordinal index) and a dedicated PVC that survives rescheduling
- A headless Service (clusterIP: None) creates DNS A records per pod: pod-name.service-name.namespace.svc.cluster.local
- Ordered rollouts: pods are created/deleted sequentially (0, 1, 2...) — not in parallel like Deployments
- Rolling updates follow reverse ordinal order (highest to lowest) — pod N-1 updates first, pod 0 last
- The biggest mistake: running databases as Deployments — pod identity loss causes split-brain, data corruption, or cluster rejoin failures
Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order. That is a StatefulSet. Unlike a regular Deployment — where pods are interchangeable guests who can sleep in any available room — a StatefulSet guarantees each pod has a permanent name, its own private storage that follows it everywhere, and a predictable position in line. Think of it as the difference between a row of numbered safety deposit boxes (StatefulSet) versus a pile of shopping carts where you grab whichever one is closest (Deployment). The box number matters. The cart number does not.
Stateless apps are easy. You spin up ten identical pods, kill any three, Kubernetes replaces them and nobody cares which replacement is which. But the real world is full of systems that refuse to be stateless: databases, message brokers, distributed caches, search engines. These systems have opinions about identity. Elasticsearch node 2 needs to rejoin the cluster as Elasticsearch node 2 — not as some random newcomer that triggers a full shard rebalance. Kafka broker 0 owns specific partitions and cannot pretend to be a fresh broker without corrupting the consumer offset records for every topic it leads. Ignoring this reality and running stateful workloads as Deployments is one of the most expensive mistakes teams make on Kubernetes, and it almost always surfaces at 2am during a production incident when an on-call engineer is staring at split-brain metrics they do not immediately understand.
StatefulSets exist precisely to give Kubernetes the vocabulary to reason about identity, ordering, and sticky storage. They provide three guarantees that Deployments fundamentally cannot: a stable, unique network identity that survives pod restarts and rescheduling; stable, persistent storage that follows the pod regardless of which node it lands on; and ordered, graceful deployment and scaling that respects cluster quorum requirements. These are not conveniences — they are load-bearing architectural properties that distributed consensus protocols depend on at the wire level.
By the end of this article you will understand how StatefulSets work under the hood: the controller loop, the role of the headless service in DNS, how PVC ownership is tracked via OwnerReferences, why pod ordinals matter for rolling updates, and the exact failure modes that bite teams in production. You will also have complete, runnable manifests with every significant field explained.
What is a Kubernetes StatefulSet?
A StatefulSet is a Kubernetes workload API object designed specifically for managing stateful applications. The name is deliberately chosen to contrast with the default Kubernetes mental model — stateless pods that are interchangeable, ephemeral, and replaceable. StatefulSets break that model intentionally and provide three guarantees that distributed stateful systems depend on at the protocol level.
First: stable, unique network identity. Each pod in a StatefulSet receives a deterministic name based on its ordinal index — web-0, web-1, web-2. This name is not a random hash suffix. It survives pod restarts, node failures, and rescheduling. Combined with a headless Service, this creates a stable DNS entry that other pods can use to discover and connect to a specific instance — which is critical for systems like ZooKeeper, etcd, and Kafka where cluster membership is a first-class concept baked into the protocol.
Second: stable, persistent storage. volumeClaimTemplates provision a dedicated PersistentVolumeClaim per pod using a naming convention tied to the pod's ordinal. When a pod is deleted and recreated, it reattaches to the same PVC. The storage follows the identity, not the node.
Third: ordered deployment and scaling. Pods are created sequentially — pod 0 must be Running and Ready before pod 1 is created. Pods are deleted in reverse order — pod N-1 before pod N-2 before pod 0. This ordering is not a soft preference — it is enforced by the StatefulSet controller at the API level.
Understanding which of these three guarantees your workload actually needs is more important than knowing the YAML syntax. A single-instance PostgreSQL database needs storage persistence but does not need the ordering guarantees. A ZooKeeper ensemble needs all three.
- Pod identity = ordinal index: web-0, web-1, web-2 — deterministic, survives restarts, rescheduling, and node failures
- DNS identity = pod-name.service-name.namespace.svc.cluster.local — per-pod A record created by the headless Service
- Storage identity = volumeClaimTemplate name + ordinal: data-web-0, data-web-1 — PVCs are not shared and are not deleted when the pod is
- Ordering guarantee: pods are created sequentially (0 before 1 before 2) and deleted in reverse (2 before 1 before 0) — enforced by the controller, not advisory
- Rolling update order: pod N-1 updates first, pod 0 updates last — preserving the stability of lower-ordinal pods that often hold more critical cluster roles
Ordered Operations: Creation, Deletion, and Rolling Updates
StatefulSet operations are strictly ordered in ways that feel unusual if you are used to Deployment behaviour. Understanding this ordering is essential for both operating StatefulSets correctly and for debugging when something gets stuck.
Pod creation is strictly sequential: the controller creates pod 0, then waits until it is Running and Ready before creating pod 1. Pod 1 must be Running and Ready before pod 2 is created. This is not just about startup — it reflects the dependency structure of many distributed systems where node 0 bootstraps the cluster and node 1 joins as a follower. If you scale a StatefulSet from 3 to 5 replicas, pods 3 and 4 are created sequentially in that order.
Pod deletion is strictly reverse-sequential: the controller deletes pod N-1, waits for it to fully terminate, then deletes pod N-2, and so on to pod 0. This preserves quorum during scale-down — a ZooKeeper ensemble being scaled from 5 to 3 nodes loses its two highest-ordinal nodes first, maintaining the 3-node quorum throughout the process rather than potentially losing the primary.
Rolling updates follow a special reverse-ordinal pattern that is different from what most engineers expect. The controller updates pod N-1 first, waits for it to become Ready, then proceeds to pod N-2, down to pod 0 last. Pod 0, which often holds the most critical role in distributed systems (initial voter in ZooKeeper, partition leader for critical topics in Kafka), is updated last — preserving cluster stability for as long as possible during the rollout.
The partition field in the rolling update strategy is one of the most useful and most forgotten StatefulSet features. Setting partition: 3 in a 5-pod StatefulSet means only pods 3 and 4 update when you apply a new template. Pods 0, 1, and 2 remain on the old version. This is a native canary mechanism built into the StatefulSet API — you validate pods 3 and 4 under production traffic, then lower partition to 0 to complete the rollout. The failure mode to remember: if you set partition for a canary and forget to reset it to 0, your cluster runs in a permanently split-version state indefinitely.
- A single failing readiness probe on any pod blocks the entire rollout — the controller will not advance to the next ordinal until the current pod passes
- terminationGracePeriodSeconds must be long enough for clean shutdown — databases need 60-120 seconds minimum for WAL flush and checkpoint, not the Kubernetes default of 30
- If you manually delete a pod during a rolling update, the controller recreates it with the OLD version first, then applies the update — this is by design but can be confusing
- OnDelete strategy means pods only update when you manually delete them — useful for workloads that need a human gate between each pod update, but easy to forget and end up with a stalled fleet
- Rule: always define a readiness probe on StatefulSet pods — without one, a CrashLooping pod is considered Ready immediately after its container starts and the rollout advances to a broken state
PVC Lifecycle: Storage That Follows Identity
StatefulSets use volumeClaimTemplates to provision one PersistentVolumeClaim per pod. The naming convention is deterministic: template-name-statefulset-name-ordinal. For a StatefulSet named io-thecodeforge-db with a template named data, the PVCs are data-io-thecodeforge-db-0, data-io-thecodeforge-db-1, and data-io-thecodeforge-db-2. This naming is not configurable — it is generated by the StatefulSet controller.
The storage lifecycle has two properties that every engineer running StatefulSets in production must understand deeply.
First, PVCs survive pod deletion. When a pod is deleted — whether by a rolling update, a manual kubectl delete pod, a node failure, or a scale-down — its PVC is not deleted. When the pod is recreated with the same ordinal, it reattaches to the same PVC. This is the 'sticky storage' guarantee. It is what makes your PostgreSQL data directory survive node failures without data loss.
Second, PVCs are NOT deleted when the StatefulSet is deleted. This is a deliberate safety mechanism — accidentally deleting a StatefulSet should not destroy production databases. But it means that scaling down a StatefulSet from 5 to 3 replicas leaves two orphaned PVCs (data-name-3 and data-name-4) that consume cloud storage indefinitely. At $0.08-0.15 per GB per month on most cloud providers, a 1TB database with ten orphaned scale-down PVCs accumulates $800-1500 per month in silent storage waste. This is one of the most common cost anomalies in Kubernetes clusters and one of the least visible.
Kubernetes 1.27 introduced the persistentVolumeClaimRetentionPolicy field on StatefulSets, which allows you to configure automatic PVC deletion on scale-down (whenScaled) or StatefulSet deletion (whenDeleted). For most production stateful workloads, setting whenScaled to Delete is appropriate — the PVC for a scaled-down pod can reasonably be considered ephemeral. Setting whenDeleted to Delete is more dangerous and should only be used when you have confirmed out-of-cluster backups.
- Scaling down a StatefulSet leaves PVCs orphaned — they are not deleted automatically unless persistentVolumeClaimRetentionPolicy is configured
- Orphaned PVCs consume cloud storage at full price indefinitely — at $0.10/GB/month, a 50GB PVC costs $5/month sitting unused, and teams typically have dozens
- If the StorageClass has a volume count limit or quota, orphaned PVCs can block new pod scheduling silently
- Deleting a StatefulSet does NOT delete its PVCs by default — you must clean up manually or configure whenDeleted: Delete with confirmed backup coverage
- Rule: after every scale-down or StatefulSet deletion, verify PVC state: kubectl get pvc -l app=<name> and reconcile against expected counts
Headless Services and DNS: How Pods Find Each Other
A headless Service is the DNS backbone of a StatefulSet. Without it, StatefulSet pods have stable names but no way for other pods to resolve those names to IP addresses. With it, every pod in the StatefulSet gets an individual DNS A record that points directly to that pod's IP — bypassing the load-balancing layer that regular Services add.
The distinction between a regular Service and a headless Service is important to internalise. A regular Service (clusterIP: something) creates one DNS name that resolves to a virtual IP, and kube-proxy load-balances traffic from that virtual IP to any matching pod. You cannot target a specific pod by DNS with a regular Service. A headless Service (clusterIP: None) creates no virtual IP and no load balancing. Instead, CoreDNS creates individual A records for each pod, one per pod, each pointing to that pod's actual IP address. This is how pod-to-pod targeting works.
The DNS naming convention for StatefulSet pods is: pod-name.service-name.namespace.svc.cluster.local. For a StatefulSet named db with a headless Service named db-headless in namespace prod, pod 0's DNS entry is db-0.db-headless.prod.svc.cluster.local. This entry is stable — it always points to the pod with that identity, regardless of which node it runs on. When the pod is rescheduled to a different node with a different IP, CoreDNS updates the A record to reflect the new IP. The DNS name remains constant; only the IP it resolves to changes.
The headless Service also creates a SRV record for each pod: _port-name._protocol.service-name.namespace.svc.cluster.local. SRV records carry both the hostname and port, which is how distributed systems like Kafka and ZooKeeper bootstrap cluster membership at startup — they query the SRV record to discover all current pod hostnames without needing to know ordinal count in advance.
One practical detail: CoreDNS caches DNS responses with a TTL. For headless Services the default TTL is 5 seconds. If a pod is rescheduled quickly, there is a brief window where other pods may try to connect to the old IP before the cache expires. Design your application's connection retry logic to tolerate this — most database connection pools handle it correctly if configured with appropriate connection timeouts.
- Regular Service: one DNS name, one virtual IP, load-balanced across pods — you cannot target a specific pod
- Headless Service: individual DNS A records per pod — you CAN target a specific pod by its stable name
- Pod DNS pattern: pod-name.headless-svc.namespace.svc.cluster.local — stable across restarts and rescheduling
- SRV records allow applications to discover all pods dynamically without knowing the replica count — used by ZooKeeper, Kafka, and etcd for bootstrap
- CoreDNS TTL for headless Services defaults to 5 seconds — design connection retry logic to tolerate this brief staleness window after pod rescheduling
PodDisruptionBudget: Protecting Quorum During Voluntary Disruptions
PodDisruptionBudgets (PDBs) are among the most important and most skipped Kubernetes objects for StatefulSets in production. A PDB is a policy object that limits how many pods in a set can be simultaneously unavailable during voluntary disruptions — node drains, cluster upgrades, autoscaler scale-downs, and manual evictions.
For a quorum-based system, the requirement is concrete: a 3-node ZooKeeper cluster needs at least 2 nodes to maintain quorum. A 5-node etcd cluster needs at least 3 nodes. A PDB with minAvailable: 2 tells Kubernetes — during any voluntary disruption, you must ensure at least 2 of my pods are running and Ready before proceeding with the eviction. If a node drain would cause a second pod to become unavailable while the first is still evicting, the drain blocks until the first pod is rescheduled and Ready elsewhere.
This blocking behaviour is the point. Without a PDB, kubectl drain proceeds freely, evicting pods without regard for quorum. A node drain during a cluster upgrade with three StatefulSet pods all on the same node (a common situation if pod anti-affinity is not configured) will evict all three simultaneously, destroying quorum completely.
The combination of PDB plus pod anti-affinity is the production-grade pattern. The PDB protects against voluntary disruptions. Pod anti-affinity with requiredDuringSchedulingIgnoredDuringExecution and topologyKey: kubernetes.io/hostname protects against involuntary disruptions by ensuring no two pods land on the same node. Together, they mean: node drains cannot break quorum, and a single node failure cannot take out more than one pod.
One operational detail: kubectl drain respects PDBs. kubectl delete pod does not. If your operational runbook uses kubectl delete pod to perform maintenance, PDBs provide no protection. Use kubectl drain or kubernetes eviction API calls for any maintenance that should respect disruption budgets.
- PDBs only apply to voluntary disruptions: node drains initiated by kubectl drain, cluster upgrades, autoscaler scale-downs, and kubernetes eviction API calls
- Node crashes, kernel panics, OOM kills, and hardware failures are involuntary — PDBs do not block or delay them in any way
- For protection against involuntary disruptions, use pod anti-affinity with topologyKey: topology.kubernetes.io/zone to spread pods across availability zones
- A PDB with minAvailable: 1 on a 3-pod StatefulSet allows two simultaneous evictions — that is two out of three nodes gone, which breaks quorum for any majority-voting system
- Rule: set minAvailable to exactly your quorum threshold (floor(n/2) + 1), not to 1 unless you have explicitly accepted the quorum implications
The Kafka Split-Brain: How a Deployment Killed 14 Hours of Messages
- Never run stateful distributed systems — Kafka, ZooKeeper, Elasticsearch, etcd, Redis Cluster — as Deployments; pod identity loss is catastrophic at the protocol level and the failure mode is silent until it is not
- PodDisruptionBudget is mandatory for StatefulSets in production — without one, a node drain can evict every pod simultaneously and break quorum with no warning
- Node drains respect PodDisruptionBudgets, but kubectl delete pod does not — operational runbooks must distinguish between the two
- Test failure modes in staging by simulating node drains before they happen in production: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data and verify quorum is maintained throughout
Key takeaways
Common mistakes to avoid
6 patternsRunning stateful workloads (Kafka, ZooKeeper, Elasticsearch, Redis Cluster) as Deployments
Not setting a PodDisruptionBudget on production StatefulSets
Using a regular ClusterIP Service instead of a headless Service for StatefulSet DNS
Forgetting to clean up orphaned PVCs after scaling down or deleting a StatefulSet
Not setting terminationGracePeriodSeconds high enough for database pods
Setting partition for a canary update and forgetting to reset it to 0
Interview Questions on This Topic
What are the three guarantees that a StatefulSet provides that a Deployment does not?
Frequently Asked Questions
That's Kubernetes. Mark it forged?
9 min read · try the examples if you haven't