Senior 9 min · March 06, 2026

Kubernetes StatefulSets — Kafka's 14-Hour Split-Brain

Kafka split-brain from Deployment: 14 hours lost.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • StatefulSets provide stable network identity, persistent storage, and ordered deployment/scaling for stateful workloads
  • Each pod gets a deterministic name (ordinal index) and a dedicated PVC that survives rescheduling
  • A headless Service (clusterIP: None) creates DNS A records per pod: pod-name.service-name.namespace.svc.cluster.local
  • Ordered rollouts: pods are created/deleted sequentially (0, 1, 2...) — not in parallel like Deployments
  • Rolling updates follow reverse ordinal order (highest to lowest) — pod N-1 updates first, pod 0 last
  • The biggest mistake: running databases as Deployments — pod identity loss causes split-brain, data corruption, or cluster rejoin failures
Plain-English First

Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order. That is a StatefulSet. Unlike a regular Deployment — where pods are interchangeable guests who can sleep in any available room — a StatefulSet guarantees each pod has a permanent name, its own private storage that follows it everywhere, and a predictable position in line. Think of it as the difference between a row of numbered safety deposit boxes (StatefulSet) versus a pile of shopping carts where you grab whichever one is closest (Deployment). The box number matters. The cart number does not.

Stateless apps are easy. You spin up ten identical pods, kill any three, Kubernetes replaces them and nobody cares which replacement is which. But the real world is full of systems that refuse to be stateless: databases, message brokers, distributed caches, search engines. These systems have opinions about identity. Elasticsearch node 2 needs to rejoin the cluster as Elasticsearch node 2 — not as some random newcomer that triggers a full shard rebalance. Kafka broker 0 owns specific partitions and cannot pretend to be a fresh broker without corrupting the consumer offset records for every topic it leads. Ignoring this reality and running stateful workloads as Deployments is one of the most expensive mistakes teams make on Kubernetes, and it almost always surfaces at 2am during a production incident when an on-call engineer is staring at split-brain metrics they do not immediately understand.

StatefulSets exist precisely to give Kubernetes the vocabulary to reason about identity, ordering, and sticky storage. They provide three guarantees that Deployments fundamentally cannot: a stable, unique network identity that survives pod restarts and rescheduling; stable, persistent storage that follows the pod regardless of which node it lands on; and ordered, graceful deployment and scaling that respects cluster quorum requirements. These are not conveniences — they are load-bearing architectural properties that distributed consensus protocols depend on at the wire level.

By the end of this article you will understand how StatefulSets work under the hood: the controller loop, the role of the headless service in DNS, how PVC ownership is tracked via OwnerReferences, why pod ordinals matter for rolling updates, and the exact failure modes that bite teams in production. You will also have complete, runnable manifests with every significant field explained.

What is a Kubernetes StatefulSet?

A StatefulSet is a Kubernetes workload API object designed specifically for managing stateful applications. The name is deliberately chosen to contrast with the default Kubernetes mental model — stateless pods that are interchangeable, ephemeral, and replaceable. StatefulSets break that model intentionally and provide three guarantees that distributed stateful systems depend on at the protocol level.

First: stable, unique network identity. Each pod in a StatefulSet receives a deterministic name based on its ordinal index — web-0, web-1, web-2. This name is not a random hash suffix. It survives pod restarts, node failures, and rescheduling. Combined with a headless Service, this creates a stable DNS entry that other pods can use to discover and connect to a specific instance — which is critical for systems like ZooKeeper, etcd, and Kafka where cluster membership is a first-class concept baked into the protocol.

Second: stable, persistent storage. volumeClaimTemplates provision a dedicated PersistentVolumeClaim per pod using a naming convention tied to the pod's ordinal. When a pod is deleted and recreated, it reattaches to the same PVC. The storage follows the identity, not the node.

Third: ordered deployment and scaling. Pods are created sequentially — pod 0 must be Running and Ready before pod 1 is created. Pods are deleted in reverse order — pod N-1 before pod N-2 before pod 0. This ordering is not a soft preference — it is enforced by the StatefulSet controller at the API level.

Understanding which of these three guarantees your workload actually needs is more important than knowing the YAML syntax. A single-instance PostgreSQL database needs storage persistence but does not need the ordering guarantees. A ZooKeeper ensemble needs all three.

io_thecodeforge/statefulset-basic.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# The headless Service must be created BEFORE the StatefulSet.
# The StatefulSet controller validates the serviceName reference on creation.
# Without the headless Service, per-pod DNS records cannot be created.
apiVersion: v1
kind: Service
metadata:
  name: io-thecodeforge-headless
  labels:
    app: io-thecodeforge-db
spec:
  clusterIP: None  # This is what makes it headless — no virtual IP, direct pod DNS records
  selector:
    app: io-thecodeforge-db  # Must match the StatefulSet pod labels exactly
  ports:
    - port: 5432
      name: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
spec:
  serviceName: io-thecodeforge-headless  # References the headless Service above
  replicas: 3
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      terminationGracePeriodSeconds: 120  # Give PostgreSQL time to flush WAL before SIGKILL
      containers:
        - name: postgres
          image: postgres:16-alpine
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: io-thecodeforge-db-secret
                  key: password
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "postgres"]
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  # volumeClaimTemplates creates one PVC per pod:
  # data-io-thecodeforge-db-0, data-io-thecodeforge-db-1, data-io-thecodeforge-db-2
  # These PVCs survive pod deletion and are reattached when pods are recreated.
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]  # One node at a time — appropriate for databases
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi
Output
service/io-thecodeforge-headless created
statefulset.apps/io-thecodeforge-db created
# Pods created sequentially:
# io-thecodeforge-db-0 Running
# io-thecodeforge-db-1 Running (created after -0 passed readiness)
# io-thecodeforge-db-2 Running (created after -1 passed readiness)
The StatefulSet Identity Model
  • Pod identity = ordinal index: web-0, web-1, web-2 — deterministic, survives restarts, rescheduling, and node failures
  • DNS identity = pod-name.service-name.namespace.svc.cluster.local — per-pod A record created by the headless Service
  • Storage identity = volumeClaimTemplate name + ordinal: data-web-0, data-web-1 — PVCs are not shared and are not deleted when the pod is
  • Ordering guarantee: pods are created sequentially (0 before 1 before 2) and deleted in reverse (2 before 1 before 0) — enforced by the controller, not advisory
  • Rolling update order: pod N-1 updates first, pod 0 updates last — preserving the stability of lower-ordinal pods that often hold more critical cluster roles
Production Insight
The serviceName field MUST reference an existing headless Service — a regular ClusterIP Service will not create per-pod DNS records and the StatefulSet controller will reject the spec if the Service does not exist.
Pods without a functioning headless Service have stable names but no resolvable DNS addresses — peer discovery fails silently because pods simply cannot find each other by name.
Rule: always create the headless Service before the StatefulSet, and verify DNS resolution from inside the cluster immediately after creation.
Key Takeaway
StatefulSets give pods three things Deployments fundamentally cannot: a name that survives restarts, storage that follows that name, and a creation order that respects distributed system dependencies.
These guarantees are not API conveniences — they are load-bearing properties that consensus protocols like Raft and ZAB depend on to reason about cluster membership correctly.
Rule: if your system cares about which pod it is talking to, you need a StatefulSet.
StatefulSet vs Deployment Decision Tree
IfApplication is stateless — any pod can handle any request with no awareness of which pod it is
UseUse Deployment — simpler, parallel scaling, no identity overhead, easier rolling updates
IfApplication needs stable network identity for peer discovery or leader election
UseUse StatefulSet — deterministic pod names with headless Service DNS that survives restarts
IfApplication needs persistent storage tied to a specific pod instance rather than a specific node
UseUse StatefulSet with volumeClaimTemplates — PVCs follow pod identity across node rescheduling
IfApplication requires ordered startup or shutdown to respect cluster quorum or replication dependencies
UseUse StatefulSet — sequential creation in ordinal order and reverse-order deletion are enforced at the controller level
IfApplication is a single-instance database with no peer discovery requirements
UseConsider Deployment with a manually created PVC — StatefulSet ordering adds no value for a single replica and the simpler mental model is worth it

Ordered Operations: Creation, Deletion, and Rolling Updates

StatefulSet operations are strictly ordered in ways that feel unusual if you are used to Deployment behaviour. Understanding this ordering is essential for both operating StatefulSets correctly and for debugging when something gets stuck.

Pod creation is strictly sequential: the controller creates pod 0, then waits until it is Running and Ready before creating pod 1. Pod 1 must be Running and Ready before pod 2 is created. This is not just about startup — it reflects the dependency structure of many distributed systems where node 0 bootstraps the cluster and node 1 joins as a follower. If you scale a StatefulSet from 3 to 5 replicas, pods 3 and 4 are created sequentially in that order.

Pod deletion is strictly reverse-sequential: the controller deletes pod N-1, waits for it to fully terminate, then deletes pod N-2, and so on to pod 0. This preserves quorum during scale-down — a ZooKeeper ensemble being scaled from 5 to 3 nodes loses its two highest-ordinal nodes first, maintaining the 3-node quorum throughout the process rather than potentially losing the primary.

Rolling updates follow a special reverse-ordinal pattern that is different from what most engineers expect. The controller updates pod N-1 first, waits for it to become Ready, then proceeds to pod N-2, down to pod 0 last. Pod 0, which often holds the most critical role in distributed systems (initial voter in ZooKeeper, partition leader for critical topics in Kafka), is updated last — preserving cluster stability for as long as possible during the rollout.

The partition field in the rolling update strategy is one of the most useful and most forgotten StatefulSet features. Setting partition: 3 in a 5-pod StatefulSet means only pods 3 and 4 update when you apply a new template. Pods 0, 1, and 2 remain on the old version. This is a native canary mechanism built into the StatefulSet API — you validate pods 3 and 4 under production traffic, then lower partition to 0 to complete the rollout. The failure mode to remember: if you set partition for a canary and forget to reset it to 0, your cluster runs in a permanently split-version state indefinitely.

io_thecodeforge/statefulset-update.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
spec:
  serviceName: io-thecodeforge-headless
  replicas: 5
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      # Only pods with ordinal >= 3 will be updated when the template changes.
      # Pods 0, 1, and 2 remain on the current version.
      # This is the canary pattern: validate pods 3 and 4 first.
      # To complete the rollout: kubectl patch statefulset io-thecodeforge-db
      #   --type=merge -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
      partition: 3
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      # 90 seconds gives PostgreSQL enough time to:
      # 1. Finish in-progress transactions
      # 2. Flush dirty buffers to disk
      # 3. Write a clean checkpoint to WAL
      # 4. Cleanly release file locks
      # Without sufficient grace period, the next startup triggers crash recovery.
      terminationGracePeriodSeconds: 90
      containers:
        - name: postgres
          image: postgres:16-alpine
          ports:
            - containerPort: 5432
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "postgres"]
            initialDelaySeconds: 10
            periodSeconds: 5
            # failureThreshold * periodSeconds = how long before the rollout is blocked
            # 3 * 5 = 15 seconds of failed readiness before the controller stops advancing
            failureThreshold: 3
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi

# Monitor rollout progress:
# kubectl rollout status statefulset/io-thecodeforge-db
# kubectl get statefulset io-thecodeforge-db -o jsonpath=\
#   '{.status.currentRevision} {.status.updateRevision}'
Output
statefulset.apps/io-thecodeforge-db configured
# Rolling update sequence (partition: 3):
# io-thecodeforge-db-4 -> updated (ordinal 4, highest)
# io-thecodeforge-db-3 -> updated (ordinal 3, next)
# io-thecodeforge-db-2 -> SKIPPED (ordinal 2 < partition 3)
# io-thecodeforge-db-1 -> SKIPPED
# io-thecodeforge-db-0 -> SKIPPED
Rolling Update Gotchas That Block Rollouts
  • A single failing readiness probe on any pod blocks the entire rollout — the controller will not advance to the next ordinal until the current pod passes
  • terminationGracePeriodSeconds must be long enough for clean shutdown — databases need 60-120 seconds minimum for WAL flush and checkpoint, not the Kubernetes default of 30
  • If you manually delete a pod during a rolling update, the controller recreates it with the OLD version first, then applies the update — this is by design but can be confusing
  • OnDelete strategy means pods only update when you manually delete them — useful for workloads that need a human gate between each pod update, but easy to forget and end up with a stalled fleet
  • Rule: always define a readiness probe on StatefulSet pods — without one, a CrashLooping pod is considered Ready immediately after its container starts and the rollout advances to a broken state
Production Insight
A failing readiness probe on pod N-1 blocks the rolling update indefinitely — the controller will sit waiting with no timeout and no automatic escalation.
Set terminationGracePeriodSeconds substantially higher than your database's measured shutdown time — measure it in staging under realistic load, not under an idle test instance.
Rule: test your readiness probe independently before relying on it to gate a StatefulSet rolling update — a probe that always passes immediately is worse than no probe.
Key Takeaway
StatefulSet operations have a strict and enforced ordering: create forward (0, 1, 2), delete backward (2, 1, 0), update in reverse ordinal (N-1, N-2, 0).
The partition field enables native canary deployments — update a subset, validate under production traffic, then lower partition to complete the rollout.
Rule: a single failing readiness probe blocks the entire StatefulSet rollout indefinitely — probe design and terminationGracePeriodSeconds are not afterthoughts, they are critical to rollout reliability.
Update Strategy Selection
IfStandard rolling update where all pods should update automatically in ordinal order
UseUse RollingUpdate strategy with partition: 0 — pods update in reverse ordinal order, each waited upon for readiness
IfNeed to validate a subset of pods under production traffic before full rollout
UseSet partition to the ordinal boundary — pods >= partition update while lower-ordinal pods stay on the old version
IfNeed human approval between each pod update — change control or staged rollout with validation steps
UseUse OnDelete strategy — pods only update when you manually delete them, giving full control over pace and sequencing
IfNeed to pause a rollout mid-way after discovering an issue in the first updated pods
UseSet partition to the ordinal of the last successfully updated pod — pods below that partition stay on the known-good version while you investigate

PVC Lifecycle: Storage That Follows Identity

StatefulSets use volumeClaimTemplates to provision one PersistentVolumeClaim per pod. The naming convention is deterministic: template-name-statefulset-name-ordinal. For a StatefulSet named io-thecodeforge-db with a template named data, the PVCs are data-io-thecodeforge-db-0, data-io-thecodeforge-db-1, and data-io-thecodeforge-db-2. This naming is not configurable — it is generated by the StatefulSet controller.

The storage lifecycle has two properties that every engineer running StatefulSets in production must understand deeply.

First, PVCs survive pod deletion. When a pod is deleted — whether by a rolling update, a manual kubectl delete pod, a node failure, or a scale-down — its PVC is not deleted. When the pod is recreated with the same ordinal, it reattaches to the same PVC. This is the 'sticky storage' guarantee. It is what makes your PostgreSQL data directory survive node failures without data loss.

Second, PVCs are NOT deleted when the StatefulSet is deleted. This is a deliberate safety mechanism — accidentally deleting a StatefulSet should not destroy production databases. But it means that scaling down a StatefulSet from 5 to 3 replicas leaves two orphaned PVCs (data-name-3 and data-name-4) that consume cloud storage indefinitely. At $0.08-0.15 per GB per month on most cloud providers, a 1TB database with ten orphaned scale-down PVCs accumulates $800-1500 per month in silent storage waste. This is one of the most common cost anomalies in Kubernetes clusters and one of the least visible.

Kubernetes 1.27 introduced the persistentVolumeClaimRetentionPolicy field on StatefulSets, which allows you to configure automatic PVC deletion on scale-down (whenScaled) or StatefulSet deletion (whenDeleted). For most production stateful workloads, setting whenScaled to Delete is appropriate — the PVC for a scaled-down pod can reasonably be considered ephemeral. Setting whenDeleted to Delete is more dangerous and should only be used when you have confirmed out-of-cluster backups.

io_thecodeforge/pvc-lifecycle.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Kubernetes 1.27+ introduces persistentVolumeClaimRetentionPolicy
# This controls what happens to PVCs when pods are deleted or the StatefulSet scales down.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
spec:
  serviceName: io-thecodeforge-headless
  replicas: 3
  # Available from Kubernetes 1.27+ (stable in 1.29+)
  persistentVolumeClaimRetentionPolicy:
    # whenScaled: what happens to PVCs when replicas is reduced
    #   Delete  - PVC is deleted when its pod is scaled away (use with confirmed backups)
    #   Retain  - PVC survives scale-down (default, safe but causes orphaned volume drift)
    whenScaled: Retain
    # whenDeleted: what happens to PVCs when the StatefulSet is deleted
    #   Delete  - PVCs are deleted with the StatefulSet (DANGEROUS — only with backups)
    #   Retain  - PVCs survive StatefulSet deletion (default, safe)
    whenDeleted: Retain
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi

---
# Manual PVC lifecycle management for clusters on Kubernetes < 1.27:

# After scaling down from 5 to 3 replicas, list orphaned PVCs:
# kubectl get pvc -l app=io-thecodeforge-db
# NAME                           STATUS  CAPACITY
# data-io-thecodeforge-db-0      Bound   50Gi   <- still in use
# data-io-thecodeforge-db-1      Bound   50Gi   <- still in use
# data-io-thecodeforge-db-2      Bound   50Gi   <- still in use
# data-io-thecodeforge-db-3      Bound   50Gi   <- ORPHANED (pod deleted)
# data-io-thecodeforge-db-4      Bound   50Gi   <- ORPHANED (pod deleted)

# Clean up orphaned PVCs after confirming data is not needed:
# kubectl delete pvc data-io-thecodeforge-db-3 data-io-thecodeforge-db-4

# WARNING: PVC deletion is IRREVERSIBLE
# The underlying PersistentVolume and its data are gone permanently.
# Always confirm you have a recent backup before deleting any database PVC.
Output
statefulset.apps/io-thecodeforge-db configured
# PVC status after scaling from 5 to 3 replicas (Kubernetes < 1.27 or whenScaled: Retain):
NAME STATUS VOLUME CAPACITY STORAGECLASS
data-io-thecodeforge-db-0 Bound pvc-abc123 50Gi io-thecodeforge-ssd
data-io-thecodeforge-db-1 Bound pvc-def456 50Gi io-thecodeforge-ssd
data-io-thecodeforge-db-2 Bound pvc-ghi789 50Gi io-thecodeforge-ssd
data-io-thecodeforge-db-3 Bound pvc-jkl012 50Gi io-thecodeforge-ssd # orphaned
data-io-thecodeforge-db-4 Bound pvc-mno345 50Gi io-thecodeforge-ssd # orphaned
PVC Orphan Leak — The Silent Storage Cost Nobody Monitors
  • Scaling down a StatefulSet leaves PVCs orphaned — they are not deleted automatically unless persistentVolumeClaimRetentionPolicy is configured
  • Orphaned PVCs consume cloud storage at full price indefinitely — at $0.10/GB/month, a 50GB PVC costs $5/month sitting unused, and teams typically have dozens
  • If the StorageClass has a volume count limit or quota, orphaned PVCs can block new pod scheduling silently
  • Deleting a StatefulSet does NOT delete its PVCs by default — you must clean up manually or configure whenDeleted: Delete with confirmed backup coverage
  • Rule: after every scale-down or StatefulSet deletion, verify PVC state: kubectl get pvc -l app=<name> and reconcile against expected counts
Production Insight
Orphaned PVCs from scale-downs accumulate silently for months in environments without explicit PVC lifecycle management — the first signal is usually a cloud cost anomaly review or a scheduling failure when storage quotas are hit.
PVC deletion is irreversible — the underlying PersistentVolume and every byte of data it contains are gone permanently. Never delete a database PVC without confirming a recent backup.
Rule: use persistentVolumeClaimRetentionPolicy in Kubernetes 1.27+ for automatic cleanup, or automate manual cleanup with a CronJob that detects PVCs whose owning pod ordinal no longer exists.
Key Takeaway
StatefulSet PVCs survive pod deletion — this is the 'sticky storage' guarantee that makes databases viable on Kubernetes.
Scaling down orphans PVCs; deleting the StatefulSet orphans them too — neither triggers automatic cleanup by default.
Rule: treat PVC lifecycle as a first-class operational concern with explicit monitoring and cleanup automation — orphaned volumes are a slow financial bleed that compounds over time.

Headless Services and DNS: How Pods Find Each Other

A headless Service is the DNS backbone of a StatefulSet. Without it, StatefulSet pods have stable names but no way for other pods to resolve those names to IP addresses. With it, every pod in the StatefulSet gets an individual DNS A record that points directly to that pod's IP — bypassing the load-balancing layer that regular Services add.

The distinction between a regular Service and a headless Service is important to internalise. A regular Service (clusterIP: something) creates one DNS name that resolves to a virtual IP, and kube-proxy load-balances traffic from that virtual IP to any matching pod. You cannot target a specific pod by DNS with a regular Service. A headless Service (clusterIP: None) creates no virtual IP and no load balancing. Instead, CoreDNS creates individual A records for each pod, one per pod, each pointing to that pod's actual IP address. This is how pod-to-pod targeting works.

The DNS naming convention for StatefulSet pods is: pod-name.service-name.namespace.svc.cluster.local. For a StatefulSet named db with a headless Service named db-headless in namespace prod, pod 0's DNS entry is db-0.db-headless.prod.svc.cluster.local. This entry is stable — it always points to the pod with that identity, regardless of which node it runs on. When the pod is rescheduled to a different node with a different IP, CoreDNS updates the A record to reflect the new IP. The DNS name remains constant; only the IP it resolves to changes.

The headless Service also creates a SRV record for each pod: _port-name._protocol.service-name.namespace.svc.cluster.local. SRV records carry both the hostname and port, which is how distributed systems like Kafka and ZooKeeper bootstrap cluster membership at startup — they query the SRV record to discover all current pod hostnames without needing to know ordinal count in advance.

One practical detail: CoreDNS caches DNS responses with a TTL. For headless Services the default TTL is 5 seconds. If a pod is rescheduled quickly, there is a brief window where other pods may try to connect to the old IP before the cache expires. Design your application's connection retry logic to tolerate this — most database connection pools handle it correctly if configured with appropriate connection timeouts.

io_thecodeforge/headless-service.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# The headless Service — created before the StatefulSet, referenced by serviceName field.
apiVersion: v1
kind: Service
metadata:
  name: io-thecodeforge-headless
  namespace: prod
  labels:
    app: io-thecodeforge-db
spec:
  clusterIP: None  # No virtual IP — each pod gets its own DNS A record
  publishNotReadyAddresses: false  # Only Ready pods get DNS records (default: false)
  # Set to true for stateful systems that need to discover all members
  # including those still initializing — useful for ZooKeeper ensemble bootstrap
  selector:
    app: io-thecodeforge-db  # Must match StatefulSet pod labels EXACTLY
  ports:
    - port: 5432
      targetPort: 5432
      name: postgres

# --- DNS Records Created by CoreDNS ---
#
# Per-pod A records (stable, direct — the primary use case):
#   io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.1.5
#   io-thecodeforge-db-1.io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.2.8
#   io-thecodeforge-db-2.io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.3.12
#
# SRV records (used for automatic cluster member discovery):
#   _postgres._tcp.io-thecodeforge-headless.prod.svc.cluster.local
#   -> SRV: 0 50 5432 io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local
#   -> SRV: 0 50 5432 io-thecodeforge-db-1.io-thecodeforge-headless.prod.svc.cluster.local
#   -> SRV: 0 50 5432 io-thecodeforge-db-2.io-thecodeforge-headless.prod.svc.cluster.local
#
# Service-level A record (returns all pod IPs — round-robin, not load-balanced):
#   io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.1.5, 10.244.2.8, 10.244.3.12

# --- Verify DNS resolution from inside the cluster ---
# kubectl exec io-thecodeforge-db-1 -- \n#   nslookup io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local
# Expected output: Address: 10.244.1.5
Output
service/io-thecodeforge-headless created
# DNS verification from inside the cluster:
# Server: 10.96.0.10
# Address: 10.96.0.10#53
# Name: io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local
# Address: 10.244.1.5
DNS Resolution for StatefulSets
  • Regular Service: one DNS name, one virtual IP, load-balanced across pods — you cannot target a specific pod
  • Headless Service: individual DNS A records per pod — you CAN target a specific pod by its stable name
  • Pod DNS pattern: pod-name.headless-svc.namespace.svc.cluster.local — stable across restarts and rescheduling
  • SRV records allow applications to discover all pods dynamically without knowing the replica count — used by ZooKeeper, Kafka, and etcd for bootstrap
  • CoreDNS TTL for headless Services defaults to 5 seconds — design connection retry logic to tolerate this brief staleness window after pod rescheduling
Production Insight
The headless Service selector must match the StatefulSet pod labels exactly — a label mismatch creates a headless Service with no endpoints, which means DNS A records for zero pods, which means peer discovery silently fails.
CoreDNS cache means there is a 5-second window after pod rescheduling where DNS resolves to the old IP — application connection retry logic must handle this rather than assuming immediate DNS convergence.
Rule: test DNS resolution with nslookup from inside the cluster immediately after StatefulSet creation, and again after simulating a pod reschedule in staging.
Key Takeaway
A headless Service is not optional for StatefulSets that need peer-to-peer communication — it is the mechanism that creates per-pod DNS identity.
Without a headless Service, pods have stable names but no resolvable DNS addresses — peer discovery fails silently with NXDOMAIN errors that are easy to misattribute.
Rule: if your StatefulSet pods need to discover and communicate with each other by name, the headless Service must exist before the StatefulSet and its selector must be verified.

PodDisruptionBudget: Protecting Quorum During Voluntary Disruptions

PodDisruptionBudgets (PDBs) are among the most important and most skipped Kubernetes objects for StatefulSets in production. A PDB is a policy object that limits how many pods in a set can be simultaneously unavailable during voluntary disruptions — node drains, cluster upgrades, autoscaler scale-downs, and manual evictions.

For a quorum-based system, the requirement is concrete: a 3-node ZooKeeper cluster needs at least 2 nodes to maintain quorum. A 5-node etcd cluster needs at least 3 nodes. A PDB with minAvailable: 2 tells Kubernetes — during any voluntary disruption, you must ensure at least 2 of my pods are running and Ready before proceeding with the eviction. If a node drain would cause a second pod to become unavailable while the first is still evicting, the drain blocks until the first pod is rescheduled and Ready elsewhere.

This blocking behaviour is the point. Without a PDB, kubectl drain proceeds freely, evicting pods without regard for quorum. A node drain during a cluster upgrade with three StatefulSet pods all on the same node (a common situation if pod anti-affinity is not configured) will evict all three simultaneously, destroying quorum completely.

The combination of PDB plus pod anti-affinity is the production-grade pattern. The PDB protects against voluntary disruptions. Pod anti-affinity with requiredDuringSchedulingIgnoredDuringExecution and topologyKey: kubernetes.io/hostname protects against involuntary disruptions by ensuring no two pods land on the same node. Together, they mean: node drains cannot break quorum, and a single node failure cannot take out more than one pod.

One operational detail: kubectl drain respects PDBs. kubectl delete pod does not. If your operational runbook uses kubectl delete pod to perform maintenance, PDBs provide no protection. Use kubectl drain or kubernetes eviction API calls for any maintenance that should respect disruption budgets.

io_thecodeforge/pdb-and-affinity.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# PodDisruptionBudget — protects quorum during voluntary disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: io-thecodeforge-db-pdb
  namespace: prod
spec:
  # For a 3-node cluster: minAvailable: 2 (majority = quorum)
  # For a 5-node cluster: minAvailable: 3 (majority = quorum)
  # For a 7-node cluster: minAvailable: 4 (majority = quorum)
  # Rule: minAvailable = floor(replicas / 2) + 1
  minAvailable: 2
  selector:
    matchLabels:
      app: io-thecodeforge-db

# Alternative using maxUnavailable (equivalent for a 3-pod set):
# spec:
#   maxUnavailable: 1
#   selector:
#     matchLabels:
#       app: io-thecodeforge-db

---
# Pod anti-affinity — spreads pods across nodes to protect against involuntary failures
# PDB protects against voluntary disruptions (drain, upgrade)
# Anti-affinity protects against involuntary disruptions (node crash)
# You need BOTH for production-grade StatefulSet resilience.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
  namespace: prod
spec:
  serviceName: io-thecodeforge-headless
  replicas: 3
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      affinity:
        podAntiAffinity:
          # required = hard constraint: pods WILL NOT be scheduled on the same node
          # preferred = soft preference: Kubernetes tries but can violate if necessary
          # For quorum-sensitive systems, use required.
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - io-thecodeforge-db
              # kubernetes.io/hostname = spread across nodes
              # topology.kubernetes.io/zone = spread across AZs (stronger guarantee)
              topologyKey: topology.kubernetes.io/zone
      containers:
        - name: postgres
          image: postgres:16-alpine
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi
Output
poddisruptionbudget.policy/io-thecodeforge-db-pdb created
statefulset.apps/io-thecodeforge-db configured
# Verify PDB status:
# kubectl get pdb io-thecodeforge-db-pdb
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# io-thecodeforge-db-pdb 2 N/A 1 5m
# ALLOWED DISRUPTIONS: 1 means one pod can be evicted at a time (3 - 2 = 1)
PDB Does NOT Protect Against Node Crashes — You Need Both
  • PDBs only apply to voluntary disruptions: node drains initiated by kubectl drain, cluster upgrades, autoscaler scale-downs, and kubernetes eviction API calls
  • Node crashes, kernel panics, OOM kills, and hardware failures are involuntary — PDBs do not block or delay them in any way
  • For protection against involuntary disruptions, use pod anti-affinity with topologyKey: topology.kubernetes.io/zone to spread pods across availability zones
  • A PDB with minAvailable: 1 on a 3-pod StatefulSet allows two simultaneous evictions — that is two out of three nodes gone, which breaks quorum for any majority-voting system
  • Rule: set minAvailable to exactly your quorum threshold (floor(n/2) + 1), not to 1 unless you have explicitly accepted the quorum implications
Production Insight
kubectl drain respects PodDisruptionBudgets — kubectl delete pod does not. If your team's operational runbooks use kubectl delete pod for any maintenance task that requires respecting quorum, your PDB provides no protection for those operations.
A PDB with ALLOWED DISRUPTIONS showing 0 means a node drain will block at this StatefulSet until quorum is restored — this is correct behaviour but will surprise on-call engineers who expect drains to complete quickly.
Rule: set minAvailable to your quorum threshold, use requiredDuringSchedulingIgnoredDuringExecution anti-affinity across zones, and update your operational runbooks to use kubectl drain instead of kubectl delete pod.
Key Takeaway
PodDisruptionBudgets are non-negotiable for production StatefulSets — without one, a single node drain during a cluster upgrade can evict all pods simultaneously and destroy quorum.
PDBs protect against voluntary disruptions only — node crashes require pod anti-affinity spread across nodes or zones.
Rule: minAvailable = floor(replicas / 2) + 1 for any majority-quorum system, and combine with zone-level anti-affinity for complete resilience.
● Production incidentPOST-MORTEMseverity: high

The Kafka Split-Brain: How a Deployment Killed 14 Hours of Messages

Symptom
After a routine node drain, Kafka consumers across multiple services began reporting OffsetOutOfRange errors within minutes. Producer requests configured with acks=all started timing out. The Kafka controller election loop entered a crash cycle — two of the three brokers simultaneously believed they were the controller and began issuing conflicting partition reassignment commands. Fourteen hours of uncommitted messages in the __consumer_offsets topic were lost. The blast radius extended to every downstream service that consumed from the cluster.
Assumption
The team assumed Kubernetes would reschedule pods one at a time, maintaining their identity throughout the drain. They expected the deployment's rolling update configuration to apply to node drain evictions. They also assumed the terminationGracePeriodSeconds was sufficient to allow orderly shutdown — it was not, because the drain evicted all three pods in parallel before any of them finished their clean shutdown sequence.
Root cause
The Kafka cluster was deployed as a Deployment, not a StatefulSet. When the node was drained, Kubernetes terminated all three pods and created three new pods with randomly generated name suffixes — kafka-7b4f9-xk2mn, kafka-7b4f9-r9pqw, kafka-7b4f9-hjk34 — instead of the stable kafka-0, kafka-1, kafka-2 that the cluster configuration expected. The new pods had no knowledge of the old broker IDs stored in ZooKeeper. ZooKeeper saw three unknown brokers registering while the ephemeral nodes for the old broker IDs had not yet expired. The cluster entered a split-brain state where partition leader metadata in ZooKeeper pointed to broker IDs that no longer existed. Consumers could not fetch from those partitions and producers could not confirm acknowledgements. There was no PodDisruptionBudget, so Kubernetes had no constraint preventing it from evicting all three pods simultaneously.
Fix
Migrated the Kafka cluster from Deployment to StatefulSet with a headless Service, giving each broker a stable identity (kafka-0, kafka-1, kafka-2) that ZooKeeper and the broker configuration could rely on. Each broker now derives its broker.id from its pod ordinal via a downward API environment variable. Added a PodDisruptionBudget with minAvailable: 2 to ensure at least two brokers survive any voluntary disruption. Set terminationGracePeriodSeconds: 120 and added a preStop hook that triggers a controlled leader election handoff before the broker process terminates. Added pod anti-affinity rules to spread brokers across availability zones so a single AZ failure cannot take out quorum.
Key lesson
  • Never run stateful distributed systems — Kafka, ZooKeeper, Elasticsearch, etcd, Redis Cluster — as Deployments; pod identity loss is catastrophic at the protocol level and the failure mode is silent until it is not
  • PodDisruptionBudget is mandatory for StatefulSets in production — without one, a node drain can evict every pod simultaneously and break quorum with no warning
  • Node drains respect PodDisruptionBudgets, but kubectl delete pod does not — operational runbooks must distinguish between the two
  • Test failure modes in staging by simulating node drains before they happen in production: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data and verify quorum is maintained throughout
Production debug guideSymptom-driven diagnostics for Kubernetes StatefulSet issues — organised by what you see, not by what you think caused it6 entries
Symptom · 01
StatefulSet pod stuck in Pending with FailedScheduling event
Fix
Check node resources and PVC binding status: kubectl describe pod <pod-name> and look at the Events section. The two most common causes are insufficient node CPU or memory for the requested resources, and a PVC in Pending state because the StorageClass cannot provision a volume. Check StorageClass exists and its provisioner is healthy. Verify pod anti-affinity rules are not preventing scheduling by checking if enough eligible nodes exist.
Symptom · 02
StatefulSet pod stuck in Terminating and will not complete deletion
Fix
Check for finalizers on the pod and its PVC: kubectl get pvc <pvc-name> -o yaml and look for metadata.finalizers. The kubernetes.io/pvc-protection finalizer is the most common blocker — it prevents deletion while a pod is using the PVC. Check if a validating webhook or operator is intercepting deletion. If the pod is truly stuck and the data is confirmed safe, force delete with: kubectl delete pod <pod-name> --grace-period=0 --force.
Symptom · 03
Rolling update stuck — only some pods updated, others not progressing
Fix
Check the update strategy partition value: kubectl get statefulset <name> -o jsonpath='{.spec.updateStrategy.rollingUpdate.partition}'. If partition is set to a non-zero value, only pods with ordinal >= partition get updated. Verify the last-updated pod passes its readiness probe — a failing readiness probe blocks the entire rollout. Check kubectl rollout status statefulset/<name> for a human-readable status. Compare currentRevision and updateRevision to confirm the rollout is actually in progress.
Symptom · 04
Pod rescheduled to different node but PVC will not attach — pod stuck in ContainerCreating
Fix
Check PVC access mode — ReadWriteOnce volumes can only attach to one node at a time. If the old node has not released the volume attachment, the new pod cannot claim it. Check volume attachment objects: kubectl get volumeattachment and look for one referencing the PV. The node controller's eviction timeout (default 5-6 minutes) must pass before Kubernetes forcibly detaches the volume. Check kubectl describe pv <pv-name> for the claim reference and release status.
Symptom · 05
DNS resolution failing for StatefulSet pods — peers cannot reach each other by name
Fix
Verify the headless Service exists and has clusterIP: None: kubectl get svc <service-name> -o jsonpath='{.spec.clusterIP}'. Check that the Service selector matches the StatefulSet pod labels exactly — a label mismatch creates a headless Service with no endpoints. Verify endpoints are populated: kubectl get endpoints <service-name>. Test DNS resolution from inside the cluster: kubectl exec <any-pod> -- nslookup <pod-name>.<service-name>.<namespace>.svc.cluster.local.
Symptom · 06
Pod crash-looping after PVC reattachment — database process will not start
Fix
Check mount point permissions: kubectl exec <pod> -- ls -la /data. Some databases (PostgreSQL, MySQL) require the data directory to be owned by a specific UID. Check if the volume was unmounted uncleanly during a crash — some databases (PostgreSQL) enter recovery mode and replay WAL on next start, which is expected but can fail if WAL files are corrupt. Check application logs carefully for recovery errors vs startup errors — they require different responses.
★ StatefulSet Quick Debug Cheat SheetRapid diagnostics for common StatefulSet production issues. These are the first commands I reach for when a StatefulSet starts misbehaving.
Pod stuck in Pending — not scheduling
Immediate action
Check scheduling failure reason and PVC binding status simultaneously — both are common causes and both produce the same Pending state
Commands
kubectl describe pod <pod-name> | grep -A15 Events
kubectl get pvc -o wide --selector app=<statefulset-label>
Fix now
Verify StorageClass exists and provisioner is healthy. Confirm node has sufficient CPU and memory for the pod's resource requests. Check pod anti-affinity rules are not creating an impossible scheduling constraint.
Rolling update not progressing past a specific ordinal+
Immediate action
Check rollout status and partition configuration — partition is the most commonly forgotten setting
Commands
kubectl rollout status statefulset/<name> --timeout=30s
kubectl get statefulset <name> -o jsonpath='{.spec.updateStrategy.rollingUpdate.partition}'
Fix now
If partition is non-zero, pods below that ordinal will not update. Set partition to 0 for a full rollout. If partition is 0 and the update is still blocked, check whether the highest-ordinal updated pod passes its readiness probe — a failing probe blocks all subsequent updates.
PVC stuck in Terminating — volume not releasing+
Immediate action
Check finalizers blocking deletion and volume attachment state on the node
Commands
kubectl get pvc <pvc-name> -o jsonpath='{.metadata.finalizers}'
kubectl get volumeattachment -o wide | grep <pv-name>
Fix now
If kubernetes.io/pvc-protection is listed in finalizers and the PVC is safe to delete, patch the finalizers to empty: kubectl patch pvc <pvc-name> -p '{"metadata":{"finalizers":[]}}'. Verify no pod is currently using the PVC before doing this.
StatefulSet pods cannot reach peers by DNS name+
Immediate action
Verify headless Service configuration and CoreDNS health
Commands
kubectl get svc <headless-svc> -o jsonpath='{.spec.clusterIP}'
kubectl exec <pod> -- nslookup <pod-name>.<headless-svc>.<namespace>.svc.cluster.local
Fix now
Headless Service must show 'None' for clusterIP. If nslookup fails, verify the Service selector matches the StatefulSet pod labels exactly and that the pod is in Running state — DNS records are only created for Ready pods.
StatefulSet vs Deployment vs DaemonSet
DimensionStatefulSetDeploymentDaemonSet
Pod identityStable, deterministic name (ordinal index: web-0, web-1) — survives restarts and node reschedulingRandom hash suffix regenerated on every reschedule — no identity continuity between pod instancesRandom hash suffix, one pod per matching node — tied to node lifecycle rather than an independent identity
Network identityStable per-pod DNS via headless Service: pod-name.svc.namespace.svc.cluster.localLoad-balanced via ClusterIP Service — cannot target a specific pod by DNSNode-local networking, accessed via node IP — no stable cross-node DNS identity
StoragePer-pod PVC via volumeClaimTemplates — PVC survives pod deletion and reattaches on recreationShared volumes (all pods see the same data) or ephemeral volumes (lost on pod deletion)Host volumes for node-local data, or node-specific PVCs — storage is tied to the node, not a portable identity
Scaling orderSequential: pod 0 created first, N-1 deleted first — enforced by the controller at the API levelParallel: all pods created or deleted simultaneously, controlled by maxSurge and maxUnavailableOne pod per matching node — scaling is driven by node count, not a replica field
Rolling updateReverse ordinal (N-1 updates first, pod 0 last) — each pod must be Ready before the next updatesParallel within maxUnavailable and maxSurge constraints — no ordering guarantees between podsParallel across nodes — similar to Deployment but one-per-node constraint limits concurrency naturally
Use caseDatabases, message brokers (Kafka), distributed coordination (ZooKeeper, etcd), search engines (Elasticsearch)Stateless APIs, web servers, microservices — anything where pod identity is irrelevantLog collectors (Fluentd), monitoring agents (node-exporter), CNI plugins, security agents — node-level infrastructure
Pod replacementSame name, same PVC reattachment — the replacement is the same identity in a new containerNew name, new pod, no identity continuity — the replacement is a different entity from the old podPod is recreated on the same node when the node recovers — tied to node lifecycle
Disruption budgetCritical — PDB protects quorum; without one a single drain can take the entire cluster downRecommended — PDB protects against availability loss during updates and drainsRarely needed — one pod per node means voluntary disruptions are node-scoped by definition

Key takeaways

1
StatefulSets provide stable identity, sticky storage, and ordered operations
the three load-bearing guarantees that distributed stateful systems like ZooKeeper, Kafka, Elasticsearch, and etcd depend on at the protocol level, not just as conveniences.
2
A headless Service is mandatory for StatefulSets that need peer-to-peer communication
it creates the per-pod DNS A records that enable stable name resolution. Without it, pods have stable names but no way for other pods to resolve those names to IP addresses.
3
PVCs survive pod deletion but are not deleted on scale-down or StatefulSet deletion
orphaned PVCs consume cloud storage indefinitely. Treat PVC lifecycle as a first-class operational concern with explicit monitoring and cleanup automation.
4
PodDisruptionBudgets are non-negotiable for production StatefulSets
without one, a node drain can evict all pods simultaneously and destroy quorum. Set minAvailable to floor(replicas / 2) + 1, and combine with zone-level pod anti-affinity for complete resilience.
5
Rolling updates proceed in reverse ordinal order and are blocked by a failing readiness probe on any pod
probe design and terminationGracePeriodSeconds are critical to rollout reliability, not optional fields.
6
The partition field enables native canary deployments
update a subset of pods, validate under production traffic, then lower partition to 0 to complete the rollout. Forgetting to reset partition leaves the cluster in a permanently split-version state.
7
kubectl drain respects PodDisruptionBudgets; kubectl delete pod does not
operational runbooks for StatefulSet maintenance must use drain, not direct pod deletion, to get the safety guarantees you configured.
8
Never run quorum-based distributed systems as Deployments
pod identity loss is catastrophic at the protocol level and the failure mode is a split-brain or data loss incident, not a graceful degradation.

Common mistakes to avoid

6 patterns
×

Running stateful workloads (Kafka, ZooKeeper, Elasticsearch, Redis Cluster) as Deployments

Symptom
Pod rescheduling causes identity loss — the broker or database node rejoins its cluster as an unknown new member instead of its previous identity. This triggers full data resync in some systems, split-brain in others, and cluster membership corruption in quorum-based systems. In Kafka specifically, partition leaders in ZooKeeper point to broker IDs that no longer exist, causing OffsetOutOfRange errors for consumers and producer ack timeouts.
Fix
Migrate to StatefulSet with a headless Service. Derive each instance's cluster identity from its pod ordinal — for Kafka this means setting broker.id to the ordinal via the downward API. Use volumeClaimTemplates to ensure each broker's data directory follows its identity. Add a PodDisruptionBudget to prevent simultaneous eviction during future node maintenance.
×

Not setting a PodDisruptionBudget on production StatefulSets

Symptom
A node drain or cluster upgrade evicts all StatefulSet pods simultaneously because Kubernetes has no constraint preventing it. For quorum-based systems — ZooKeeper, etcd, Kafka controller — losing majority quorum makes the entire cluster unavailable. Recovery requires manual intervention: typically forcing a new leader election or restoring from backup, neither of which is quick.
Fix
Create a PDB with minAvailable set to the quorum threshold: floor(replicas / 2) + 1. For a 3-node cluster this is 2. For a 5-node cluster this is 3. Combine with pod anti-affinity using topologyKey: topology.kubernetes.io/zone to spread pods across failure domains and protect against involuntary node failures.
×

Using a regular ClusterIP Service instead of a headless Service for StatefulSet DNS

Symptom
Pods cannot reach each other by their stable names. DNS lookups for pod-specific names (web-0.my-service) return NXDOMAIN. Applications relying on peer discovery — Kafka broker registration, ZooKeeper ensemble formation, Elasticsearch cluster joining — fail to form a cluster. The failure is often logged as a connection timeout rather than a DNS error, which misleads diagnosis.
Fix
Create a Service with clusterIP: None. Reference it in the StatefulSet's serviceName field. Verify DNS works from inside the cluster: kubectl exec <any-pod> -- nslookup <pod-name>.<service-name>.<namespace>.svc.cluster.local. The response must show the pod's actual IP address, not a virtual IP.
×

Forgetting to clean up orphaned PVCs after scaling down or deleting a StatefulSet

Symptom
Cloud storage costs increase over months with no corresponding running workloads. StorageClass volume count quotas are hit unexpectedly, preventing new pod scheduling. Finance or FinOps teams flag unexpected storage line items that engineering cannot initially explain. In environments with strict storage quotas, orphaned PVCs can block a future scale-up of the same StatefulSet.
Fix
After every scale-down, run kubectl get pvc -l app=<name> and reconcile against the expected count. In Kubernetes 1.27+, configure persistentVolumeClaimRetentionPolicy: whenScaled: Delete for environments with confirmed backup coverage. For older clusters, automate cleanup with a CronJob that identifies PVCs whose pod ordinal no longer exists in the StatefulSet.
×

Not setting terminationGracePeriodSeconds high enough for database pods

Symptom
Database pods are killed before completing clean shutdown. PostgreSQL does not finish writing its checkpoint. MySQL does not flush its InnoDB buffer pool. On the next startup, the database enters crash recovery mode, replaying WAL or redo logs — which can take minutes to hours depending on the volume of dirty pages. The StatefulSet rolling update is blocked for the entire recovery duration, alarming on-call engineers who see a pod stuck in Init state.
Fix
Measure actual clean shutdown time under realistic load in staging before setting terminationGracePeriodSeconds. For most PostgreSQL configurations, 90-120 seconds is appropriate. Add a preStop lifecycle hook that sends the database a clean shutdown signal (pg_ctl stop -m fast for PostgreSQL) and waits for the process to exit — this ensures the shutdown begins before SIGTERM is sent and the grace period starts counting.
×

Setting partition for a canary update and forgetting to reset it to 0

Symptom
After a canary deployment, only a subset of pods runs the new image version. The remaining pods stay on the old version indefinitely. Different pods in the same StatefulSet exhibit different behaviour — bugs that were supposed to be fixed still appear from some pods, and new features are only available from the canary pods. The inconsistency is usually discovered days or weeks later during debugging, not at deployment time.
Fix
After validating canary pods, explicitly reset partition to 0: kubectl patch statefulset <name> --type=merge -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'. Add a CI/CD pipeline step that resets partition to 0 after successful canary validation — make it part of the deployment automation rather than a manual operational step. Monitor currentRevision versus updateRevision on the StatefulSet object to detect partial rollouts in your observability stack.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What are the three guarantees that a StatefulSet provides that a Deploym...
Q02SENIOR
Explain the role of the headless Service in a StatefulSet. What happens ...
Q03SENIOR
A 3-node etcd cluster deployed as a StatefulSet lost quorum during a clu...
Q04SENIOR
What happens to PVCs when you scale down a StatefulSet from 5 to 3 repli...
Q05SENIOR
How does the StatefulSet rolling update work, and how is it different fr...
Q01 of 05SENIOR

What are the three guarantees that a StatefulSet provides that a Deployment does not?

ANSWER
StatefulSets provide three guarantees that Deployments fundamentally cannot. First, stable network identity: each pod receives a deterministic name based on its ordinal index — web-0, web-1, web-2 — that survives restarts, rescheduling, and node failures. Combined with a headless Service, this creates a stable per-pod DNS record at pod-name.service-name.namespace.svc.cluster.local. Deployments assign a random hash suffix to pod names, which changes every time a pod is replaced. Second, stable persistent storage: volumeClaimTemplates provision a dedicated PVC per pod using a naming convention tied to the pod's ordinal. When a pod is deleted and recreated with the same ordinal, it reattaches to the same PVC on any node. The storage follows the identity, not the node. Third, ordered deployment and scaling: pods are created sequentially — pod 0 must be Running and Ready before pod 1 is created. Pods are deleted in reverse order — pod N-1 before pod 0. Rolling updates proceed in reverse ordinal order. Deployments create and delete pods in parallel with no ordering guarantees. These are not conveniences — distributed consensus protocols like Raft and ZAB depend on them at the wire level.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is Kubernetes StatefulSets in simple terms?
02
When should I use a StatefulSet instead of a Deployment?
03
Can I use a StatefulSet with a single replica?
04
How do I perform a blue-green deployment with a StatefulSet?
05
What happens if a node hosting a StatefulSet pod crashes?
🔥

That's Kubernetes. Mark it forged?

9 min read · try the examples if you haven't

Previous
Kubernetes ConfigMaps and Secrets
5 / 12 · Kubernetes
Next
Kubernetes HPA — Autoscaling