Advanced 9 min · March 06, 2026

Kubernetes StatefulSets — Kafka's 14-Hour Split-Brain

Q: What is Kubernetes StatefulSets in simple terms?

Kubernetes StatefulSets is a fundamental concept in DevOps. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

Q: When should I use a StatefulSet instead of a Deployment?

Use a StatefulSet when your application requires any of these: stable network identity (pods need to find each other by name), persistent storage tied to specific pod instances (databases, message brokers), or ordered deployment/scaling (cluster quorum, leader election). Use a Deployment for stateless applications where any pod can handle any request and pod identity is irrelevant.

Q: Can I use a StatefulSet with a single replica?

Yes, but evaluate whether you need it. A single-replica StatefulSet still provides stable identity and PVC lifecycle management. However, the ordering guarantees add no value for a single pod. If you only need a PVC attached to a pod, a Deployment with a manually created PVC may be simpler. Use a single-replica StatefulSet if you might scale up later and want consistent naming from the start.

Q: How do I perform a blue-green deployment with a StatefulSet?

StatefulSets don't natively support blue-green deployments. The closest pattern is using the partition field: create a new StatefulSet with a different name pointing to the same headless Service selector, verify the new pods are healthy, then update the Service selector to point to the new StatefulSet. Alternatively, use partition to update a subset of pods as a canary, validate, then lower the partition to complete the rollout.

Q: What happens if a node hosting a StatefulSet pod crashes?

The pod enters Unknown or Terminating state. After the node controller's eviction timeout (default 5 minutes), Kubernetes schedules a new pod with the same name on a different node. The new pod reattaches to the same PVC (assuming the storage backend supports multi-attach or the old node released the volume). During this time, the pod is unavailable — the StatefulSet does not create a replacement with a different name. This is the key difference from Deployments, which create a new pod immediately.

Kafka split-brain from Deployment: 14 hours lost.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Production DevOps experience
✓Deep understanding of the tool's internals
✓Experience debugging distributed systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

StatefulSets provide stable network identity, persistent storage, and ordered deployment/scaling for stateful workloads
Each pod gets a deterministic name (ordinal index) and a dedicated PVC that survives rescheduling
A headless Service (clusterIP: None) creates DNS A records per pod: pod-name.service-name.namespace.svc.cluster.local
Ordered rollouts: pods are created/deleted sequentially (0, 1, 2...) — not in parallel like Deployments
Rolling updates follow reverse ordinal order (highest to lowest) — pod N-1 updates first, pod 0 last
The biggest mistake: running databases as Deployments — pod identity loss causes split-brain, data corruption, or cluster rejoin failures

✦ Definition~90s read

What is Kubernetes StatefulSets?

★

Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order.

StatefulSets break that model intentionally and provide three guarantees that distributed stateful systems depend on at the protocol level.

First: stable, unique network identity. Each pod in a StatefulSet receives a deterministic name based on its ordinal index — web-0, web-1, web-2. This name is not a random hash suffix. It survives pod restarts, node failures, and rescheduling. Combined with a headless Service, this creates a stable DNS entry that other pods can use to discover and connect to a specific instance — which is critical for systems like ZooKeeper, etcd, and Kafka where cluster membership is a first-class concept baked into the protocol.

Second: stable, persistent storage. volumeClaimTemplates provision a dedicated PersistentVolumeClaim per pod using a naming convention tied to the pod's ordinal. When a pod is deleted and recreated, it reattaches to the same PVC. The storage follows the identity, not the node.

Third: ordered deployment and scaling. Pods are created sequentially — pod 0 must be Running and Ready before pod 1 is created. Pods are deleted in reverse order — pod N-1 before pod N-2 before pod 0. This ordering is not a soft preference — it is enforced by the StatefulSet controller at the API level.

Understanding which of these three guarantees your workload actually needs is more important than knowing the YAML syntax. A single-instance PostgreSQL database needs storage persistence but does not need the ordering guarantees. A ZooKeeper ensemble needs all three.

Plain-English First

Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order. That is a StatefulSet. Unlike a regular Deployment — where pods are interchangeable guests who can sleep in any available room — a StatefulSet guarantees each pod has a permanent name, its own private storage that follows it everywhere, and a predictable position in line. Think of it as the difference between a row of numbered safety deposit boxes (StatefulSet) versus a pile of shopping carts where you grab whichever one is closest (Deployment). The box number matters. The cart number does not.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Stateful applications in Kubernetes demand persistent identity, stable network names, and ordered lifecycle—none of which vanilla Deployments provide. Without StatefulSets, your database pods will fight over the same volumes, restart in random order, and break quorum. StatefulSets enforce pod-ordering and bind storage to identity, letting you run Postgres, Cassandra, or Kafka the same way you run stateless microservices, without data corruption or split-brain disasters.

What is a Kubernetes StatefulSet?

A StatefulSet is a Kubernetes workload API object designed specifically for managing stateful applications. The name is deliberately chosen to contrast with the default Kubernetes mental model — stateless pods that are interchangeable, ephemeral, and replaceable. StatefulSets break that model intentionally and provide three guarantees that distributed stateful systems depend on at the protocol level.

io_thecodeforge/statefulset-basic.yamlYAML

# The headless Service must be created BEFORE the StatefulSet.
# The StatefulSet controller validates the serviceName reference on creation.
# Without the headless Service, per-pod DNS records cannot be created.
apiVersion: v1
kind: Service
metadata:
  name: io-thecodeforge-headless
  labels:
    app: io-thecodeforge-db
spec:
  clusterIP: None  # This is what makes it headless — no virtual IP, direct pod DNS records
  selector:
    app: io-thecodeforge-db  # Must match the StatefulSet pod labels exactly
  ports:
    - port: 5432
      name: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
spec:
  serviceName: io-thecodeforge-headless  # References the headless Service above
  replicas: 3
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      terminationGracePeriodSeconds: 120  # Give PostgreSQL time to flush WAL before SIGKILL
      containers:
        - name: postgres
          image: postgres:16-alpine
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: io-thecodeforge-db-secret
                  key: password
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "postgres"]
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  # volumeClaimTemplates creates one PVC per pod:
  # data-io-thecodeforge-db-0, data-io-thecodeforge-db-1, data-io-thecodeforge-db-2
  # These PVCs survive pod deletion and are reattached when pods are recreated.
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]  # One node at a time — appropriate for databases
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi

Output

service/io-thecodeforge-headless created

statefulset.apps/io-thecodeforge-db created

# Pods created sequentially:

# io-thecodeforge-db-0 Running

# io-thecodeforge-db-1 Running (created after -0 passed readiness)

# io-thecodeforge-db-2 Running (created after -1 passed readiness)

Mental Model

The StatefulSet Identity Model

Think of StatefulSet pods as numbered soldiers in a formation — each has a rank (ordinal), a personal locker (PVC), and always stands in the same position relative to the others. You can replace a soldier, but the replacement steps into the same rank, opens the same locker, and takes the same position.

Pod identity = ordinal index: web-0, web-1, web-2 — deterministic, survives restarts, rescheduling, and node failures
DNS identity = pod-name.service-name.namespace.svc.cluster.local — per-pod A record created by the headless Service
Storage identity = volumeClaimTemplate name + ordinal: data-web-0, data-web-1 — PVCs are not shared and are not deleted when the pod is
Ordering guarantee: pods are created sequentially (0 before 1 before 2) and deleted in reverse (2 before 1 before 0) — enforced by the controller, not advisory
Rolling update order: pod N-1 updates first, pod 0 updates last — preserving the stability of lower-ordinal pods that often hold more critical cluster roles

📊 Production Insight

The serviceName field MUST reference an existing headless Service — a regular ClusterIP Service will not create per-pod DNS records and the StatefulSet controller will reject the spec if the Service does not exist.

Pods without a functioning headless Service have stable names but no resolvable DNS addresses — peer discovery fails silently because pods simply cannot find each other by name.

Rule: always create the headless Service before the StatefulSet, and verify DNS resolution from inside the cluster immediately after creation.

🎯 Key Takeaway

StatefulSets give pods three things Deployments fundamentally cannot: a name that survives restarts, storage that follows that name, and a creation order that respects distributed system dependencies.

These guarantees are not API conveniences — they are load-bearing properties that consensus protocols like Raft and ZAB depend on to reason about cluster membership correctly.

Rule: if your system cares about which pod it is talking to, you need a StatefulSet.

StatefulSet vs Deployment Decision Tree

IfApplication is stateless — any pod can handle any request with no awareness of which pod it is

→

UseUse Deployment — simpler, parallel scaling, no identity overhead, easier rolling updates

IfApplication needs stable network identity for peer discovery or leader election

→

UseUse StatefulSet — deterministic pod names with headless Service DNS that survives restarts

IfApplication needs persistent storage tied to a specific pod instance rather than a specific node

→

UseUse StatefulSet with volumeClaimTemplates — PVCs follow pod identity across node rescheduling

IfApplication requires ordered startup or shutdown to respect cluster quorum or replication dependencies

→

UseUse StatefulSet — sequential creation in ordinal order and reverse-order deletion are enforced at the controller level

IfApplication is a single-instance database with no peer discovery requirements

→

UseConsider Deployment with a manually created PVC — StatefulSet ordering adds no value for a single replica and the simpler mental model is worth it

Ordered Operations: Creation, Deletion, and Rolling Updates

StatefulSet operations are strictly ordered in ways that feel unusual if you are used to Deployment behaviour. Understanding this ordering is essential for both operating StatefulSets correctly and for debugging when something gets stuck.

Pod creation is strictly sequential: the controller creates pod 0, then waits until it is Running and Ready before creating pod 1. Pod 1 must be Running and Ready before pod 2 is created. This is not just about startup — it reflects the dependency structure of many distributed systems where node 0 bootstraps the cluster and node 1 joins as a follower. If you scale a StatefulSet from 3 to 5 replicas, pods 3 and 4 are created sequentially in that order.

Pod deletion is strictly reverse-sequential: the controller deletes pod N-1, waits for it to fully terminate, then deletes pod N-2, and so on to pod 0. This preserves quorum during scale-down — a ZooKeeper ensemble being scaled from 5 to 3 nodes loses its two highest-ordinal nodes first, maintaining the 3-node quorum throughout the process rather than potentially losing the primary.

Rolling updates follow a special reverse-ordinal pattern that is different from what most engineers expect. The controller updates pod N-1 first, waits for it to become Ready, then proceeds to pod N-2, down to pod 0 last. Pod 0, which often holds the most critical role in distributed systems (initial voter in ZooKeeper, partition leader for critical topics in Kafka), is updated last — preserving cluster stability for as long as possible during the rollout.

The partition field in the rolling update strategy is one of the most useful and most forgotten StatefulSet features. Setting partition: 3 in a 5-pod StatefulSet means only pods 3 and 4 update when you apply a new template. Pods 0, 1, and 2 remain on the old version. This is a native canary mechanism built into the StatefulSet API — you validate pods 3 and 4 under production traffic, then lower partition to 0 to complete the rollout. The failure mode to remember: if you set partition for a canary and forget to reset it to 0, your cluster runs in a permanently split-version state indefinitely.

io_thecodeforge/statefulset-update.yamlYAML

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
spec:
  serviceName: io-thecodeforge-headless
  replicas: 5
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      # Only pods with ordinal >= 3 will be updated when the template changes.
      # Pods 0, 1, and 2 remain on the current version.
      # This is the canary pattern: validate pods 3 and 4 first.
      # To complete the rollout: kubectl patch statefulset io-thecodeforge-db
      #   --type=merge -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
      partition: 3
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      # 90 seconds gives PostgreSQL enough time to:
      # 1. Finish in-progress transactions
      # 2. Flush dirty buffers to disk
      # 3. Write a clean checkpoint to WAL
      # 4. Cleanly release file locks
      # Without sufficient grace period, the next startup triggers crash recovery.
      terminationGracePeriodSeconds: 90
      containers:
        - name: postgres
          image: postgres:16-alpine
          ports:
            - containerPort: 5432
          readinessProbe:
            exec:
              command: ["pg_isready", "-U", "postgres"]
            initialDelaySeconds: 10
            periodSeconds: 5
            # failureThreshold * periodSeconds = how long before the rollout is blocked
            # 3 * 5 = 15 seconds of failed readiness before the controller stops advancing
            failureThreshold: 3
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi

# Monitor rollout progress:
# kubectl rollout status statefulset/io-thecodeforge-db
# kubectl get statefulset io-thecodeforge-db -o jsonpath=\
#   '{.status.currentRevision} {.status.updateRevision}'

Output

statefulset.apps/io-thecodeforge-db configured

# Rolling update sequence (partition: 3):

# io-thecodeforge-db-4 -> updated (ordinal 4, highest)

# io-thecodeforge-db-3 -> updated (ordinal 3, next)

# io-thecodeforge-db-2 -> SKIPPED (ordinal 2 < partition 3)

# io-thecodeforge-db-1 -> SKIPPED

# io-thecodeforge-db-0 -> SKIPPED

⚠ Rolling Update Gotchas That Block Rollouts

These are the most common reasons a StatefulSet rolling update gets stuck and stays stuck, often without an obvious error message.

📊 Production Insight

A failing readiness probe on pod N-1 blocks the rolling update indefinitely — the controller will sit waiting with no timeout and no automatic escalation.

Set terminationGracePeriodSeconds substantially higher than your database's measured shutdown time — measure it in staging under realistic load, not under an idle test instance.

Rule: test your readiness probe independently before relying on it to gate a StatefulSet rolling update — a probe that always passes immediately is worse than no probe.

🎯 Key Takeaway

StatefulSet operations have a strict and enforced ordering: create forward (0, 1, 2), delete backward (2, 1, 0), update in reverse ordinal (N-1, N-2, 0).

The partition field enables native canary deployments — update a subset, validate under production traffic, then lower partition to complete the rollout.

Rule: a single failing readiness probe blocks the entire StatefulSet rollout indefinitely — probe design and terminationGracePeriodSeconds are not afterthoughts, they are critical to rollout reliability.

Update Strategy Selection

IfStandard rolling update where all pods should update automatically in ordinal order

→

UseUse RollingUpdate strategy with partition: 0 — pods update in reverse ordinal order, each waited upon for readiness

IfNeed to validate a subset of pods under production traffic before full rollout

→

UseSet partition to the ordinal boundary — pods >= partition update while lower-ordinal pods stay on the old version

IfNeed human approval between each pod update — change control or staged rollout with validation steps

→

UseUse OnDelete strategy — pods only update when you manually delete them, giving full control over pace and sequencing

IfNeed to pause a rollout mid-way after discovering an issue in the first updated pods

→

UseSet partition to the ordinal of the last successfully updated pod — pods below that partition stay on the known-good version while you investigate

thecodeforge.io

Kubernetes Statefulsets

PVC Lifecycle: Storage That Follows Identity

StatefulSets use volumeClaimTemplates to provision one PersistentVolumeClaim per pod. The naming convention is deterministic: template-name-statefulset-name-ordinal. For a StatefulSet named io-thecodeforge-db with a template named data, the PVCs are data-io-thecodeforge-db-0, data-io-thecodeforge-db-1, and data-io-thecodeforge-db-2. This naming is not configurable — it is generated by the StatefulSet controller.

The storage lifecycle has two properties that every engineer running StatefulSets in production must understand deeply.

First, PVCs survive pod deletion. When a pod is deleted — whether by a rolling update, a manual kubectl delete pod, a node failure, or a scale-down — its PVC is not deleted. When the pod is recreated with the same ordinal, it reattaches to the same PVC. This is the 'sticky storage' guarantee. It is what makes your PostgreSQL data directory survive node failures without data loss.

Second, PVCs are NOT deleted when the StatefulSet is deleted. This is a deliberate safety mechanism — accidentally deleting a StatefulSet should not destroy production databases. But it means that scaling down a StatefulSet from 5 to 3 replicas leaves two orphaned PVCs (data-name-3 and data-name-4) that consume cloud storage indefinitely. At $0.08-0.15 per GB per month on most cloud providers, a 1TB database with ten orphaned scale-down PVCs accumulates $800-1500 per month in silent storage waste. This is one of the most common cost anomalies in Kubernetes clusters and one of the least visible.

Kubernetes 1.27 introduced the persistentVolumeClaimRetentionPolicy field on StatefulSets, which allows you to configure automatic PVC deletion on scale-down (whenScaled) or StatefulSet deletion (whenDeleted). For most production stateful workloads, setting whenScaled to Delete is appropriate — the PVC for a scaled-down pod can reasonably be considered ephemeral. Setting whenDeleted to Delete is more dangerous and should only be used when you have confirmed out-of-cluster backups.

io_thecodeforge/pvc-lifecycle.yamlYAML

# Kubernetes 1.27+ introduces persistentVolumeClaimRetentionPolicy
# This controls what happens to PVCs when pods are deleted or the StatefulSet scales down.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
spec:
  serviceName: io-thecodeforge-headless
  replicas: 3
  # Available from Kubernetes 1.27+ (stable in 1.29+)
  persistentVolumeClaimRetentionPolicy:
    # whenScaled: what happens to PVCs when replicas is reduced
    #   Delete  - PVC is deleted when its pod is scaled away (use with confirmed backups)
    #   Retain  - PVC survives scale-down (default, safe but causes orphaned volume drift)
    whenScaled: Retain
    # whenDeleted: what happens to PVCs when the StatefulSet is deleted
    #   Delete  - PVCs are deleted with the StatefulSet (DANGEROUS — only with backups)
    #   Retain  - PVCs survive StatefulSet deletion (default, safe)
    whenDeleted: Retain
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      containers:
        - name: postgres
          image: postgres:16-alpine
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi

---
# Manual PVC lifecycle management for clusters on Kubernetes < 1.27:

# After scaling down from 5 to 3 replicas, list orphaned PVCs:
# kubectl get pvc -l app=io-thecodeforge-db
# NAME                           STATUS  CAPACITY
# data-io-thecodeforge-db-0      Bound   50Gi   <- still in use
# data-io-thecodeforge-db-1      Bound   50Gi   <- still in use
# data-io-thecodeforge-db-2      Bound   50Gi   <- still in use
# data-io-thecodeforge-db-3      Bound   50Gi   <- ORPHANED (pod deleted)
# data-io-thecodeforge-db-4      Bound   50Gi   <- ORPHANED (pod deleted)

# Clean up orphaned PVCs after confirming data is not needed:
# kubectl delete pvc data-io-thecodeforge-db-3 data-io-thecodeforge-db-4

# WARNING: PVC deletion is IRREVERSIBLE
# The underlying PersistentVolume and its data are gone permanently.
# Always confirm you have a recent backup before deleting any database PVC.

Output

statefulset.apps/io-thecodeforge-db configured

# PVC status after scaling from 5 to 3 replicas (Kubernetes < 1.27 or whenScaled: Retain):

NAME STATUS VOLUME CAPACITY STORAGECLASS

data-io-thecodeforge-db-0 Bound pvc-abc123 50Gi io-thecodeforge-ssd

data-io-thecodeforge-db-1 Bound pvc-def456 50Gi io-thecodeforge-ssd

data-io-thecodeforge-db-2 Bound pvc-ghi789 50Gi io-thecodeforge-ssd

data-io-thecodeforge-db-3 Bound pvc-jkl012 50Gi io-thecodeforge-ssd # orphaned

data-io-thecodeforge-db-4 Bound pvc-mno345 50Gi io-thecodeforge-ssd # orphaned

⚠ PVC Orphan Leak — The Silent Storage Cost Nobody Monitors

This is one of the most consistent sources of unexpected cloud storage costs in mature Kubernetes environments. It is invisible until someone looks at the bill.

📊 Production Insight

Orphaned PVCs from scale-downs accumulate silently for months in environments without explicit PVC lifecycle management — the first signal is usually a cloud cost anomaly review or a scheduling failure when storage quotas are hit.

PVC deletion is irreversible — the underlying PersistentVolume and every byte of data it contains are gone permanently. Never delete a database PVC without confirming a recent backup.

Rule: use persistentVolumeClaimRetentionPolicy in Kubernetes 1.27+ for automatic cleanup, or automate manual cleanup with a CronJob that detects PVCs whose owning pod ordinal no longer exists.

🎯 Key Takeaway

StatefulSet PVCs survive pod deletion — this is the 'sticky storage' guarantee that makes databases viable on Kubernetes.

Scaling down orphans PVCs; deleting the StatefulSet orphans them too — neither triggers automatic cleanup by default.

Rule: treat PVC lifecycle as a first-class operational concern with explicit monitoring and cleanup automation — orphaned volumes are a slow financial bleed that compounds over time.

Headless Services and DNS: How Pods Find Each Other

A headless Service is the DNS backbone of a StatefulSet. Without it, StatefulSet pods have stable names but no way for other pods to resolve those names to IP addresses. With it, every pod in the StatefulSet gets an individual DNS A record that points directly to that pod's IP — bypassing the load-balancing layer that regular Services add.

The distinction between a regular Service and a headless Service is important to internalise. A regular Service (clusterIP: something) creates one DNS name that resolves to a virtual IP, and kube-proxy load-balances traffic from that virtual IP to any matching pod. You cannot target a specific pod by DNS with a regular Service. A headless Service (clusterIP: None) creates no virtual IP and no load balancing. Instead, CoreDNS creates individual A records for each pod, one per pod, each pointing to that pod's actual IP address. This is how pod-to-pod targeting works.

The DNS naming convention for StatefulSet pods is: pod-name.service-name.namespace.svc.cluster.local. For a StatefulSet named db with a headless Service named db-headless in namespace prod, pod 0's DNS entry is db-0.db-headless.prod.svc.cluster.local. This entry is stable — it always points to the pod with that identity, regardless of which node it runs on. When the pod is rescheduled to a different node with a different IP, CoreDNS updates the A record to reflect the new IP. The DNS name remains constant; only the IP it resolves to changes.

The headless Service also creates a SRV record for each pod: _port-name._protocol.service-name.namespace.svc.cluster.local. SRV records carry both the hostname and port, which is how distributed systems like Kafka and ZooKeeper bootstrap cluster membership at startup — they query the SRV record to discover all current pod hostnames without needing to know ordinal count in advance.

One practical detail: CoreDNS caches DNS responses with a TTL. For headless Services the default TTL is 5 seconds. If a pod is rescheduled quickly, there is a brief window where other pods may try to connect to the old IP before the cache expires. Design your application's connection retry logic to tolerate this — most database connection pools handle it correctly if configured with appropriate connection timeouts.

io_thecodeforge/headless-service.yamlYAML

# The headless Service — created before the StatefulSet, referenced by serviceName field.
apiVersion: v1
kind: Service
metadata:
  name: io-thecodeforge-headless
  namespace: prod
  labels:
    app: io-thecodeforge-db
spec:
  clusterIP: None  # No virtual IP — each pod gets its own DNS A record
  publishNotReadyAddresses: false  # Only Ready pods get DNS records (default: false)
  # Set to true for stateful systems that need to discover all members
  # including those still initializing — useful for ZooKeeper ensemble bootstrap
  selector:
    app: io-thecodeforge-db  # Must match StatefulSet pod labels EXACTLY
  ports:
    - port: 5432
      targetPort: 5432
      name: postgres

# --- DNS Records Created by CoreDNS ---
#
# Per-pod A records (stable, direct — the primary use case):
#   io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.1.5
#   io-thecodeforge-db-1.io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.2.8
#   io-thecodeforge-db-2.io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.3.12
#
# SRV records (used for automatic cluster member discovery):
#   _postgres._tcp.io-thecodeforge-headless.prod.svc.cluster.local
#   -> SRV: 0 50 5432 io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local
#   -> SRV: 0 50 5432 io-thecodeforge-db-1.io-thecodeforge-headless.prod.svc.cluster.local
#   -> SRV: 0 50 5432 io-thecodeforge-db-2.io-thecodeforge-headless.prod.svc.cluster.local
#
# Service-level A record (returns all pod IPs — round-robin, not load-balanced):
#   io-thecodeforge-headless.prod.svc.cluster.local -> 10.244.1.5, 10.244.2.8, 10.244.3.12

# --- Verify DNS resolution from inside the cluster ---
# kubectl exec io-thecodeforge-db-1 -- \n#   nslookup io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local
# Expected output: Address: 10.244.1.5

Output

service/io-thecodeforge-headless created

# DNS verification from inside the cluster:

# Server: 10.96.0.10

# Address: 10.96.0.10#53

# Name: io-thecodeforge-db-0.io-thecodeforge-headless.prod.svc.cluster.local

# Address: 10.244.1.5

Mental Model

DNS Resolution for StatefulSets

A headless Service turns CoreDNS into a pod directory — each pod gets a permanent address book entry that survives rescheduling. When the pod moves to a new node, the address book updates the IP but the name stays the same.

Regular Service: one DNS name, one virtual IP, load-balanced across pods — you cannot target a specific pod
Headless Service: individual DNS A records per pod — you CAN target a specific pod by its stable name
Pod DNS pattern: pod-name.headless-svc.namespace.svc.cluster.local — stable across restarts and rescheduling
SRV records allow applications to discover all pods dynamically without knowing the replica count — used by ZooKeeper, Kafka, and etcd for bootstrap
CoreDNS TTL for headless Services defaults to 5 seconds — design connection retry logic to tolerate this brief staleness window after pod rescheduling

📊 Production Insight

The headless Service selector must match the StatefulSet pod labels exactly — a label mismatch creates a headless Service with no endpoints, which means DNS A records for zero pods, which means peer discovery silently fails.

CoreDNS cache means there is a 5-second window after pod rescheduling where DNS resolves to the old IP — application connection retry logic must handle this rather than assuming immediate DNS convergence.

Rule: test DNS resolution with nslookup from inside the cluster immediately after StatefulSet creation, and again after simulating a pod reschedule in staging.

🎯 Key Takeaway

A headless Service is not optional for StatefulSets that need peer-to-peer communication — it is the mechanism that creates per-pod DNS identity.

Without a headless Service, pods have stable names but no resolvable DNS addresses — peer discovery fails silently with NXDOMAIN errors that are easy to misattribute.

Rule: if your StatefulSet pods need to discover and communicate with each other by name, the headless Service must exist before the StatefulSet and its selector must be verified.

PodDisruptionBudget: Protecting Quorum During Voluntary Disruptions

PodDisruptionBudgets (PDBs) are among the most important and most skipped Kubernetes objects for StatefulSets in production. A PDB is a policy object that limits how many pods in a set can be simultaneously unavailable during voluntary disruptions — node drains, cluster upgrades, autoscaler scale-downs, and manual evictions.

For a quorum-based system, the requirement is concrete: a 3-node ZooKeeper cluster needs at least 2 nodes to maintain quorum. A 5-node etcd cluster needs at least 3 nodes. A PDB with minAvailable: 2 tells Kubernetes — during any voluntary disruption, you must ensure at least 2 of my pods are running and Ready before proceeding with the eviction. If a node drain would cause a second pod to become unavailable while the first is still evicting, the drain blocks until the first pod is rescheduled and Ready elsewhere.

This blocking behaviour is the point. Without a PDB, kubectl drain proceeds freely, evicting pods without regard for quorum. A node drain during a cluster upgrade with three StatefulSet pods all on the same node (a common situation if pod anti-affinity is not configured) will evict all three simultaneously, destroying quorum completely.

The combination of PDB plus pod anti-affinity is the production-grade pattern. The PDB protects against voluntary disruptions. Pod anti-affinity with requiredDuringSchedulingIgnoredDuringExecution and topologyKey: kubernetes.io/hostname protects against involuntary disruptions by ensuring no two pods land on the same node. Together, they mean: node drains cannot break quorum, and a single node failure cannot take out more than one pod.

One operational detail: kubectl drain respects PDBs. kubectl delete pod does not. If your operational runbook uses kubectl delete pod to perform maintenance, PDBs provide no protection. Use kubectl drain or kubernetes eviction API calls for any maintenance that should respect disruption budgets.

io_thecodeforge/pdb-and-affinity.yamlYAML

# PodDisruptionBudget — protects quorum during voluntary disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: io-thecodeforge-db-pdb
  namespace: prod
spec:
  # For a 3-node cluster: minAvailable: 2 (majority = quorum)
  # For a 5-node cluster: minAvailable: 3 (majority = quorum)
  # For a 7-node cluster: minAvailable: 4 (majority = quorum)
  # Rule: minAvailable = floor(replicas / 2) + 1
  minAvailable: 2
  selector:
    matchLabels:
      app: io-thecodeforge-db

# Alternative using maxUnavailable (equivalent for a 3-pod set):
# spec:
#   maxUnavailable: 1
#   selector:
#     matchLabels:
#       app: io-thecodeforge-db

---
# Pod anti-affinity — spreads pods across nodes to protect against involuntary failures
# PDB protects against voluntary disruptions (drain, upgrade)
# Anti-affinity protects against involuntary disruptions (node crash)
# You need BOTH for production-grade StatefulSet resilience.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: io-thecodeforge-db
  namespace: prod
spec:
  serviceName: io-thecodeforge-headless
  replicas: 3
  selector:
    matchLabels:
      app: io-thecodeforge-db
  template:
    metadata:
      labels:
        app: io-thecodeforge-db
    spec:
      affinity:
        podAntiAffinity:
          # required = hard constraint: pods WILL NOT be scheduled on the same node
          # preferred = soft preference: Kubernetes tries but can violate if necessary
          # For quorum-sensitive systems, use required.
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - io-thecodeforge-db
              # kubernetes.io/hostname = spread across nodes
              # topology.kubernetes.io/zone = spread across AZs (stronger guarantee)
              topologyKey: topology.kubernetes.io/zone
      containers:
        - name: postgres
          image: postgres:16-alpine
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: io-thecodeforge-ssd
        resources:
          requests:
            storage: 50Gi

Output

poddisruptionbudget.policy/io-thecodeforge-db-pdb created

statefulset.apps/io-thecodeforge-db configured

# Verify PDB status:

# kubectl get pdb io-thecodeforge-db-pdb

# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE

# io-thecodeforge-db-pdb 2 N/A 1 5m

# ALLOWED DISRUPTIONS: 1 means one pod can be evicted at a time (3 - 2 = 1)

⚠ PDB Does NOT Protect Against Node Crashes — You Need Both

This is the most common misunderstanding about PodDisruptionBudgets. Teams add a PDB, feel protected, and then lose quorum during a node hardware failure that the PDB never saw.

📊 Production Insight

kubectl drain respects PodDisruptionBudgets — kubectl delete pod does not. If your team's operational runbooks use kubectl delete pod for any maintenance task that requires respecting quorum, your PDB provides no protection for those operations.

A PDB with ALLOWED DISRUPTIONS showing 0 means a node drain will block at this StatefulSet until quorum is restored — this is correct behaviour but will surprise on-call engineers who expect drains to complete quickly.

Rule: set minAvailable to your quorum threshold, use requiredDuringSchedulingIgnoredDuringExecution anti-affinity across zones, and update your operational runbooks to use kubectl drain instead of kubectl delete pod.

🎯 Key Takeaway

PodDisruptionBudgets are non-negotiable for production StatefulSets — without one, a single node drain during a cluster upgrade can evict all pods simultaneously and destroy quorum.

PDBs protect against voluntary disruptions only — node crashes require pod anti-affinity spread across nodes or zones.

Rule: minAvailable = floor(replicas / 2) + 1 for any majority-quorum system, and combine with zone-level anti-affinity for complete resilience.

Before You Begin: The Prerequisites Nobody Tells You About

StatefulSets look simple until they break your database at 3 AM. Before you touch one, you need to understand three things that most tutorials gloss over.

First, your cluster needs a dynamic PersistentVolume provisioner with a StorageClass that supports ReadWriteOnce. If you're on a bare-metal cluster without a proper CSI driver, your PVCs will stay Pending forever. Check with kubectl get storageclass and verify your provisioner supports volumeBindingMode: WaitForFirstConsumer for StatefulSets with topology constraints.

Second, you need a headless Service. This isn't optional — StatefulSets rely on DNS records to assign stable hostnames like pod-name-0.service-name.namespace.svc.cluster.local. Without a headless Service (spec.clusterIP: None), your pods won't get predictable network identities.

Third, understand that StatefulSets require ordered pod management by default. This means scaling down waits for pod-2 to terminate before touching pod-1. If your app doesn't need ordering, you're paying a latency tax for nothing — consider setting spec.podManagementPolicy: Parallel.

VerifyStorageClass.ymlYAML

// io.thecodeforge — devops tutorial

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Output

NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION

fast-ssd kubernetes.io/gce-pd Retain WaitForFirstConsumer true

⚠ Production Trap:

Never use 'Default' StorageClass without checking reclaimPolicy. 'Delete' means your data vanishes the moment you delete the PVC — great for test, catastrophic for production databases.

🎯 Key Takeaway

Always validate your StorageClass supports dynamic provisioning and matches your access mode before writing a single StatefulSet manifest.

thecodeforge.io

Kubernetes Statefulsets

Creating a StatefulSet: The Manifest That Actually Works

Stop copy-pasting random YAML from blogs. Here's how a real production StatefulSet looks for a PostgreSQL cluster — and why every field exists.

The critical parts: serviceName must match your headless Service's name exactly. Kubernetes uses this to generate DNS records. spec.selector.matchLabels must match the pod template labels — no mismatch or the StatefulSet controller rejects it silently.

Notice volumeClaimTemplates instead of volumes. This is the magic that gives each pod its own PVC. Each template creates one PVC per replica, named {volume-name}-{statefulset-name}-{ordinal}. The PVC survives pod deletion, so when the pod comes back (maybe on a different node), it reattaches to the same storage.

The podManagementPolicy defaults to OrderedReady, which waits for pod-0 to be Running and Ready before creating pod-1. This is required for quorum-based apps like databases. If you're running a queue worker or cache, switch to Parallel to avoid the serial bottleneck.

PostgresStatefulSet.ymlYAML

// io.thecodeforge — devops tutorial

apiVersion: v1
kind: Service
metadata:
  name: postgres-headless
spec:
  clusterIP: None
  selector:
    app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-cluster
spec:
  serviceName: postgres-headless
  replicas: 3
  podManagementPolicy: OrderedReady
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15
        volumeMounts:
        - name: pgdata
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: pgdata
    spec:
      storageClassName: fast-ssd
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

Output

statefulset.apps/postgres-cluster created

service/postgres-headless created

💡Senior Shortcut:

Add 'spec.volumeClaimTemplates[*].spec.storageClassName' explicitly. If omitted, Kubernetes uses the default StorageClass — which might be 'Delete' reclaim policy. Your production data disappears on pod delete.

🎯 Key Takeaway

volumeClaimTemplates are not optional — they're the only way to give each pod its own persistent disk that survives pod deletion.

Scaling a StatefulSet: Why Your Database Will Thank You for Ordered Pod Termination

Scaling a StatefulSet is not like scaling a Deployment. Deployment scaling is stupid-fast: it kills or spawns pods in parallel. StatefulSet scaling is surgical and ordered, and that matters when your app has a quorum.

When you scale down from 3 to 1 replica, Kubernetes terminates the highest ordinal pod first — postgres-cluster-2, then postgres-cluster-1. It waits for each pod to fully terminate (status.phase: Succeeded or Failed) before touching the next. No race conditions. No two databases trying to write to the same volume.

There's a hidden gotcha: kubectl scale statefulset postgres-cluster --replicas=0 doesn't delete the PVCs. The volumes hang around with their data intact. This is by design — you can scale back up and the pods reattach to the same disks. But if you manually delete PVCs while the pods are gone, you lose data permanently.

For applications that don't care about ordering (like a distributed cache), set spec.podManagementPolicy: Parallel — scaling goes from O(n) to O(1) time. But for databases, leave it as OrderedReady and let the controller handle the serial dance.

ScaleStatefulSet.ymlYAML

// io.thecodeforge — devops tutorial

# Scale down in action
$ kubectl scale statefulset postgres-cluster --replicas=2

# Kubernetes terminates postgres-cluster-2 first
# Watches pod status until it's gone
$ kubectl get pods -l app=postgres -w
NAME                  READY   STATUS        RESTARTS   AGE
postgres-cluster-0    1/1     Running       0          15m
postgres-cluster-1    1/1     Running       0          14m
postgres-cluster-2    1/1     Terminating   0          13m
postgres-cluster-2    0/1     Terminating   0          13m
postgres-cluster-2    0/1     Terminated    0          13m

Output

statefulset.apps/postgres-cluster scaled

# PVCs remain intact after scale-down

$ kubectl get pvc

NAME STATUS VOLUME CAPACITY ACCESS MODES

pgdata-postgres-cluster-0 Bound pvc-abc 100Gi RWO

pgdata-postgres-cluster-1 Bound pvc-def 100Gi RWO

pgdata-postgres-cluster-2 Bound pvc-ghi 100Gi RWO

⚠ Production Trap:

Never combine 'Parallel' pod management with a StatefulSet running a quorum-based database. You'll get split-brain when multiple pods initialize simultaneously without sequential leader election.

🎯 Key Takeaway

StatefulSet scaling is ordered by ordinal index — highest to lowest for termination, lowest to highest for creation. Always. That's the contract.

OnDelete: Manual Pod Replacement When You Control the Timing

The OnDelete update strategy exists because rolling updates are dangerous for stateful workloads. Unlike RollingUpdate, which replaces pods in reverse ordinal order automatically, OnDelete requires you to delete each pod manually before it gets recreated with the new spec. Why use this? When upgrading a database cluster, you may need to drain connections, wait for replication lag to subside, or run pre-upgrade health checks on each node. OnDelete gives you precise per-pod control over when updates happen. Set updateStrategy.type: OnDelete in your StatefulSet spec. After updating the pod template, no pods change until you run kubectl delete pod . The controller then rebuilds that pod using the new template, preserving its identity and storage. This prevents the controller from deciding the pace of your production upgrade.

statefulset-ondelete.ymlYAML

// io.thecodeforge — devops tutorial

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-cluster
spec:
  serviceName: postgres-hl
  replicas: 3
  updateStrategy:
    type: OnDelete
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:16
        ports:
        - containerPort: 5432
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

⚠ Production Trap:

OnDelete does not block traffic. If you delete pod-2 but pod-0 and pod-1 are still running the old version, you'll have a mixed-version cluster until you manually delete all pods. Always verify backward compatibility before adopting this strategy.

🎯 Key Takeaway

OnDelete hands upgrade control to you — delete pods only when your application is ready for the change, not when the scheduler decides.

thecodeforge.io

Kubernetes Statefulsets

Cleaning Up: Why StatefulSets Refuse to Die Cleanly

Deleting a StatefulSet does not automatically delete its PersistentVolumeClaims. By design, PVCs survive deletion to prevent accidental data loss. Run kubectl delete statefulset and pods terminate in reverse order (pod-2, pod-1, pod-0), but the PVCs remain. If you recreate the StatefulSet with the same name and volumeClaimTemplates, it will reclaim those exact PVCs. To fully clean up, you must delete the PVCs manually after the StatefulSet is gone: kubectl delete pvc -l app=. For cascading cleanup, set persistentVolumeClaimRetentionPolicy to delete in the StatefulSet spec. This policy defines what happens to PVCs when the StatefulSet is scaled down or deleted. With whenDeleted: Delete, PVCs are removed automatically. Without this, orphaned PVCs can rack up cloud storage costs indefinitely. Always audit orphaned resources after cleanup.

statefulset-retention-policy.ymlYAML

// io.thecodeforge — devops tutorial

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cluster
spec:
  serviceName: redis-hl
  replicas: 3
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete
    whenScaled: Retain
  template:
    spec:
      containers:
      - name: redis
        image: redis:7
        ports:
        - containerPort: 6379
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 5Gi

⚠ Production Trap:

Setting whenDeleted: Delete does not protect against accidental deletion. If someone deletes the StatefulSet, PVCs are destroyed instantly — no confirmation. For critical data, use Retain and build an external backup pipeline.

🎯 Key Takeaway

StatefulSets preserve PVCs on deletion by default — always set a retention policy or script manual cleanup to avoid zombie volumes and cloud bills.

● Production incidentPOST-MORTEMseverity: high

The Kafka Split-Brain: How a Deployment Killed 14 Hours of Messages

Symptom

After a routine node drain, Kafka consumers across multiple services began reporting OffsetOutOfRange errors within minutes. Producer requests configured with acks=all started timing out. The Kafka controller election loop entered a crash cycle — two of the three brokers simultaneously believed they were the controller and began issuing conflicting partition reassignment commands. Fourteen hours of uncommitted messages in the __consumer_offsets topic were lost. The blast radius extended to every downstream service that consumed from the cluster.

Assumption

The team assumed Kubernetes would reschedule pods one at a time, maintaining their identity throughout the drain. They expected the deployment's rolling update configuration to apply to node drain evictions. They also assumed the terminationGracePeriodSeconds was sufficient to allow orderly shutdown — it was not, because the drain evicted all three pods in parallel before any of them finished their clean shutdown sequence.

Root cause

The Kafka cluster was deployed as a Deployment, not a StatefulSet. When the node was drained, Kubernetes terminated all three pods and created three new pods with randomly generated name suffixes — kafka-7b4f9-xk2mn, kafka-7b4f9-r9pqw, kafka-7b4f9-hjk34 — instead of the stable kafka-0, kafka-1, kafka-2 that the cluster configuration expected. The new pods had no knowledge of the old broker IDs stored in ZooKeeper. ZooKeeper saw three unknown brokers registering while the ephemeral nodes for the old broker IDs had not yet expired. The cluster entered a split-brain state where partition leader metadata in ZooKeeper pointed to broker IDs that no longer existed. Consumers could not fetch from those partitions and producers could not confirm acknowledgements. There was no PodDisruptionBudget, so Kubernetes had no constraint preventing it from evicting all three pods simultaneously.

Fix

Migrated the Kafka cluster from Deployment to StatefulSet with a headless Service, giving each broker a stable identity (kafka-0, kafka-1, kafka-2) that ZooKeeper and the broker configuration could rely on. Each broker now derives its broker.id from its pod ordinal via a downward API environment variable. Added a PodDisruptionBudget with minAvailable: 2 to ensure at least two brokers survive any voluntary disruption. Set terminationGracePeriodSeconds: 120 and added a preStop hook that triggers a controlled leader election handoff before the broker process terminates. Added pod anti-affinity rules to spread brokers across availability zones so a single AZ failure cannot take out quorum.

Key lesson

Never run stateful distributed systems — Kafka, ZooKeeper, Elasticsearch, etcd, Redis Cluster — as Deployments; pod identity loss is catastrophic at the protocol level and the failure mode is silent until it is not
PodDisruptionBudget is mandatory for StatefulSets in production — without one, a node drain can evict every pod simultaneously and break quorum with no warning
Node drains respect PodDisruptionBudgets, but kubectl delete pod does not — operational runbooks must distinguish between the two
Test failure modes in staging by simulating node drains before they happen in production: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data and verify quorum is maintained throughout

Production debug guideSymptom-driven diagnostics for Kubernetes StatefulSet issues — organised by what you see, not by what you think caused it6 entries

Symptom · 01

StatefulSet pod stuck in Pending with FailedScheduling event

→

Fix

Check node resources and PVC binding status: kubectl describe pod <pod-name> and look at the Events section. The two most common causes are insufficient node CPU or memory for the requested resources, and a PVC in Pending state because the StorageClass cannot provision a volume. Check StorageClass exists and its provisioner is healthy. Verify pod anti-affinity rules are not preventing scheduling by checking if enough eligible nodes exist.

Symptom · 02

StatefulSet pod stuck in Terminating and will not complete deletion

→

Fix

Check for finalizers on the pod and its PVC: kubectl get pvc <pvc-name> -o yaml and look for metadata.finalizers. The kubernetes.io/pvc-protection finalizer is the most common blocker — it prevents deletion while a pod is using the PVC. Check if a validating webhook or operator is intercepting deletion. If the pod is truly stuck and the data is confirmed safe, force delete with: kubectl delete pod <pod-name> --grace-period=0 --force.

Symptom · 03

Rolling update stuck — only some pods updated, others not progressing

→

Fix

Check the update strategy partition value: kubectl get statefulset <name> -o jsonpath='{.spec.updateStrategy.rollingUpdate.partition}'. If partition is set to a non-zero value, only pods with ordinal >= partition get updated. Verify the last-updated pod passes its readiness probe — a failing readiness probe blocks the entire rollout. Check kubectl rollout status statefulset/<name> for a human-readable status. Compare currentRevision and updateRevision to confirm the rollout is actually in progress.

Symptom · 04

Pod rescheduled to different node but PVC will not attach — pod stuck in ContainerCreating

→

Fix

Check PVC access mode — ReadWriteOnce volumes can only attach to one node at a time. If the old node has not released the volume attachment, the new pod cannot claim it. Check volume attachment objects: kubectl get volumeattachment and look for one referencing the PV. The node controller's eviction timeout (default 5-6 minutes) must pass before Kubernetes forcibly detaches the volume. Check kubectl describe pv <pv-name> for the claim reference and release status.

Symptom · 05

DNS resolution failing for StatefulSet pods — peers cannot reach each other by name

→

Fix

Verify the headless Service exists and has clusterIP: None: kubectl get svc <service-name> -o jsonpath='{.spec.clusterIP}'. Check that the Service selector matches the StatefulSet pod labels exactly — a label mismatch creates a headless Service with no endpoints. Verify endpoints are populated: kubectl get endpoints <service-name>. Test DNS resolution from inside the cluster: kubectl exec <any-pod> -- nslookup <pod-name>.<service-name>.<namespace>.svc.cluster.local.

Symptom · 06

Pod crash-looping after PVC reattachment — database process will not start

→

Fix

Check mount point permissions: kubectl exec <pod> -- ls -la /data. Some databases (PostgreSQL, MySQL) require the data directory to be owned by a specific UID. Check if the volume was unmounted uncleanly during a crash — some databases (PostgreSQL) enter recovery mode and replay WAL on next start, which is expected but can fail if WAL files are corrupt. Check application logs carefully for recovery errors vs startup errors — they require different responses.

★ StatefulSet Quick Debug Cheat SheetRapid diagnostics for common StatefulSet production issues. These are the first commands I reach for when a StatefulSet starts misbehaving.

Pod stuck in Pending — not scheduling−

Immediate action

Check scheduling failure reason and PVC binding status simultaneously — both are common causes and both produce the same Pending state

Commands

kubectl describe pod <pod-name> | grep -A15 Events

kubectl get pvc -o wide --selector app=<statefulset-label>

Fix now

Verify StorageClass exists and provisioner is healthy. Confirm node has sufficient CPU and memory for the pod's resource requests. Check pod anti-affinity rules are not creating an impossible scheduling constraint.

Rolling update not progressing past a specific ordinal+

PVC stuck in Terminating — volume not releasing+

StatefulSet pods cannot reach peers by DNS name+

StatefulSet vs Deployment vs DaemonSet

Dimension	StatefulSet	Deployment	DaemonSet
Pod identity	Stable, deterministic name (ordinal index: web-0, web-1) — survives restarts and node rescheduling	Random hash suffix regenerated on every reschedule — no identity continuity between pod instances	Random hash suffix, one pod per matching node — tied to node lifecycle rather than an independent identity
Network identity	Stable per-pod DNS via headless Service: pod-name.svc.namespace.svc.cluster.local	Load-balanced via ClusterIP Service — cannot target a specific pod by DNS	Node-local networking, accessed via node IP — no stable cross-node DNS identity
Storage	Per-pod PVC via volumeClaimTemplates — PVC survives pod deletion and reattaches on recreation	Shared volumes (all pods see the same data) or ephemeral volumes (lost on pod deletion)	Host volumes for node-local data, or node-specific PVCs — storage is tied to the node, not a portable identity
Scaling order	Sequential: pod 0 created first, N-1 deleted first — enforced by the controller at the API level	Parallel: all pods created or deleted simultaneously, controlled by maxSurge and maxUnavailable	One pod per matching node — scaling is driven by node count, not a replica field
Rolling update	Reverse ordinal (N-1 updates first, pod 0 last) — each pod must be Ready before the next updates	Parallel within maxUnavailable and maxSurge constraints — no ordering guarantees between pods	Parallel across nodes — similar to Deployment but one-per-node constraint limits concurrency naturally
Use case	Databases, message brokers (Kafka), distributed coordination (ZooKeeper, etcd), search engines (Elasticsearch)	Stateless APIs, web servers, microservices — anything where pod identity is irrelevant	Log collectors (Fluentd), monitoring agents (node-exporter), CNI plugins, security agents — node-level infrastructure
Pod replacement	Same name, same PVC reattachment — the replacement is the same identity in a new container	New name, new pod, no identity continuity — the replacement is a different entity from the old pod	Pod is recreated on the same node when the node recovers — tied to node lifecycle
Disruption budget	Critical — PDB protects quorum; without one a single drain can take the entire cluster down	Recommended — PDB protects against availability loss during updates and drains	Rarely needed — one pod per node means voluntary disruptions are node-scoped by definition

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
io_thecodeforgestatefulset-basic.yaml	apiVersion: v1	What is a Kubernetes StatefulSet?
io_thecodeforgestatefulset-update.yaml	apiVersion: apps/v1	Ordered Operations
io_thecodeforgepvc-lifecycle.yaml	apiVersion: apps/v1	PVC Lifecycle
io_thecodeforgeheadless-service.yaml	apiVersion: v1	Headless Services and DNS
io_thecodeforgepdb-and-affinity.yaml	apiVersion: policy/v1	PodDisruptionBudget
VerifyStorageClass.yml	apiVersion: storage.k8s.io/v1	Before You Begin
PostgresStatefulSet.yml	apiVersion: v1	Creating a StatefulSet
ScaleStatefulSet.yml	$ kubectl scale statefulset postgres-cluster --replicas=2	Scaling a StatefulSet
statefulset-ondelete.yml	apiVersion: apps/v1	OnDelete
statefulset-retention-policy.yml	apiVersion: apps/v1	Cleaning Up

Key takeaways

StatefulSets provide stable identity, sticky storage, and ordered operations

the three load-bearing guarantees that distributed stateful systems like ZooKeeper, Kafka, Elasticsearch, and etcd depend on at the protocol level, not just as conveniences.

A headless Service is mandatory for StatefulSets that need peer-to-peer communication

it creates the per-pod DNS A records that enable stable name resolution. Without it, pods have stable names but no way for other pods to resolve those names to IP addresses.

PVCs survive pod deletion but are not deleted on scale-down or StatefulSet deletion

orphaned PVCs consume cloud storage indefinitely. Treat PVC lifecycle as a first-class operational concern with explicit monitoring and cleanup automation.

PodDisruptionBudgets are non-negotiable for production StatefulSets

without one, a node drain can evict all pods simultaneously and destroy quorum. Set minAvailable to floor(replicas / 2) + 1, and combine with zone-level pod anti-affinity for complete resilience.

Rolling updates proceed in reverse ordinal order and are blocked by a failing readiness probe on any pod

probe design and terminationGracePeriodSeconds are critical to rollout reliability, not optional fields.

The partition field enables native canary deployments

update a subset of pods, validate under production traffic, then lower partition to 0 to complete the rollout. Forgetting to reset partition leaves the cluster in a permanently split-version state.

kubectl drain respects PodDisruptionBudgets; kubectl delete pod does not

operational runbooks for StatefulSet maintenance must use drain, not direct pod deletion, to get the safety guarantees you configured.

Never run quorum-based distributed systems as Deployments

pod identity loss is catastrophic at the protocol level and the failure mode is a split-brain or data loss incident, not a graceful degradation.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What are the three guarantees that a StatefulSet provides that a Deploym...

Q02SENIOR

Explain the role of the headless Service in a StatefulSet. What happens ...

Q03SENIOR

A 3-node etcd cluster deployed as a StatefulSet lost quorum during a clu...

Q04SENIOR

What happens to PVCs when you scale down a StatefulSet from 5 to 3 repli...

Q05SENIOR

How does the StatefulSet rolling update work, and how is it different fr...

Q01 of 05SENIOR

What are the three guarantees that a StatefulSet provides that a Deployment does not?

ANSWER

StatefulSets provide three guarantees that Deployments fundamentally cannot. First, stable network identity: each pod receives a deterministic name based on its ordinal index — web-0, web-1, web-2 — that survives restarts, rescheduling, and node failures. Combined with a headless Service, this creates a stable per-pod DNS record at pod-name.service-name.namespace.svc.cluster.local. Deployments assign a random hash suffix to pod names, which changes every time a pod is replaced. Second, stable persistent storage: volumeClaimTemplates provision a dedicated PVC per pod using a naming convention tied to the pod's ordinal. When a pod is deleted and recreated with the same ordinal, it reattaches to the same PVC on any node. The storage follows the identity, not the node. Third, ordered deployment and scaling: pods are created sequentially — pod 0 must be Running and Ready before pod 1 is created. Pods are deleted in reverse order — pod N-1 before pod 0. Rolling updates proceed in reverse ordinal order. Deployments create and delete pods in parallel with no ordering guarantees. These are not conveniences — distributed consensus protocols like Raft and ZAB depend on them at the wire level.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is Kubernetes StatefulSets in simple terms?

When should I use a StatefulSet instead of a Deployment?

Can I use a StatefulSet with a single replica?

How do I perform a blue-green deployment with a StatefulSet?

What happens if a node hosting a StatefulSet pod crashes?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Kubernetes. Mark it forged?

9 min read · try the examples if you haven't