Kubernetes StatefulSets Explained — Internals, Ordering, and Production Gotchas
- StatefulSets provide three guarantees: stable identity (name + DNS), sticky storage (per-Pod PVC), and ordered operations (sequential creation/deletion).
- StatefulSet PVCs persist after StatefulSet deletion. Always delete PVCs explicitly when decommissioning. Monitor for orphaned PVCs.
- OrderedReady is the safe default for quorum-based systems. Parallel breaks consensus bootstrap. StorageClasses work, Rolling updates always follow ordinal order.
- Stable identity: Each Pod gets a persistent name (pod-0, pod-1) and a stable DNS entry via a Headless Service.
- Stable storage: Each Pod gets its own PersistentVolumeClaim that follows it across restarts and reschedules.
- Ordered operations: Pods are created sequentially (0, 1, 2) and deleted in reverse (2, 1, 0). Rolling updates follow the same order.
- Headless Service: ClusterIP: None. DNS returns Pod IPs directly. Each Pod is reachable as pod-0.service.ns.svc.cluster.local.
- Ordered operations are slow. A 10-replica StatefulSet takes 10x longer to deploy than a Deployment.
- Parallel mode (podManagementPolicy: Parallel) is faster but breaks cluster bootstrap for systems that need quorum.
- Deleting a StatefulSet without deleting its PVCs. The PVCs persist indefinitely, consuming storage and blocking re-creation of the StatefulSet with different storage config.
Pod stuck in Pending.
kubectl describe pvc data-<sts-name>-<ordinal> -n <namespace>kubectl get storageclassPod stuck in ContainerCreating with volume attach error.
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Eventskubectl get volumeattachment | grep <pv-name>Rolling update stuck.
kubectl rollout status statefulset/<name> -n <namespace> --timeout=30skubectl get pods -n <namespace> -l app=<label> --sort-by=.metadata.name -o wide | grep -v RunningPVCs consuming unexpected storage after StatefulSet deletion.
kubectl get pvc -n <namespace> | grep <sts-name>kubectl describe pvc data-<sts-name>-0 -n <namespace> | grep -A 5 StatusDNS resolution failing between StatefulSet Pods.
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.clusterIP}'kubectl get statefulset <sts-name> -n <namespace> -o jsonpath='{.spec.serviceName}'Production Incident
kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2 data-postgres-3 data-postgres-4 -n production.
2. Verified the underlying PersistentVolumes were released and deleted (or set reclaimPolicy: Delete for the StorageClass).
3. Re-created the StatefulSet with the new storage class in volumeClaimTemplates.
4. Added a cleanup script to the team's runbook that explicitly deletes PVCs after StatefulSet deletion.
5. Set up monitoring for unbound PVCs: alert when PVCs exist without a bound Pod for more than 1 hour.Production Debug GuideSymptom-first investigation path for StatefulSet failures.
kubectl describe pvc data-<sts-name>-<ordinal>. If the PVC is Pending, the StorageClass may not exist or the provisioner may be down. Check node affinity — the PV may be bound to a specific node that is unavailable.kubectl get pvc -n <ns>. If the PV is still attached to the old node, force-detach it or wait for the attach-detach controller to time out (6 minutes default).kubectl rollout status statefulset/<name> to see which ordinal is blocking.clusterIP: None. Check that spec.serviceName in the StatefulSet matches the Headless Service name.fulSet Pods cannot resolve each other by Test DNS resolution: kubectl exec pod-0 -- nslookup pod-1.<service>.<namespace>.svc.cluster.local.kubectl delete pvc data-<sts-name>-*.Stateless apps are easy. You spin up ten identical pods, kill any three, Kubernetes replaces them — nobody cares. But the real world is full of systems that refuse to be stateless: databases, message brokers, distributed caches, search engines. These systems have opinions. Elasticsearch node 2 needs to rejoin the cluster as Elasticsearch node 2, not as some random newcomer. Kafka broker 0 owns specific partitions and cannot pretend to be a fresh broker without corrupting data.
StatefulSets exist precisely to give Kubernetes the vocabulary to reason about identity, ordering, and sticky storage. They provide three guarantees that Deployments fundamentally cannot: a stable, unique network identity that survives pod restarts; stable, persistent storage that follows the pod around regardless of which node it lands on; and ordered, graceful deployment and scaling.
This is not a getting-started guide. It covers the controller loop internals, PVC ownership tracking, the role of the Headless Service, why pod ordinals matter for rolling updates, and the exact failure modes that bite teams in production.
Stable Identity: Network Names and Pod Ordinals
The defining feature of a StatefulSet is stable identity. Each Pod receives a unique, predictable name based on the StatefulSet name and an ordinal index: <statefulset-name>-0, <statefulset-name>-1, <statefulset-name>-2. This identity persists across restarts, reschedules, and even node failures. If pod-2 is rescheduled to a different node, it is still pod-2.
This identity extends to DNS. When a StatefulSet specifies a Headless Service via spec.serviceName, Kubernetes creates DNS A records for each Pod: pod-0.service.namespace.svc.cluster.local. These DNS names resolve directly to the Pod's IP address. When the Pod restarts with a new IP, the DNS record is updated automatically.
This is fundamentally different from Deployments, where Pods are interchangeable and have random names. Stateful systems rely on this identity for peer discovery, leader election, and data partitioning. A Kafka broker must rejoin the cluster with the same identity to resume ownership of its partitions.
# StatefulSet with stable identity and Headless Service. # Each Pod is reachable as: postgres-0.postgres.production.svc.cluster.local # Package: io.thecodeforge.k8s # Headless Service: DNS returns Pod IPs directly. apiVersion: v1 kind: Service metadata: name: postgres namespace: production spec: clusterIP: None # Headless: no virtual IP selector: app: postgres ports: - port: 5432 targetPort: 5432 --- # StatefulSet: stable identity, ordered operations, sticky storage. apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres namespace: production spec: serviceName: postgres # Must match the Headless Service name replicas: 3 selector: matchLabels: app: postgres template: metadata: labels: app: postgres spec: containers: - name: postgres image: postgres:15 ports: - containerPort: 5432 env: - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-credentials key: password - name: PGDATA value: /var/lib/postgresql/data/pgdata volumeMounts: - name: data mountPath: /var/lib/postgresql/data readinessProbe: exec: command: - pg_isready - -U - postgres initialDelaySeconds: 5 periodSeconds: 5 # volumeClaimTemplates: creates a PVC for each Pod. # PVC name format: <template-name>-<statefulset-name>-<ordinal> # e.g., data-postgres-0, data-postgres-1, data-postgres-2 # These PVCs persist even after the StatefulSet is deleted. volumeClaimTemplates: - metadata: name: data spec: accessModes: - ReadWriteOnce storageClassName: fast-ssd resources: requests: storage: 100Gi
# Each with its own PVC: data-postgres-0, data-postgres-1, data-postgres-2
# DNS entries:
# postgres-0.postgres.production.svc.cluster.local -> <pod-0-ip>
# postgres-1.postgres.production.svc.cluster.local -> <pod-1-ip>
# postgres-2.postgres.production.svc.cluster.local -> <pod-2-ip>
- Pod name: stable across restarts. postgres-2 is always postgres-2.
- DNS: pod-0.service.ns.svc.cluster.local. Updated on Pod IP change.
- Storage: PVC follows the Pod. Same PVC is re-attached on reschedule.
- Ordinal: determines creation order (0, 1, 2) and deletion order (2, 1, 0).
- Headless Service is required. Without it, DNS records are not created.
--attach-detach-reconcile-sync-period). During this window, the Pod cannot start and the StatefulSet cannot proceed to the next ordinal. For critical databases, this delay can cause quorum loss in a 3-node cluster.Ordered Operations: Creation, Deletion, and Rolling Updates
StatefulSets enforce strict ordering on all lifecycle operations. Pods are created sequentially from ordinal 0 to N-1. Pod N is not created until Pod N-1 is Running and Ready. Pods are deleted in reverse order: N-1 first, then N-2, down to 0. Rolling updates follow the same ordinal order.
This ordering is critical for systems that need quorum during bootstrap. A 3-node etcd cluster needs at least 2 nodes to form quorum. If all 3 Pods start simultaneously, none can form quorum because they all try to discover peers that do not exist yet. Ordered creation ensures pod-0 starts first, pod-1 joins pod-0, and pod-2 joins the existing 2-node cluster.
The podManagementPolicy field controls this behavior. The default is OrderedReady: Pods are created and deleted one at a time in ordinal order. The alternative is Parallel: Pods are created and deleted simultaneously, like a Deployment. Parallel is faster but bootstrap.
# StatefulSet with OrderedReady (default) and Parallel comparison. # Package: io.thecodeforge.k8s # OrderedReady: Pods created one at a time. Slow but safe for quorum-based systems. apiVersion: apps/v1 kind: StatefulSet metadata: name: etcd-cluster namespace: production spec: serviceName: etcd replicas: 3 podManagementPolicy: OrderedReady breaks systems that need ordered # Default. Sequential creation/deletion. selector: matchLabels: app: etcd template: metadata: labels: app: etcd spec: containers: - name: etcd image: quay.io/coreos/etcd:v3.5.12 ports: - containerPort: 2379 name: client - containerPort: 2380 name: peer env: - name: ETCD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: ETCD_INITIAL_CLUSTER value: "etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380" - name: ETCD_INITIAL_CLUSTER_STATE value: "new" - name: ETCD_INITIAL_CLUSTER_TOKEN value: "etcd-cluster-token" - name: ETCD_DATA_DIR value: "/var/lib/etcd/data" volumeMounts: - name: data mountPath: /var/lib/etcd volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi --- # Parallel: All Pods created simultaneously. Fast but unsafe for quorum bootstrap. # Use only for systems that do not require ordered startup. apiVersion: apps/v1 kind: StatefulSet metadata: name: redis-cache namespace: production spec: serviceName: redis replicas: 3 podManagementPolicy: Parallel # All Pods created at once. selector: matchLabels: app: redis template: metadata: labels: app: redis spec: containers: - name: redis image: redis:7-alpine ports: - containerPort: 6379 volumeMounts: - name: data mountPath: /data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 10Gi
# redis-cache: All 3 Pods created simultaneously.
- OrderedReady: sequential creation. Safe for quorum. Slow at scale.
- Parallel: simultaneous creation. Fast. Unsafe for quorum bootstrap.
- Rolling updates always follow ordinal order regardless of podManagementPolicy.
- Scale-down always follows reverse ordinal order regardless of podManagementPolicy.
- OnDelete update strategy: Pods are not updated until manually deleted. Gives full control.
kubectl rollout status statefulset/<name> — if a single ordinal is stuck, the entire rollout blocks.PersistentVolumeClaim Lifecycle: Ownership, Orphans, and Reclaim
The volumeClaimTemplates field in a StatefulSet is a template for creating PVCs. When a StatefulSet Pod is created, Kubernetes creates a PVC from the template with a deterministic name: <template-name>-<statefulset-name>-<ordinal>. For example, a StatefulSet named postgres with a volumeClaimTemplate named data creates PVCs: data-postgres-0, data-postgres-1, data-postgres-2.
These PVCs are owned by the StatefulSet but are NOT deleted when the StatefulSet is deleted. This is by design — the data must persist so it can be re-attached if the StatefulSet is re-created. However, this creates a common production pitfall: orphaned PVCs that consume storage indefinitely.
The reclaim policy on the underlying StorageClass determines what happens to the PersistentVolume when the PVC is finally deleted. Retain keeps the PV and its data. Delete removes the PV and its data. The default varies by cloud provider.
# PVC lifecycle management for StatefulSets. # Package: io.thecodeforge.k8s # StorageClass with Delete reclaim policy. # When PVC is deleted, the underlying PV and data are also deleted. # Use for ephemeral or reproducible data. apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast-ssd-deletable provisioner: kubernetes.io/aws-ebs parameters: type: gp3 reclaimPolicy: Delete # PV is deleted when PVC is deleted volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- # StorageClass with Retain reclaim policy. # When PVC is deleted, the PV is kept (but unbound). # Use for critical data that must survive PVC deletion. apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast-ssd-retain provisioner: kubernetes.io/aws-ebs parameters: type: gp3 reclaimPolicy: Retain # PV is kept when PVC is deleted volumeBindingMode: WaitForFirstConsumer allowVolumeExpansion: true --- # StatefulSet using the deletable storage class. # PVCs are auto-deleted when the StatefulSet is deleted IF you also delete the PVCs. apiVersion: apps/v1 kind: StatefulSet metadata: name: kafka namespace: production spec: serviceName: kafka replicas: 3 selector: matchLabels: app: kafka template: metadata: labels: app: kafka spec: containers: - name: kafka image: confluentinc/cp-kafka:7.5.0 ports: - containerPort: 9092 env: - name: KAFKA_BROKER_ID valueFrom: fieldRef: fieldPath: metadata.labels['apps.kubernetes.io/pod-index'] - name: KAFKA_LOG_DIRS value: /var/lib/kafka/data volumeMounts: - name: data mountPath: /var/lib/kafka/data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd-deletable resources: requests: storage: 500Gi
# Total storage: 1.5TB. These persist even after StatefulSet deletion.
- PVC name is deterministic: <template>-<sts-name>-<ordinal>.
- Kubernetes matches by name. Existing PVCs are re-attached, new specs are ignored.
- PVC spec (storage class, size) is immutable after creation.
- To change storage class: delete StatefulSet, delete PVCs, re-create StatefulSet.
- To resize PVC: set allowVolumeExpansion: true on StorageClass, then edit PVC spec.resources.requests.storage.
Update Strategies: RollingUpdate vs OnDelete
StatefulSets support two update strategies: RollingUpdate (default) and OnDelete. The choice determines how Pod template changes (image update, env var change) are propagated to existing Pods.
RollingUpdate updates Pods one at a time in ordinal order, waiting for each Pod to be Ready before proceeding to the next. This is the safe default but slow. The maxUnavailable field (available in Kubernetes 1.24+) controls how many Pods can be unavailable during the update.
OnDelete does not automatically update Pods. When the Pod template is changed, existing Pods continue running the old spec. The update is applied only when a Pod is manually deleted. Kubernetes recreates the Pod with the new spec. This gives full control over update timing but requires manual intervention.
# StatefulSet update strategies. # Package: io.thecodeforge.k8s # RollingUpdate (default): Automatic, ordered updates. apiVersion: apps/v1 kind: StatefulSet metadata: name: zookeeper namespace: production spec: serviceName: zookeeper replicas: 3 updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 # Allow 1 Pod to be unavailable during update selector: matchLabels: app: zookeeper template: metadata: labels: app: zookeeper spec: containers: - name: zookeeper image: zookeeper:3.8 ports: - containerPort: 2181 readinessProbe: exec: command: - sh - -c - "echo ruok | nc localhost 2181 | grep imok" initialDelaySeconds: 10 periodSeconds: 5 volumeMounts: - name: data mountPath: /data - name: datalog mountPath: /datalog volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi - metadata: name: datalog spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi --- # OnDelete: Manual update control. No automatic Pod recreation on template change. apiVersion: apps/v1 kind: systems where each restart has significant operational impact (rebalancing, reindexing, replication catch-up).
PodDisruptionBudgets and StatefulSet Availability
PodDisruptionBudgets (PDBs) are critical for StatefulSets. They prevent the voluntary disruption controller from evicting too many Pods simultaneously during node drains, cluster upgrades, or preemptions. The controller does not intentionally evict more Pods than the budget allows.
For a 3-node etcd cluster, set minAvailable: 2. This ensures that a node drain cannot break quorum. If the drain would evict a third Pod, it blocks until one of the evicted Pods is rescheduled and Ready.
PDBs only block voluntary disruption (drain, upgrade, preemption). They do NOT protect against involuntary disruption (node crash, OOMKill, kernel panic). This distinction is critical: PDBs are a guardrail for planned maintenance, not a safety net for unplanned failures.
# PodDisruptionBudget for a 3-node etcd cluster. # Ensures at least 2 Pods are always available (quorum). # Package: io.thecodeforge.k8s apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: etcd-pdb namespace: production spec: minAvailable: 2 selector: matchLabels: app: etcd --- # PodDisruptionBudget using maxUnavailable for a 5-node Kafka cluster. # Allows at most 1 Pod to be down at any time. apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: kafka-pdb namespace: production spec: maxUnavailable: 1 selector: matchLabels: app: kafka
# kubectl get pdb -n production
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# etcd-pdb 2 N/A 1 5d
# kafka-pdb N/A 1 0 5d
# ALLOWED DISRUPTIONS = 0 means the PDB is currently blocking all voluntary evictions.
- minAvailable: 2 on a 3-replica cluster = 1 Pod can be evicted.
- maxUnavailable: 1 on a 3-replica cluster = same result, different expression.
- PDBs block voluntary disruption only (drain, upgrade, preemption).
- They do NOT protect against involuntary disruption (crash, OOMKill).
- Setting minAvailable equal to replica count blocks all maintenance. Use quorum size instead.
| Aspect | Deployment | StatefulSet | DaemonSet |
|---|---|---|---|
| Pod identity | Random, interchangeable | Stable, ordinal (pod-0, pod-1) | One per node (or subset) |
| Pod naming | random-hash | sts-name-ordinal | daemon-hash |
| Storage | Ephemeral or shared PV | Per-Pod PVC (sticky) | HostPath or shared PV |
| Scaling | Horizontal (free) | Ordered (sequential) | Automatic (per-node) |
| Rolling update | maxSurge + maxUnavailable | OrderedReady or OnDelete | RollingUpdate or OnDelete |
| DNS identity | Service VIP (load-balanced) | Per-Pod DNS via Headless Service | Service VIP (load-balanced) |
| Self-healing | Yes (replace any Pod) | Yes (replace with same identity) | Yes (replace on same node) |
| Creation order | Parallel | Sequential (0, 1, 2) | Parallel (one per node) |
| Deletion order | Parallel | Reverse (N, N-1, ..., 0) | Parallel |
| Use case | Stateless APIs, web servers | Databases, Kafka, ZooKeeper, etcd | Log agents, node-exporter, CNI agents |
🎯 Key Takeaways
- StatefulSets provide three guarantees: stable identity (name + DNS), sticky storage (per-Pod PVC), and ordered operations (sequential creation/deletion).
- StatefulSet PVCs persist after StatefulSet deletion. Always delete PVCs explicitly when decommissioning. Monitor for orphaned PVCs.
- OrderedReady is the safe default for quorum-based systems. Parallel breaks consensus bootstrap. StorageClasses work, Rolling updates always follow ordinal order.
- OnDelete is the production standard for large data systems where each restart triggers hours of rebalancing.
- PDBs are mandatory for quorum-based StatefulSets. Set minAvailable to quorum size, not replica count.
- PVC name matching is by name, not spec. To change storage class, delete PVCs first, then re-create the StatefulSet.
- The Headless Service is mandatory for per-Pod DNS. Without it, peer discovery fails and the cluster cannot bootstrap.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the three guarantees a StatefulSet provides that a Deployment cannot. Why are these important for stateful workloads?
- QWhat happens when you delete a StatefulSet? What happens to its PVCs? How do you properly decommission a StatefulSet?
- QExplain the difference between OrderedReady and Parallel podManagementPolicy. When would you use each?
- QA 3-node etcd cluster is running as a StatefulSet. A node fails and one Pod is rescheduled. The new Pod stays Pending for 5 minutes. What is happening and how do you fix it?
- QHow does Kubernetes match PVCs when a StatefulSet is re-created? What happens if you change the storage class in the template?
- QExplain the RollingUpdate vs OnDelete update strategies. When would you choose OnDelete?
- QWhat is a PodDisruptionBudget and why is it critical for StatefulSets? What is the most common PDB misconfiguration?
- QHow does a StatefulSet Pod discover its peers? What role does the Headless Service play?
- QDescribe the PVC lifecycle for a StatefulSet. What is the reclaim policy and how does it affect storage costs?
- QHow would you design a zero-downtime upgrade strategy for a 5-node Elasticsearch cluster running as a StatefulSet?
Frequently Asked Questions
What is the difference between a Deployment and a StatefulSet?
A Deployment manages interchangeable Pods with no stable identity. Pods get random names, ephemeral storage, and are created/deleted in parallel. A StatefulSet manages Pods with stable identity (ordinal names), sticky storage (per-Pod PVCs), and ordered operations (sequential creation/deletion). Use Deployments for stateless apps. Use StatefulSets for databases, message brokers, and distributed systems that need peer discovery.
Why do StatefulSet Pods need a Headless Service?
The Headless Service (clusterIP: None) creates DNS A records for each Pod individually: pod-0.service.ns.svc.cluster.local. Without it, DNS returns only the ClusterIP (if using a regular Service), and you cannot address specific Pods. Peer discovery in systems like Kafka, ZooKeeper, and etcd requires individual Pod DNS names.
What happens to PVCs when a StatefulSet is deleted?
PVCs are NOT deleted when the StatefulSet is deleted. They persist indefinitely, consuming storage. If you re-create the StatefulSet with the same name, Kubernetes re-attaches the existing PVCs by name. To start fresh, you must explicitly delete the PVCs: kubectl delete pvc data-<sts-name>-*.
When should I use OnDelete instead of RollingUpdate?
Use OnDelete for large data systems (Elasticsearch, Cassandra, CockroachDB) where each Pod restart triggers significant operational overhead like shard rebalancing or replication catch-up. OnDelete gives you manual control: update the spec, then delete Pods one at a time during maintenance windows, waiting for cluster health to recover between restarts.
How do I resize a StatefulSet PVC?
Set allowVolumeExpansion: true on the StorageClass. Then edit the PVC's spec.resources.requests.storage directly. Kubernetes will expand the underlying volume. Note: you cannot change the storage class — only the size. Some volumes support online expansion (no Pod restart required). Others require the Pod to be restarted.
What is the most common StatefulSet production mistake?
Deleting a StatefulSet without deleting its PVCs. The PVCs persist, consume storage, and block re-creation with different storage configuration. Always delete PVCs explicitly when decommissioning a StatefulSet, and monitor for orphaned PVCs to prevent silent storage leaks.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.