Skip to content
Home DevOps Kubernetes StatefulSets Explained — Internals, Ordering, and Production Gotchas

Kubernetes StatefulSets Explained — Internals, Ordering, and Production Gotchas

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Kubernetes → Topic 3 of 12
Kubernetes StatefulSets deep dive: stable identity, ordered rollouts, PVC lifecycle, headless services, and production gotchas every senior engineer must know.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
Kubernetes StatefulSets deep dive: stable identity, ordered rollouts, PVC lifecycle, headless services, and production gotchas every senior engineer must know.
  • StatefulSets provide three guarantees: stable identity (name + DNS), sticky storage (per-Pod PVC), and ordered operations (sequential creation/deletion).
  • StatefulSet PVCs persist after StatefulSet deletion. Always delete PVCs explicitly when decommissioning. Monitor for orphaned PVCs.
  • OrderedReady is the safe default for quorum-based systems. Parallel breaks consensus bootstrap. StorageClasses work, Rolling updates always follow ordinal order.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Stable identity: Each Pod gets a persistent name (pod-0, pod-1) and a stable DNS entry via a Headless Service.
  • Stable storage: Each Pod gets its own PersistentVolumeClaim that follows it across restarts and reschedules.
  • Ordered operations: Pods are created sequentially (0, 1, 2) and deleted in reverse (2, 1, 0). Rolling updates follow the same order.
  • Headless Service: ClusterIP: None. DNS returns Pod IPs directly. Each Pod is reachable as pod-0.service.ns.svc.cluster.local.
  • Ordered operations are slow. A 10-replica StatefulSet takes 10x longer to deploy than a Deployment.
  • Parallel mode (podManagementPolicy: Parallel) is faster but breaks cluster bootstrap for systems that need quorum.
  • Deleting a StatefulSet without deleting its PVCs. The PVCs persist indefinitely, consuming storage and blocking re-creation of the StatefulSet with different storage config.
🚨 START HERE
StatefulSet Triage Commands
Rapid commands to isolate StatefulSet lifecycle and storage issues.
🟡Pod stuck in Pending.
Immediate ActionCheck PVC binding status and StorageClass.
Commands
kubectl describe pvc data-<sts-name>-<ordinal> -n <namespace>
kubectl get storageclass
Fix NowIf PVC is Pending, the StorageClass provisioner may be down. Check PV events: `kubectl get events -n <ns> --field-selector involvedObject.name=data-<sts-name>-<ordinal>`.
🟡Pod stuck in ContainerCreating with volume attach error.
Immediate ActionCheck if the PV is still attached to a previous node.
Commands
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
kubectl get volumeattachment | grep <pv-name>
Fix NowIf VolumeAttachment shows the PV attached to a dead node, delete the VolumeAttachment object. The attach-detach controller will retry on the new node.
🟡Rolling update stuck.
Immediate ActionIdentify the blocking ordinal.
Commands
kubectl rollout status statefulset/<name> -n <namespace> --timeout=30s
kubectl get pods -n <namespace> -l app=<label> --sort-by=.metadata.name -o wide | grep -v Running
Fix NowIf a specific ordinal is not Ready, check its logs and readiness probe. Fix the issue and the rollout will automatically proceed to the next ordinal.
🟡PVCs consuming unexpected storage after StatefulSet deletion.
Immediate ActionList orphaned PVCs.
Commands
kubectl get pvc -n <namespace> | grep <sts-name>
kubectl describe pvc data-<sts-name>-0 -n <namespace> | grep -A 5 Status
Fix NowIf PVCs exist without a bound Pod, they are orphaned. Delete them: `kubectl delete pvc data-<sts-name>-* -n <namespace>`. Check reclaim policy on the StorageClass.
🟡DNS resolution failing between StatefulSet Pods.
Immediate ActionVerify Headless Service and serviceName match.
Commands
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.clusterIP}'
kubectl get statefulset <sts-name> -n <namespace> -o jsonpath='{.spec.serviceName}'
Fix NowIf clusterIP is not 'None', the Service is not Headless. If serviceName does not match the Headless Service name, DNS records are not created. Fix the mismatch.
Production IncidentStatefulSet PVC Orphan Caused 2TB Storage Leak and Blocked Cluster MigrationA team deleted a 5-replica PostgreSQL StatefulSet to migrate to a new storage class. The Pods were deleted but the 5 PVCs persisted, consuming 2TB of premium SSD storage. When they tried to re-create the StatefulSet with the new storage class, the old PVCs were re-attached, ignoring the new storage configuration.
SymptomAfter deleting the StatefulSet, the cloud bill showed 2TB of unattached persistent disks. When the new StatefulSet was created, Pods attached to the old PVCs with the old storage class instead of the new ones. The team could not understand why the new storage configuration was not being applied.
AssumptionDeleting the StatefulSet would clean up all associated resources including PVCs.
Root causeStatefulSets use a PersistentVolumeClaim template (volumeClaimTemplates) that creates a PVC for each Pod. These PVCs are owned by the StatefulSet but have a reclaim policy of Retain by default. When the StatefulSet is deleted, the PVCs are NOT deleted — they persist in the namespace, retaining their data and their storage class. When a new StatefulSet with the same name and PVC template names is created, Kubernetes matches the existing PVCs by name and re-attaches them, completely ignoring any changes to the storage class or size in the new template.
Fix1. Manually deleted the orphaned PVCs: kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2 data-postgres-3 data-postgres-4 -n production. 2. Verified the underlying PersistentVolumes were released and deleted (or set reclaimPolicy: Delete for the StorageClass). 3. Re-created the StatefulSet with the new storage class in volumeClaimTemplates. 4. Added a cleanup script to the team's runbook that explicitly deletes PVCs after StatefulSet deletion. 5. Set up monitoring for unbound PVCs: alert when PVCs exist without a bound Pod for more than 1 hour.
Key Lesson
StatefulSet PVCs are NOT deleted when the StatefulSet is deleted. They persist indefinitely unless explicitly removed.PVC names are deterministic: data-<statefulset-name>-<ordinal>. Kubernetes matches by name, not by spec. Changing the storage class in the template has no effect on existing PVCs.Always delete PVCs explicitly when decommissioning a StatefulSet. Add this to your runbook.Monitor for unbound PVCs. They consume storage and can cause billing surprises.Before migrating storage classes, back up data, delete StatefulSet, delete PVCs, then re-create with new storage class.
Production Debug GuideSymptom-first investigation path for StatefulSet failures.
StatefulSet Pod stuck in Pending.Check if the PVC is bound. Run kubectl describe pvc data-<sts-name>-<ordinal>. If the PVC is Pending, the StorageClass may not exist or the provisioner may be down. Check node affinity — the PV may be bound to a specific node that is unavailable.
StatefulSet Pod stuck in CrashLoopBackOff after node failure.The Pod is likely trying to re-attach a PVC that is still attached to the old (failed) node. Check PVC status: kubectl get pvc -n <ns>. If the PV is still attached to the old node, force-detach it or wait for the attach-detach controller to time out (6 minutes default).
Rolling update stuck on a specific ordinal (e.g., pod-3).StatefulSets update Pods in ordinal order. If pod-3 is not ready, pod-4 will not be updated. Check pod-3's readiness probe, logs, and events. Use kubectl rollout status statefulset/<name> to see which ordinal is blocking.
StatefulSet scale-down stuck. Pods not being deleted.StatefulSets delete Pods in reverse ordinal order (highest first). If pod-2 is not terminating, pod-1 and pod-0 will not be deleted. Check for finalizers on the Pod, PVC detach issues, or PodDisruptionBudget conflicts.
State DNS name.Verify the Headless Service exists and has clusterIP: None. Check that spec.serviceName in the StatefulSet matches the Headless Service name.fulSet Pods cannot resolve each other by Test DNS resolution: kubectl exec pod-0 -- nslookup pod-1.<service>.<namespace>.svc.cluster.local.
New StatefulSet Pod attached to old PVC with wrong data.This is expected behavior. Kubernetes matches PVCs by name. If you re-create a StatefulSet with the same name, it re-attaches existing PVCs. To start fresh, delete the PVCs first: kubectl delete pvc data-<sts-name>-*.

Stateless apps are easy. You spin up ten identical pods, kill any three, Kubernetes replaces them — nobody cares. But the real world is full of systems that refuse to be stateless: databases, message brokers, distributed caches, search engines. These systems have opinions. Elasticsearch node 2 needs to rejoin the cluster as Elasticsearch node 2, not as some random newcomer. Kafka broker 0 owns specific partitions and cannot pretend to be a fresh broker without corrupting data.

StatefulSets exist precisely to give Kubernetes the vocabulary to reason about identity, ordering, and sticky storage. They provide three guarantees that Deployments fundamentally cannot: a stable, unique network identity that survives pod restarts; stable, persistent storage that follows the pod around regardless of which node it lands on; and ordered, graceful deployment and scaling.

This is not a getting-started guide. It covers the controller loop internals, PVC ownership tracking, the role of the Headless Service, why pod ordinals matter for rolling updates, and the exact failure modes that bite teams in production.

Stable Identity: Network Names and Pod Ordinals

The defining feature of a StatefulSet is stable identity. Each Pod receives a unique, predictable name based on the StatefulSet name and an ordinal index: <statefulset-name>-0, <statefulset-name>-1, <statefulset-name>-2. This identity persists across restarts, reschedules, and even node failures. If pod-2 is rescheduled to a different node, it is still pod-2.

This identity extends to DNS. When a StatefulSet specifies a Headless Service via spec.serviceName, Kubernetes creates DNS A records for each Pod: pod-0.service.namespace.svc.cluster.local. These DNS names resolve directly to the Pod's IP address. When the Pod restarts with a new IP, the DNS record is updated automatically.

This is fundamentally different from Deployments, where Pods are interchangeable and have random names. Stateful systems rely on this identity for peer discovery, leader election, and data partitioning. A Kafka broker must rejoin the cluster with the same identity to resume ownership of its partitions.

io/thecodeforge/k8s/statefulset-identity.yaml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
# StatefulSet with stable identity and Headless Service.
# Each Pod is reachable as: postgres-0.postgres.production.svc.cluster.local
# Package: io.thecodeforge.k8s

# Headless Service: DNS returns Pod IPs directly.
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: production
spec:
  clusterIP: None              # Headless: no virtual IP
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432
---
# StatefulSet: stable identity, ordered operations, sticky storage.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres        # Must match the Headless Service name
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:15
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          readinessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - postgres
            initialDelaySeconds: 5
            periodSeconds: 5
  # volumeClaimTemplates: creates a PVC for each Pod.
  # PVC name format: <template-name>-<statefulset-name>-<ordinal>
  # e.g., data-postgres-0, data-postgres-1, data-postgres-2
  # These PVCs persist even after the StatefulSet is deleted.
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 100Gi
▶ Output
# Pods created in order: postgres-0, postgres-1, postgres-2
# Each with its own PVC: data-postgres-0, data-postgres-1, data-postgres-2
# DNS entries:
# postgres-0.postgres.production.svc.cluster.local -> <pod-0-ip>
# postgres-1.postgres.production.svc.cluster.local -> <pod-1-ip>
# postgres-2.postgres.production.svc.cluster.local -> <pod-2-ip>
Mental Model
Identity Is Not Just a Name
This is why you cannot run Service for a StatefulSet behind a regular ClusterIP Service for peer discovery. The ClusterIP Service load-balances across all Pods — you cannot address postgres-2 specifically individual Pod addressing.
  • Pod name: stable across restarts. postgres-2 is always postgres-2.
  • DNS: pod-0.service.ns.svc.cluster.local. Updated on Pod IP change.
  • Storage: PVC follows the Pod. Same PVC is re-attached on reschedule.
  • Ordinal: determines creation order (0, 1, 2) and deletion order (2, 1, 0).
  • Headless Service is required. Without it, DNS records are not created.
📊 Production Insight
When a StatefulSet Pod is rescheduled to a different node, there is a window where the Pod is Pending because the PVC is still attached to the old node. The attach-detach controller must detach the PV from the old node before it can be attached to the new node. This takes up to 6 minutes by default (. You need the Headlesscontrolled by --attach-detach-reconcile-sync-period). During this window, the Pod cannot start and the StatefulSet cannot proceed to the next ordinal. For critical databases, this delay can cause quorum loss in a 3-node cluster.
🎯 Key Takeaway
StatefulSet identity is three guarantees: stable name, stable DNS, and sticky storage. All three must persist across restarts. The Headless Service is mandatory for DNS-based peer discovery. PVC attachment delays during node failover can block the entire StatefulSet.

Ordered Operations: Creation, Deletion, and Rolling Updates

StatefulSets enforce strict ordering on all lifecycle operations. Pods are created sequentially from ordinal 0 to N-1. Pod N is not created until Pod N-1 is Running and Ready. Pods are deleted in reverse order: N-1 first, then N-2, down to 0. Rolling updates follow the same ordinal order.

This ordering is critical for systems that need quorum during bootstrap. A 3-node etcd cluster needs at least 2 nodes to form quorum. If all 3 Pods start simultaneously, none can form quorum because they all try to discover peers that do not exist yet. Ordered creation ensures pod-0 starts first, pod-1 joins pod-0, and pod-2 joins the existing 2-node cluster.

The podManagementPolicy field controls this behavior. The default is OrderedReady: Pods are created and deleted one at a time in ordinal order. The alternative is Parallel: Pods are created and deleted simultaneously, like a Deployment. Parallel is faster but bootstrap.

io/thecodeforge/k8s/statefulset-ordering.yaml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
# StatefulSet with OrderedReady (default) and Parallel comparison.
# Package: io.thecodeforge.k8s

# OrderedReady: Pods created one at a time. Slow but safe for quorum-based systems.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd-cluster
  namespace: production
spec:
  serviceName: etcd
  replicas: 3
  podManagementPolicy: OrderedReady  breaks systems that need ordered # Default. Sequential creation/deletion.
  selector:
    matchLabels:
      app: etcd
  template:
    metadata:
      labels:
        app: etcd
    spec:
      containers:
        - name: etcd
          image: quay.io/coreos/etcd:v3.5.12
          ports:
            - containerPort: 2379
              name: client
            - containerPort: 2380
              name: peer
          env:
            - name: ETCD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: ETCD_INITIAL_CLUSTER
              value: "etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380"
            - name: ETCD_INITIAL_CLUSTER_STATE
              value: "new"
            - name: ETCD_INITIAL_CLUSTER_TOKEN
              value: "etcd-cluster-token"
            - name: ETCD_DATA_DIR
              value: "/var/lib/etcd/data"
          volumeMounts:
            - name: data
              mountPath: /var/lib/etcd
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
---
# Parallel: All Pods created simultaneously. Fast but unsafe for quorum bootstrap.
# Use only for systems that do not require ordered startup.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cache
  namespace: production
spec:
  serviceName: redis
  replicas: 3
  podManagementPolicy: Parallel    # All Pods created at once.
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
▶ Output
# etcd-cluster: Pods created in order etcd-0, etcd-1, etcd-2. Each waits for the previous to be Ready.
# redis-cache: All 3 Pods created simultaneously.
Mental Model
OrderedReady vs Parallel: The Quorum Problem
Rule of thumb: use OrderedReady for systems that use consensus protocols (Raft, Paxos). Use Parallel for systems that are independent per Pod (Redis standalone, independent workers).
  • OrderedReady: sequential creation. Safe for quorum. Slow at scale.
  • Parallel: simultaneous creation. Fast. Unsafe for quorum bootstrap.
  • Rolling updates always follow ordinal order regardless of podManagementPolicy.
  • Scale-down always follows reverse ordinal order regardless of podManagementPolicy.
  • OnDelete update strategy: Pods are not updated until manually deleted. Gives full control.
📊 Production Insight
Rolling updates on StatefulSets are slow by design. If a StatefulSet has 10 replicas and each Pod takes 60 seconds to become Ready, a rolling update takes at least 10 minutes. For large StatefulSets, consider using the OnDelete update strategy: update the spec, then manually delete Pods one at a time during maintenance windows. This gives you full control over timing and prevents unexpected updates during peak traffic. Monitor kubectl rollout status statefulset/<name> — if a single ordinal is stuck, the entire rollout blocks.
🎯 Key Takeaway
OrderedReady is the safe default for quorum-based systems. Parallel is faster but breaks consensus bootstrap. Rolling updates always follow ordinal order. Use OnDelete for manual control over update timing in production.

PersistentVolumeClaim Lifecycle: Ownership, Orphans, and Reclaim

The volumeClaimTemplates field in a StatefulSet is a template for creating PVCs. When a StatefulSet Pod is created, Kubernetes creates a PVC from the template with a deterministic name: <template-name>-<statefulset-name>-<ordinal>. For example, a StatefulSet named postgres with a volumeClaimTemplate named data creates PVCs: data-postgres-0, data-postgres-1, data-postgres-2.

These PVCs are owned by the StatefulSet but are NOT deleted when the StatefulSet is deleted. This is by design — the data must persist so it can be re-attached if the StatefulSet is re-created. However, this creates a common production pitfall: orphaned PVCs that consume storage indefinitely.

The reclaim policy on the underlying StorageClass determines what happens to the PersistentVolume when the PVC is finally deleted. Retain keeps the PV and its data. Delete removes the PV and its data. The default varies by cloud provider.

io/thecodeforge/k8s/statefulset-pvc-lifecycle.yaml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
# PVC lifecycle management for StatefulSets.
# Package: io.thecodeforge.k8s

# StorageClass with Delete reclaim policy.
# When PVC is deleted, the underlying PV and data are also deleted.
# Use for ephemeral or reproducible data.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd-deletable
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Delete          # PV is deleted when PVC is deleted
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# StorageClass with Retain reclaim policy.
# When PVC is deleted, the PV is kept (but unbound).
# Use for critical data that must survive PVC deletion.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd-retain
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Retain          # PV is kept when PVC is deleted
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# StatefulSet using the deletable storage class.
# PVCs are auto-deleted when the StatefulSet is deleted IF you also delete the PVCs.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: production
spec:
  serviceName: kafka
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.5.0
          ports:
            - containerPort: 9092
          env:
            - name: KAFKA_BROKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
            - name: KAFKA_LOG_DIRS
              value: /var/lib/kafka/data
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd-deletable
        resources:
          requests:
            storage: 500Gi
▶ Output
# PVCs created: data-kafka-0 (500Gi), data-kafka-1 (500Gi), data-kafka-2 (500Gi)
# Total storage: 1.5TB. These persist even after StatefulSet deletion.
Mental Model
PVC Name Matching: Why Storage Class Changes Are Ignored
This is the most common source of confusion during StatefulSet migrations. The new storage class in the YAML is silently ignored because the old PVCs still exist.
  • PVC name is deterministic: <template>-<sts-name>-<ordinal>.
  • Kubernetes matches by name. Existing PVCs are re-attached, new specs are ignored.
  • PVC spec (storage class, size) is immutable after creation.
  • To change storage class: delete StatefulSet, delete PVCs, re-create StatefulSet.
  • To resize PVC: set allowVolumeExpansion: true on StorageClass, then edit PVC spec.resources.requests.storage.
📊 Production Insight
Orphaned PVCs are the silent storage leak in Kubernetes. Every deleted StatefulSet leaves behind PVCs that consume cloud storage indefinitely. At scale, this can cost thousands of dollars per month. Set up monitoring for unbound PVCs (status.phase: Pending or status.phase: Bound with no Pod). Alert when PVCs exist without a corresponding Pod for more than 1 hour. Consider a cron job that identifies and reports orphaned PVCs weekly.
🎯 Key Takeaway
StatefulSet PVCs persist after StatefulSet deletion. Kubernetes matches PVCs by name, ignoring new template specs. To change storage class, delete PVCs first. Monitor for orphaned PVCs to prevent silent storage leaks.

Update Strategies: RollingUpdate vs OnDelete

StatefulSets support two update strategies: RollingUpdate (default) and OnDelete. The choice determines how Pod template changes (image update, env var change) are propagated to existing Pods.

RollingUpdate updates Pods one at a time in ordinal order, waiting for each Pod to be Ready before proceeding to the next. This is the safe default but slow. The maxUnavailable field (available in Kubernetes 1.24+) controls how many Pods can be unavailable during the update.

OnDelete does not automatically update Pods. When the Pod template is changed, existing Pods continue running the old spec. The update is applied only when a Pod is manually deleted. Kubernetes recreates the Pod with the new spec. This gives full control over update timing but requires manual intervention.

io/thecodeforge/k8s/statefulset-update-strategy.yaml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
# StatefulSet update strategies.
# Package: io.thecodeforge.k8s

# RollingUpdate (default): Automatic, ordered updates.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zookeeper
  namespace: production
spec:
  serviceName: zookeeper
  replicas: 3
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1        # Allow 1 Pod to be unavailable during update
  selector:
    matchLabels:
      app: zookeeper
  template:
    metadata:
      labels:
        app: zookeeper
    spec:
      containers:
        - name: zookeeper
          image: zookeeper:3.8
          ports:
            - containerPort: 2181
          readinessProbe:
            exec:
              command:
                - sh
                - -c
                - "echo ruok | nc localhost 2181 | grep imok"
            initialDelaySeconds: 10
            periodSeconds: 5
          volumeMounts:
            - name: data
              mountPath: /data
            - name: datalog
              mountPath: /datalog
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
    - metadata:
        name: datalog
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
---
# OnDelete: Manual update control. No automatic Pod recreation on template change.
apiVersion: apps/v1
kind: systems where each restart has significant operational impact (rebalancing, reindexing, replication catch-up).
📊 Production Insight
For large Elasticsearch or Cassandra clusters, RollingUpdate causes hours of unnecessary rebalancing. Each Pod restart triggers shard redistribution, which competes with application traffic for I/O and network bandwidth. Use OnDelete and restart Pods one at a time during maintenance windows, waiting for cluster health to return to green before proceeding to the next Pod. Monitor cluster health metrics (Elasticsearch: _cluster/health, Cassandra: nodetool status) between each restart.
🎯 Key Takeaway
RollingUpdate is the safe default for systems with fast startup. OnDelete is the production standard for large data systems where each restart has significant operational impact. Use maxUnavailable to control parallelism during RollingUpdate.

PodDisruptionBudgets and StatefulSet Availability

PodDisruptionBudgets (PDBs) are critical for StatefulSets. They prevent the voluntary disruption controller from evicting too many Pods simultaneously during node drains, cluster upgrades, or preemptions. The controller does not intentionally evict more Pods than the budget allows.

For a 3-node etcd cluster, set minAvailable: 2. This ensures that a node drain cannot break quorum. If the drain would evict a third Pod, it blocks until one of the evicted Pods is rescheduled and Ready.

PDBs only block voluntary disruption (drain, upgrade, preemption). They do NOT protect against involuntary disruption (node crash, OOMKill, kernel panic). This distinction is critical: PDBs are a guardrail for planned maintenance, not a safety net for unplanned failures.

io/thecodeforge/k8s/statefulset-pdb.yaml · YAML
1234567891011121314151617181920212223242526
# PodDisruptionBudget for a 3-node etcd cluster.
# Ensures at least 2 Pods are always available (quorum).
# Package: io.thecodeforge.k8s
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: etcd-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: etcd
---
# PodDisruptionBudget using maxUnavailable for a 5-node Kafka cluster.
# Allows at most 1 Pod to be down at any time.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: kafka
▶ Output
# Verify PDB status:
# kubectl get pdb -n production
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# etcd-pdb 2 N/A 1 5d
# kafka-pdb N/A 1 0 5d
# ALLOWED DISRUPTIONS = 0 means the PDB is currently blocking all voluntary evictions.
Mental Model
minAvailable vs maxUnavailable
For quorum-based systems, always use minAvailable set to the quorum size. This is more intuitive than calculating maxUnavailable.
  • minAvailable: 2 on a 3-replica cluster = 1 Pod can be evicted.
  • maxUnavailable: 1 on a 3-replica cluster = same result, different expression.
  • PDBs block voluntary disruption only (drain, upgrade, preemption).
  • They do NOT protect against involuntary disruption (crash, OOMKill).
  • Setting minAvailable equal to replica count blocks all maintenance. Use quorum size instead.
📊 Production Insight
The most common PDB misconfiguration is setting minAvailable equal to the replica count. If you have 3 replicas and set minAvailable: 3, the PDB blocks all voluntary disruptions — including necessary node drains during maintenance. This forces operators to delete the PDB before draining nodes, which defeats the purpose. Set minAvailable to the quorum size (2 for 3 replicas) or use maxUnavailable: 1. During cluster upgrades, the upgrade controller respects PDBs and waits for Pods to be rescheduled before proceeding to the next node.
🎯 Key Takeaway
PDBs are mandatory for StatefulSets running quorum-based systems. Set minAvailable to the quorum size, not the replica count. PDBs only block voluntary disruption — they do not protect against node crashes. Always pair PDBs with anti-affinity rules to spread Pods across nodes.
PDB Configuration Decision Tree
IfQuorum-based system (etcd, ZooKeeper, CockroachDB)
UseSet minAvailable to quorum size. For 3 replicas: minAvailable: 2. For 5 replicas: minAvailable: 3.
IfIndependent replicas (Redis standalone, stateless workers)
UseSet maxUnavailable: 1 or minAvailable: N-1. Allows rolling maintenance without service degradation.
IfSingle-replica StatefulSet (single PostgreSQL instance)
UsePDB with minAvailable: 1 blocks all voluntary disruption. Use cautiously — maintenance requires manual PDB deletion.
IfLarge cluster (10+ replicas) with no quorum requirement
UseSet maxUnavailable: 20-25% to allow parallel node drains during upgrades.
🗂 Deployment vs StatefulSet vs DaemonSet: When to Use Each
Understanding the workload controller trade-offs for different application types.
AspectDeploymentStatefulSetDaemonSet
Pod identityRandom, interchangeableStable, ordinal (pod-0, pod-1)One per node (or subset)
Pod namingrandom-hashsts-name-ordinaldaemon-hash
StorageEphemeral or shared PVPer-Pod PVC (sticky)HostPath or shared PV
ScalingHorizontal (free)Ordered (sequential)Automatic (per-node)
Rolling updatemaxSurge + maxUnavailableOrderedReady or OnDeleteRollingUpdate or OnDelete
DNS identityService VIP (load-balanced)Per-Pod DNS via Headless ServiceService VIP (load-balanced)
Self-healingYes (replace any Pod)Yes (replace with same identity)Yes (replace on same node)
Creation orderParallelSequential (0, 1, 2)Parallel (one per node)
Deletion orderParallelReverse (N, N-1, ..., 0)Parallel
Use caseStateless APIs, web serversDatabases, Kafka, ZooKeeper, etcdLog agents, node-exporter, CNI agents

🎯 Key Takeaways

  • StatefulSets provide three guarantees: stable identity (name + DNS), sticky storage (per-Pod PVC), and ordered operations (sequential creation/deletion).
  • StatefulSet PVCs persist after StatefulSet deletion. Always delete PVCs explicitly when decommissioning. Monitor for orphaned PVCs.
  • OrderedReady is the safe default for quorum-based systems. Parallel breaks consensus bootstrap. StorageClasses work, Rolling updates always follow ordinal order.
  • OnDelete is the production standard for large data systems where each restart triggers hours of rebalancing.
  • PDBs are mandatory for quorum-based StatefulSets. Set minAvailable to quorum size, not replica count.
  • PVC name matching is by name, not spec. To change storage class, delete PVCs first, then re-create the StatefulSet.
  • The Headless Service is mandatory for per-Pod DNS. Without it, peer discovery fails and the cluster cannot bootstrap.

⚠ Common Mistakes to Avoid

    Deleting a StatefulSet without deleting its PVCs. The PVCs persist indefinitely, consuming storage and blocking re-creation with different storage config. Always delete PVCs explicitly.
    Using Parallel podManagementPolicy for quorum-based systems (etcd, ZooKeeper). All Pods start simultaneously and cannot discover peers, causing bootstrap failure. Use OrderedReady.
    Re-creating a StatefulSet expecting new PVCs. Kubernetes matches PVCs by name and re-attaches old PVCs, ignoring new storage class or size. Delete PVCs first.
    Not setting PodDisruptionBudgets. Node drains can terminate multiple Pods simultaneously, losing quorum. Set minAvailable to the quorum size.
    Using RollingUpdate for large data systems (Elasticsearch, Cassandra). Each restart triggers hours of rebalancing. Use OnDelete and restart during maintenance windows.
    Not configuring readiness probes. Without readiness probes, the StatefulSet considers each Pod Ready immediately, proceeding to the next ordinal even if the application is not fully started.
    Setting minAvailable equal to replica count in PDBs. This blocks all voluntary disruptions including necessary maintenance. Set minAvailable to quorum size.
    Ignoring PVC attachment delays during node failover. The attach-detach controller takes up to 6 minutes to detach PVs from failed nodes. During this time, the Pod cannot start.
    Not using volumeClaimTemplates. If you use a regular PV, it is not tied to the Pod's identity. On reschedule, the Pod may get a different PV with different data.
    Not monitoring for orphaned PVCs. Deleted StatefulSets leave behind PVCs that consume cloud storage indefinitely. Set up alerts for unbound PVCs.
    Using the default reclaimPolicy (Retain) for ephemeral data. This keeps PVs after PVC deletion, consuming storage. Use Delete for reproducible data.
    Not setting anti-affinity rules. StatefulSet Pods may be scheduled on the same node, creating a single point of failure. Use podAntiAffinity to spread across nodes.

Interview Questions on This Topic

  • QExplain the three guarantees a StatefulSet provides that a Deployment cannot. Why are these important for stateful workloads?
  • QWhat happens when you delete a StatefulSet? What happens to its PVCs? How do you properly decommission a StatefulSet?
  • QExplain the difference between OrderedReady and Parallel podManagementPolicy. When would you use each?
  • QA 3-node etcd cluster is running as a StatefulSet. A node fails and one Pod is rescheduled. The new Pod stays Pending for 5 minutes. What is happening and how do you fix it?
  • QHow does Kubernetes match PVCs when a StatefulSet is re-created? What happens if you change the storage class in the template?
  • QExplain the RollingUpdate vs OnDelete update strategies. When would you choose OnDelete?
  • QWhat is a PodDisruptionBudget and why is it critical for StatefulSets? What is the most common PDB misconfiguration?
  • QHow does a StatefulSet Pod discover its peers? What role does the Headless Service play?
  • QDescribe the PVC lifecycle for a StatefulSet. What is the reclaim policy and how does it affect storage costs?
  • QHow would you design a zero-downtime upgrade strategy for a 5-node Elasticsearch cluster running as a StatefulSet?

Frequently Asked Questions

What is the difference between a Deployment and a StatefulSet?

A Deployment manages interchangeable Pods with no stable identity. Pods get random names, ephemeral storage, and are created/deleted in parallel. A StatefulSet manages Pods with stable identity (ordinal names), sticky storage (per-Pod PVCs), and ordered operations (sequential creation/deletion). Use Deployments for stateless apps. Use StatefulSets for databases, message brokers, and distributed systems that need peer discovery.

Why do StatefulSet Pods need a Headless Service?

The Headless Service (clusterIP: None) creates DNS A records for each Pod individually: pod-0.service.ns.svc.cluster.local. Without it, DNS returns only the ClusterIP (if using a regular Service), and you cannot address specific Pods. Peer discovery in systems like Kafka, ZooKeeper, and etcd requires individual Pod DNS names.

What happens to PVCs when a StatefulSet is deleted?

PVCs are NOT deleted when the StatefulSet is deleted. They persist indefinitely, consuming storage. If you re-create the StatefulSet with the same name, Kubernetes re-attaches the existing PVCs by name. To start fresh, you must explicitly delete the PVCs: kubectl delete pvc data-<sts-name>-*.

When should I use OnDelete instead of RollingUpdate?

Use OnDelete for large data systems (Elasticsearch, Cassandra, CockroachDB) where each Pod restart triggers significant operational overhead like shard rebalancing or replication catch-up. OnDelete gives you manual control: update the spec, then delete Pods one at a time during maintenance windows, waiting for cluster health to recover between restarts.

How do I resize a StatefulSet PVC?

Set allowVolumeExpansion: true on the StorageClass. Then edit the PVC's spec.resources.requests.storage directly. Kubernetes will expand the underlying volume. Note: you cannot change the storage class — only the size. Some volumes support online expansion (no Pod restart required). Others require the Pod to be restarted.

What is the most common StatefulSet production mistake?

Deleting a StatefulSet without deleting its PVCs. The PVCs persist, consume storage, and block re-creation with different storage configuration. Always delete PVCs explicitly when decommissioning a StatefulSet, and monitor for orphaned PVCs to prevent silent storage leaks.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousKubernetes Pods and DeploymentsNext →Kubernetes ConfigMaps and Secrets
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged