Advanced 8 min · March 06, 2026

Kubernetes StatefulSets: PVC Orphan Caused 2TB Leak

After deleting a StatefulSet, 2TB unattached disks appeared; new Pods reattached old PVCs ignoring storage class changes.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Stable identity: Each Pod gets a persistent name (pod-0, pod-1) and a stable DNS entry via a Headless Service.
  • Stable storage: Each Pod gets its own PersistentVolumeClaim that follows it across restarts and reschedules.
  • Ordered operations: Pods are created sequentially (0, 1, 2) and deleted in reverse (2, 1, 0). Rolling updates follow the same order.
  • Headless Service: ClusterIP: None. DNS returns Pod IPs directly. Each Pod is reachable as pod-0.service.ns.svc.cluster.local.
  • Ordered operations are slow. A 10-replica StatefulSet takes 10x longer to deploy than a Deployment.
  • Parallel mode (podManagementPolicy: Parallel) is faster but breaks cluster bootstrap for systems that need quorum.
  • Deleting a StatefulSet without deleting its PVCs. The PVCs persist indefinitely, consuming storage and blocking re-creation of the StatefulSet with different storage config.
Plain-English First

Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order. That's a StatefulSet. Unlike a regular Deployment — where pods are interchangeable guests who can sleep in any room — a StatefulSet guarantees each pod has a permanent name, its own private storage, and a predictable position in line. Think of it as the difference between a row of numbered safety deposit boxes (StatefulSet) versus a pile of shopping carts you just grab any one of (Deployment).

Stateless apps are easy. You spin up ten identical pods, kill any three, Kubernetes replaces them — nobody cares. But the real world is full of systems that refuse to be stateless: databases, message brokers, distributed caches, search engines. These systems have opinions. Elasticsearch node 2 needs to rejoin the cluster as Elasticsearch node 2, not as some random newcomer. Kafka broker 0 owns specific partitions and cannot pretend to be a fresh broker without corrupting data.

StatefulSets exist precisely to give Kubernetes the vocabulary to reason about identity, ordering, and sticky storage. They provide three guarantees that Deployments fundamentally cannot: a stable, unique network identity that survives pod restarts; stable, persistent storage that follows the pod around regardless of which node it lands on; and ordered, graceful deployment and scaling.

This is not a getting-started guide. It covers the controller loop internals, PVC ownership tracking, the role of the Headless Service, why pod ordinals matter for rolling updates, and the exact failure modes that bite teams in production.

StatefulSet vs Deployment: When to Use Each

The most common decision Kubernetes operators face is choosing between a Deployment and a StatefulSet. Both manage replica Pods, but they differ fundamentally in how they treat identity, storage, and order.

A Deployment assumes all Pods are interchangeable. Each Pod gets a random name (e.g., myapp-68dcf7d8b4-abc123), can be replaced with any other Pod, and uses ephemeral or shared storage. Deployments scale quickly in parallel and are perfect for stateless services: web servers, REST APIs, worker queues.

A StatefulSet assumes each Pod is unique. Each Pod gets a stable ordinal name (pod-0, pod-1), a persistent DNS entry via a Headless Service, and its own PersistentVolumeClaim that follows it across reschedules. StatefulSets scale one Pod at a time (ordered) and are necessary for stateful workloads: databases (PostgreSQL, Cassandra), message brokers (Kafka, RabbitMQ), distributed consensus systems (etcd, ZooKeeper).

Use a Deployment when
  • Pods do not need stable network identities.
  • Storage can be ephemeral or shared (e.g., a Stateless API reading from a central database).
  • You need fast parallel scaling and rolling updates.
  • Pods can be killed and recreated anywhere without impact.
Use a StatefulSet when
  • Each Pod must be addressable by a unique, stable name (e.g., kafka-0, kafka-1).
  • Each Pod requires its own persistent storage that must survive restarts and rescheduling.
  • Pods need ordered startup and shutdown (e.g., quorum-based systems).
  • The application performs leader election or data partitioning that depends on stable identities.

If you are unsure, start with a Deployment. Stateless applications are simpler to scale, debug, and upgrade. Only move to a StatefulSet when you encounter a concrete requirement for stable identity or per-Pod storage. Overusing StatefulSets adds unnecessary complexity and cost.

When neither fits, consider a DaemonSet for running exactly one Pod per node (e.g., log collectors, node monitoring agents) or a Job/CronJob for batch workloads.

io/thecodeforge/k8s/selection-deployment-vs-statefulset.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Deployment for stateless API
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: myapp:1.0
          ports:
            - containerPort: 8080
---
# StatefulSet for Kafka broker
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
spec:
  serviceName: kafka-hs
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.5.0
          ports:
            - containerPort: 9092
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi
Output
# Deployment: Pods named api-server-xxx-yyy (random). All identical. No per-Pod storage.
# StatefulSet: Pods named kafka-0, kafka-1, kafka-2. Each with its own PVC.
Pets vs Cattle
  • Deployment: Pods are cattle — replaceable, no identity, fast scaling.
  • StatefulSet: Pods are pets — named, sticky storage, ordered operations.
  • DaemonSet: One pet per node — for node-level agents.
  • Job: Single-run pet — for batch tasks.
  • Rule of thumb: Start with Deployment, escalate to StatefulSet only when required.
Production Insight
StatefulSets introduce a significant operational cost beyond cloud storage. Ordered creation slows deployments and scaling events. A 10-replica StatefulSet can take 10 minutes to roll out if each Pod takes 60 seconds to become ready. During that time, cluster capacity is partially utilized and rolling updates block traffic on earlier ordinals. In large production clusters, consider whether the benefits of stable identity outweigh the slower operations.
Key Takeaway
Use Deployments for stateless, interchangeable Pods. Use StatefulSets only when Pods need stable identities, persistent per-Pod storage, or ordered operations. Overusing StatefulSets adds unnecessary complexity and cost.

Kubernetes Service Types: ClusterIP, NodePort, LoadBalancer, and Headless

Services abstract Pod-to-Pod communication and external access. Each Service type serves a different purpose and carries different trade-offs. Understanding them is essential for exposing StatefulSet Pods correctly.

ClusterIP (default): Exposes the Service on an internal cluster IP. Only reachable from within the cluster. Use for internal microservice communication. ClusterIP is the most efficient because it does not require external load balancers.

NodePort: Exposes the Service on each Node's IP at a static port (30000-32767). Reachable from outside the cluster by hitting <NodeIP>:<NodePort>. Use for development, debugging, or when you need direct node access. Not recommended for production due to security and port collision issues.

LoadBalancer: Exposes the Service externally via a cloud provider's load balancer (e.g., AWS ELB, GCP HTTP(S) LB). Automatically creates a NodePort and ClusterIP behind the scenes. Use for exposing a single Service to the internet. Each Service gets its own load balancer, which incurs hourly cost.

Headless (clusterIP: None): Does not allocate a cluster IP. DNS returns the IPs of all healthy Pods directly. Used primarily with StatefulSets for per-Pod DNS records (pod-0.service.ns.svc.cluster.local). Clients decide which Pod to contact. Also used for custom service discovery.

TypeCluster IPExternal AccessUse CaseCost
ClusterIPYesNoInternal microservicesFree
NodePortYesNodeIP:PortDev/Test, bare-metalFree
LoadBalancerYesCloud LBSingle-service exposurePer LB/hour
HeadlessNo (None)DNS-based Pod IPsStatefulSet peer discoveryFree

When exposing a StatefulSet externally, you typically use a LoadBalancer or Ingress for the entire cluster (all Pods), not per-Pod. For inter-Pod communication within the StatefulSet, you always use a Headless Service.

io/thecodeforge/k8s/service-types-examples.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# ClusterIP: internal only
apiVersion: v1
kind: Service
metadata:
  name: my-internal
spec:
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080
---
# NodePort: external via node IP
apiVersion: v1
kind: Service
metadata:
  name: my-nodeport
spec:
  type: NodePort
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080
      nodePort: 30080
---
# LoadBalancer: external via cloud LB
apiVersion: v1
kind: Service
metadata:
  name: my-lb
spec:
  type: LoadBalancer
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080
---
# Headless: no cluster IP, used with StatefulSets
apiVersion: v1
kind: Service
metadata:
  name: my-headless
spec:
  clusterIP: None
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080
Output
# ClusterIP: service is reachable as my-internal.ns.svc.cluster.local
# NodePort: external curl http://node-ip:30080
# LoadBalancer: external curl http://<lb-dns-name>
# Headless: no cluster IP; DNS returns Pod IPs directly
Headless Is Required for StatefulSet DNS
  • Headless Service: clusterIP: None — creates per-Pod A/AAAA records.
  • Regular Service: clusterIP: set — creates a single virtual IP and load-balances.
  • StatefulSet spec.serviceName must match the Headless Service metadata.name.
  • DNS name format: <pod-name>.<service-name>.<namespace>.svc.cluster.local.
  • Used by Cassandra, Kafka, ZooKeeper, etc. for seed discovery.
Production Insight
LoadBalancer services are expensive at scale. Each Service creates a separate cloud load balancer with hourly charges ($15-25/month each). With hundreds of microservices, this cost adds up quickly. Instead, use an Ingress controller (e.g., NGINX Ingress, AWS ALB Ingress) to route multiple services through a single LoadBalancer. Ingress also provides SSL termination, path-based routing, and rate limiting at no extra LB cost. Reserve LoadBalancer for non-HTTP services or when you need direct TCP/UDP load balancing.
Key Takeaway
Choose Service type based on access requirements: ClusterIP for internal, NodePort for dev, LoadBalancer for single external service, Headless for StatefulSet DNS. Consolidate external HTTP traffic with Ingress to save costs.

Ingress vs LoadBalancer: Choosing the Right External Exposure Mechanism

When you need to expose a StatefulSet (or any service) to the internet, you have two primary options: a LoadBalancer Service or an Ingress resource. The choice depends on protocol, routing requirements, and cost.

LoadBalancer Service: Creates a cloud load balancer (e.g., AWS ELB, GCP TCP LB) that forwards traffic directly to your Service's Pods. Operates at Layer 4 (TCP/UDP). Each LoadBalancer Service gets its own static IP or DNS name. Simple to set up, but each one is a separate billable resource. Best for non-HTTP protocols (gRPC, WebSocket, database connections) or when you need a single service exposed with minimal configuration.

Ingress: A cluster-level resource that provides HTTP(S) routing rules. Requires an Ingress controller (e.g., NGINX Ingress, Istio Gateway, AWS Load Balancer Controller). The controller typically runs as a DaemonSet or Deployment and is itself exposed via a LoadBalancer Service. Ingress operates at Layer 7, allowing path-based routing (e.g., /api -> service-a, /web -> service-b), host-based routing (api.example.com -> service-a), SSL termination, and rate limiting. Multiple Ingress rules can share the same underlying LoadBalancer, saving money.

Decision Matrix: | Criteria | LoadBalancer Service | Ingress | |----------|---------------------|---------| | Protocol | TCP, UDP, HTTP | HTTP, HTTPS, gRPC (with controller) | | Routing | No (single target) | Path, host, headers | | SSL termination | Manual (annotation) | Built-in (cert-manager) | | Cost per service | One LB per Service | One LB for many Ingresses | | Setup complexity | Low | Medium (controller required) | | Use case | Database, non-HTTP, simple apps | HTTP APIs, web apps, microservices |

For HTTP workloads with multiple services, use Ingress. For non-HTTP workloads or when you need absolute simplicity, use LoadBalancer. You can also combine both: an Ingress controller exposed via a LoadBalancer Service.

io/thecodeforge/k8s/ingress-vs-lb.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# LoadBalancer Service for a database (non-HTTP)
apiVersion: v1
kind: Service
metadata:
  name: postgres-lb
spec:
  type: LoadBalancer
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432
---
# Ingress for HTTP API (multiple services behind one LB)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /users
            pathType: Prefix
            backend:
              service:
                name: user-service
                port:
                  number: 8080
          - path: /orders
            pathType: Prefix
            backend:
              service:
                name: order-service
                port:
                  number: 8080
Output
# LoadBalancer: single cloud ELB created for postgres-lb. Database clients connect directly.
# Ingress: one cloud ELB for the Ingress controller, routing /users and /orders to different services.
Ingress Controller Is Mandatory
  • Ingress controllers: NGINX, HAProxy, Traefik, AWS LB Controller, Istio Gateway.
  • Ingress resources define routing rules; the controller implements them.
  • The controller itself is often exposed via a LoadBalancer Service.
  • Cert-manager can automate SSL certificate provisioning for Ingresses.
  • Ingress supports sticky sessions, rate limiting, and canary releases via annotations.
Production Insight
In production, never expose your StatefulSet Pods directly via a LoadBalancer Service unless you have a specific non-HTTP protocol. Use an Ingress controller with proper SSL termination and path-based routing. This approach reduces cloud load balancer costs by 80% and centralizes traffic management. For internal cluster traffic, keep the Headless Service for StatefulSet DNS and use ClusterIP Services for application discovery. The LoadBalancer should only be the single entry point for external traffic.
Key Takeaway
Choose LoadBalancer for simple non-HTTP services or when you need minimal configuration. Choose Ingress for HTTP-based multi-service exposure to consolidate LBs and gain Layer 7 routing. Always install an Ingress controller first.

Stable Identity: Network Names and Pod Ordinals

The defining feature of a StatefulSet is stable identity. Each Pod receives a unique, predictable name based on the StatefulSet name and an ordinal index: <statefulset-name>-0, <statefulset-name>-1, <statefulset-name>-2. This identity persists across restarts, reschedules, and even node failures. If pod-2 is rescheduled to a different node, it is still pod-2.

This identity extends to DNS. When a StatefulSet specifies a Headless Service via spec.serviceName, Kubernetes creates DNS A records for each Pod: pod-0.service.namespace.svc.cluster.local. These DNS names resolve directly to the Pod's IP address. When the Pod restarts with a new IP, the DNS record is updated automatically.

This is fundamentally different from Deployments, where Pods are interchangeable and have random names. Stateful systems rely on this identity for peer discovery, leader election, and data partitioning. A Kafka broker must rejoin the cluster with the same identity to resume ownership of its partitions.

io/thecodeforge/k8s/statefulset-identity.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# StatefulSet with stable identity and Headless Service.
# Each Pod is reachable as: postgres-0.postgres.production.svc.cluster.local
# Package: io.thecodeforge.k8s

# Headless Service: DNS returns Pod IPs directly.
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: production
spec:
  clusterIP: None              # Headless: no virtual IP
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432
---
# StatefulSet: stable identity, ordered operations, sticky storage.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres        # Must match the Headless Service name
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:15
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          readinessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - postgres
            initialDelaySeconds: 5
            periodSeconds: 5
  # volumeClaimTemplates: creates a PVC for each Pod.
  # PVC name format: <template-name>-<statefulset-name>-<ordinal>
  # e.g., data-postgres-0, data-postgres-1, data-postgres-2
  # These PVCs persist even after the StatefulSet is deleted.
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 100Gi
Output
# Pods created in order: postgres-0, postgres-1, postgres-2
# Each with its own PVC: data-postgres-0, data-postgres-1, data-postgres-2
# DNS entries:
# postgres-0.postgres.production.svc.cluster.local -> <pod-0-ip>
# postgres-1.postgres.production.svc.cluster.local -> <pod-1-ip>
# postgres-2.postgres.production.svc.cluster.local -> <pod-2-ip>
Identity Is Not Just a Name
  • Pod name: stable across restarts. postgres-2 is always postgres-2.
  • DNS: pod-0.service.ns.svc.cluster.local. Updated on Pod IP change.
  • Storage: PVC follows the Pod. Same PVC is re-attached on reschedule.
  • Ordinal: determines creation order (0, 1, 2) and deletion order (2, 1, 0).
  • Headless Service is required. Without it, DNS records are not created.
Production Insight
When a StatefulSet Pod is rescheduled to a different node, there is a window where the Pod is Pending because the PVC is still attached to the old node. The attach-detach controller must detach the PV from the old node before it can be attached to the new node. This takes up to 6 minutes by default (. You need the Headlesscontrolled by --attach-detach-reconcile-sync-period). During this window, the Pod cannot start and the StatefulSet cannot proceed to the next ordinal. For critical databases, this delay can cause quorum loss in a 3-node cluster.
Key Takeaway
StatefulSet identity is three guarantees: stable name, stable DNS, and sticky storage. All three must persist across restarts. The Headless Service is mandatory for DNS-based peer discovery. PVC attachment delays during node failover can block the entire StatefulSet.

Ordered Operations: Creation, Deletion, and Rolling Updates

StatefulSets enforce strict ordering on all lifecycle operations. Pods are created sequentially from ordinal 0 to N-1. Pod N is not created until Pod N-1 is Running and Ready. Pods are deleted in reverse order: N-1 first, then N-2, down to 0. Rolling updates follow the same ordinal order.

This ordering is critical for systems that need quorum during bootstrap. A 3-node etcd cluster needs at least 2 nodes to form quorum. If all 3 Pods start simultaneously, none can form quorum because they all try to discover peers that do not exist yet. Ordered creation ensures pod-0 starts first, pod-1 joins pod-0, and pod-2 joins the existing 2-node cluster.

The podManagementPolicy field controls this behavior. The default is OrderedReady: Pods are created and deleted one at a time in ordinal order. The alternative is Parallel: Pods are created and deleted simultaneously, like a Deployment. Parallel is faster but bootstrap.

io/thecodeforge/k8s/statefulset-ordering.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# StatefulSet with OrderedReady (default) and Parallel comparison.
# Package: io.thecodeforge.k8s

# OrderedReady: Pods created one at a time. Slow but safe for quorum-based systems.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd-cluster
  namespace: production
spec:
  serviceName: etcd
  replicas: 3
  podManagementPolicy: OrderedReady  breaks systems that need ordered # Default. Sequential creation/deletion.
  selector:
    matchLabels:
      app: etcd
  template:
    metadata:
      labels:
        app: etcd
    spec:
      containers:
        - name: etcd
          image: quay.io/coreos/etcd:v3.5.12
          ports:
            - containerPort: 2379
              name: client
            - containerPort: 2380
              name: peer
          env:
            - name: ETCD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: ETCD_INITIAL_CLUSTER
              value: "etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380"
            - name: ETCD_INITIAL_CLUSTER_STATE
              value: "new"
            - name: ETCD_INITIAL_CLUSTER_TOKEN
              value: "etcd-cluster-token"
            - name: ETCD_DATA_DIR
              value: "/var/lib/etcd/data"
          volumeMounts:
            - name: data
              mountPath: /var/lib/etcd
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
---
# Parallel: All Pods created simultaneously. Fast but unsafe for quorum bootstrap.
# Use only for systems that do not require ordered startup.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cache
  namespace: production
spec:
  serviceName: redis
  replicas: 3
  podManagementPolicy: Parallel    # All Pods created at once.
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
Output
# etcd-cluster: Pods created in order etcd-0, etcd-1, etcd-2. Each waits for the previous to be Ready.
# redis-cache: All 3 Pods created simultaneously.
OrderedReady vs Parallel: The Quorum Problem
  • OrderedReady: sequential creation. Safe for quorum. Slow at scale.
  • Parallel: simultaneous creation. Fast. Unsafe for quorum bootstrap.
  • Rolling updates always follow ordinal order regardless of podManagementPolicy.
  • Scale-down always follows reverse ordinal order regardless of podManagementPolicy.
  • OnDelete update strategy: Pods are not updated until manually deleted. Gives full control.
Production Insight
Rolling updates on StatefulSets are slow by design. If a StatefulSet has 10 replicas and each Pod takes 60 seconds to become Ready, a rolling update takes at least 10 minutes. For large StatefulSets, consider using the OnDelete update strategy: update the spec, then manually delete Pods one at a time during maintenance windows. This gives you full control over timing and prevents unexpected updates during peak traffic. Monitor kubectl rollout status statefulset/<name> — if a single ordinal is stuck, the entire rollout blocks.
Key Takeaway
OrderedReady is the safe default for quorum-based systems. Parallel is faster but breaks consensus bootstrap. Rolling updates always follow ordinal order. Use OnDelete for manual control over update timing in production.

PersistentVolumeClaim Lifecycle: Ownership, Orphans, and Reclaim

The volumeClaimTemplates field in a StatefulSet is a template for creating PVCs. When a StatefulSet Pod is created, Kubernetes creates a PVC from the template with a deterministic name: <template-name>-<statefulset-name>-<ordinal>. For example, a StatefulSet named postgres with a volumeClaimTemplate named data creates PVCs: data-postgres-0, data-postgres-1, data-postgres-2.

These PVCs are owned by the StatefulSet but are NOT deleted when the StatefulSet is deleted. This is by design — the data must persist so it can be re-attached if the StatefulSet is re-created. However, this creates a common production pitfall: orphaned PVCs that consume storage indefinitely.

The reclaim policy on the underlying StorageClass determines what happens to the PersistentVolume when the PVC is finally deleted. Retain keeps the PV and its data. Delete removes the PV and its data. The default varies by cloud provider.

io/thecodeforge/k8s/statefulset-pvc-lifecycle.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# PVC lifecycle management for StatefulSets.
# Package: io.thecodeforge.k8s

# StorageClass with Delete reclaim policy.
# When PVC is deleted, the underlying PV and data are also deleted.
# Use for ephemeral or reproducible data.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd-deletable
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Delete          # PV is deleted when PVC is deleted
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# StorageClass with Retain reclaim policy.
# When PVC is deleted, the PV is kept (but unbound).
# Use for critical data that must survive PVC deletion.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd-retain
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Retain          # PV is kept when PVC is deleted
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# StatefulSet using the deletable storage class.
# PVCs are auto-deleted when the StatefulSet is deleted IF you also delete the PVCs.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: production
spec:
  serviceName: kafka
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.5.0
          ports:
            - containerPort: 9092
          env:
            - name: KAFKA_BROKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
            - name: KAFKA_LOG_DIRS
              value: /var/lib/kafka/data
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd-deletable
        resources:
          requests:
            storage: 500Gi
Output
# PVCs created: data-kafka-0 (500Gi), data-kafka-1 (500Gi), data-kafka-2 (500Gi)
# Total storage: 1.5TB. These persist even after StatefulSet deletion.
PVC Name Matching: Why Storage Class Changes Are Ignored
  • PVC name is deterministic: <template>-<sts-name>-<ordinal>.
  • Kubernetes matches by name. Existing PVCs are re-attached, new specs are ignored.
  • PVC spec (storage class, size) is immutable after creation.
  • To change storage class: delete StatefulSet, delete PVCs, re-create StatefulSet.
  • To resize PVC: set allowVolumeExpansion: true on StorageClass, then edit PVC spec.resources.requests.storage.
Production Insight
Orphaned PVCs are the silent storage leak in Kubernetes. Every deleted StatefulSet leaves behind PVCs that consume cloud storage indefinitely. At scale, this can cost thousands of dollars per month. Set up monitoring for unbound PVCs (status.phase: Pending or status.phase: Bound with no Pod). Alert when PVCs exist without a corresponding Pod for more than 1 hour. Consider a cron job that identifies and reports orphaned PVCs weekly.
Key Takeaway
StatefulSet PVCs persist after StatefulSet deletion. Kubernetes matches PVCs by name, ignoring new template specs. To change storage class, delete PVCs first. Monitor for orphaned PVCs to prevent silent storage leaks.

Update Strategies: RollingUpdate vs OnDelete

StatefulSets support two update strategies: RollingUpdate (default) and OnDelete. The choice determines how Pod template changes (image update, env var change) are propagated to existing Pods.

RollingUpdate updates Pods one at a time in ordinal order, waiting for each Pod to be Ready before proceeding to the next. This is the safe default but slow. The maxUnavailable field (available in Kubernetes 1.24+) controls how many Pods can be unavailable during the update.

OnDelete does not automatically update Pods. When the Pod template is changed, existing Pods continue running the old spec. The update is applied only when a Pod is manually deleted. Kubernetes recreates the Pod with the new spec. This gives full control over update timing but requires manual intervention.

io/thecodeforge/k8s/statefulset-update-strategy.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# StatefulSet update strategies.
# Package: io.thecodeforge.k8s

# RollingUpdate (default): Automatic, ordered updates.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zookeeper
  namespace: production
spec:
  serviceName: zookeeper
  replicas: 3
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1        # Allow 1 Pod to be unavailable during update
  selector:
    matchLabels:
      app: zookeeper
  template:
    metadata:
      labels:
        app: zookeeper
    spec:
      containers:
        - name: zookeeper
          image: zookeeper:3.8
          ports:
            - containerPort: 2181
          readinessProbe:
            exec:
              command:
                - sh
                - -c
                - "echo ruok | nc localhost 2181 | grep imok"
            initialDelaySeconds: 10
            periodSeconds: 5
          volumeMounts:
            - name: data
              mountPath: /data
            - name: datalog
              mountPath: /datalog
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
    - metadata:
        name: datalog
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
---
# OnDelete: Manual update control. No automatic Pod recreation on template change.
apiVersion: apps/v1
kind: systems where each restart has significant operational impact (rebalancing, reindexing, replication catch-up).
Production Insight
For large Elasticsearch or Cassandra clusters, RollingUpdate causes hours of unnecessary rebalancing. Each Pod restart triggers shard redistribution, which competes with application traffic for I/O and network bandwidth. Use OnDelete and restart Pods one at a time during maintenance windows, waiting for cluster health to return to green before proceeding to the next Pod. Monitor cluster health metrics (Elasticsearch: _cluster/health, Cassandra: nodetool status) between each restart.
Key Takeaway
RollingUpdate is the safe default for systems with fast startup. OnDelete is the production standard for large data systems where each restart has significant operational impact. Use maxUnavailable to control parallelism during RollingUpdate.

PodDisruptionBudgets and StatefulSet Availability

PodDisruptionBudgets (PDBs) are critical for StatefulSets. They prevent the voluntary disruption controller from evicting too many Pods simultaneously during node drains, cluster upgrades, or preemptions. The controller does not intentionally evict more Pods than the budget allows.

For a 3-node etcd cluster, set minAvailable: 2. This ensures that a node drain cannot break quorum. If the drain would evict a third Pod, it blocks until one of the evicted Pods is rescheduled and Ready.

PDBs only block voluntary disruption (drain, upgrade, preemption). They do NOT protect against involuntary disruption (node crash, OOMKill, kernel panic). This distinction is critical: PDBs are a guardrail for planned maintenance, not a safety net for unplanned failures.

io/thecodeforge/k8s/statefulset-pdb.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# PodDisruptionBudget for a 3-node etcd cluster.
# Ensures at least 2 Pods are always available (quorum).
# Package: io.thecodeforge.k8s
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: etcd-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: etcd
---
# PodDisruptionBudget using maxUnavailable for a 5-node Kafka cluster.
# Allows at most 1 Pod to be down at any time.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: kafka
Output
# Verify PDB status:
# kubectl get pdb -n production
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# etcd-pdb 2 N/A 1 5d
# kafka-pdb N/A 1 0 5d
# ALLOWED DISRUPTIONS = 0 means the PDB is currently blocking all voluntary evictions.
minAvailable vs maxUnavailable
  • minAvailable: 2 on a 3-replica cluster = 1 Pod can be evicted.
  • maxUnavailable: 1 on a 3-replica cluster = same result, different expression.
  • PDBs block voluntary disruption only (drain, upgrade, preemption).
  • They do NOT protect against involuntary disruption (crash, OOMKill).
  • Setting minAvailable equal to replica count blocks all maintenance. Use quorum size instead.
Production Insight
The most common PDB misconfiguration is setting minAvailable equal to the replica count. If you have 3 replicas and set minAvailable: 3, the PDB blocks all voluntary disruptions — including necessary node drains during maintenance. This forces operators to delete the PDB before draining nodes, which defeats the purpose. Set minAvailable to the quorum size (2 for 3 replicas) or use maxUnavailable: 1. During cluster upgrades, the upgrade controller respects PDBs and waits for Pods to be rescheduled before proceeding to the next node.
Key Takeaway
PDBs are mandatory for StatefulSets running quorum-based systems. Set minAvailable to the quorum size, not the replica count. PDBs only block voluntary disruption — they do not protect against node crashes. Always pair PDBs with anti-affinity rules to spread Pods across nodes.
PDB Configuration Decision Tree
IfQuorum-based system (etcd, ZooKeeper, CockroachDB)
UseSet minAvailable to quorum size. For 3 replicas: minAvailable: 2. For 5 replicas: minAvailable: 3.
IfIndependent replicas (Redis standalone, stateless workers)
UseSet maxUnavailable: 1 or minAvailable: N-1. Allows rolling maintenance without service degradation.
IfSingle-replica StatefulSet (single PostgreSQL instance)
UsePDB with minAvailable: 1 blocks all voluntary disruption. Use cautiously — maintenance requires manual PDB deletion.
IfLarge cluster (10+ replicas) with no quorum requirement
UseSet maxUnavailable: 20-25% to allow parallel node drains during upgrades.
● Production incidentPOST-MORTEMseverity: high

StatefulSet PVC Orphan Caused 2TB Storage Leak and Blocked Cluster Migration

Symptom
After deleting the StatefulSet, the cloud bill showed 2TB of unattached persistent disks. When the new StatefulSet was created, Pods attached to the old PVCs with the old storage class instead of the new ones. The team could not understand why the new storage configuration was not being applied.
Assumption
Deleting the StatefulSet would clean up all associated resources including PVCs.
Root cause
StatefulSets use a PersistentVolumeClaim template (volumeClaimTemplates) that creates a PVC for each Pod. These PVCs are owned by the StatefulSet but have a reclaim policy of Retain by default. When the StatefulSet is deleted, the PVCs are NOT deleted — they persist in the namespace, retaining their data and their storage class. When a new StatefulSet with the same name and PVC template names is created, Kubernetes matches the existing PVCs by name and re-attaches them, completely ignoring any changes to the storage class or size in the new template.
Fix
1. Manually deleted the orphaned PVCs: kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2 data-postgres-3 data-postgres-4 -n production. 2. Verified the underlying PersistentVolumes were released and deleted (or set reclaimPolicy: Delete for the StorageClass). 3. Re-created the StatefulSet with the new storage class in volumeClaimTemplates. 4. Added a cleanup script to the team's runbook that explicitly deletes PVCs after StatefulSet deletion. 5. Set up monitoring for unbound PVCs: alert when PVCs exist without a bound Pod for more than 1 hour.
Key lesson
  • StatefulSet PVCs are NOT deleted when the StatefulSet is deleted. They persist indefinitely unless explicitly removed.
  • PVC names are deterministic: data-<statefulset-name>-<ordinal>. Kubernetes matches by name, not by spec. Changing the storage class in the template has no effect on existing PVCs.
  • Always delete PVCs explicitly when decommissioning a StatefulSet. Add this to your runbook.
  • Monitor for unbound PVCs. They consume storage and can cause billing surprises.
  • Before migrating storage classes, back up data, delete StatefulSet, delete PVCs, then re-create with new storage class.
Production debug guideSymptom-first investigation path for StatefulSet failures.6 entries
Symptom · 01
StatefulSet Pod stuck in Pending.
Fix
Check if the PVC is bound. Run kubectl describe pvc data-<sts-name>-<ordinal>. If the PVC is Pending, the StorageClass may not exist or the provisioner may be down. Check node affinity — the PV may be bound to a specific node that is unavailable.
Symptom · 02
StatefulSet Pod stuck in CrashLoopBackOff after node failure.
Fix
The Pod is likely trying to re-attach a PVC that is still attached to the old (failed) node. Check PVC status: kubectl get pvc -n <ns>. If the PV is still attached to the old node, force-detach it or wait for the attach-detach controller to time out (6 minutes default).
Symptom · 03
Rolling update stuck on a specific ordinal (e.g., pod-3).
Fix
StatefulSets update Pods in ordinal order. If pod-3 is not ready, pod-4 will not be updated. Check pod-3's readiness probe, logs, and events. Use kubectl rollout status statefulset/<name> to see which ordinal is blocking.
Symptom · 04
StatefulSet scale-down stuck. Pods not being deleted.
Fix
StatefulSets delete Pods in reverse ordinal order (highest first). If pod-2 is not terminating, pod-1 and pod-0 will not be deleted. Check for finalizers on the Pod, PVC detach issues, or PodDisruptionBudget conflicts.
Symptom · 05
State DNS name.
Fix
Verify the Headless Service exists and has clusterIP: None. Check that spec.serviceName in the StatefulSet matches the Headless Service name.fulSet Pods cannot resolve each other by Test DNS resolution: kubectl exec pod-0 -- nslookup pod-1.<service>.<namespace>.svc.cluster.local.
Symptom · 06
New StatefulSet Pod attached to old PVC with wrong data.
Fix
This is expected behavior. Kubernetes matches PVCs by name. If you re-create a StatefulSet with the same name, it re-attaches existing PVCs. To start fresh, delete the PVCs first: kubectl delete pvc data-<sts-name>-*.
★ StatefulSet Triage CommandsRapid commands to isolate StatefulSet lifecycle and storage issues.
Pod stuck in Pending.
Immediate action
Check PVC binding status and StorageClass.
Commands
kubectl describe pvc data-<sts-name>-<ordinal> -n <namespace>
kubectl get storageclass
Fix now
If PVC is Pending, the StorageClass provisioner may be down. Check PV events: kubectl get events -n <ns> --field-selector involvedObject.name=data-<sts-name>-<ordinal>.
Pod stuck in ContainerCreating with volume attach error.+
Immediate action
Check if the PV is still attached to a previous node.
Commands
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
kubectl get volumeattachment | grep <pv-name>
Fix now
If VolumeAttachment shows the PV attached to a dead node, delete the VolumeAttachment object. The attach-detach controller will retry on the new node.
Rolling update stuck.+
Immediate action
Identify the blocking ordinal.
Commands
kubectl rollout status statefulset/<name> -n <namespace> --timeout=30s
kubectl get pods -n <namespace> -l app=<label> --sort-by=.metadata.name -o wide | grep -v Running
Fix now
If a specific ordinal is not Ready, check its logs and readiness probe. Fix the issue and the rollout will automatically proceed to the next ordinal.
PVCs consuming unexpected storage after StatefulSet deletion.+
Immediate action
List orphaned PVCs.
Commands
kubectl get pvc -n <namespace> | grep <sts-name>
kubectl describe pvc data-<sts-name>-0 -n <namespace> | grep -A 5 Status
Fix now
If PVCs exist without a bound Pod, they are orphaned. Delete them: kubectl delete pvc data-<sts-name>-* -n <namespace>. Check reclaim policy on the StorageClass.
DNS resolution failing between StatefulSet Pods.+
Immediate action
Verify Headless Service and serviceName match.
Commands
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.clusterIP}'
kubectl get statefulset <sts-name> -n <namespace> -o jsonpath='{.spec.serviceName}'
Fix now
If clusterIP is not 'None', the Service is not Headless. If serviceName does not match the Headless Service name, DNS records are not created. Fix the mismatch.
Deployment vs StatefulSet vs DaemonSet: When to Use Each
AspectDeploymentStatefulSetDaemonSet
Pod identityRandom, interchangeableStable, ordinal (pod-0, pod-1)One per node (or subset)
Pod namingrandom-hashsts-name-ordinaldaemon-hash
StorageEphemeral or shared PVPer-Pod PVC (sticky)HostPath or shared PV
ScalingHorizontal (free)Ordered (sequential)Automatic (per-node)
Rolling updatemaxSurge + maxUnavailableOrderedReady or OnDeleteRollingUpdate or OnDelete
DNS identityService VIP (load-balanced)Per-Pod DNS via Headless ServiceService VIP (load-balanced)
Self-healingYes (replace any Pod)Yes (replace with same identity)Yes (replace on same node)
Creation orderParallelSequential (0, 1, 2)Parallel (one per node)
Deletion orderParallelReverse (N, N-1, ..., 0)Parallel
Use caseStateless APIs, web serversDatabases, Kafka, ZooKeeper, etcdLog agents, node-exporter, CNI agents

Key takeaways

1
StatefulSets provide three guarantees
stable identity (name + DNS), sticky storage (per-Pod PVC), and ordered operations (sequential creation/deletion).
2
StatefulSet PVCs persist after StatefulSet deletion. Always delete PVCs explicitly when decommissioning. Monitor for orphaned PVCs.
3
OrderedReady is the safe default for quorum-based systems. Parallel breaks consensus bootstrap. StorageClasses work, Rolling updates always follow ordinal order.
4
OnDelete is the production standard for large data systems where each restart triggers hours of rebalancing.
5
PDBs are mandatory for quorum-based StatefulSets. Set minAvailable to quorum size, not replica count.
6
PVC name matching is by name, not spec. To change storage class, delete PVCs first, then re-create the StatefulSet.
7
The Headless Service is mandatory for per-Pod DNS. Without it, peer discovery fails and the cluster cannot bootstrap.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is the difference between a Deployment and a StatefulSet?
02
Why do StatefulSet Pods need a Headless Service?
03
What happens to PVCs when a StatefulSet is deleted?
04
When should I use OnDelete instead of RollingUpdate?
05
How do I resize a StatefulSet PVC?
06
What is the most common StatefulSet production mistake?
🔥

That's Kubernetes. Mark it forged?

8 min read · try the examples if you haven't

Previous
Kubernetes Pods and Deployments
3 / 12 · Kubernetes
Next
Kubernetes ConfigMaps and Secrets