Kubernetes StatefulSets: PVC Orphan Caused 2TB Leak
After deleting a StatefulSet, 2TB unattached disks appeared; new Pods reattached old PVCs ignoring storage class changes.
- Stable identity: Each Pod gets a persistent name (pod-0, pod-1) and a stable DNS entry via a Headless Service.
- Stable storage: Each Pod gets its own PersistentVolumeClaim that follows it across restarts and reschedules.
- Ordered operations: Pods are created sequentially (0, 1, 2) and deleted in reverse (2, 1, 0). Rolling updates follow the same order.
- Headless Service: ClusterIP: None. DNS returns Pod IPs directly. Each Pod is reachable as pod-0.service.ns.svc.cluster.local.
- Ordered operations are slow. A 10-replica StatefulSet takes 10x longer to deploy than a Deployment.
- Parallel mode (podManagementPolicy: Parallel) is faster but breaks cluster bootstrap for systems that need quorum.
- Deleting a StatefulSet without deleting its PVCs. The PVCs persist indefinitely, consuming storage and blocking re-creation of the StatefulSet with different storage config.
Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order. That's a StatefulSet. Unlike a regular Deployment — where pods are interchangeable guests who can sleep in any room — a StatefulSet guarantees each pod has a permanent name, its own private storage, and a predictable position in line. Think of it as the difference between a row of numbered safety deposit boxes (StatefulSet) versus a pile of shopping carts you just grab any one of (Deployment).
Stateless apps are easy. You spin up ten identical pods, kill any three, Kubernetes replaces them — nobody cares. But the real world is full of systems that refuse to be stateless: databases, message brokers, distributed caches, search engines. These systems have opinions. Elasticsearch node 2 needs to rejoin the cluster as Elasticsearch node 2, not as some random newcomer. Kafka broker 0 owns specific partitions and cannot pretend to be a fresh broker without corrupting data.
StatefulSets exist precisely to give Kubernetes the vocabulary to reason about identity, ordering, and sticky storage. They provide three guarantees that Deployments fundamentally cannot: a stable, unique network identity that survives pod restarts; stable, persistent storage that follows the pod around regardless of which node it lands on; and ordered, graceful deployment and scaling.
This is not a getting-started guide. It covers the controller loop internals, PVC ownership tracking, the role of the Headless Service, why pod ordinals matter for rolling updates, and the exact failure modes that bite teams in production.
StatefulSet vs Deployment: When to Use Each
The most common decision Kubernetes operators face is choosing between a Deployment and a StatefulSet. Both manage replica Pods, but they differ fundamentally in how they treat identity, storage, and order.
A Deployment assumes all Pods are interchangeable. Each Pod gets a random name (e.g., myapp-68dcf7d8b4-abc123), can be replaced with any other Pod, and uses ephemeral or shared storage. Deployments scale quickly in parallel and are perfect for stateless services: web servers, REST APIs, worker queues.
A StatefulSet assumes each Pod is unique. Each Pod gets a stable ordinal name (pod-0, pod-1), a persistent DNS entry via a Headless Service, and its own PersistentVolumeClaim that follows it across reschedules. StatefulSets scale one Pod at a time (ordered) and are necessary for stateful workloads: databases (PostgreSQL, Cassandra), message brokers (Kafka, RabbitMQ), distributed consensus systems (etcd, ZooKeeper).
- Pods do not need stable network identities.
- Storage can be ephemeral or shared (e.g., a Stateless API reading from a central database).
- You need fast parallel scaling and rolling updates.
- Pods can be killed and recreated anywhere without impact.
- Each Pod must be addressable by a unique, stable name (e.g., kafka-0, kafka-1).
- Each Pod requires its own persistent storage that must survive restarts and rescheduling.
- Pods need ordered startup and shutdown (e.g., quorum-based systems).
- The application performs leader election or data partitioning that depends on stable identities.
If you are unsure, start with a Deployment. Stateless applications are simpler to scale, debug, and upgrade. Only move to a StatefulSet when you encounter a concrete requirement for stable identity or per-Pod storage. Overusing StatefulSets adds unnecessary complexity and cost.
When neither fits, consider a DaemonSet for running exactly one Pod per node (e.g., log collectors, node monitoring agents) or a Job/CronJob for batch workloads.
- Deployment: Pods are cattle — replaceable, no identity, fast scaling.
- StatefulSet: Pods are pets — named, sticky storage, ordered operations.
- DaemonSet: One pet per node — for node-level agents.
- Job: Single-run pet — for batch tasks.
- Rule of thumb: Start with Deployment, escalate to StatefulSet only when required.
Kubernetes Service Types: ClusterIP, NodePort, LoadBalancer, and Headless
Services abstract Pod-to-Pod communication and external access. Each Service type serves a different purpose and carries different trade-offs. Understanding them is essential for exposing StatefulSet Pods correctly.
ClusterIP (default): Exposes the Service on an internal cluster IP. Only reachable from within the cluster. Use for internal microservice communication. ClusterIP is the most efficient because it does not require external load balancers.
NodePort: Exposes the Service on each Node's IP at a static port (30000-32767). Reachable from outside the cluster by hitting <NodeIP>:<NodePort>. Use for development, debugging, or when you need direct node access. Not recommended for production due to security and port collision issues.
LoadBalancer: Exposes the Service externally via a cloud provider's load balancer (e.g., AWS ELB, GCP HTTP(S) LB). Automatically creates a NodePort and ClusterIP behind the scenes. Use for exposing a single Service to the internet. Each Service gets its own load balancer, which incurs hourly cost.
Headless (clusterIP: None): Does not allocate a cluster IP. DNS returns the IPs of all healthy Pods directly. Used primarily with StatefulSets for per-Pod DNS records (pod-0.service.ns.svc.cluster.local). Clients decide which Pod to contact. Also used for custom service discovery.
| Type | Cluster IP | External Access | Use Case | Cost |
|---|---|---|---|---|
| ClusterIP | Yes | No | Internal microservices | Free |
| NodePort | Yes | NodeIP:Port | Dev/Test, bare-metal | Free |
| LoadBalancer | Yes | Cloud LB | Single-service exposure | Per LB/hour |
| Headless | No (None) | DNS-based Pod IPs | StatefulSet peer discovery | Free |
When exposing a StatefulSet externally, you typically use a LoadBalancer or Ingress for the entire cluster (all Pods), not per-Pod. For inter-Pod communication within the StatefulSet, you always use a Headless Service.
- Headless Service: clusterIP: None — creates per-Pod A/AAAA records.
- Regular Service: clusterIP: set — creates a single virtual IP and load-balances.
- StatefulSet spec.serviceName must match the Headless Service metadata.name.
- DNS name format: <pod-name>.<service-name>.<namespace>.svc.cluster.local.
- Used by Cassandra, Kafka, ZooKeeper, etc. for seed discovery.
Ingress vs LoadBalancer: Choosing the Right External Exposure Mechanism
When you need to expose a StatefulSet (or any service) to the internet, you have two primary options: a LoadBalancer Service or an Ingress resource. The choice depends on protocol, routing requirements, and cost.
LoadBalancer Service: Creates a cloud load balancer (e.g., AWS ELB, GCP TCP LB) that forwards traffic directly to your Service's Pods. Operates at Layer 4 (TCP/UDP). Each LoadBalancer Service gets its own static IP or DNS name. Simple to set up, but each one is a separate billable resource. Best for non-HTTP protocols (gRPC, WebSocket, database connections) or when you need a single service exposed with minimal configuration.
Ingress: A cluster-level resource that provides HTTP(S) routing rules. Requires an Ingress controller (e.g., NGINX Ingress, Istio Gateway, AWS Load Balancer Controller). The controller typically runs as a DaemonSet or Deployment and is itself exposed via a LoadBalancer Service. Ingress operates at Layer 7, allowing path-based routing (e.g., /api -> service-a, /web -> service-b), host-based routing (api.example.com -> service-a), SSL termination, and rate limiting. Multiple Ingress rules can share the same underlying LoadBalancer, saving money.
Decision Matrix: | Criteria | LoadBalancer Service | Ingress | |----------|---------------------|---------| | Protocol | TCP, UDP, HTTP | HTTP, HTTPS, gRPC (with controller) | | Routing | No (single target) | Path, host, headers | | SSL termination | Manual (annotation) | Built-in (cert-manager) | | Cost per service | One LB per Service | One LB for many Ingresses | | Setup complexity | Low | Medium (controller required) | | Use case | Database, non-HTTP, simple apps | HTTP APIs, web apps, microservices |
For HTTP workloads with multiple services, use Ingress. For non-HTTP workloads or when you need absolute simplicity, use LoadBalancer. You can also combine both: an Ingress controller exposed via a LoadBalancer Service.
- Ingress controllers: NGINX, HAProxy, Traefik, AWS LB Controller, Istio Gateway.
- Ingress resources define routing rules; the controller implements them.
- The controller itself is often exposed via a LoadBalancer Service.
- Cert-manager can automate SSL certificate provisioning for Ingresses.
- Ingress supports sticky sessions, rate limiting, and canary releases via annotations.
Stable Identity: Network Names and Pod Ordinals
The defining feature of a StatefulSet is stable identity. Each Pod receives a unique, predictable name based on the StatefulSet name and an ordinal index: <statefulset-name>-0, <statefulset-name>-1, <statefulset-name>-2. This identity persists across restarts, reschedules, and even node failures. If pod-2 is rescheduled to a different node, it is still pod-2.
This identity extends to DNS. When a StatefulSet specifies a Headless Service via spec.serviceName, Kubernetes creates DNS A records for each Pod: pod-0.service.namespace.svc.cluster.local. These DNS names resolve directly to the Pod's IP address. When the Pod restarts with a new IP, the DNS record is updated automatically.
This is fundamentally different from Deployments, where Pods are interchangeable and have random names. Stateful systems rely on this identity for peer discovery, leader election, and data partitioning. A Kafka broker must rejoin the cluster with the same identity to resume ownership of its partitions.
- Pod name: stable across restarts. postgres-2 is always postgres-2.
- DNS: pod-0.service.ns.svc.cluster.local. Updated on Pod IP change.
- Storage: PVC follows the Pod. Same PVC is re-attached on reschedule.
- Ordinal: determines creation order (0, 1, 2) and deletion order (2, 1, 0).
- Headless Service is required. Without it, DNS records are not created.
--attach-detach-reconcile-sync-period). During this window, the Pod cannot start and the StatefulSet cannot proceed to the next ordinal. For critical databases, this delay can cause quorum loss in a 3-node cluster.Ordered Operations: Creation, Deletion, and Rolling Updates
StatefulSets enforce strict ordering on all lifecycle operations. Pods are created sequentially from ordinal 0 to N-1. Pod N is not created until Pod N-1 is Running and Ready. Pods are deleted in reverse order: N-1 first, then N-2, down to 0. Rolling updates follow the same ordinal order.
This ordering is critical for systems that need quorum during bootstrap. A 3-node etcd cluster needs at least 2 nodes to form quorum. If all 3 Pods start simultaneously, none can form quorum because they all try to discover peers that do not exist yet. Ordered creation ensures pod-0 starts first, pod-1 joins pod-0, and pod-2 joins the existing 2-node cluster.
The podManagementPolicy field controls this behavior. The default is OrderedReady: Pods are created and deleted one at a time in ordinal order. The alternative is Parallel: Pods are created and deleted simultaneously, like a Deployment. Parallel is faster but bootstrap.
- OrderedReady: sequential creation. Safe for quorum. Slow at scale.
- Parallel: simultaneous creation. Fast. Unsafe for quorum bootstrap.
- Rolling updates always follow ordinal order regardless of podManagementPolicy.
- Scale-down always follows reverse ordinal order regardless of podManagementPolicy.
- OnDelete update strategy: Pods are not updated until manually deleted. Gives full control.
kubectl rollout status statefulset/<name> — if a single ordinal is stuck, the entire rollout blocks.PersistentVolumeClaim Lifecycle: Ownership, Orphans, and Reclaim
The volumeClaimTemplates field in a StatefulSet is a template for creating PVCs. When a StatefulSet Pod is created, Kubernetes creates a PVC from the template with a deterministic name: <template-name>-<statefulset-name>-<ordinal>. For example, a StatefulSet named postgres with a volumeClaimTemplate named data creates PVCs: data-postgres-0, data-postgres-1, data-postgres-2.
These PVCs are owned by the StatefulSet but are NOT deleted when the StatefulSet is deleted. This is by design — the data must persist so it can be re-attached if the StatefulSet is re-created. However, this creates a common production pitfall: orphaned PVCs that consume storage indefinitely.
The reclaim policy on the underlying StorageClass determines what happens to the PersistentVolume when the PVC is finally deleted. Retain keeps the PV and its data. Delete removes the PV and its data. The default varies by cloud provider.
- PVC name is deterministic: <template>-<sts-name>-<ordinal>.
- Kubernetes matches by name. Existing PVCs are re-attached, new specs are ignored.
- PVC spec (storage class, size) is immutable after creation.
- To change storage class: delete StatefulSet, delete PVCs, re-create StatefulSet.
- To resize PVC: set allowVolumeExpansion: true on StorageClass, then edit PVC spec.resources.requests.storage.
Update Strategies: RollingUpdate vs OnDelete
StatefulSets support two update strategies: RollingUpdate (default) and OnDelete. The choice determines how Pod template changes (image update, env var change) are propagated to existing Pods.
RollingUpdate updates Pods one at a time in ordinal order, waiting for each Pod to be Ready before proceeding to the next. This is the safe default but slow. The maxUnavailable field (available in Kubernetes 1.24+) controls how many Pods can be unavailable during the update.
OnDelete does not automatically update Pods. When the Pod template is changed, existing Pods continue running the old spec. The update is applied only when a Pod is manually deleted. Kubernetes recreates the Pod with the new spec. This gives full control over update timing but requires manual intervention.
PodDisruptionBudgets and StatefulSet Availability
PodDisruptionBudgets (PDBs) are critical for StatefulSets. They prevent the voluntary disruption controller from evicting too many Pods simultaneously during node drains, cluster upgrades, or preemptions. The controller does not intentionally evict more Pods than the budget allows.
For a 3-node etcd cluster, set minAvailable: 2. This ensures that a node drain cannot break quorum. If the drain would evict a third Pod, it blocks until one of the evicted Pods is rescheduled and Ready.
PDBs only block voluntary disruption (drain, upgrade, preemption). They do NOT protect against involuntary disruption (node crash, OOMKill, kernel panic). This distinction is critical: PDBs are a guardrail for planned maintenance, not a safety net for unplanned failures.
- minAvailable: 2 on a 3-replica cluster = 1 Pod can be evicted.
- maxUnavailable: 1 on a 3-replica cluster = same result, different expression.
- PDBs block voluntary disruption only (drain, upgrade, preemption).
- They do NOT protect against involuntary disruption (crash, OOMKill).
- Setting minAvailable equal to replica count blocks all maintenance. Use quorum size instead.
StatefulSet PVC Orphan Caused 2TB Storage Leak and Blocked Cluster Migration
kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2 data-postgres-3 data-postgres-4 -n production.
2. Verified the underlying PersistentVolumes were released and deleted (or set reclaimPolicy: Delete for the StorageClass).
3. Re-created the StatefulSet with the new storage class in volumeClaimTemplates.
4. Added a cleanup script to the team's runbook that explicitly deletes PVCs after StatefulSet deletion.
5. Set up monitoring for unbound PVCs: alert when PVCs exist without a bound Pod for more than 1 hour.- StatefulSet PVCs are NOT deleted when the StatefulSet is deleted. They persist indefinitely unless explicitly removed.
- PVC names are deterministic: data-<statefulset-name>-<ordinal>. Kubernetes matches by name, not by spec. Changing the storage class in the template has no effect on existing PVCs.
- Always delete PVCs explicitly when decommissioning a StatefulSet. Add this to your runbook.
- Monitor for unbound PVCs. They consume storage and can cause billing surprises.
- Before migrating storage classes, back up data, delete StatefulSet, delete PVCs, then re-create with new storage class.
kubectl describe pvc data-<sts-name>-<ordinal>. If the PVC is Pending, the StorageClass may not exist or the provisioner may be down. Check node affinity — the PV may be bound to a specific node that is unavailable.kubectl get pvc -n <ns>. If the PV is still attached to the old node, force-detach it or wait for the attach-detach controller to time out (6 minutes default).kubectl rollout status statefulset/<name> to see which ordinal is blocking.clusterIP: None. Check that spec.serviceName in the StatefulSet matches the Headless Service name.fulSet Pods cannot resolve each other by Test DNS resolution: kubectl exec pod-0 -- nslookup pod-1.<service>.<namespace>.svc.cluster.local.kubectl delete pvc data-<sts-name>-*.kubectl get events -n <ns> --field-selector involvedObject.name=data-<sts-name>-<ordinal>.Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
That's Kubernetes. Mark it forged?
8 min read · try the examples if you haven't