Advanced 11 min · March 06, 2026

Kubernetes Services and Ingress

Kubernetes StatefulSets: PVC Orphan Caused 2TB Leak

Q: What is the difference between a Deployment and a StatefulSet?

A Deployment manages interchangeable Pods with no stable identity. Pods get random names, ephemeral storage, and are created/deleted in parallel. A StatefulSet manages Pods with stable identity (ordinal names), sticky storage (per-Pod PVCs), and ordered operations (sequential creation/deletion). Use Deployments for stateless apps. Use StatefulSets for databases, message brokers, and distributed systems that need peer discovery.

Q: Why do StatefulSet Pods need a Headless Service?

The Headless Service (clusterIP: None) creates DNS A records for each Pod individually: pod-0.service.ns.svc.cluster.local. Without it, DNS returns only the ClusterIP (if using a regular Service), and you cannot address specific Pods. Peer discovery in systems like Kafka, ZooKeeper, and etcd requires individual Pod DNS names.

Q: When should I use OnDelete instead of RollingUpdate?

Use OnDelete for large data systems (Elasticsearch, Cassandra, CockroachDB) where each Pod restart triggers significant operational overhead like shard rebalancing or replication catch-up. OnDelete gives you manual control: update the spec, then delete Pods one at a time during maintenance windows, waiting for cluster health to recover between restarts.

Q: How do I resize a StatefulSet PVC?

Set allowVolumeExpansion: true on the StorageClass. Then edit the PVC's spec.resources.requests.storage directly. Kubernetes will expand the underlying volume. Note: you cannot change the storage class — only the size. Some volumes support online expansion (no Pod restart required). Others require the Pod to be restarted.

After deleting a StatefulSet, 2TB unattached disks appeared; new Pods reattached old PVCs ignoring storage class changes.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Stable identity: Each Pod gets a persistent name (pod-0, pod-1) and a stable DNS entry via a Headless Service.
Stable storage: Each Pod gets its own PersistentVolumeClaim that follows it across restarts and reschedules.
Ordered operations: Pods are created sequentially (0, 1, 2) and deleted in reverse (2, 1, 0). Rolling updates follow the same order.
Headless Service: ClusterIP: None. DNS returns Pod IPs directly. Each Pod is reachable as pod-0.service.ns.svc.cluster.local.
Ordered operations are slow. A 10-replica StatefulSet takes 10x longer to deploy than a Deployment.
Parallel mode (podManagementPolicy: Parallel) is faster but breaks cluster bootstrap for systems that need quorum.
Deleting a StatefulSet without deleting its PVCs. The PVCs persist indefinitely, consuming storage and blocking re-creation of the StatefulSet with different storage config.

✦ Definition~90s read

What is Kubernetes Services and Ingress?

Kubernetes StatefulSets are a workload API object designed for stateful applications that require stable, unique network identifiers, persistent storage, and ordered deployment/scaling. Unlike Deployments, which treat all pods as interchangeable, StatefulSets assign each pod a deterministic ordinal index (e.g., web-0, web-1) and a stable hostname derived from that index.

★

Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order.

This identity persists across rescheduling, enabling applications like databases (Cassandra, MySQL Galera, ZooKeeper) to maintain cluster membership without relying on ephemeral IPs. StatefulSets also integrate with PersistentVolumeClaims (PVCs) via volumeClaimTemplates, ensuring each pod gets its own dedicated storage that survives pod restarts—but this same feature can cause PVC orphans if the StatefulSet is deleted without cleaning up claims, leading to leaked storage volumes that continue incurring cloud costs (e.g., 2TB EBS volumes at ~$0.08/GB/month).

StatefulSets solve the problem of running stateful workloads in a container orchestration system designed for stateless patterns. Without them, you'd need external service discovery, manual storage provisioning, and custom scripts to enforce startup order—all of which StatefulSets handle natively.

However, they come with trade-offs: rolling updates are sequential (one pod at a time) by default, scaling down requires explicit PVC deletion, and headless services (ClusterIP: None) are typically required for stable DNS resolution. For stateless apps or those that can tolerate eventual consistency (e.g., web servers, caches), Deployments with ReplicaSets remain the simpler, faster choice.

In the broader Kubernetes ecosystem, StatefulSets occupy the niche between fully managed databases (e.g., Amazon RDS, Cloud SQL) and operator-based solutions (e.g., Strimzi for Kafka, Zalando's Postgres Operator). They're ideal when you need direct control over pod identity and storage but don't want the operational overhead of a custom operator.

Common alternatives include using Deployments with shared volumes (for read-heavy workloads) or external storage services entirely. The key rule: if your application doesn't care which pod handles a request or which storage volume it uses, you don't need a StatefulSet—and using one risks the exact PVC orphan problem this article addresses.

Plain-English First

Imagine a hotel where every guest always gets the same room number, the same locker, and is always checked in and out in the exact same order. That's a StatefulSet. Unlike a regular Deployment — where pods are interchangeable guests who can sleep in any room — a StatefulSet guarantees each pod has a permanent name, its own private storage, and a predictable position in line. Think of it as the difference between a row of numbered safety deposit boxes (StatefulSet) versus a pile of shopping carts you just grab any one of (Deployment).

Stateless apps are easy. You spin up ten identical pods, kill any three, Kubernetes replaces them — nobody cares. But the real world is full of systems that refuse to be stateless: databases, message brokers, distributed caches, search engines. These systems have opinions. Elasticsearch node 2 needs to rejoin the cluster as Elasticsearch node 2, not as some random newcomer. Kafka broker 0 owns specific partitions and cannot pretend to be a fresh broker without corrupting data.

StatefulSets exist precisely to give Kubernetes the vocabulary to reason about identity, ordering, and sticky storage. They provide three guarantees that Deployments fundamentally cannot: a stable, unique network identity that survives pod restarts; stable, persistent storage that follows the pod around regardless of which node it lands on; and ordered, graceful deployment and scaling.

This is not a getting-started guide. It covers the controller loop internals, PVC ownership tracking, the role of the Headless Service, why pod ordinals matter for rolling updates, and the exact failure modes that bite teams in production.

How Kubernetes Services and Ingress Actually Route Traffic

A Kubernetes Service is a stable network endpoint that decouples client traffic from a dynamic set of Pods. It provides a virtual IP (ClusterIP) or external load balancer that forwards requests to healthy Pods based on label selectors. Ingress, on the other hand, is an API object that manages external HTTP/S access to Services, offering host-based and path-based routing, TLS termination, and virtual hosting — all within a single external endpoint. Together, they form the traffic routing layer of a Kubernetes cluster.

Services operate at Layer 4 (TCP/UDP) by default, using iptables or IPVS rules to distribute traffic across Pods. Ingress controllers (e.g., NGINX, Traefik, HAProxy) implement the Ingress resource at Layer 7, parsing HTTP headers and paths to route to the correct Service. This separation means you can scale Pods independently without changing client configurations — the Service IP remains constant, and Ingress rules update dynamically as Services are added or removed.

Use Services when you need internal cluster communication or simple TCP/UDP load balancing. Add Ingress when you need HTTP-specific features like path-based routing, SSL termination, or multiple domains on a single IP. In production, this pattern is essential for zero-downtime deployments: Services abstract Pod churn, and Ingress enables canary releases by routing a percentage of traffic to a different Service.

Ingress Is Not a Service Type

Ingress does not replace a Service — it routes external traffic to a Service. Without a backing Service, an Ingress rule is a dead end.

Production Insight

A payment platform used a single Ingress with path-based routing to two Services. A misconfigured path prefix caused all traffic to hit the wrong Service, silently dropping 15% of checkout requests for 4 hours before the 5xx spike alerted the team.

Symptom: Partial HTTP 503s on specific endpoints with no Pod failures — the Ingress controller logs showed requests hitting the wrong Service backend.

Rule: Always test Ingress rules with a dry-run or staging cluster; path matching is exact and order-sensitive — a catch-all rule can shadow all others.

Key Takeaway

Services provide stable L4 endpoints; Ingress adds L7 routing on top.

Ingress without a backing Service is a no-op — always verify the Service selector matches.

Test path and host rules in isolation; a single misordered rule can silently blackhole traffic.

thecodeforge.io

StatefulSet PVC Orphan Leak Flow

Kubernetes Services Ingress

StatefulSet vs Deployment: When to Use Each

The most common decision Kubernetes operators face is choosing between a Deployment and a StatefulSet. Both manage replica Pods, but they differ fundamentally in how they treat identity, storage, and order.

A Deployment assumes all Pods are interchangeable. Each Pod gets a random name (e.g., myapp-68dcf7d8b4-abc123), can be replaced with any other Pod, and uses ephemeral or shared storage. Deployments scale quickly in parallel and are perfect for stateless services: web servers, REST APIs, worker queues.

A StatefulSet assumes each Pod is unique. Each Pod gets a stable ordinal name (pod-0, pod-1), a persistent DNS entry via a Headless Service, and its own PersistentVolumeClaim that follows it across reschedules. StatefulSets scale one Pod at a time (ordered) and are necessary for stateful workloads: databases (PostgreSQL, Cassandra), message brokers (Kafka, RabbitMQ), distributed consensus systems (etcd, ZooKeeper).

Use a Deployment when

Pods do not need stable network identities.
Storage can be ephemeral or shared (e.g., a Stateless API reading from a central database).
You need fast parallel scaling and rolling updates.
Pods can be killed and recreated anywhere without impact.

Use a StatefulSet when

Each Pod must be addressable by a unique, stable name (e.g., kafka-0, kafka-1).
Each Pod requires its own persistent storage that must survive restarts and rescheduling.
Pods need ordered startup and shutdown (e.g., quorum-based systems).
The application performs leader election or data partitioning that depends on stable identities.

If you are unsure, start with a Deployment. Stateless applications are simpler to scale, debug, and upgrade. Only move to a StatefulSet when you encounter a concrete requirement for stable identity or per-Pod storage. Overusing StatefulSets adds unnecessary complexity and cost.

When neither fits, consider a DaemonSet for running exactly one Pod per node (e.g., log collectors, node monitoring agents) or a Job/CronJob for batch workloads.

io/thecodeforge/k8s/selection-deployment-vs-statefulset.yamlYAML

# Deployment for stateless API
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: myapp:1.0
          ports:
            - containerPort: 8080
---
# StatefulSet for Kafka broker
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
spec:
  serviceName: kafka-hs
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.5.0
          ports:
            - containerPort: 9092
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 100Gi

Output

# Deployment: Pods named api-server-xxx-yyy (random). All identical. No per-Pod storage.

# StatefulSet: Pods named kafka-0, kafka-1, kafka-2. Each with its own PVC.

Pets vs Cattle

Deployment: Pods are cattle — replaceable, no identity, fast scaling.
StatefulSet: Pods are pets — named, sticky storage, ordered operations.
DaemonSet: One pet per node — for node-level agents.
Job: Single-run pet — for batch tasks.
Rule of thumb: Start with Deployment, escalate to StatefulSet only when required.

Production Insight

StatefulSets introduce a significant operational cost beyond cloud storage. Ordered creation slows deployments and scaling events. A 10-replica StatefulSet can take 10 minutes to roll out if each Pod takes 60 seconds to become ready. During that time, cluster capacity is partially utilized and rolling updates block traffic on earlier ordinals. In large production clusters, consider whether the benefits of stable identity outweigh the slower operations.

Key Takeaway

Use Deployments for stateless, interchangeable Pods. Use StatefulSets only when Pods need stable identities, persistent per-Pod storage, or ordered operations. Overusing StatefulSets adds unnecessary complexity and cost.

Kubernetes Service Types: ClusterIP, NodePort, LoadBalancer, and Headless

Services abstract Pod-to-Pod communication and external access. Each Service type serves a different purpose and carries different trade-offs. Understanding them is essential for exposing StatefulSet Pods correctly.

ClusterIP (default): Exposes the Service on an internal cluster IP. Only reachable from within the cluster. Use for internal microservice communication. ClusterIP is the most efficient because it does not require external load balancers.

NodePort: Exposes the Service on each Node's IP at a static port (30000-32767). Reachable from outside the cluster by hitting <NodeIP>:<NodePort>. Use for development, debugging, or when you need direct node access. Not recommended for production due to security and port collision issues.

LoadBalancer: Exposes the Service externally via a cloud provider's load balancer (e.g., AWS ELB, GCP HTTP(S) LB). Automatically creates a NodePort and ClusterIP behind the scenes. Use for exposing a single Service to the internet. Each Service gets its own load balancer, which incurs hourly cost.

Headless (clusterIP: None): Does not allocate a cluster IP. DNS returns the IPs of all healthy Pods directly. Used primarily with StatefulSets for per-Pod DNS records (pod-0.service.ns.svc.cluster.local). Clients decide which Pod to contact. Also used for custom service discovery.

Type	Cluster IP	External Access	Use Case	Cost
ClusterIP	Yes	No	Internal microservices	Free
NodePort	Yes	NodeIP:Port	Dev/Test, bare-metal	Free
LoadBalancer	Yes	Cloud LB	Single-service exposure	Per LB/hour
Headless	No (None)	DNS-based Pod IPs	StatefulSet peer discovery	Free

When exposing a StatefulSet externally, you typically use a LoadBalancer or Ingress for the entire cluster (all Pods), not per-Pod. For inter-Pod communication within the StatefulSet, you always use a Headless Service.

io/thecodeforge/k8s/service-types-examples.yamlYAML

# ClusterIP: internal only
apiVersion: v1
kind: Service
metadata:
  name: my-internal
spec:
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080
---
# NodePort: external via node IP
apiVersion: v1
kind: Service
metadata:
  name: my-nodeport
spec:
  type: NodePort
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080
      nodePort: 30080
---
# LoadBalancer: external via cloud LB
apiVersion: v1
kind: Service
metadata:
  name: my-lb
spec:
  type: LoadBalancer
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080
---
# Headless: no cluster IP, used with StatefulSets
apiVersion: v1
kind: Service
metadata:
  name: my-headless
spec:
  clusterIP: None
  selector:
    app: myapp
  ports:
    - port: 80
      targetPort: 8080

Output

# ClusterIP: service is reachable as my-internal.ns.svc.cluster.local

# NodePort: external curl http://node-ip:30080

# LoadBalancer: external curl http://<lb-dns-name>

# Headless: no cluster IP; DNS returns Pod IPs directly

Headless Is Required for StatefulSet DNS

Headless Service: clusterIP: None — creates per-Pod A/AAAA records.
Regular Service: clusterIP: set — creates a single virtual IP and load-balances.
StatefulSet spec.serviceName must match the Headless Service metadata.name.
DNS name format: <pod-name>.<service-name>.<namespace>.svc.cluster.local.
Used by Cassandra, Kafka, ZooKeeper, etc. for seed discovery.

Production Insight

LoadBalancer services are expensive at scale. Each Service creates a separate cloud load balancer with hourly charges ($15-25/month each). With hundreds of microservices, this cost adds up quickly. Instead, use an Ingress controller (e.g., NGINX Ingress, AWS ALB Ingress) to route multiple services through a single LoadBalancer. Ingress also provides SSL termination, path-based routing, and rate limiting at no extra LB cost. Reserve LoadBalancer for non-HTTP services or when you need direct TCP/UDP load balancing.

Key Takeaway

Choose Service type based on access requirements: ClusterIP for internal, NodePort for dev, LoadBalancer for single external service, Headless for StatefulSet DNS. Consolidate external HTTP traffic with Ingress to save costs.

Ingress vs LoadBalancer: Choosing the Right External Exposure Mechanism

When you need to expose a StatefulSet (or any service) to the internet, you have two primary options: a LoadBalancer Service or an Ingress resource. The choice depends on protocol, routing requirements, and cost.

LoadBalancer Service: Creates a cloud load balancer (e.g., AWS ELB, GCP TCP LB) that forwards traffic directly to your Service's Pods. Operates at Layer 4 (TCP/UDP). Each LoadBalancer Service gets its own static IP or DNS name. Simple to set up, but each one is a separate billable resource. Best for non-HTTP protocols (gRPC, WebSocket, database connections) or when you need a single service exposed with minimal configuration.

Ingress: A cluster-level resource that provides HTTP(S) routing rules. Requires an Ingress controller (e.g., NGINX Ingress, Istio Gateway, AWS Load Balancer Controller). The controller typically runs as a DaemonSet or Deployment and is itself exposed via a LoadBalancer Service. Ingress operates at Layer 7, allowing path-based routing (e.g., /api -> service-a, /web -> service-b), host-based routing (api.example.com -> service-a), SSL termination, and rate limiting. Multiple Ingress rules can share the same underlying LoadBalancer, saving money.

Decision Matrix: | Criteria | LoadBalancer Service | Ingress | |----------|---------------------|---------| | Protocol | TCP, UDP, HTTP | HTTP, HTTPS, gRPC (with controller) | | Routing | No (single target) | Path, host, headers | | SSL termination | Manual (annotation) | Built-in (cert-manager) | | Cost per service | One LB per Service | One LB for many Ingresses | | Setup complexity | Low | Medium (controller required) | | Use case | Database, non-HTTP, simple apps | HTTP APIs, web apps, microservices |

For HTTP workloads with multiple services, use Ingress. For non-HTTP workloads or when you need absolute simplicity, use LoadBalancer. You can also combine both: an Ingress controller exposed via a LoadBalancer Service.

io/thecodeforge/k8s/ingress-vs-lb.yamlYAML

# LoadBalancer Service for a database (non-HTTP)
apiVersion: v1
kind: Service
metadata:
  name: postgres-lb
spec:
  type: LoadBalancer
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432
---
# Ingress for HTTP API (multiple services behind one LB)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /users
            pathType: Prefix
            backend:
              service:
                name: user-service
                port:
                  number: 8080
          - path: /orders
            pathType: Prefix
            backend:
              service:
                name: order-service
                port:
                  number: 8080

Output

# LoadBalancer: single cloud ELB created for postgres-lb. Database clients connect directly.

# Ingress: one cloud ELB for the Ingress controller, routing /users and /orders to different services.

Ingress Controller Is Mandatory

Ingress controllers: NGINX, HAProxy, Traefik, AWS LB Controller, Istio Gateway.
Ingress resources define routing rules; the controller implements them.
The controller itself is often exposed via a LoadBalancer Service.
Cert-manager can automate SSL certificate provisioning for Ingresses.
Ingress supports sticky sessions, rate limiting, and canary releases via annotations.

Production Insight

In production, never expose your StatefulSet Pods directly via a LoadBalancer Service unless you have a specific non-HTTP protocol. Use an Ingress controller with proper SSL termination and path-based routing. This approach reduces cloud load balancer costs by 80% and centralizes traffic management. For internal cluster traffic, keep the Headless Service for StatefulSet DNS and use ClusterIP Services for application discovery. The LoadBalancer should only be the single entry point for external traffic.

Key Takeaway

Choose LoadBalancer for simple non-HTTP services or when you need minimal configuration. Choose Ingress for HTTP-based multi-service exposure to consolidate LBs and gain Layer 7 routing. Always install an Ingress controller first.

Stable Identity: Network Names and Pod Ordinals

The defining feature of a StatefulSet is stable identity. Each Pod receives a unique, predictable name based on the StatefulSet name and an ordinal index: <statefulset-name>-0, <statefulset-name>-1, <statefulset-name>-2. This identity persists across restarts, reschedules, and even node failures. If pod-2 is rescheduled to a different node, it is still pod-2.

This identity extends to DNS. When a StatefulSet specifies a Headless Service via spec.serviceName, Kubernetes creates DNS A records for each Pod: pod-0.service.namespace.svc.cluster.local. These DNS names resolve directly to the Pod's IP address. When the Pod restarts with a new IP, the DNS record is updated automatically.

This is fundamentally different from Deployments, where Pods are interchangeable and have random names. Stateful systems rely on this identity for peer discovery, leader election, and data partitioning. A Kafka broker must rejoin the cluster with the same identity to resume ownership of its partitions.

io/thecodeforge/k8s/statefulset-identity.yamlYAML

# StatefulSet with stable identity and Headless Service.
# Each Pod is reachable as: postgres-0.postgres.production.svc.cluster.local
# Package: io.thecodeforge.k8s

# Headless Service: DNS returns Pod IPs directly.
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: production
spec:
  clusterIP: None              # Headless: no virtual IP
  selector:
    app: postgres
  ports:
    - port: 5432
      targetPort: 5432
---
# StatefulSet: stable identity, ordered operations, sticky storage.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres        # Must match the Headless Service name
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:15
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: password
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
          readinessProbe:
            exec:
              command:
                - pg_isready
                - -U
                - postgres
            initialDelaySeconds: 5
            periodSeconds: 5
  # volumeClaimTemplates: creates a PVC for each Pod.
  # PVC name format: <template-name>-<statefulset-name>-<ordinal>
  # e.g., data-postgres-0, data-postgres-1, data-postgres-2
  # These PVCs persist even after the StatefulSet is deleted.
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 100Gi

Output

# Pods created in order: postgres-0, postgres-1, postgres-2

# Each with its own PVC: data-postgres-0, data-postgres-1, data-postgres-2

# DNS entries:

# postgres-0.postgres.production.svc.cluster.local -> <pod-0-ip>

# postgres-1.postgres.production.svc.cluster.local -> <pod-1-ip>

# postgres-2.postgres.production.svc.cluster.local -> <pod-2-ip>

Identity Is Not Just a Name

Pod name: stable across restarts. postgres-2 is always postgres-2.
DNS: pod-0.service.ns.svc.cluster.local. Updated on Pod IP change.
Storage: PVC follows the Pod. Same PVC is re-attached on reschedule.
Ordinal: determines creation order (0, 1, 2) and deletion order (2, 1, 0).
Headless Service is required. Without it, DNS records are not created.

Production Insight

When a StatefulSet Pod is rescheduled to a different node, there is a window where the Pod is Pending because the PVC is still attached to the old node. The attach-detach controller must detach the PV from the old node before it can be attached to the new node. This takes up to 6 minutes by default (. You need the Headlesscontrolled by --attach-detach-reconcile-sync-period). During this window, the Pod cannot start and the StatefulSet cannot proceed to the next ordinal. For critical databases, this delay can cause quorum loss in a 3-node cluster.

Key Takeaway

StatefulSet identity is three guarantees: stable name, stable DNS, and sticky storage. All three must persist across restarts. The Headless Service is mandatory for DNS-based peer discovery. PVC attachment delays during node failover can block the entire StatefulSet.

Ordered Operations: Creation, Deletion, and Rolling Updates

StatefulSets enforce strict ordering on all lifecycle operations. Pods are created sequentially from ordinal 0 to N-1. Pod N is not created until Pod N-1 is Running and Ready. Pods are deleted in reverse order: N-1 first, then N-2, down to 0. Rolling updates follow the same ordinal order.

This ordering is critical for systems that need quorum during bootstrap. A 3-node etcd cluster needs at least 2 nodes to form quorum. If all 3 Pods start simultaneously, none can form quorum because they all try to discover peers that do not exist yet. Ordered creation ensures pod-0 starts first, pod-1 joins pod-0, and pod-2 joins the existing 2-node cluster.

The podManagementPolicy field controls this behavior. The default is OrderedReady: Pods are created and deleted one at a time in ordinal order. The alternative is Parallel: Pods are created and deleted simultaneously, like a Deployment. Parallel is faster but bootstrap.

io/thecodeforge/k8s/statefulset-ordering.yamlYAML

# StatefulSet with OrderedReady (default) and Parallel comparison.
# Package: io.thecodeforge.k8s

# OrderedReady: Pods created one at a time. Slow but safe for quorum-based systems.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd-cluster
  namespace: production
spec:
  serviceName: etcd
  replicas: 3
  podManagementPolicy: OrderedReady  breaks systems that need ordered # Default. Sequential creation/deletion.
  selector:
    matchLabels:
      app: etcd
  template:
    metadata:
      labels:
        app: etcd
    spec:
      containers:
        - name: etcd
          image: quay.io/coreos/etcd:v3.5.12
          ports:
            - containerPort: 2379
              name: client
            - containerPort: 2380
              name: peer
          env:
            - name: ETCD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: ETCD_INITIAL_CLUSTER
              value: "etcd-0=http://etcd-0.etcd:2380,etcd-1=http://etcd-1.etcd:2380,etcd-2=http://etcd-2.etcd:2380"
            - name: ETCD_INITIAL_CLUSTER_STATE
              value: "new"
            - name: ETCD_INITIAL_CLUSTER_TOKEN
              value: "etcd-cluster-token"
            - name: ETCD_DATA_DIR
              value: "/var/lib/etcd/data"
          volumeMounts:
            - name: data
              mountPath: /var/lib/etcd
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
---
# Parallel: All Pods created simultaneously. Fast but unsafe for quorum bootstrap.
# Use only for systems that do not require ordered startup.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-cache
  namespace: production
spec:
  serviceName: redis
  replicas: 3
  podManagementPolicy: Parallel    # All Pods created at once.
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

Output

# etcd-cluster: Pods created in order etcd-0, etcd-1, etcd-2. Each waits for the previous to be Ready.

# redis-cache: All 3 Pods created simultaneously.

OrderedReady vs Parallel: The Quorum Problem

OrderedReady: sequential creation. Safe for quorum. Slow at scale.
Parallel: simultaneous creation. Fast. Unsafe for quorum bootstrap.
Rolling updates always follow ordinal order regardless of podManagementPolicy.
Scale-down always follows reverse ordinal order regardless of podManagementPolicy.
OnDelete update strategy: Pods are not updated until manually deleted. Gives full control.

Production Insight

Rolling updates on StatefulSets are slow by design. If a StatefulSet has 10 replicas and each Pod takes 60 seconds to become Ready, a rolling update takes at least 10 minutes. For large StatefulSets, consider using the OnDelete update strategy: update the spec, then manually delete Pods one at a time during maintenance windows. This gives you full control over timing and prevents unexpected updates during peak traffic. Monitor kubectl rollout status statefulset/<name> — if a single ordinal is stuck, the entire rollout blocks.

Key Takeaway

OrderedReady is the safe default for quorum-based systems. Parallel is faster but breaks consensus bootstrap. Rolling updates always follow ordinal order. Use OnDelete for manual control over update timing in production.

PersistentVolumeClaim Lifecycle: Ownership, Orphans, and Reclaim

The volumeClaimTemplates field in a StatefulSet is a template for creating PVCs. When a StatefulSet Pod is created, Kubernetes creates a PVC from the template with a deterministic name: <template-name>-<statefulset-name>-<ordinal>. For example, a StatefulSet named postgres with a volumeClaimTemplate named data creates PVCs: data-postgres-0, data-postgres-1, data-postgres-2.

These PVCs are owned by the StatefulSet but are NOT deleted when the StatefulSet is deleted. This is by design — the data must persist so it can be re-attached if the StatefulSet is re-created. However, this creates a common production pitfall: orphaned PVCs that consume storage indefinitely.

The reclaim policy on the underlying StorageClass determines what happens to the PersistentVolume when the PVC is finally deleted. Retain keeps the PV and its data. Delete removes the PV and its data. The default varies by cloud provider.

io/thecodeforge/k8s/statefulset-pvc-lifecycle.yamlYAML

# PVC lifecycle management for StatefulSets.
# Package: io.thecodeforge.k8s

# StorageClass with Delete reclaim policy.
# When PVC is deleted, the underlying PV and data are also deleted.
# Use for ephemeral or reproducible data.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd-deletable
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Delete          # PV is deleted when PVC is deleted
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# StorageClass with Retain reclaim policy.
# When PVC is deleted, the PV is kept (but unbound).
# Use for critical data that must survive PVC deletion.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd-retain
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
reclaimPolicy: Retain          # PV is kept when PVC is deleted
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# StatefulSet using the deletable storage class.
# PVCs are auto-deleted when the StatefulSet is deleted IF you also delete the PVCs.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: production
spec:
  serviceName: kafka
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.5.0
          ports:
            - containerPort: 9092
          env:
            - name: KAFKA_BROKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
            - name: KAFKA_LOG_DIRS
              value: /var/lib/kafka/data
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd-deletable
        resources:
          requests:
            storage: 500Gi

Output

# PVCs created: data-kafka-0 (500Gi), data-kafka-1 (500Gi), data-kafka-2 (500Gi)

# Total storage: 1.5TB. These persist even after StatefulSet deletion.

PVC Name Matching: Why Storage Class Changes Are Ignored

PVC name is deterministic: <template>-<sts-name>-<ordinal>.
Kubernetes matches by name. Existing PVCs are re-attached, new specs are ignored.
PVC spec (storage class, size) is immutable after creation.
To change storage class: delete StatefulSet, delete PVCs, re-create StatefulSet.
To resize PVC: set allowVolumeExpansion: true on StorageClass, then edit PVC spec.resources.requests.storage.

Production Insight

Orphaned PVCs are the silent storage leak in Kubernetes. Every deleted StatefulSet leaves behind PVCs that consume cloud storage indefinitely. At scale, this can cost thousands of dollars per month. Set up monitoring for unbound PVCs (status.phase: Pending or status.phase: Bound with no Pod). Alert when PVCs exist without a corresponding Pod for more than 1 hour. Consider a cron job that identifies and reports orphaned PVCs weekly.

Key Takeaway

StatefulSet PVCs persist after StatefulSet deletion. Kubernetes matches PVCs by name, ignoring new template specs. To change storage class, delete PVCs first. Monitor for orphaned PVCs to prevent silent storage leaks.

Update Strategies: RollingUpdate vs OnDelete

StatefulSets support two update strategies: RollingUpdate (default) and OnDelete. The choice determines how Pod template changes (image update, env var change) are propagated to existing Pods.

RollingUpdate updates Pods one at a time in ordinal order, waiting for each Pod to be Ready before proceeding to the next. This is the safe default but slow. The maxUnavailable field (available in Kubernetes 1.24+) controls how many Pods can be unavailable during the update.

OnDelete does not automatically update Pods. When the Pod template is changed, existing Pods continue running the old spec. The update is applied only when a Pod is manually deleted. Kubernetes recreates the Pod with the new spec. This gives full control over update timing but requires manual intervention.

io/thecodeforge/k8s/statefulset-update-strategy.yamlYAML

# StatefulSet update strategies.
# Package: io.thecodeforge.k8s

# RollingUpdate (default): Automatic, ordered updates.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: zookeeper
  namespace: production
spec:
  serviceName: zookeeper
  replicas: 3
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1        # Allow 1 Pod to be unavailable during update
  selector:
    matchLabels:
      app: zookeeper
  template:
    metadata:
      labels:
        app: zookeeper
    spec:
      containers:
        - name: zookeeper
          image: zookeeper:3.8
          ports:
            - containerPort: 2181
          readinessProbe:
            exec:
              command:
                - sh
                - -c
                - "echo ruok | nc localhost 2181 | grep imok"
            initialDelaySeconds: 10
            periodSeconds: 5
          volumeMounts:
            - name: data
              mountPath: /data
            - name: datalog
              mountPath: /datalog
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
    - metadata:
        name: datalog
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 20Gi
---
# OnDelete: Manual update control. No automatic Pod recreation on template change.
apiVersion: apps/v1
kind: systems where each restart has significant operational impact (rebalancing, reindexing, replication catch-up).

Production Insight

For large Elasticsearch or Cassandra clusters, RollingUpdate causes hours of unnecessary rebalancing. Each Pod restart triggers shard redistribution, which competes with application traffic for I/O and network bandwidth. Use OnDelete and restart Pods one at a time during maintenance windows, waiting for cluster health to return to green before proceeding to the next Pod. Monitor cluster health metrics (Elasticsearch: _cluster/health, Cassandra: nodetool status) between each restart.

Key Takeaway

RollingUpdate is the safe default for systems with fast startup. OnDelete is the production standard for large data systems where each restart has significant operational impact. Use maxUnavailable to control parallelism during RollingUpdate.

PodDisruptionBudgets and StatefulSet Availability

PodDisruptionBudgets (PDBs) are critical for StatefulSets. They prevent the voluntary disruption controller from evicting too many Pods simultaneously during node drains, cluster upgrades, or preemptions. The controller does not intentionally evict more Pods than the budget allows.

For a 3-node etcd cluster, set minAvailable: 2. This ensures that a node drain cannot break quorum. If the drain would evict a third Pod, it blocks until one of the evicted Pods is rescheduled and Ready.

PDBs only block voluntary disruption (drain, upgrade, preemption). They do NOT protect against involuntary disruption (node crash, OOMKill, kernel panic). This distinction is critical: PDBs are a guardrail for planned maintenance, not a safety net for unplanned failures.

io/thecodeforge/k8s/statefulset-pdb.yamlYAML

# PodDisruptionBudget for a 3-node etcd cluster.
# Ensures at least 2 Pods are always available (quorum).
# Package: io.thecodeforge.k8s
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: etcd-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: etcd
---
# PodDisruptionBudget using maxUnavailable for a 5-node Kafka cluster.
# Allows at most 1 Pod to be down at any time.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: kafka

Output

# Verify PDB status:

# kubectl get pdb -n production

# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE

# etcd-pdb 2 N/A 1 5d

# kafka-pdb N/A 1 0 5d

# ALLOWED DISRUPTIONS = 0 means the PDB is currently blocking all voluntary evictions.

minAvailable vs maxUnavailable

minAvailable: 2 on a 3-replica cluster = 1 Pod can be evicted.
maxUnavailable: 1 on a 3-replica cluster = same result, different expression.
PDBs block voluntary disruption only (drain, upgrade, preemption).
They do NOT protect against involuntary disruption (crash, OOMKill).
Setting minAvailable equal to replica count blocks all maintenance. Use quorum size instead.

Production Insight

The most common PDB misconfiguration is setting minAvailable equal to the replica count. If you have 3 replicas and set minAvailable: 3, the PDB blocks all voluntary disruptions — including necessary node drains during maintenance. This forces operators to delete the PDB before draining nodes, which defeats the purpose. Set minAvailable to the quorum size (2 for 3 replicas) or use maxUnavailable: 1. During cluster upgrades, the upgrade controller respects PDBs and waits for Pods to be rescheduled before proceeding to the next node.

Key Takeaway

PDBs are mandatory for StatefulSets running quorum-based systems. Set minAvailable to the quorum size, not the replica count. PDBs only block voluntary disruption — they do not protect against node crashes. Always pair PDBs with anti-affinity rules to spread Pods across nodes.

PDB Configuration Decision Tree

IfQuorum-based system (etcd, ZooKeeper, CockroachDB)

→

UseSet minAvailable to quorum size. For 3 replicas: minAvailable: 2. For 5 replicas: minAvailable: 3.

IfIndependent replicas (Redis standalone, stateless workers)

→

UseSet maxUnavailable: 1 or minAvailable: N-1. Allows rolling maintenance without service degradation.

IfSingle-replica StatefulSet (single PostgreSQL instance)

→

UsePDB with minAvailable: 1 blocks all voluntary disruption. Use cautiously — maintenance requires manual PDB deletion.

IfLarge cluster (10+ replicas) with no quorum requirement

→

UseSet maxUnavailable: 20-25% to allow parallel node drains during upgrades.

Ingress Is Dead. Long Live Gateway API.

The Ingress API is frozen. No new features. No bug fixes that aren't security critical. Kubernetes v1.19 marked it stable, and then the project stopped. Gateway API is the replacement. If you are building a new cluster, do not design around Ingress. Design around Gateway, even if you wrap it with an Ingress controller for compatibility. Why? Because Gateway separates the concerns Ingress smushed together: infrastructure provider, cluster operator, and application developer. Ingress forces a single resource to define routing, TLS, and backend selection. Gateway lets each team own their piece without stepping on each other. The production incident you will face: an app team changes an Ingress rule and accidentally takes down another team's routing because the Ingress object is shared. Gateway prevents that. Migrate now, not when your Ingress breaks and you have no upgrade path.

gateway-example.yamlYAML

// io.thecodeforge
# Gateway API v1 example — not Ingress
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: prod-gateway
spec:
  gatewayClassName: nginx
  listeners:
  - name: https
    port: 443
    protocol: HTTPS
    hostname: "*.acme.io"
    tls:
      mode: Terminate
      certificateRefs:
        - name: wildcard-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: payments-api
spec:
  parentRefs:
  - name: prod-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /payments
    backendRefs:
    - name: payments-v2
      port: 8080

Output

gateway.networking.k8s.io/v1 Gateway created

gateway.networking.k8s.io/v1 HTTPRoute created

Production Trap:

Ingress controllers that also support Gateway API (nginx, Contour, Envoy) will silently apply both, causing duplicate routes or conflicting ACLs. Audit your CRDs before rollout.

Key Takeaway

Gateway API is the future. Ingress is legacy as of v1.19. Do not start new projects on Ingress.

Path Types: The Silent Routing Breakage

Ingress supports three path types: Prefix, Exact, and ImplementationSpecific. The one that bites everyone is Prefix. It does not behave like a regex or glob. Prefix matching means it matches if the request path starts with the given path. A rule for /api will match /api/v1, /api/v2, and /api/v1/orders. But it also matches /api-keys because /api is a prefix of /api-keys. Your regex-loving brain says 'that's wrong'. Your nginx controller says 'too bad, it's kubernetes-native prefix matching'. Always append a trailing slash to your paths unless you explicitly want prefix behavior. Use Exact for literal endpoints. ImplementationSpecific delegates to your controller — meaning your rules are not portable. If you migrate from nginx to Contour, your ImplementationSpecific rules will break silently. Stick to Prefix and Exact for portability. Test with curl before deploying to production.

ingress-path-example.yamlYAML

// io.thecodeforge
# Correct path matching — avoid the /api-keys trap
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-gateway
spec:
  ingressClassName: nginx
  rules:
  - host: api.acme.io
    http:
      paths:
      - path: /api/v1
        pathType: Prefix
        backend:
          service:
            name: v1-backend
            port:
              number: 80
      - path: /docs
        pathType: Exact
        backend:
          service:
            name: docs-static
            port:
              number: 80

Output

Prefix: /api/v1 matches /api/v1/orders, /api/v1/health, but NOT /api-keys

Exact: /docs matches only /docs, not /docs/ or /docs.html

Production Trap:

Prefix matching with a trailing slash acts differently across controllers. nginx strips trailing slashes before matching. Contour does not. Test on staging first.

Key Takeaway

Prefix path matching is not regex. Use Exact for fixed endpoints, Prefix for API version trees.

● Production incidentPOST-MORTEMseverity: high

StatefulSet PVC Orphan Caused 2TB Storage Leak and Blocked Cluster Migration

Symptom

After deleting the StatefulSet, the cloud bill showed 2TB of unattached persistent disks. When the new StatefulSet was created, Pods attached to the old PVCs with the old storage class instead of the new ones. The team could not understand why the new storage configuration was not being applied.

Assumption

Deleting the StatefulSet would clean up all associated resources including PVCs.

Root cause

StatefulSets use a PersistentVolumeClaim template (volumeClaimTemplates) that creates a PVC for each Pod. These PVCs are owned by the StatefulSet but have a reclaim policy of Retain by default. When the StatefulSet is deleted, the PVCs are NOT deleted — they persist in the namespace, retaining their data and their storage class. When a new StatefulSet with the same name and PVC template names is created, Kubernetes matches the existing PVCs by name and re-attaches them, completely ignoring any changes to the storage class or size in the new template.

Fix

1. Manually deleted the orphaned PVCs: kubectl delete pvc data-postgres-0 data-postgres-1 data-postgres-2 data-postgres-3 data-postgres-4 -n production. 2. Verified the underlying PersistentVolumes were released and deleted (or set reclaimPolicy: Delete for the StorageClass). 3. Re-created the StatefulSet with the new storage class in volumeClaimTemplates. 4. Added a cleanup script to the team's runbook that explicitly deletes PVCs after StatefulSet deletion. 5. Set up monitoring for unbound PVCs: alert when PVCs exist without a bound Pod for more than 1 hour.

Key lesson

StatefulSet PVCs are NOT deleted when the StatefulSet is deleted. They persist indefinitely unless explicitly removed.
PVC names are deterministic: data-<statefulset-name>-<ordinal>. Kubernetes matches by name, not by spec. Changing the storage class in the template has no effect on existing PVCs.
Always delete PVCs explicitly when decommissioning a StatefulSet. Add this to your runbook.
Monitor for unbound PVCs. They consume storage and can cause billing surprises.
Before migrating storage classes, back up data, delete StatefulSet, delete PVCs, then re-create with new storage class.

Production debug guideSymptom-first investigation path for StatefulSet failures.6 entries

Symptom · 01

StatefulSet Pod stuck in Pending.

→

Fix

Check if the PVC is bound. Run kubectl describe pvc data-<sts-name>-<ordinal>. If the PVC is Pending, the StorageClass may not exist or the provisioner may be down. Check node affinity — the PV may be bound to a specific node that is unavailable.

Symptom · 02

StatefulSet Pod stuck in CrashLoopBackOff after node failure.

→

Fix

The Pod is likely trying to re-attach a PVC that is still attached to the old (failed) node. Check PVC status: kubectl get pvc -n <ns>. If the PV is still attached to the old node, force-detach it or wait for the attach-detach controller to time out (6 minutes default).

Symptom · 03

Rolling update stuck on a specific ordinal (e.g., pod-3).

→

Fix

StatefulSets update Pods in ordinal order. If pod-3 is not ready, pod-4 will not be updated. Check pod-3's readiness probe, logs, and events. Use kubectl rollout status statefulset/<name> to see which ordinal is blocking.

Symptom · 04

StatefulSet scale-down stuck. Pods not being deleted.

→

Fix

StatefulSets delete Pods in reverse ordinal order (highest first). If pod-2 is not terminating, pod-1 and pod-0 will not be deleted. Check for finalizers on the Pod, PVC detach issues, or PodDisruptionBudget conflicts.

Symptom · 05

State DNS name.

→

Fix

Verify the Headless Service exists and has clusterIP: None. Check that spec.serviceName in the StatefulSet matches the Headless Service name.fulSet Pods cannot resolve each other by Test DNS resolution: kubectl exec pod-0 -- nslookup pod-1.<service>.<namespace>.svc.cluster.local.

Symptom · 06

New StatefulSet Pod attached to old PVC with wrong data.

→

Fix

This is expected behavior. Kubernetes matches PVCs by name. If you re-create a StatefulSet with the same name, it re-attaches existing PVCs. To start fresh, delete the PVCs first: kubectl delete pvc data-<sts-name>-*.

★ StatefulSet Triage CommandsRapid commands to isolate StatefulSet lifecycle and storage issues.

Pod stuck in Pending.−

Immediate action

Check PVC binding status and StorageClass.

Commands

kubectl describe pvc data-<sts-name>-<ordinal> -n <namespace>

kubectl get storageclass

Fix now

If PVC is Pending, the StorageClass provisioner may be down. Check PV events: kubectl get events -n <ns> --field-selector involvedObject.name=data-<sts-name>-<ordinal>.

Pod stuck in ContainerCreating with volume attach error.+

Rolling update stuck.+

PVCs consuming unexpected storage after StatefulSet deletion.+

DNS resolution failing between StatefulSet Pods.+

Deployment vs StatefulSet vs DaemonSet: When to Use Each

Aspect	Deployment	StatefulSet	DaemonSet
Pod identity	Random, interchangeable	Stable, ordinal (pod-0, pod-1)	One per node (or subset)
Pod naming	random-hash	sts-name-ordinal	daemon-hash
Storage	Ephemeral or shared PV	Per-Pod PVC (sticky)	HostPath or shared PV
Scaling	Horizontal (free)	Ordered (sequential)	Automatic (per-node)
Rolling update	maxSurge + maxUnavailable	OrderedReady or OnDelete	RollingUpdate or OnDelete
DNS identity	Service VIP (load-balanced)	Per-Pod DNS via Headless Service	Service VIP (load-balanced)
Self-healing	Yes (replace any Pod)	Yes (replace with same identity)	Yes (replace on same node)
Creation order	Parallel	Sequential (0, 1, 2)	Parallel (one per node)
Deletion order	Parallel	Reverse (N, N-1, ..., 0)	Parallel
Use case	Stateless APIs, web servers	Databases, Kafka, ZooKeeper, etcd	Log agents, node-exporter, CNI agents

Key takeaways

StatefulSets provide three guarantees

stable identity (name + DNS), sticky storage (per-Pod PVC), and ordered operations (sequential creation/deletion).

StatefulSet PVCs persist after StatefulSet deletion. Always delete PVCs explicitly when decommissioning. Monitor for orphaned PVCs.

OrderedReady is the safe default for quorum-based systems. Parallel breaks consensus bootstrap. StorageClasses work, Rolling updates always follow ordinal order.

OnDelete is the production standard for large data systems where each restart triggers hours of rebalancing.

PDBs are mandatory for quorum-based StatefulSets. Set minAvailable to quorum size, not replica count.

PVC name matching is by name, not spec. To change storage class, delete PVCs first, then re-create the StatefulSet.

The Headless Service is mandatory for per-Pod DNS. Without it, peer discovery fails and the cluster cannot bootstrap.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 6 QUESTIONS

Frequently Asked Questions

What is the difference between a Deployment and a StatefulSet?

Why do StatefulSet Pods need a Headless Service?

What happens to PVCs when a StatefulSet is deleted?

When should I use OnDelete instead of RollingUpdate?

How do I resize a StatefulSet PVC?

What is the most common StatefulSet production mistake?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Kubernetes. Mark it forged?

11 min read · try the examples if you haven't