Skip to content
Home DevOps etcd Disk Latency — How 800ms Killed the Kubernetes Cluster

etcd Disk Latency — How 800ms Killed the Kubernetes Cluster

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Kubernetes → Topic 1 of 12
The degraded etcd EBS volume caused 800ms write latency, stalling Raft consensus and freezing the entire cluster.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
The degraded etcd EBS volume caused 800ms write latency, stalling Raft consensus and freezing the entire cluster.
  • You now understand what Introduction to Kubernetes is and why it exists
  • You've seen it working in a real runnable example
  • Practice daily — the forge only works when it's hot
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Kubernetes is a declarative container orchestration platform that continuously reconciles observed state with desired state.
  • Control plane: kube-apiserver, etcd, kube-scheduler, kube-controller-manager — each has a distinct role and failure mode.
  • etcd is the single source of truth — its disk latency is the cluster's performance ceiling.
  • The scheduler filters then scores nodes; it does NOT rebalance or predict load.
  • kubelet on each node runs the actual containers and reports status back to the API server.
  • Most production outages trace back to etcd misconfiguration, not application code.
🚨 START HERE

Kubernetes Triage Cheat Sheet

First-response commands for common K8s production incidents.
🟡

Pod not starting — no events visible.

Immediate ActionCheck if the scheduler is running and if nodes have capacity.
Commands
kubectl get pods -n kube-system | grep scheduler
kubectl describe nodes | grep -A 5 'Allocated resources'
Fix NowIf scheduler is down: check kube-system logs. If no capacity: scale the cluster or evict low-priority Pods.
🟡

Service returns 502/503 intermittently.

Immediate ActionCheck if endpoints exist and Pods are passing readiness probes.
Commands
kubectl get endpoints <service-name>
kubectl get pods -l app=<selector> -o wide
Fix NowIf endpoints are empty: Pods are failing readiness probes. Check probe configuration and Pod logs. If endpoints exist but 503 persists: check kube-proxy iptables rules with `iptables-save | grep <service-cluster-ip>`.
🟡

Node marked NotReady — Pods being evicted.

Immediate ActionSSH to the node and check kubelet status.
Commands
kubectl describe node <node-name> | grep -A 10 Conditions
systemctl status kubelet
Fix NowIf kubelet is down: `systemctl restart kubelet`. If disk pressure: clean up unused images with `crictl rmi --prune`. If memory pressure: identify and kill the offending process.
🟡

PersistentVolumeClaim stuck in Pending.

Immediate ActionCheck if a PersistentVolume exists that matches the claim's requirements.
Commands
kubectl get pv
kubectl describe pvc <pvc-name>
Fix NowIf no PV available: provision one manually or ensure the StorageClass has a provisioner. If PV exists but not binding: check accessModes and storageClassName match.
🟡

Pod evicted due to node pressure.

Immediate ActionIdentify the type of pressure and the root cause.
Commands
kubectl describe node <node-name> | grep -i pressure
kubectl top node <node-name>
Fix NowDiskPressure: clean up old logs and images, increase disk size. MemoryPressure: reduce Pod memory requests, add more nodes. PIDPressure: reduce number of processes per Pod.
Production Incident

The etcd Disk That Killed the Entire Cluster

A production cluster with 200 nodes stopped scheduling new Pods. Existing Pods continued running, but all deployments, scaling operations, and config updates hung indefinitely. The cluster appeared healthy from node metrics but was functionally frozen.
Symptomkubectl commands hang or timeout. New Pods stuck in Pending. Deployment rollouts never complete. API server logs show 'etcdserver: request timed out' errors. Controller-manager logs show leader election failures.
AssumptionThe API server is overloaded, or the scheduler has crashed.
Root causeetcd's data directory was on a network-attached EBS volume that had degraded to p99 write latency of 800ms (normal: 2ms). etcd requires sub-10ms disk writes for stable operation. The degraded disk caused the Raft consensus protocol to stall — the cluster could not commit new state changes. The API server, which depends on etcd for every operation, began queuing requests until it exhausted its connection pool. The scheduler and controller-manager, which watch etcd via the API server, received no updates and effectively froze.
Fix1. Immediately migrate etcd to local NVMe SSDs (provisioned IOPS EBS or instance-local storage). 2. Set etcd disk latency alerts at p99 > 10ms as critical. 3. Implement etcd defragmentation on a schedule (etcdctl defrag). 4. Configure etcd auto-compaction (--auto-compaction-retention=8) to prevent unbounded data growth. 5. Monitor etcd member health with etcdctl endpoint health and etcdctl endpoint status.
Key Lesson
etcd is the single point of failure for the entire cluster. Its disk performance is the cluster's ceiling.Never run etcd on network-attached storage in production. Local SSDs are mandatory.API server timeouts are often etcd problems, not API server problems. Trace downward, not upward.etcd requires periodic defragmentation. Without it, space is freed but not reclaimed, leading to disk pressure.
Production Debug Guide

Symptom-driven investigation paths for the most common failure modes.

Pod stuck in Pending state.1. Run kubectl describe pod <name> and read the Events section. 2. Common causes: insufficient CPU/memory on any node (check kubectl describe nodes for Allocatable vs Allocated), PersistentVolumeClaim not bound, node affinity/taint mismatches, resource quotas exceeded. 3. If no events appear, the scheduler may be down — check kubectl get pods -n kube-system for kube-scheduler.
Pod stuck in CrashLoopBackOff.1. Run kubectl logs <pod> --previous to see the logs from the crashed container (current logs may be empty). 2. Common causes: missing environment variables, failed health checks, OOMKill (check kubectl describe pod for Last State), misconfigured entrypoint. 3. If OOMKill, increase memory limits or fix the memory leak. Check kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState}'.
Pods cannot reach each other across nodes.1. Verify the CNI plugin is healthy: kubectl get pods -n kube-system | grep calico (or flannel/weave). 2. Check if Pod CIDR ranges overlap between nodes: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. 3. Verify kube-proxy is running: kubectl get pods -n kube-system | grep kube-proxy. 4. Test from within a Pod: kubectl exec -it <pod> -- curl <service-ip>:<port>.
Deployment rollout hangs — new ReplicaSet never becomes ready.1. Check the new ReplicaSet: kubectl describe rs <new-rs-name>. 2. Look for Pods that are Pending or CrashLoopBackOff. 3. Check if the new image exists in the registry and if imagePullSecrets are configured. 4. If using rolling update with maxUnavailable=0 and the cluster has no spare capacity, new Pods cannot be scheduled. 5. Rollback: kubectl rollout undo deployment/<name>.
etcd high latency alerts firing — API server slow.1. Check etcd latency: etcdctl endpoint health --write-out=table. 2. Check disk I/O on etcd nodes: iostat -x 1. 3. Check etcd database size: etcdctl endpoint status --write-out=table. 4. If disk is the bottleneck, migrate to local SSDs. 5. If database is large, run defragmentation: etcdctl defrag.
PersistentVolumeClaim stuck in Pending.1. Check if any PersistentVolume matches: kubectl get pv. 2. Describe the PVC: kubectl describe pvc <name>. Common causes: no PV available with matching accessModes and storageClassName, or the StorageClass has no provisioner. 3. If using dynamic provisioning, verify the storage provisioner pod is running and hasn't hit a quota or permission error.

Kubernetes is not a deployment tool. It is a distributed state reconciliation engine. Every component — from the scheduler to the kubelet — operates on the same principle: watch the desired state in etcd, compare it with observed state, and act to close the gap. This is the mental model that unlocks real debugging capability.

The control plane is the brain. etcd is the memory. The kubelet is the muscle on each node. The scheduler decides placement. When any of these components degrades, the symptoms are often misleading — a Pod stuck in Pending looks like a scheduling problem but is frequently an etcd latency issue or a resource quota misconfiguration.

The common misconception is that Kubernetes 'runs containers.' It does not. Kubernetes manages the desired state of workloads. The container runtime (containerd, CRI-O) runs containers. Kubernetes tells the runtime what to run, monitors whether it is running, and corrects deviations. This distinction matters when debugging crashes, image pull failures, and networking issues.

Control Plane Architecture: The Brain of the Cluster

The Kubernetes control plane consists of four components that work together to maintain cluster state. Understanding each component's role — and its failure modes — is essential for production operations.

kube-apiserver is the front door. Every kubectl command, every controller reconciliation, every kubelet status report goes through the API server. It validates requests, persists state to etcd, and serves as the watch endpoint for all controllers. It is stateless — you can run multiple replicas behind a load balancer for HA.

etcd is the single source of truth. It is a distributed, consistent key-value store built on the Raft consensus protocol. All cluster state — Pod definitions, ConfigMaps, Secrets, node registrations — lives in etcd. If etcd loses quorum, the cluster cannot make any state changes. etcd is the most critical component and the most commonly under-provisioned.

kube-scheduler watches for unscheduled Pods and assigns them to nodes. It does not run Pods — it only writes the nodeName field. The kubelet on the assigned node then pulls the image and starts the container. The scheduler uses a two-phase process: filtering (eliminate infeasible nodes) and scoring (rank feasible nodes, pick the highest score).

kube-controller-manager runs the control loops. Each controller watches a specific resource type and reconciles actual state with desired state. The Deployment controller ensures the right number of replicas exist. The Node controller detects when nodes go unhealthy. The Endpoint controller updates Service endpoints as Pods come and go.

control-plane-architecture.yaml · YAML
123456789101112131415161718192021
# Control Plane Health CheckRun this to verify all components are healthy
# Save as check-control-plane.sh

# 1. API Server health (returns 200 if healthy)
curl -k https://localhost:6443/healthz
# Expected: "ok"

# 2. etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \n  --endpoints=https://127.0.0.1:2379 \n  --cacert=/etc/kubernetes/pki/etcd/ca.crt \n  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \n  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Expected: "is healthy"

# 3. etcd cluster member status
ETCDCTL_API=3 etcdctl endpoint status \n  --endpoints=https://127.0.0.1:2379 \n  --cacert=/etc/kubernetes/pki/etcd/ca.crt \n  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \n  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \n  --write-out=table
# Shows: ID, Status, Version, DB Size, Raft Term, Raft Index

# 4. Scheduler and Controller-Manager leader election
kubectl get endpoints kube-scheduler -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
kubectl get endpoints kube-controller-manager -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'

# 5. All control plane components running
kubectl get pods -n kube-system -o wide
▶ Output
ok

127.0.0.1:2379 is healthy: successfully committed proposal: took = 2.145ms

+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://127.0.0.1:2379 | 8e9e05c52164694d | 3.5.9 | 25 MB | true | false | 4 | 18234 | 18234 | |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

{"holderIdentity":"master-1_xxxxx","leaseDurationSeconds":15,"acquireTime":"2026-03-01T10:00:00Z","renewTime":"2026-04-07T14:30:00Z","leaderTransitions":3}

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system coredns-5d78c9869d-abc12 1/1 Running 0 30d 10.244.0.5 master-1
kube-system etcd-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-apiserver-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-controller-manager-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-proxy-xyz78 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-scheduler-master-1 1/1 Running 0 30d 192.168.1.10 master-1
Mental Model
The Reconciliation Loop — The Heartbeat of Kubernetes
Understanding this loop is the single most important concept in Kubernetes.
  • The API server is the only component that talks to etcd. All other components go through the API server.
  • Controllers are level-triggered, not edge-triggered. They care about the current state, not the event that caused it.
  • This is why Kubernetes is self-healing. It does not remember what happened — it only checks what is true right now.
📊 Production Insight
Control plane HA requires at least 3 etcd members and 2+ API server replicas.
A single-node control plane is a single point of failure — if the master node dies, existing Pods keep running (kubelet is independent), but you cannot deploy, scale, or modify anything until the control plane recovers.
etcd quorum requires (n/2)+1 members alive. With 3 members, you can tolerate 1 failure.
Never run an even number of etcd members — split-brain scenarios become possible.
🎯 Key Takeaway
The control plane is a distributed system with etcd as its consensus backbone.
Every API request, every scheduler decision, every controller reconciliation depends on etcd's health.
Production clusters need 3+ etcd members on local SSDs — and etcd latency monitoring is not optional, it's the earliest warning of cluster degradation.
Control Plane Sizing for Production
IfDev/test cluster, non-critical workloads.
UseSingle control plane node is acceptable. Accept the risk of API unavailability during maintenance.
IfProduction cluster, < 100 nodes.
Use3 control plane nodes with stacked etcd. etcd runs on the same nodes as the API server — cost-effective HA.
IfProduction cluster, > 100 nodes or strict SLA requirements.
Use3–5 dedicated etcd nodes + 2+ API server nodes (external etcd). Isolates etcd disk I/O from API load.
IfMulti-region cluster.
UseStretched etcd across regions needs < 10ms latency. If higher, use separate clusters per region.

The Scheduler: How Kubernetes Decides Where Pods Run

The kube-scheduler is the component that assigns Pods to nodes. It does not run Pods — it only writes the spec.nodeName field on the Pod object. The kubelet on the assigned node then pulls the image and starts the container.

Filtering (Feasibility): Eliminate nodes that cannot run the Pod. Filter reasons include: insufficient CPU/memory, node taints the Pod cannot tolerate, node affinity mismatches, volume zone constraints, and Pod topology spread constraints. After filtering, if zero nodes remain, the Pod stays in Pending.

Scoring (Ranking): Rank the feasible nodes by a set of scoring plugins. Default scoring includes: NodeResourcesBalancedAllocation (prefer nodes with balanced CPU/memory usage), ImageLocality (prefer nodes that already have the container image), InterPodAffinity (prefer nodes where affinity rules are satisfied), and TaintToleration (prefer nodes with fewer taints). The node with the highest weighted score wins.

The scheduler makes decisions based on the state of the cluster at scheduling time. It does not predict future load. It does not rebalance existing Pods. Once a Pod is scheduled, only explicit actions (eviction, deletion, preemption) can move it.

scheduler-configuration.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
# Example: Pod with scheduling constraints
# This Pod will ONLY be scheduled on nodes with the label 'disktype=ssd'
# and will prefer nodes in zone 'us-east-1a'
apiVersion: v1
kind: Pod
metadata:
  name: io-thecodeforge-payment-service
  namespace: production
spec:
  # Hard requirement: node MUST have this label
  nodeSelector:
    disktype: ssd

  # Soft preference: scheduler tries to place here, but can choose elsewhere
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1a
    # Pod affinity: prefer to run near other payment-service Pods
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: payment-service
            topologyKey: kubernetes.io/hostname

  # Tolerations: allow scheduling on nodes with the 'dedicated=high-cpu' taint
  tolerations:
    - key: dedicated
      operator: Equal
      value: high-cpu
      effect: NoSchedule

  # Topology spread: distribute replicas evenly across zones
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: payment-service

  containers:
    - name: payment-service
      image: registry.thecodeforge.io/payment-service:v2.4.1
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"
        limits:
          cpu: "1000m"
          memory: "1Gi"
▶ Output
pod/io-thecodeforge-payment-service created

# Verify scheduling decision
kubectl describe pod io-thecodeforge-payment-service -n production | grep -A 10 Events

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5s default-scheduler Successfully assigned production/io-thecodeforge-payment-service to node-3
Mental Model
Requests vs Limits — The Scheduler Only Sees Requests
This is why setting requests=limits (Guaranteed QoS) gives the most predictable performance.
  • Guaranteed QoS (requests=limits): Pod is last to be evicted under resource pressure.
  • Burstable QoS (requests < limits): Pod can burst but is evicted before Guaranteed Pods.
  • BestEffort QoS (no requests, no limits): First to be evicted. Never use in production.
📊 Production Insight
Scheduler performance degrades with cluster size and Pod count.
At > 5000 Pods, scheduling latency can exceed 1 second, causing deployment rollouts to slow dramatically.
Mitigate with: scheduler extenders for custom logic, Pod topology spread constraints instead of pod anti-affinity (more efficient), and multiple scheduler profiles for different workload classes.
The scheduler's scoring algorithm is pluggable — you can weight or disable scoring plugins via a KubeSchedulerConfiguration.
🎯 Key Takeaway
The scheduler is a scoring engine, not a bin-packer.
It ranks feasible nodes and picks the best match at scheduling time.
It does not rebalance, predict load, or consider limits.
Understanding the filter-then-score pipeline — and how nodeSelector, affinity, taints, and topology spread interact within it — is essential for controlling Pod placement at scale.
Scheduling Constraint Selection
IfPod MUST run on a specific type of node (e.g., GPU, SSD).
UseUse nodeSelector or nodeAffinity required mode. Hard constraint — Pod stays Pending if no node matches.
IfPod PREFERS a specific node type but can run elsewhere.
UseUse nodeAffinity preferred mode. Soft constraint — scheduler tries to match but places elsewhere if needed.
IfReplicas must be spread across failure domains (zones, nodes).
UseUse topologySpreadConstraints. More flexible and performant than pod anti-affinity.
IfPod should run near (or away from) other specific Pods.
UseUse podAffinity (co-locate) or podAntiAffinity (spread). At scale prefer topologySpreadConstraints.
IfNode has taints (dedicated nodes, spot instances).
UseAdd tolerations to the Pod spec. Without a matching toleration, the Pod won't schedule on the tainted node.

Pod Networking: How Containers Talk to Each Other

Kubernetes networking has three fundamental requirements, enforced by the CNI (Container Network Interface) plugin:

  1. Every Pod gets its own IP address, unique across the cluster.
  2. Pods on any node can communicate with Pods on any other node without NAT.
  3. Agents on a node (kubelet, system daemons) can communicate with all Pods on that node.

These requirements are simple to state but complex to implement. The CNI plugin (Calico, Cilium, Flannel, AWS VPC CNI) is responsible for wiring this up. It allocates IP addresses from the node's Pod CIDR range, sets up network interfaces inside the Pod's network namespace, and configures routing rules so Pods can reach each other across nodes.

kube-proxy handles Service networking. It watches the API server for Service and Endpoint objects, then programs iptables rules (or IPVS rules) on each node. When a Pod connects to a Service's ClusterIP, the kernel's iptables rules intercept the connection and DNAT it to one of the backend Pod IPs. This is why Service IPs are virtual — they do not exist on any network interface.

networking-debug.yaml · YAML
12345678910111213141516171819202122232425262728
# Debugging Pod networking step by step

# 1. Verify Pod has an IP address
kubectl get pods -n production -o wide
# If Pod IP is <none>, the CNI plugin failed to assign an address

# 2. Check if the CNI plugin is healthy
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|aws-node'

# 3. Verify Pod CIDR allocation per node
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
# Each node must have a unique, non-overlapping CIDR

# 4. Test Pod-to-Pod connectivity across nodes
kubectl exec -it pod-on-node-a -- ping <pod-ip-on-node-b>
# If this fails but intra-node works, the CNI cross-node routing is broken

# 5. Check Service endpoints
kubectl get endpoints payment-service -n production
# If endpoints are empty, no Pods match the Service's selector

# 6. Test Service DNS resolution
kubectl exec -it <pod> -- nslookup payment-service.production.svc.cluster.local
# If DNS fails, check CoreDNS pods: kubectl get pods -n kube-system | grep coredns

# 7. Inspect iptables rules for a Service
# (run on the node where your Pod is running)
iptables-save | grep <service-cluster-ip>
▶ Output
NAME READY STATUS IP NODE
payment-service-7d8f9-abc12 1/1 Running 10.244.1.45 node-2
payment-service-7d8f9-def34 1/1 Running 10.244.2.78 node-3

NAME READY STATUS RESTARTS AGE
calico-node-abc12 1/1 Running 0 30d
calico-kube-controllers-5d78-def34 1/1 Running 0 30d

node-1 10.244.0.0/24
node-2 10.244.1.0/24
node-3 10.244.2.0/24

PING 10.244.2.78 (10.244.2.78): 56 data bytes
64 bytes from 10.244.2.78: seq=0 ttl=62 time=0.456 ms

NAME ENDPOINTS AGE
payment-service 10.244.1.45:8080,10.244.2.78:8080 15d

Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: payment-service.production.svc.cluster.local
Address 1: 10.96.45.12 payment-service.production.svc.cluster.local
Mental Model
The Three Layers of K8s Networking
Most networking bugs are CNI or DNS issues, not application code issues.
  • Pod IP works but Service IP fails: kube-proxy or iptables issue.
  • Service IP works but DNS fails: CoreDNS issue.
  • DNS works but external access fails: Ingress controller or cloud LB issue.
📊 Production Insight
CNI plugin selection has massive performance and operational implications.
Calico (BGP mode) scales well but requires BGP peering knowledge.
Cilium (eBPF) bypasses iptables entirely, offering better performance at scale (>1000 Services) but is more complex to debug.
AWS VPC CNI assigns real VPC IP addresses to Pods, simplifying security group integration but consuming VPC IP space rapidly.
Evaluate CNI based on: Service count, NetworkPolicy requirements, observability needs, and team expertise.
🎯 Key Takeaway
Kubernetes networking is a layered system: CNI for Pod connectivity, kube-proxy for Service load balancing, Ingress for external access.
Debug from the bottom up — Pod IP, then ClusterIP, then DNS, then Ingress.
CNI plugin choice is a long-term architectural decision with performance, security, and operational trade-offs.

Kubernetes Storage: PersistentVolumes, Claims, and StorageClasses

Kubernetes storage decouples Pod lifecycle from data life. A Pod can be deleted and recreated, but its data persists if it uses a PersistentVolume (PV) and PersistentVolumeClaim (PVC). This is critical for stateful workloads like databases.

PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically by a StorageClass. It is a cluster resource, like a node. PVs have a capacity and access mode (ReadWriteOnce, ReadOnlyMany, ReadWriteMany).

PersistentVolumeClaim (PVC) is a request for storage by a user. It specifies size and access mode. Kubernetes binds a PVC to a PV that meets the requirements. If no matching PV exists, the PVC remains Pending — unless a StorageClass with a dynamic provisioner is referenced.

StorageClass defines a class of storage. It specifies the provisioner (e.g., kubernetes.io/aws-ebs), parameters (type, IOPS), and reclaim policy. When a PVC requests a StorageClass, the provisioner automatically creates a PV that satisfies the claim.

storage-example.yaml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# StorageClass for AWS gp3 volumes with 3000 IOPS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: io-thecodeforge-fast
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
# PVC that uses the StorageClass above
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: io-thecodeforge-payment-db-pvc
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: io-thecodeforge-fast
---
# Pod using the PVC
apiVersion: v1
kind: Pod
metadata:
  name: io-thecodeforge-payment-db
  namespace: production
spec:
  containers:
    - name: postgres
      image: postgres:16
      env:
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
      volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: io-thecodeforge-payment-db-pvc
▶ Output
# After creation
kubectl get sc
# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
# io-thecodeforge-fast ebs.csi.aws.com Delete WaitForFirstConsumer

kubectl get pvc -n production
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# io-thecodeforge-payment-db-pvc Bound pvc-abc123 100Gi RWO io-thecodeforge-fast

kubectl get pv
# NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
# pvc-abc123 100Gi RWO Delete Bound production/io-thecodeforge-payment-db-pvc
⚠ PV Reclaim Policy: The Silent Data Loss Trap
  • If a PVC is deleted, what happens to the underlying PV depends on the persistentVolumeReclaimPolicy:
  • Retain: PV remains but is in Released state — you must manually reclaim it.
  • Delete: PV and underlying storage are deleted. This is default for dynamic provisioners.
  • Recycle: Deprecated. Attempts to scrub and re-use.
  • Production gotcha*: If you delete a PVC with a Delete reclaim policy without first taking a snapshot, you lose all data. Always set Retain for critical databases, or use a backup solution.
📊 Production Insight
Dynamic provisioning with a default StorageClass is dangerous — a typo in a PVC's storageClassName can fall back to the default class.
If you delete a Namespace, all PVCs in it are deleted, and with ReclaimPolicy=Delete, all data is gone.
Monitor PVC usage and set ResourceQuotas on storage requests to prevent runaway claims from exhausting your cloud budget.
For stateful workloads, use StatefulSets with volumeClaimTemplates to automatically generate unique PVCs per replica.
🎯 Key Takeaway
Kubernetes storage is about lifecycle decoupling: PV is a resource, PVC is a claim, StorageClass enables dynamic provisioning.
The reclaim policy determines whether data survives PVC deletion — set to Retain for anything irreplaceable.
StatefulSets with volumeClaimTemplates are the correct pattern for database-like workloads.
Choosing a Storage Approach
IfEphemeral data — logs, scratch space, caches.
UseUse emptyDir volume. Data is lost when the Pod is deleted, which is expected.
IfPersistent data that must survive Pod restarts (single replica).
UseUse a PVC with a StorageClass that provides a block store (EBS, Persistent Disk). Use ReadWriteOnce access mode.
IfMulti-replica app that needs shared read-write access (e.g., NFS, shared config).
UseUse a PVC with ReadWriteMany access mode. Not all provisioners support it — consider NFS, EFS, or GlusterFS.
IfDatabase with strict consistency requirements (Postgres, MySQL).
UseUse a single PVC per replica (StatefulSet + volumeClaimTemplates). Prefer local SSDs or dedicated EBS volumes. Never use ReadWriteMany for databases.

Namespaces, Resource Quotas, and Multi-Tenancy

Namespaces are virtual clusters within a physical cluster. They provide isolation boundaries for resources, RBAC, and network policies. Every resource lives in a namespace — except cluster-scoped resources like Nodes and PersistentVolumes.

ResourceQuota limits aggregate resource consumption within a namespace. You can set quotas on CPU, memory, Pod count, PVC storage, and even the number of Services. Without quotas, a single misconfigured application can consume all cluster resources and starve others.

LimitRange sets default requests/limits and min/max constraints for Pods in a namespace. This prevents a Pod from requesting an absurd amount of resources or running without any limits.

Multi-tenancy with Namespaces is common: each team gets its own namespace, with RBAC restricting cross-namespace access. But true multi-tenancy (running untrusted workloads) requires additional isolation — consider virtual clusters (vClusters) or sandbox containers (gVisor, Kata Containers).

quota-and-limitrange.yaml · YAML
1234567891011121314151617181920212223242526272829303132333435363738
# ResourceQuota for a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: io-thecodeforge-team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "10"
    requests.storage: 500Gi
    pods: "50"
    services: "10"
---
# LimitRange to enforce default resource boundaries
apiVersion: v1
kind: LimitRange
metadata:
  name: io-thecodeforge-default-limits
  namespace: team-a
spec:
  limits:
    - default:
        cpu: "500m"
        memory: 512Mi
      defaultRequest:
        cpu: "100m"
        memory: 128Mi
      max:
        cpu: "2"
        memory: 4Gi
      min:
        cpu: "50m"
        memory: 64Mi
      type: Container
▶ Output
# After applying
kubectl describe resourcequota -n team-a
# Name: io-thecodeforge-team-quota
# Namespace: team-a
# Resource Used Hard
# -------- --- ---
# pods 12 50
# requests.cpu 3.5 10
# requests.memory 7Gi 20Gi
# limits.cpu 8 20
# limits.memory 18Gi 40Gi
# persistentvolumeclaims 2 10
# requests.storage 120Gi 500Gi
# services 4 10

kubectl describe limitrange -n team-a
# Limits:
# Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
# ---- -------- --- --- --------------- ------------- -----------------------
# Container cpu 50m 2 100m 500m -
# Container memory 64Mi 4Gi 128Mi 512Mi -
🔥Multi-Tenancy Warning
Using Namespaces alone for security isolation between untrusted tenants is insufficient. A Pod in one namespace can still connect to a Service in another namespace unless NetworkPolicies block it. Also, a Pod can access the API server with its ServiceAccount token — RBAC must be scoped per namespace. For hard multi-tenancy, consider dedicated clusters, virtual clusters (vCluster), or sandbox runtimes.
📊 Production Insight
Without ResourceQuotas, a single team's application can silently consume all cluster resources and block other teams.
Quota enforcement is immediate — if a team tries to create a Pod that would exceed its quota, the API server rejects it.
LimitRange is essential for environments where teams might forget to set requests and limits — it prevents BestEffort Pods by default.
Monitor namespace usage with kubectl top and Prometheus alerts to catch quota exhaustion before it causes deployment failures.
🎯 Key Takeaway
Namespaces provide lightweight isolation — RBAC, ResourceQuotas, and NetworkPolicies make them safe for trusted tenants.
Always set ResourceQuotas and LimitRange per namespace in shared clusters — without them, one team can starve everyone.
For truly untrusted workloads, Namespaces alone are not enough: reach for dedicated clusters or sandbox containers.
Namespace Isolation Strategy
IfSame team, different environments (dev, staging).
UseSeparate namespaces per environment. RBAC restricts team members to their environment. No ResourceQuota needed for dev, but set for staging to match production.
IfMultiple teams on a shared cluster.
UseOne namespace per team with ResourceQuota, LimitRange, and RBAC. NetworkPolicy denies all cross-namespace traffic by default, allow specific flows explicitly.
IfRunning untrusted code (CI/CD agents, third-party apps).
UseUse a separate cluster or sandbox runtime. Namespace-level isolation is insufficient. Consider virtual clusters with vCluster or run in a separate pool of nodes with taints and tolerations.
🗂 Kubernetes Component Comparison
Role, scope, and failure impact of each control plane and node component.
ComponentRoleFailure ImpactRecovery
kube-apiserverValidates and serves all API requests. Gateway to etcd.No new deployments, scaling, or config changes. Existing Pods continue running.Restart the process. If HA, load balancer routes to healthy replica.
etcdDistributed key-value store. Single source of truth for all cluster state.Cluster freezes — no state changes possible. If quorum lost, cluster is partitioned.Restore from snapshot or replace failed member. Requires etcdctl expertise.
kube-schedulerAssigns unscheduled Pods to nodes based on resource availability and constraints.New Pods stuck in Pending. Existing Pods unaffected.Restart the process. If leader election fails, check lease in etcd.
kube-controller-managerRuns reconciliation loops for Deployments, ReplicaSets, Nodes, Endpoints, etc.No self-healing. Crashed Pods not restarted. Scaling stops. Node failures not detected.Restart the process. Controllers resume reconciliation from current state.
kubeletNode agent. Pulls images, starts containers, reports node status to API server.Pods on that node stop being managed. Node marked NotReady after 40s (default). Pods evicted after 5 minutes.Restart kubelet. If node is unhealthy, cordoning and replacing the node may be necessary.
kube-proxyPrograms iptables/IPVS rules for Service load balancing on each node.Services unreachable from Pods on that node. Cross-node Service access still works from other nodes.Restart the process. Rules are rebuilt from current Service/Endpoint state.
CoreDNSCluster DNS. Resolves Service names to ClusterIPs.Service DNS resolution fails. Pods can still reach other Pods by direct IP.Restart CoreDNS Pods. Check ConfigMap for misconfiguration.

🎯 Key Takeaways

  • You now understand what Introduction to Kubernetes is and why it exists
  • You've seen it working in a real runnable example
  • Practice daily — the forge only works when it's hot
  • The reconciliation loop is the fundamental operating principle of every Kubernetes controller. Understanding it transforms debugging from trial-and-error to systematic investigation.
  • etcd is the single point of truth and the most common root cause of cluster-wide issues. Its disk latency is the cluster's ceiling.
  • The scheduler scores nodes — it does not bin-pack, predict load, or rebalance. Scheduling decisions are permanent until the Pod is explicitly moved.
  • Kubernetes networking is layered (CNI, kube-proxy, Ingress). Debug from the bottom up: Pod IP, ClusterIP, DNS, Ingress.
  • Resource requests drive scheduling; resource limits drive runtime enforcement. Setting requests=limits (Guaranteed QoS) gives the most predictable behavior.
  • Storage is decoupled from Pod lifecycle via PV/PVC claims. The reclaim policy determines whether data survives PVC deletion — set to Retain for irretrievable data.
  • Namespaces provide isolated environments, but true security requires RBAC, NetworkPolicies, and ResourceQuotas. Without quotas, a single app can starve the cluster.

⚠ Common Mistakes to Avoid

    Running etcd on network-attached storage
    Symptom

    API server timeouts, scheduler freezes, cluster becomes unresponsive during high write load.

    Fix

    etcd requires local SSDs with <10ms p99 write latency. Use provisioned IOPS EBS at minimum, instance-local NVMe ideally. Monitor etcd_disk_wal_fsync_duration_seconds as a critical metric.

    Setting resource limits without requests (or vice versa)
    Symptom

    Pods get BestEffort QoS and are first to be evicted under resource pressure, or scheduler places Pods on nodes without actual capacity.

    Fix

    Always set both requests and limits. For predictable performance, set requests=limits (Guaranteed QoS). Use Vertical Pod Autoscaler (VPA) in 'off' mode to get right-sizing recommendations.

    Using `latest` tag for container images
    Symptom

    Different nodes run different versions of the same image because latest is mutable. Rollbacks are impossible because you cannot determine which latest was running at a given time.

    Fix

    Always use immutable, versioned tags (git SHA or semantic version). Never use latest in production. Use image digests (image: repo@sha256:abc123...) for maximum determinism.

    No PodDisruptionBudgets on critical services
    Symptom

    Node maintenance or cluster upgrade drains all Pods of a service simultaneously, causing complete outage.

    Fix

    Define PDBs with minAvailable: 1 (or percentage) for all production services. This ensures voluntary disruptions (drains) respect availability constraints.

    Ignoring liveness probes that restart Pods unnecessarily
    Symptom

    Pods in CrashLoopBackOff because liveness probe fails during slow startup. Each restart makes startup slower (cold cache), creating a death spiral.

    Fix

    Use startupProbe for slow-starting containers. The liveness probe only activates after the startup probe succeeds. Set appropriate initialDelaySeconds and failureThreshold.

    No RBAC restrictions
    Symptom

    A compromised Pod with a mounted ServiceAccount token can read all Secrets in the cluster, escalate privileges, and pivot to other namespaces.

    Fix

    Create dedicated ServiceAccounts per workload. Bind minimal RBAC roles. Set automountServiceAccountToken: false on Pods that don't need API access. Use NetworkPolicies to restrict Pod-to-Pod traffic.

    Not setting StorageClass reclaim policy for critical data
    Symptom

    Deleting a PVC permanently deletes the underlying PV and all data if reclaimPolicy is Delete.

    Fix

    For stateful workloads (databases, queues), create a custom StorageClass with reclaimPolicy: Retain. Set a backup policy and take regular snapshots.

Interview Questions on This Topic

  • QExplain the Kubernetes reconciliation loop. How does it apply to a Deployment managing a ReplicaSet managing Pods?Mid-levelReveal
    The reconciliation loop is a continuous observe-diff-act cycle. A Deployment controller watches desired state (replicas, pod template) from etcd via the API server. It compares the actual number of ReplicaSets and Pods. If a Pod is deleted, the controller sees the count is below desired, creates a new ReplicaSet or updates an existing one. The ReplicaSet controller then creates a new Pod. The scheduler assigns it to a node, and the kubelet starts it. This loop runs every few seconds, making Kubernetes self-healing without human intervention.
  • QWhat happens when you delete a Pod that belongs to a Deployment? Trace the full sequence of events through every controller involved.SeniorReveal
    When you kubectl delete pod on a Pod owned by a Deployment, the deletion is processed by the API server, which notifies all watchers. The ReplicaSet controller, which watches for Pod changes, sees Pod count decreased below the desired count in the ReplicaSet's replicas. It then creates a new Pod object (a new Pod from the same template). The scheduler picks up the newly created unscheduled Pod (spec.nodeName empty) and runs its filter/score phases to assign it to a node. The scheduler updates the Pod's nodeName. The kubelet on the target node sees a Pod bound to it, pulls the image, starts the container, and reports back status. The API server updates endpoints for Services. The Deployment controller monitors the ReplicaSet to ensure it has the correct Pod template and count. Note that if the deletion causes the ReplicaSet count to drop below minReadySeconds, the Deployment waits for the new Pod to become ready before considering the update complete.
  • QHow does the kube-scheduler decide which node to place a Pod on? What are the two phases, and what plugins participate in each?SeniorReveal
    The scheduler has two phases: filtering and scoring. In filtering, predicate plugins eliminate nodes that cannot run the Pod. Default filters include: NodeResourcesFit (checks requested resources), NodeUnschedulable, NodeName, NodePorts, NodeAffinity, TaintToleration, VolumeBinding (for PVCs), and InterPodAffinity (hard rules). If zero nodes remain, Pod stays Pending. In scoring, priority plugins assign a score (0-100) to each feasible node. Default scoring plugins: NodeResourcesBalancedAllocation (prefers balanced CPU/memory), NodeResourcesLeastAllocated (prefers nodes with more free resources), ImageLocality (prefers nodes with the container image already cached), InterPodAffinity (soft rules), and TaintToleration (prefers nodes with fewer taints). The node with the highest weighted sum wins. Ties are broken randomly.
  • QWhat is the difference between a Service's ClusterIP and the Pod IPs it routes to? How does kube-proxy implement this?Mid-levelReveal
    ClusterIP is a virtual IP assigned to a Service, reachable only from within the cluster. It does not correspond to any network interface. Pod IPs are real IPs assigned by the CNI plugin to each Pod, routable across nodes. kube-proxy watches the API server for Service and Endpoint changes. It programs iptables (or IPVS) rules on each node that intercept traffic to the ClusterIP and randomly DNAT it to one of the backend Pod IPs. This is done at the kernel level using conntrack. If the Service has no ready Pods, the Endpoints list is empty, and traffic to the ClusterIP is dropped (connection refused).
  • QA Pod is stuck in Pending. Walk me through your debugging process, from the first command you would run to identifying the root cause.Mid-levelReveal
    First, run kubectl describe pod <name> and read the Events section. Common reasons: insufficient CPU/memory on nodes, PVC not bound, node affinity/taint mismatch, resource quota exceeded. If no events, the scheduler may be down: kubectl get pods -n kube-system | grep scheduler. If the scheduler is running but no events, check if any node has enough allocatable resources: kubectl describe nodes | grep -A5 Allocated. Also check for taints: kubectl describe nodes | grep Taints. If the Pod has a PVC, verify it is Bound: kubectl get pvc. If using a StorageClass, ensure the provisioner is running. If no root cause found, check cluster-wide resource quotas: kubectl describe quota -n <namespace>. Finally, check if there are too many Pods on a single node (node capacity).
  • QExplain etcd's role in the cluster. What happens if etcd loses quorum? How would you recover?SeniorReveal
    etcd is the distributed key-value store that holds all Kubernetes state: Pod definitions, ConfigMaps, Secrets, nodes, etc. It uses the Raft consensus protocol, which requires a majority of members to agree on any state change. If etcd loses quorum (e.g., 2 out of 3 members fail), no more writes or reads from etcd are possible — the cluster is effectively frozen. New Pods cannot be scheduled, existing Pods continue running but cannot be updated. To recover, you must restore from a backup snapshot. On a surviving etcd member, stop etcd, delete the data directory, restore from the latest snapshot using etcdctl snapshot restore, and then start etcd. Then join the other restored members. This is a high-risk operation — always practice it in a non-production environment first.
  • QWhat is the difference between requests and limits, and how do they affect scheduling vs runtime behavior?Mid-levelReveal
    Requests are used by the scheduler for node capacity decisions. The scheduler sums requests across all Pods on a node and ensures it does not exceed node capacity. Limits are enforced at runtime by the kernel (cgroups). If a Pod exceeds its CPU limit, it gets throttled; if it exceeds memory limit, it gets OOMKilled. Setting requests alone without limits (or vice versa) leads to unpredictable behaviour: no limits means the Pod can consume all node resources and starve others; no requests means the scheduler may overcommit the node. Best practice: set requests=limits (Guaranteed QoS) for production workloads to get the most predictable performance and worst-case eviction priority.
  • QHow would you design a zero-downtime deployment strategy using Kubernetes primitives (Deployments, PDBs, health checks)?SeniorReveal
    Use a Deployment with rolling update strategy (maxSurge: 25%, maxUnavailable: 0 for zero downtime during rollout; or maxUnavailable: 25% to allow some tolerance). Set liveness and readiness probes on all Pods. Use minReadySeconds to give time for the new Pod to stabilise before counting it as ready. Define a PodDisruptionBudget with minAvailable: 1 (or percentage) to prevent voluntary disruptions (node drains) from taking down all replicas. For StatefulSets, set podManagementPolicy: Parallel and use readiness gates. Additionally, use preStop hooks to gracefully drain connections before terminating. Test rollout with kubectl rollout status and have a rollback plan (kubectl rollout undo). For critical services, consider a canary or blue-green deployment using a second Deployment and Service selector swap.

Frequently Asked Questions

What is Introduction to Kubernetes in simple terms?

Introduction to Kubernetes is a fundamental concept in DevOps. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

What is the difference between a Deployment, a ReplicaSet, and a Pod?

A Pod is the smallest unit — one or more containers sharing a network namespace. A ReplicaSet ensures a specified number of Pod replicas are running at all times. A Deployment manages ReplicaSets and provides declarative updates (rolling updates, rollbacks). The hierarchy is: Deployment -> ReplicaSet -> Pod. You almost never create ReplicaSets or Pods directly — you create Deployments, and the Deployment controller creates the ReplicaSet, which creates the Pods.

What happens if the control plane node goes down?

Existing Pods on worker nodes continue running — the kubelet on each node operates independently of the control plane for running workloads. However, you cannot deploy new workloads, scale existing workloads, update configurations, or modify any cluster state until the control plane recovers. This is why production clusters need at least 3 control plane nodes for high availability.

How does Kubernetes handle node failures?

The Node controller in kube-controller-manager monitors node heartbeats. If a node stops sending heartbeats (default: every 10s), the node is marked NotReady after 40 seconds. After 5 minutes (the pod-eviction-timeout), the control plane evicts Pods from the unreachable node and reschedules them on healthy nodes. During this 5-minute window, the Pods are running but unreachable if the node is truly down. You can tune this timeout, but setting it too low causes unnecessary evictions during temporary network blips.

What is the difference between a ConfigMap and a Secret?

Functionally, they are identical — both inject configuration data into Pods as environment variables or mounted files. The difference is intent and handling: Secrets are base64-encoded (not encrypted by default), stored separately in etcd, and can be encrypted at rest with an EncryptionConfiguration. ConfigMaps are for non-sensitive configuration. In production, use an external secrets manager (Vault, AWS Secrets Manager) with the Secrets Store CSI Driver instead of Kubernetes Secrets for sensitive data.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →Kubernetes Pods and Deployments
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged