etcd Disk Latency — How 800ms Killed the Kubernetes Cluster
The degraded etcd EBS volume caused 800ms write latency, stalling Raft consensus and freezing the entire cluster.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
- Kubernetes is a declarative container orchestration platform that continuously reconciles observed state with desired state.
- Control plane: kube-apiserver, etcd, kube-scheduler, kube-controller-manager — each has a distinct role and failure mode.
- etcd is the single source of truth — its disk latency is the cluster's performance ceiling.
- The scheduler filters then scores nodes; it does NOT rebalance or predict load.
- kubelet on each node runs the actual containers and reports status back to the API server.
- Most production outages trace back to etcd misconfiguration, not application code.
Imagine you own a giant warehouse with hundreds of workers. Instead of telling each worker exactly what to do every minute, you hire a smart manager who reads a wish list ('I need 5 boxes packed, always'), watches the floor, and reassigns workers automatically when someone calls in sick. Kubernetes is that manager — you describe what your software should look like, and Kubernetes keeps reality matching the wish list, forever, across thousands of machines.
Kubernetes is not a deployment tool. It is a distributed state reconciliation engine. Every component — from the scheduler to the kubelet — operates on the same principle: watch the desired state in etcd, compare it with observed state, and act to close the gap. This is the mental model that unlocks real debugging capability.
The control plane is the brain. etcd is the memory. The kubelet is the muscle on each node. The scheduler decides placement. When any of these components degrades, the symptoms are often misleading — a Pod stuck in Pending looks like a scheduling problem but is frequently an etcd latency issue or a resource quota misconfiguration.
The common misconception is that Kubernetes 'runs containers.' It does not. Kubernetes manages the desired state of workloads. The container runtime (containerd, CRI-O) runs containers. Kubernetes tells the runtime what to run, monitors whether it is running, and corrects deviations. This distinction matters when debugging crashes, image pull failures, and networking issues.
What etcd Latency Actually Does to Kubernetes
etcd is the distributed key-value store that backs Kubernetes, holding all cluster state — pods, services, configmaps, secrets. The core mechanic: every write to etcd must be committed to a majority of nodes (quorum) before it's considered durable. This means a single slow disk on one node can stall the entire cluster. In practice, etcd's performance is measured by fsync latency: the time to flush a write to disk. Kubernetes control-plane components — kube-apiserver, scheduler, controller-manager — all depend on etcd's linearizable reads and writes. When fsync latency exceeds 100ms, watch timeouts and leader elections cascade. At 800ms, the cluster enters a death spiral: heartbeats fail, leaders step down, and no new writes succeed. You use etcd in every Kubernetes cluster, but its sensitivity to disk I/O is often underestimated. Understanding this matters because a single slow disk — not CPU, not memory — is the most common cause of control-plane outages in production.
The 4 Essential Objects: Pod, Service, Deployment, Namespace
Before diving into the architecture, you need a concrete mental model of the four objects you'll use every day. Kubernetes exposes hundreds of resource types, but 80% of your interactions will involve these four.
Pod is the smallest deployable unit. A Pod wraps one or more containers, gives them a shared network namespace (one IP per Pod), and optionally shared storage volumes. Containers in the same Pod can communicate via localhost. Pods are ephemeral — they can be killed and rescheduled at any time. Never run a single Pod without a controller (Deployment, StatefulSet, DaemonSet).
Service provides a stable network endpoint for a set of Pods. Because Pods can die and be replaced with new IPs, a Service gives a fixed IP (ClusterIP) and DNS name that load-balances across the healthy Pods. The Service uses label selectors to determine which Pods belong to it.
Deployment is the most common controller. It declares the desired state for your stateless applications: how many replicas, which container image, resource limits, health checks, update strategy. The Deployment controller creates a ReplicaSet, which creates the Pods. When you update the Pod template, the Deployment creates a new ReplicaSet and gradually scales it up and the old one down (rolling update).
Namespace is a virtual cluster boundary. It isolates resources, RBAC, and network policies. Every resource lives in a namespace — except cluster-scoped resources like Nodes and PersistentVolumes. Use namespaces to separate environments (dev, staging, prod) or teams.
Together, these objects form the foundation: you define a Deployment that creates Pods, expose them via a Service, and organize everything in a Namespace.
Control Plane Architecture: The Brain of the Cluster
The Kubernetes control plane consists of four components that work together to maintain cluster state. Understanding each component's role — and its failure modes — is essential for production operations.
kube-apiserver is the front door. Every kubectl command, every controller reconciliation, every kubelet status report goes through the API server. It validates requests, persists state to etcd, and serves as the watch endpoint for all controllers. It is stateless — you can run multiple replicas behind a load balancer for HA.
etcd is the single source of truth. It is a distributed, consistent key-value store built on the Raft consensus protocol. All cluster state — Pod definitions, ConfigMaps, Secrets, node registrations — lives in etcd. If etcd loses quorum, the cluster cannot make any state changes. etcd is the most critical component and the most commonly under-provisioned.
kube-scheduler watches for unscheduled Pods and assigns them to nodes. It does not run Pods — it only writes the nodeName field. The kubelet on the assigned node then pulls the image and starts the container. The scheduler uses a two-phase process: filtering (eliminate infeasible nodes) and scoring (rank feasible nodes, pick the highest score).
kube-controller-manager runs the control loops. Each controller watches a specific resource type and reconciles actual state with desired state. The Deployment controller ensures the right number of replicas exist. The Node controller detects when nodes go unhealthy. The Endpoint controller updates Service endpoints as Pods come and go.
- The API server is the only component that talks to etcd. All other components go through the API server.
- Controllers are level-triggered, not edge-triggered. They care about the current state, not the event that caused it.
- This is why Kubernetes is self-healing. It does not remember what happened — it only checks what is true right now.
Control Plane vs Data Plane Architecture
Kubernetes is divided into two logical planes: the control plane (brain) and the data plane (muscle). The control plane makes decisions about the cluster state — what should run, where it should run, and whether the current state matches the desired state. The data plane executes those decisions — it runs the actual containers, provides the network connectivity, and reports back the observed state.
The control plane components (kube-apiserver, etcd, scheduler, controller-manager) typically run on dedicated master nodes, though in smaller clusters they may be colocated. The data plane consists of the worker nodes, each running kubelet, kube-proxy, the container runtime, and the CNI plugin.
The key architectural insight: control plane components communicate with each other and with etcd, but they never directly interact with the user containers. All interactions go through the API server. The kubelet on each worker node polls the API server for Pods assigned to its node, then instructs the container runtime to pull images and start containers. kube-proxy watches the API server for Service changes and programs iptables/IPVS rules accordingly.
This separation means that if the control plane fails, existing containers continue running (the kubelet is autonomous for running workloads) but you cannot make any changes. Conversely, if a worker node fails, the control plane detects it (via the Node controller) and reschedules the Pods on healthy nodes after a timeout.
The Scheduler: How Kubernetes Decides Where Pods Run
The kube-scheduler is the component that assigns Pods to nodes. It does not run Pods — it only writes the spec.nodeName field on the Pod object. The kubelet on the assigned node then pulls the image and starts the container.
The scheduler uses a two-phase process:
Filtering (Feasibility): Eliminate nodes that cannot run the Pod. Filter reasons include: insufficient CPU/memory, node taints the Pod cannot tolerate, node affinity mismatches, volume zone constraints, and Pod topology spread constraints. After filtering, if zero nodes remain, the Pod stays in Pending.
Scoring (Ranking): Rank the feasible nodes by a set of scoring plugins. Default scoring includes: NodeResourcesBalancedAllocation (prefer nodes with balanced CPU/memory usage), ImageLocality (prefer nodes that already have the container image), InterPodAffinity (prefer nodes where affinity rules are satisfied), and TaintToleration (prefer nodes with fewer taints). The node with the highest weighted score wins.
The scheduler makes decisions based on the state of the cluster at scheduling time. It does not predict future load. It does not rebalance existing Pods. Once a Pod is scheduled, only explicit actions (eviction, deletion, preemption) can move it.
- Guaranteed QoS (requests=limits): Pod is last to be evicted under resource pressure.
- Burstable QoS (requests < limits): Pod can burst but is evicted before Guaranteed Pods.
- BestEffort QoS (no requests, no limits): First to be evicted. Never use in production.
nodeSelector or nodeAffinity required mode. Hard constraint — Pod stays Pending if no node matches.nodeAffinity preferred mode. Soft constraint — scheduler tries to match but places elsewhere if needed.topologySpreadConstraints. More flexible and performant than pod anti-affinity.podAffinity (co-locate) or podAntiAffinity (spread). At scale prefer topologySpreadConstraints.tolerations to the Pod spec. Without a matching toleration, the Pod won't schedule on the tainted node.Pod Networking: How Containers Talk to Each Other
Kubernetes networking has three fundamental requirements, enforced by the CNI (Container Network Interface) plugin:
- Every Pod gets its own IP address, unique across the cluster.
- Pods on any node can communicate with Pods on any other node without NAT.
- Agents on a node (kubelet, system daemons) can communicate with all Pods on that node.
These requirements are simple to state but complex to implement. The CNI plugin (Calico, Cilium, Flannel, AWS VPC CNI) is responsible for wiring this up. It allocates IP addresses from the node's Pod CIDR range, sets up network interfaces inside the Pod's network namespace, and configures routing rules so Pods can reach each other across nodes.
kube-proxy handles Service networking. It watches the API server for Service and Endpoint objects, then programs iptables rules (or IPVS rules) on each node. When a Pod connects to a Service's ClusterIP, the kernel's iptables rules intercept the connection and DNAT it to one of the backend Pod IPs. This is why Service IPs are virtual — they do not exist on any network interface.
- Pod IP works but Service IP fails: kube-proxy or iptables issue.
- Service IP works but DNS fails: CoreDNS issue.
- DNS works but external access fails: Ingress controller or cloud LB issue.
Kubernetes Storage: PersistentVolumes, Claims, and StorageClasses
Kubernetes storage decouples Pod lifecycle from data life. A Pod can be deleted and recreated, but its data persists if it uses a PersistentVolume (PV) and PersistentVolumeClaim (PVC). This is critical for stateful workloads like databases.
PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically by a StorageClass. It is a cluster resource, like a node. PVs have a capacity and access mode (ReadWriteOnce, ReadOnlyMany, ReadWriteMany).
PersistentVolumeClaim (PVC) is a request for storage by a user. It specifies size and access mode. Kubernetes binds a PVC to a PV that meets the requirements. If no matching PV exists, the PVC remains Pending — unless a StorageClass with a dynamic provisioner is referenced.
StorageClass defines a class of storage. It specifies the provisioner (e.g., kubernetes.io/aws-ebs), parameters (type, IOPS), and reclaim policy. When a PVC requests a StorageClass, the provisioner automatically creates a PV that satisfies the claim.
persistentVolumeReclaimPolicy:
- Retain: PV remains but is in Released state — you must manually reclaim it.
- Delete: PV and underlying storage are deleted. This is default for dynamic provisioners.
- Recycle: Deprecated. Attempts to scrub and re-use.
Production gotcha: If you delete a PVC with a Delete reclaim policy without first taking a snapshot, you lose all data. Always set Retain for critical databases, or use a backup solution.volumeClaimTemplates to automatically generate unique PVCs per replica.emptyDir volume. Data is lost when the Pod is deleted, which is expected.ReadWriteOnce access mode.ReadWriteMany access mode. Not all provisioners support it — consider NFS, EFS, or GlusterFS.ReadWriteMany for databases.Namespaces, Resource Quotas, and Multi-Tenancy
Namespaces are virtual clusters within a physical cluster. They provide isolation boundaries for resources, RBAC, and network policies. Every resource lives in a namespace — except cluster-scoped resources like Nodes and PersistentVolumes.
ResourceQuota limits aggregate resource consumption within a namespace. You can set quotas on CPU, memory, Pod count, PVC storage, and even the number of Services. Without quotas, a single misconfigured application can consume all cluster resources and starve others.
LimitRange sets default requests/limits and min/max constraints for Pods in a namespace. This prevents a Pod from requesting an absurd amount of resources or running without any limits.
Multi-tenancy with Namespaces is common: each team gets its own namespace, with RBAC restricting cross-namespace access. But true multi-tenancy (running untrusted workloads) requires additional isolation — consider virtual clusters (vClusters) or sandbox containers (gVisor, Kata Containers).
kubectl top and Prometheus alerts to catch quota exhaustion before it causes deployment failures.Kubernetes vs Docker Compose: When to Use Each
Docker Compose and Kubernetes both orchestrate containers, but they serve fundamentally different use cases. Docker Compose is a single-host orchestration tool designed for development environments and small deployments. Kubernetes is a multi-host, production-grade orchestration system with automated healing, scaling, and rolling updates.
| Feature | Docker Compose | Kubernetes |
|---|---|---|
| Scope | Single host | Multi-node cluster |
| Scaling | Manual (docker-compose up --scale) | Automatic (Horizontal Pod Autoscaler, Cluster Autoscaler) |
| Self-healing | None (no automatic restart of failed containers) | Automatic (controllers restart/recreate Pods) |
| Rolling updates | Basic (stop all, start new) | Configurable (maxSurge, maxUnavailable, Canary, Blue-Green) |
| Networking | Flat network with links | Service abstraction with DNS, kube-proxy, CNI |
| Storage | Named volumes on single host | PV/PVC with dynamic provisioning across nodes |
| Secrets management | Plain text env files | Secrets (base64, encryption at rest), external CSI drivers |
| Configuration | Individual YAML per service | Declarative API with multiple resource types |
| Learning curve | Low | High |
Choose Docker Compose when: you are developing locally, running CI integrations tests, or deploying a simple application on a single VM where orchestration overhead is not justified.
Choose Kubernetes when: you need high availability, rolling updates, auto-scaling, multi-node deployment, or run multiple microservices that need advanced networking (service discovery, load balancing, network policies). Many teams start with Docker Compose for development and then write Kubernetes manifests for production — maintaining both can be an overhead, but tools like Kompose can convert Compose files to Kubernetes YAML.
kubectl apply --dry-run=server.The Evolution of Deployment — Why We Stopped Trusting Bare Metal
You don't understand Kubernetes until you understand the deployment hell it replaced. Before containers, you had two choices: dump a JAR on a physical server and pray nothing else touched the port, or waste 40% of your budget on VM overhead because each app needed its own OS instance.
Virtualization fixed the hardware waste but introduced its own cancer — golden images that rotted over time, configuration drift that turned production into a snowflake zoo, and boot times measured in coffee breaks. Then came containers. Docker gave you repeatable build artifacts and second-level startup. But now you had 50 containers on a single VM and no sane way to manage them.
That's the gap Kubernetes fills. Not as a container manager — as a control system. It takes your container images and applies a desired state loop. You say "I want 3 replicas of payment-api behind a stable DNS name." Kubernetes makes it true, then keeps it true. No SSH, no manual restart, no "it works on my machine." The whole industry pivoted from pet servers to cattle because manual operations don't scale past 5 microservices.
Why Kubernetes Stands Out — The Desired State Loop
Every other orchestrator tells you how to start processes. Kubernetes tells you how the system should look and makes reality match the spec. This is the single most important concept to internalize.
When you write a Deployment, you declare: "3 replicas, port 8080, liveness probe hitting /healthz." The control plane stores that intent in etcd. Then the kubelet on each node continuously checks: "Does my pod match what etcd says? No? Fix it." This isn't a one-time deploy. It's a running reconciliation loop that fires every second until you delete the resource.
Why does this matter? Because production never stays still. A node crashes — the controller sees 2 replicas instead of 3 and spawns a replacement. A pod runs out of memory — the restart policy kills and re-creates it. Traffic spikes — your HorizontalPodAutoscaler reads the metrics and tells the deployment to scale to 10. No human touching a terminal at 3 AM.
The magic isn't the containers. It's the control theory applied to distributed systems. You describe the steady state. Kubernetes enforces it. Period.
Real-World Kubernetes: Where the Theory Dies
You've read the docs. You've deployed a pod. Now what? The real value of Kubernetes isn't container orchestration — it's the patterns that survive production. Three use cases define modern k8s: stateless web backends, event-driven batch jobs, and stateful data pipelines.
Stateless apps are the entry drug. Horizontal Pod Autoscaler + Deployment + Service = you can absorb traffic spikes without waking up at 3 AM. Batch jobs use Jobs and CronJobs to replace cron on bare metal — way easier to restart failed pods than re-ssh into a dead VM. Stateful stuff uses StatefulSets with PersistentVolumeClaims for databases like PostgreSQL or Kafka. You don't need to manage the database lifecycle in k8s — just give it stable storage and a stable network identity.
The trap? Thinking every app belongs in k8s. Latency-sensitive workloads, GPU training jobs that don't scale horizontally, or anything that needs raw hardware access — leave those on bare metal or spot instances. Kubernetes is not a Swiss Army knife. It's a hammer. Use it on nails.
Kuberenetes Projects That Actually Teach You Something
Stop running nginx in a playground. Build something that breaks, then fix it. Three projects will teach you more than any certification: a multi-service web app with zero-downtime deploys, a CI/CD pipeline that runs entirely inside the cluster, and a GitOps setup that auto-remediates drift.
First project: Deploy a frontend + API + database. Use rolling updates with readiness probes. Simulate a bad deploy — watch the probe kill it and rollback automatically. You'll understand why livenessProbe and readinessProbe are not optional. Second: Run a Jenkins or Argo Workflows executor inside k8s with dynamic pod-per-build. Learn how PersistentVolumeClaims hold workspace data and how cluster autoscaling handles build spikes. Third: Set up Argo CD or Flux with a Git repo. Break the cluster state manually — watch it self-heal. That's the Desired State Loop in action, not theory.
These projects force you to hit real problems: pod eviction, OOMKilled containers, RBAC misconfigurations, and etcd latency when the control plane gets hammered. You'll stop treating k8s like magic and start treating it like a distributed system that demands respect.
What Is Kubernetes? The Bare Minimum You Need to Know
Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications. At its core, you manage a cluster: a set of machines called nodes. One node is the master (control plane), the rest are workers (data plane). You define your app's desired state — how many replicas, which image, what ports — in a YAML manifest. Kubernetes then ensures the cluster matches that state, healing failures, scaling load, and rolling updates. A Pod is the smallest unit: one or more containers sharing networking and storage. Deployments manage replica sets. Services provide stable network endpoints. This declarative approach means you tell Kubernetes what you want, not how to achieve it. The system handles the rest, watching for drift and correcting it automatically.
Key Primitives You Must Understand
Beyond Pods, four objects form Kubernetes' backbone. A Deployment manages Pod lifecycles: rolling updates, rollbacks, replica scaling. It creates a ReplicaSet that watches pod counts. A Service provides stable networking — Pods get ephemeral IPs, but a Service gives a fixed ClusterIP or LoadBalancer. Ingress routes external HTTP/S traffic to Services. ConfigMaps and Secrets decouple configuration from images. Volumes (PersistentVolumeClaims) persist data beyond Pod restarts. Namespaces isolate resources within a cluster. RBAC (Role-Based Access Control) locks down who can do what. These primitives layer on each other: Deployment → ReplicaSet → Pod → Container. Knowing which to use and when separates beginners from pros. Start with a Deployment + Service pair; that covers 80% of use cases.
The etcd Disk That Killed the Entire Cluster
etcdctl defrag).
4. Configure etcd auto-compaction (--auto-compaction-retention=8) to prevent unbounded data growth.
5. Monitor etcd member health with etcdctl endpoint health and etcdctl endpoint status.- etcd is the single point of failure for the entire cluster. Its disk performance is the cluster's ceiling.
- Never run etcd on network-attached storage in production. Local SSDs are mandatory.
- API server timeouts are often etcd problems, not API server problems. Trace downward, not upward.
- etcd requires periodic defragmentation. Without it, space is freed but not reclaimed, leading to disk pressure.
kubectl describe pod <name> and read the Events section. 2. Common causes: insufficient CPU/memory on any node (check kubectl describe nodes for Allocatable vs Allocated), PersistentVolumeClaim not bound, node affinity/taint mismatches, resource quotas exceeded. 3. If no events appear, the scheduler may be down — check kubectl get pods -n kube-system for kube-scheduler.kubectl logs <pod> --previous to see the logs from the crashed container (current logs may be empty). 2. Common causes: missing environment variables, failed health checks, OOMKill (check kubectl describe pod for Last State), misconfigured entrypoint. 3. If OOMKill, increase memory limits or fix the memory leak. Check kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState}'.kubectl get pods -n kube-system | grep calico (or flannel/weave). 2. Check if Pod CIDR ranges overlap between nodes: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. 3. Verify kube-proxy is running: kubectl get pods -n kube-system | grep kube-proxy. 4. Test from within a Pod: kubectl exec -it <pod> -- curl <service-ip>:<port>.kubectl describe rs <new-rs-name>. 2. Look for Pods that are Pending or CrashLoopBackOff. 3. Check if the new image exists in the registry and if imagePullSecrets are configured. 4. If using rolling update with maxUnavailable=0 and the cluster has no spare capacity, new Pods cannot be scheduled. 5. Rollback: kubectl rollout undo deployment/<name>.etcdctl endpoint health --write-out=table. 2. Check disk I/O on etcd nodes: iostat -x 1. 3. Check etcd database size: etcdctl endpoint status --write-out=table. 4. If disk is the bottleneck, migrate to local SSDs. 5. If database is large, run defragmentation: etcdctl defrag.kubectl get pv. 2. Describe the PVC: kubectl describe pvc <name>. Common causes: no PV available with matching accessModes and storageClassName, or the StorageClass has no provisioner. 3. If using dynamic provisioning, verify the storage provisioner pod is running and hasn't hit a quota or permission error.kubectl get pods -n kube-system | grep schedulerkubectl describe nodes | grep -A 5 'Allocated resources'Key takeaways
Common mistakes to avoid
7 patternsRunning etcd on network-attached storage
etcd_disk_wal_fsync_duration_seconds as a critical metric.Setting resource limits without requests (or vice versa)
Using `latest` tag for container images
latest is mutable. Rollbacks are impossible because you cannot determine which latest was running at a given time.latest in production. Use image digests (image: repo@sha256:abc123...) for maximum determinism.No PodDisruptionBudgets on critical services
minAvailable: 1 (or percentage) for all production services. This ensures voluntary disruptions (drains) respect availability constraints.Ignoring liveness probes that restart Pods unnecessarily
startupProbe for slow-starting containers. The liveness probe only activates after the startup probe succeeds. Set appropriate initialDelaySeconds and failureThreshold.No RBAC restrictions
automountServiceAccountToken: false on Pods that don't need API access. Use NetworkPolicies to restrict Pod-to-Pod traffic.Not setting StorageClass reclaim policy for critical data
reclaimPolicy: Retain. Set a backup policy and take regular snapshots.Interview Questions on This Topic
Explain the Kubernetes reconciliation loop. How does it apply to a Deployment managing a ReplicaSet managing Pods?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
That's Kubernetes. Mark it forged?
14 min read · try the examples if you haven't