etcd Disk Latency — Kubernetes Architecture Failure
etcd disk latency from co-located workloads caused Raft leader election failures, crashing the API server.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
- Kubernetes is a declarative control loop: you define desired state, the system reconciles toward it.
- Control Plane components: API Server, Scheduler, Controller Manager, and etcd — the single source of truth using Raft.
- Nodes run the kubelet, kube-proxy, and container runtime; they execute pod specs and report back.
- Performance insight: etcd's fsync latency dictates cluster responsiveness — keep it under 10ms or expect leader elections.
- Production insight: API Server is the only component that talks directly to etcd; all others watch it. If etcd is slow, the entire control plane slows.
- Biggest mistake: treating etcd as a generic KV store — it's a replicated log, not an OLTP database.
- Scaling insight: watch cache exhaustion causes re-list storms; increase
--watch-cache-sizesfor resource-heavy clusters. - Scheduler nuance: default
--percentage-of-nodes-to-scorecan skip optimal nodes at scale — set to 100% for latency-sensitive workloads.
Imagine a massive Amazon warehouse. There's a central manager's office (the Control Plane) that decides which workers (Nodes) pick which packages (containers), tracks every shelf location (etcd), and reschedules sick workers automatically. The workers don't think — they just follow orders from the office, report their status, and run their assigned tasks. Kubernetes is exactly that warehouse management system, but for software running on servers.
Kubernetes replaces bespoke deployment scripts and manual server management with a declarative control loop. You describe the desired state, and the cluster continuously reconciles reality toward it. The architecture is not a black box; it's a set of coordinated components with specific failure domains and performance characteristics. Understanding these internals is what separates engineers who debug Kubernetes from those who are confused by it.
This is not a getting-started guide. It is for engineers already running Kubernetes who need to understand the 'why' behind scheduler decisions, etcd consistency guarantees, and kubelet behavior under pressure. We will trace what happens, component by component, when you run kubectl apply -f deployment.yaml, and identify the production decisions that bite teams hardest. The single most overlooked truth: the API Server is the bottleneck, but etcd is the clock that drives it.
One more thing: don't assume HA means safe. Running three API Server replicas without understanding leader election or etcd quorum is a false sense of security. We'll cover exactly what breaks and how to catch it before your on-call phone rings.
What Kubernetes Architecture Actually Is
Kubernetes architecture is a distributed system designed to manage containerized workloads across a cluster of machines. The core mechanic is a declarative control loop: you specify the desired state (e.g., 3 replicas of a web server), and the control plane continuously reconciles the actual state to match it. This loop runs on a set of master components — API server, etcd, scheduler, controller manager — while worker nodes run kubelet, kube-proxy, and a container runtime.
In practice, etcd is the single source of truth for all cluster state. Every API request, every pod scheduling decision, every config map update goes through etcd. If etcd slows down, the entire control plane stalls. The scheduler can't assign pods, the API server times out, and node heartbeats fail. A 100ms write latency in etcd can cascade into minutes of cluster unavailability.
You use Kubernetes when you need automated deployment, scaling, and healing of applications across multiple hosts. It matters because it abstracts away individual machines and provides a uniform API for operations. But the architecture's central dependency on etcd means that disk I/O performance on the etcd nodes directly determines cluster reliability — a fact many teams discover only after a production outage.
Control Plane: The Cluster's Brain
The Control Plane makes global decisions (e.g., scheduling) and detects and responds to cluster events. It consists of the API Server, etcd, Scheduler, and Controller Manager. In production, it is almost always replicated across multiple nodes for high availability.
The API Server (kube-apiserver) is the front end—all communication, whether from kubectl, controllers, or nodes, goes through it. It validates and persists resources to etcd, and exposes a watch API that components use to detect changes. The Scheduler (kube-scheduler) watches for unscheduled Pods and assigns them to nodes. The Controller Manager (kube-controller-manager) bundles multiple controllers: Node Controller, ReplicaSet Controller, Endpoint Controller, etc. Each runs as a separate loop but shares the same binary.
One detail that bites teams: the API Server's --max-mutating-requests-inflight and --max-requests-inflight defaults are 200 and 400 respectively. If you run a large cluster with aggressive automation, you'll hit this limit and calls start queuing. Monitor apiserver_request_count and apiserver_current_inflight_requests early—before your CI/CD pipeline starts timing out.
Another subtlety: leader election among controller manager replicas is handled via endpoints in kube-system. If the leader dies unexpectedly, it takes about 15 seconds for a new leader to take over (governed by lease duration and renew deadline). During that window, the corresponding controller loop stops. For example, the node controller stops evicting pods from unreachable nodes, which can cause service disruption. Know your lease parameters.
Another often overlooked component: the cloud-controller-manager. If you're on AWS, GCP, or Azure, this controller interacts with the cloud provider's API to manage load balancers, routes, and nodes. A misconfigured cloud-controller-manager can prevent nodes from joining the cluster even though the API Server is healthy. Always check its logs when nodes fail to register.
One more: the API Server's etcd client uses a watch cache. If writes exceed the cache capacity, watch requests get 'too old resource version' errors and clients must re-list. This can cascade into a thundering herd problem. Mitigate by increasing --watch-cache-sizes or reducing watch concurrency. In one incident, a misconfigured monitoring system created too many watches, causing all controllers to re-sync every few minutes, spiking API Server CPU to 100%.
And here's a trap with admission webhooks: they run before the request reaches etcd. A slow webhook blocks the entire request pipeline. We've seen a single webhook that took 5 seconds to respond because it called an external service that was throttled. That 5-second delay was added to every write to that resource type. The fix was to add a circuit breaker and a timeout at the webhook level. Monitor apiserver_admission_webhook_admission_duration_seconds — if the 99th percentile exceeds 1 second, you have a problem.
Component Interaction Flow: From kubectl to Running Pod
When you run kubectl apply -f deployment.yaml, a chain of events propagates through the control plane to the target node. Understanding this flow is essential for debugging latency and identifying where failures occur.
The sequence starts with kubectl sending a REST POST request to the API Server's /apis/apps/v1/namespaces/default/deployments endpoint. The API Server authenticates the request (via TLS certificates or bearer tokens), authorizes it against RBAC, then passes the object through a set of admission controllers (mutating and validating webhooks). If all passes, the API Server persists the Deployment object into etcd using a Raft write.
Once the write is committed, the API Server's watch mechanism notifies the Deployment controller (part of kube-controller-manager). The Deployment controller sees the new object and creates a ReplicaSet. This in turn triggers the ReplicaSet controller to create a Pod object. The Scheduler, watching for unscheduled Pods, picks a suitable node and updates the Pod with the node binding.
The API Server persists the binding and notifies the target node's kubelet via its watch. The kubelet receives the Pod spec and begins execution: it pulls the container image (if not cached), starts the container via the CRI, mounts volumes, configures networking via CNI, and runs startup/liveness probes. At the same time, kube-proxy updates iptables rules to route Service traffic to the new pod.
The entire process typically takes 2–10 seconds for a small deployment, but can stretch to minutes with large images or slow webhooks. Each step introduces latency variables: admission webhook latency, etcd write latency, scheduler queue delay, image pull time, and CNI configuration time.
A common production pitfall: a slow admission webhook (e.g., 2 seconds per request) adds 2 seconds to every resource creation. If you create 100 pods during a rollout, that's 200 seconds of additional delay. Monitor apiserver_admission_webhook_admission_duration_seconds to catch this.
The diagram below visualizes the interaction sequence between components using a simplified mermaid sequence diagram.
kubectl proxy and inspect the stage timestamps. You can correlate admission webhook durations, etcd round trips, and response times to pinpoint the slowest step in the pipeline.Nodes: The Worker Machines
A Node is a worker machine (VM or physical) where containers are run. Each node contains the services necessary to run Pods: the kubelet, the container runtime (e.g., containerd), and the kube-proxy.
The kubelet is the node's primary agent. It receives PodSpecs from the API Server (for pods assigned to its node) and ensures the described containers are running. It does this by talking to the container runtime via the Container Runtime Interface (CRI). The kubelet also runs liveness and readiness probes, mounts volumes, and reports node conditions like DiskPressure, MemoryPressure, and PIDPressure to the API Server.
Kube-proxy (runs as a DaemonSet) maintains network rules on the node. It watches Services and EndpointSlices and updates iptables or IPVS rules so traffic to a Service's ClusterIP is load‑balanced to the actual pods.
Here's what most people miss: the kubelet's --image-pull-progress-deadline default is 1 minute. If your image is large or the registry is slow, the kubelet kills the pull and retries, creating a cycle that leaves pods in ImagePullBackOff. Set this higher or use image streaming in production.
Also, the kubelet's eviction logic uses soft and hard eviction thresholds. Hard eviction triggers immediate pod killing when exceeded, while soft eviction has a grace period. By default, evictionHard.memory.available is 100MiB — that's practically zero. Set it to 10% of node memory for predictable behavior.
One more thing: the kubelet's node status updates are sent to the API Server periodically (default 10 seconds). If the API Server is under high load or network is congested, the node may appear NotReady even though it's healthy. This is called 'node flapping' and is often a symptom of control plane load rather than node failure. Tune the node-status-update-frequency and node-monitor-grace-period accordingly.
Another detail: the kubelet's --max-pods default is 110, but that's a hard count, not a resource limit. A node may have free CPU/memory but hit this limit. In clusters running many sidecars, you can exhaust the pod slot quickly. Use --max-pods or the scheduling plugin to enforce a more appropriate cap based on your workload density.
And let's talk about system reserved resources. If you don't configure --system-reserved and --kube-reserved, the kubelet assumes all node resources are available for pods. But the operating system and the kubelet itself consume some. Without proper reservations, pods can steal resources from system daemons, leading to SSH timeouts or node instability. Always set --system-reserved=cpu=500m,memory=1Gi (adjust per node size) and enable eviction thresholds.
- Runs liveness and readiness probes.
- Mounts volumes specified in the PodSpec.
- Reports node conditions (MemoryPressure, DiskPressure).
- Manages cgroups for resource isolation.
--max-pods (default 110) is a hard cap — new pods won't schedule even if CPU/RAM is free.journalctl and crictl, not just kubectl.Worker Node Component Reference Table
The following table provides a quick reference for the core components running on every worker node. This is useful when triaging node-level issues or validating node configurations during cluster upgrades.
| Component | Description | Default Port | Log Location | Common Failure Modes |
|---|---|---|---|---|
| kubelet | Primary node agent; manages pods and reports node status | 10250 (kubelet API), 10255 (read-only) | journalctl -u kubelet | OOM due to missing limits, disk pressure, stalled Docker socket (legacy) |
| kube-proxy | Network proxy; maintains iptables/IPVS rules for Services | 10249 (metrics) | journalctl -u kube-proxy | iptables corruption (large clusters), IPVS mode fallback |
| containerd | Container runtime (default) | 10010 (CRI) | journalctl -u containerd, crictl logs | Image pull timeout, storage driver issues (overlay2), dead containerd socket |
| CRI-O | Alternative container runtime (Red Hat) | 10010 (CRI) | journalctl -u crio | Image pull timeout, conmon OOM, conmon vs runc mismatch |
| Calico (cni-calico) | CNI plugin providing network policies and BGP routing | 9099 (felix metrics) | /var/log/calico/cni/ | IPAM exhaustion, BGP peer failure, policy misconfiguration |
| Flannel (cni-flannel) | CNI plugin for simple overlay networking | – | journalctl -u flanneld | VXLAN MTU mismatch, subnet lease conflicts, no network policy support |
| Cilium (cni-cilium) | eBPF-based CNI with advanced observability | 9090 (cilium-agent metrics), 9961 (hubble) | cilium status or hubble observe | Kernel version < 5.10, eBPF feature gaps, conflicting NetworkPolicies |
Key insight: the container runtime must be the same across all nodes. Mixed runtimes (containerd on some, CRI-O on others) work but introduce subtle differences in CRI implementation – always test in a staging cluster before rolling out.
Another common pitfall: the kubelet's --container-runtime-endpoint defaults to /run/containerd/containerd.sock. If containerd is restarted and the socket disappears temporarily, kubelet will fail to start new pods. Use a systemd socket activation or a health check that waits for the socket to appear before starting the kubelet.
Also note that kube-proxy in IPVS mode (set --proxy-mode=ipvs) scales better for large clusters but requires the ipvsadm module loaded on the node. Without it, kube-proxy falls back to iptables mode, which can be a surprise if you've tuned for IPVS performance.
- Pin kubelet, kube-proxy, and container runtime versions to cluster release.
- Pre-pull images used by core components (CNI, kube-proxy) during node bootstrapping.
- Validate node components weekly against a baseline – any mismatch should trigger an alert.
etcd: The Cluster's Source of Truth
etcd is a distributed, consistent key-value store that powers Kubernetes. It uses the Raft consensus algorithm to achieve strong consistency across a cluster of members (typically 3 or 5). All cluster state—Pods, ConfigMaps, Secrets, Deployments, RBAC policy—is stored in etcd. The API Server is the only component that writes to etcd; all other components watch the API Server for changes.
The key insight: etcd is a replicated log, not a traditional database. Every write is appended to a log and only committed when a majority of members (quorum) acknowledge it. This design ensures strong consistency but makes performance heavily dependent on disk I/O latency for fsync operations. If one etcd member's disk is slow, the entire cluster's write throughput suffers.
Another thing: etcd's default --snapshot-count is 100,000. After that many changes, etcd takes a snapshot and compacts the log. On slow disks, this snapshot can spike latency and cause temporary leader election issues. Tune this value down (e.g., 50000) on clusters with frequent writes.
Less obvious: etcd's database file grows even after compaction because old data is freed but not returned to the OS. You must run etcdctl defrag periodically to reclaim space. Skipping this leads to quota limit errors (mvcc: database space exceeded) that lock the cluster.
Also, consider the impact of network latency between etcd members. Raft heartbeats are sent every 100ms by default. If round-trip time exceeds 50ms, you risk false leader elections. In multi-datacenter setups, place etcd members close together or tune heartbeat-interval to account for latency.
A production scenario: during a large-scale cluster upgrade, etcd writes spike as many resources are updated simultaneously. If you haven't tuned --snapshot-count or --quota-backend-bytes, the WAL compaction can cause fsync storms. One team saw a 5-second write latency every 2 minutes during an upgrade, causing repeated leader elections. The fix: increase --quota-backend-bytes to 16GB and set --auto-compaction-retention=1h to avoid bursts.
And the silent killer: clock skew. Raft relies on election timeouts that are based on monotonic clocks. If two etcd nodes have clock drift greater than the election timeout, they may both think the leader has timed out and start new elections, causing a split that can lead to quorum loss. Use chrony or ntpd with reliable upstream time sources and monitor clock offset across etcd members.
iostat and network latency between etcd members. Also check clock skew. Increase heartbeat-interval and election-timeout proportionally.The Scheduler: How Pods Are Assigned to Nodes
The Kubernetes scheduler (kube-scheduler) is responsible for assigning unscheduled Pods to appropriate Nodes. It does this via a two-phase pipeline: Filtering (Predicates) and Scoring (Priorities).
In the filtering phase, the scheduler selects nodes that meet the pod's hard constraints: resource requests, nodeSelector, node affinity, taints/tolerations, topology constraints (e.g., pod anti-affinity). Node conditions like DiskPressure or MemoryPressure also cause the node to be filtered out.
In the scoring phase, the scheduler ranks feasible nodes based on priority functions: spread pods across zones, minimize resource fragmentation (balanced allocation), cluster autoscaler preferences, and user-defined custom scores. The node with the highest score gets the pod.
If no node passes the filtering phase, the pod remains Pending. The scheduler emits events that can be inspected via kubectl describe pod or scheduler logs.
A lesser-known nuance: the scheduler does not re-evaluate decisions for already scheduled pods. If a node becomes overcommitted after scheduling, the pod stays there—it won't be rescheduled. That's the kubelet's job (eviction), not the scheduler's.
Another performance detail: the scheduler's --kube-api-qps defaults to 50. If you have hundreds of nodes and thousands of pods being created rapidly (e.g., during a scaling event), the scheduler may fall behind. Increase this value but watch API Server load.
Another subtlety: the scheduler uses a scheduler cache of node information to avoid hitting the API Server for every pod. But this cache can become stale if nodes update frequently. In extreme cases, the scheduler may try to place a pod on a node that no longer exists or has changed. This manifests as a pod stuck in 'PodScheduled' condition with the event 'node does not exist'. The scheduler eventually retries and the cache refreshes, but you can force this by restarting the scheduler pod.
One more: the scheduler's --percentage-of-nodes-to-score defaults to 50% for clusters with >100 nodes. This is a performance optimization — it scores only a subset of feasible nodes. But it can lead to suboptimal placements if that subset doesn't include the best node. In latency-sensitive deployments, consider setting it to 100% to always get the best score at the cost of scheduling latency.
And let's talk about pod priority and preemption. If you have pods with different priority classes, the scheduler can preempt lower-priority pods to make room for higher-priority ones. But preemption is not instant — it can take up to 30 seconds because the scheduler must gracefully evict the lower-priority pods and wait for the kubelet to terminate them. During that window, the high-priority pod remains Pending. If your application requires rapid recovery, design your priority classes with realistic timeout expectations.
scheduler_queue_incoming_pods metrics and adjust --kube-api-qps or tune scoring algorithms.--percentage-of-nodes-to-score default can skip optimal nodes — set to 100% for latency-sensitive workloads.kubectl describe pod for the reason.--percentage-of-nodes-to-score to reduce scheduling latency.kubectl describe pod for insufficient CPU/memory errors. Either increase cluster capacity or reduce requests.Kubernetes Networking: CNI, Services, and kube-proxy
Kubernetes assumes a flat network where every Pod can communicate with every other Pod without NAT, across nodes. This is achieved through the Container Network Interface (CNI), a plugin-based layer that configures network interfaces and routes on each node.
Each pod gets its own IP address (IP-per-pod model). CNI plugins like Calico, Flannel, Weave, or Cilium set up virtual interfaces and routing rules to enable cross-node pod-to-pod communication. Services abstract pod IPs and provide stable virtual IPs (ClusterIP) for pod discovery. kube-proxy watches Services and EndpointSlices and programs iptables or IPVS rules to forward traffic to the correct pods.
DNS resolution: Kubernetes DNS (CoreDNS) serves A/AAAA records for Services. Pods can resolve service names to ClusterIPs, enabling simple service discovery.
One common misconfiguration: kube-proxy --cluster-cidr must match your pod CIDR range. If they differ, kube-proxy may program incorrect routing. Also, when using Calico with NetworkPolicy, remember that Calico's default behavior is to allow all traffic unless a policy matches. This is different from Kubernetes NetworkPolicy which defaults to deny when any policy targets the pod.
Choosing the right CNI matters: Calico offers rich network policies and BGP-based routing, Flannel is simpler but lacks policy support, Cilium uses eBPF for high performance. Evaluate based on your scale and security requirements.
Also, when using Cilium with eBPF, the kube-proxy can be completely removed (kube-proxy replacement). This reduces iptables overhead and improves performance at scale. But it requires careful validation of network policies and service routing, as Cilium's implementation may differ from standard kube-proxy in edge cases like externalTrafficPolicy.
A common production pitfall: IP address management (IPAM) exhaustion. If your CNI allocates pod IPs from a fixed CIDR and you have many pods terminating and starting, the IP pool can fragment. Calico uses a block-based approach that mitigates this, but Flannel's default allocation can lead to rapid exhaustion. Monitor IP utilization metrics from your CNI plugin.
And don't forget about MTU issues. If your CNI's overlay network uses encapsulation (e.g., VXLAN with 50 bytes overhead), and your underlying network has a standard 1500 MTU, the effective MTU for pods is 1450. If your application sends large packets that require fragmentation, you may see degraded performance or timeouts. Set the MTU on your CNI config to account for encapsulation overhead, or use direct-routing mode (e.g., Calico with BGP) to avoid encapsulation.
CNI Plugins: Calico and Flannel in Kubernetes Architecture
The Container Network Interface (CNI) plugin is a critical architectural component that determines how pods communicate within and across nodes. Two of the most widely used CNI plugins are Calico and Flannel. Each implements the Kubernetes networking model differently, with distinct trade-offs in complexity, security, performance, and scale.
Calico uses a pure Layer 3 approach by default, routing pod traffic using BGP (Border Gateway Protocol) without needing overlays. This eliminates encapsulation overhead (no VXLAN/IPSEC) and allows pod-to-pod packets to be forwarded at near wire speed. Calico also implements rich network policies using iptables or eBPF, supporting granular ingress/egress rules, namespace isolation, and dynamic policy enforcement. It includes its own IPAM (IP Address Management) using block-based allocation, which reduces fragmentation.
Flannel uses an overlay network (most commonly VXLAN) to encapsulate pod traffic. It is simpler to deploy and requires no BGP infrastructure or complex routing configuration. Flannel provides a flat network where every pod gets a unique IP, but it does not support Kubernetes NetworkPolicy natively—meaning no firewall rules between pods unless you combine it with a separate policy engine like Calico or Cilium. Flannel's IPAM is simpler and can exhaust IPs faster under high pod churn.
When to choose which? For production clusters requiring network policies, multi-tenancy, and high throughput (e.g., financial services, large e-commerce), Calico is the recommended choice. For small dev/test clusters, or teams new to Kubernetes where simplicity is paramount, Flannel suffices. Many production deployments run Calico in BGP mode with eBPF acceleration for best performance.
Common issues with Calico: BGP peer misconfiguration causes routing failures; calico-node pod crash due to missing kernel modules (e.g., ip_tables, nf_conntrack); IPAM pool exhaustion leading to pods stuck in ContainerCreating. For Flannel: VXLAN MTU mismatch (set to 1450 if underlying network MTU is 1500); subnet lease conflicts when nodes are recreated quickly.
Performance consideration: Calico's iptables-based policy enforcement can become a bottleneck at scale (thousands of policies). Switch to eBPF mode (requires Linux kernel >= 5.10) for better throughput. Flannel's VXLAN incurs a 50-byte overhead per packet—increase MTU on the host interface to compensate.
The diagram below illustrates how a packet travels from one pod to another across nodes using a Calico BGP route versus a Flannel VXLAN tunnel.
API Server: The Gateway to the Cluster
The API Server is the frontend of the control plane and the only component that directly interacts with etcd. Every kubectl command, controller watch, and component communication goes through it. Understanding its request flow is critical for debugging and performance tuning.
The API Server authenticates the request (via client certificates, bearer tokens, or OIDC), authorizes it against RBAC policies, and then passes it through admission controllers (mutating and validating) before persisting to etcd. Admission controllers can modify or reject resources — this is where PodSecurity admission, resource quota enforcement, and custom webhooks run.
What most engineers miss: admission webhooks add latency to every API request. A slow webhook can increase API Server response time from milliseconds to seconds, impacting the entire cluster. Monitor apiserver_admission_webhook_admission_duration_seconds for outliers. Also, the API Server caches responses for watch requests, but the cache size is limited. Large clusters with many watches can cause cache thrashing and increased etcd read load.
Another overlooked metric: apiserver_request_duration_seconds with a high 99th percentile indicates either webhook latency or etcd slowness. Correlate with etcd metrics to pinpoint the bottleneck.
Another scenario: the API Server's watch cache can become inconsistent on very large clusters with frequent resource updates. When a watch request fails with 'too old resource version', the client must re-list all resources. This can cause cascading failures as controllers re-sync and generate additional load. Mitigate by increasing the kube-apiserver's --watch-cache-sizes or using the --watch-cache flag with appropriate values.
One more: the API Server's --max-requests-inflight and --max-mutating-requests-inflight defaults can be too low for clusters with many controllers or automation. If you see 429 Too Many Requests errors in logs, tune these up gradually while monitoring memory usage. Each inflight request consumes memory for the request context, so increasing them too aggressively can cause OOM.
And here's something about watch timeouts: by default, watch connections are long-lived. If a client (e.g., a controller) disconnects unexpectedly, the API Server keeps the watch goroutine until the timeout (default 5 minutes). In clusters with many controllers, these orphaned watches can accumulate and consume significant memory. Set --watch-termination-timeout to a lower value (e.g., 60s) to clean up stale watches faster.
webhook.Server with timeouts and always set failurePolicy: Fail in production to avoid silent bypass.--max-requests-inflight while monitoring memory.--watch-termination-timeout to a reasonable value.--max-mutating-requests-inflight and --max-requests-inflight or reduce concurrency of automated kubectl usage.--watch-cache-sizes for the affected resource types. Consider using a higher resource version tolerance in client code.CoreDNS and Service Discovery: The Cluster's Internal DNS
CoreDNS is the default DNS resolver for Kubernetes clusters. It runs as a deployment in the kube-system namespace and watches Services and EndpointSlices to provide name resolution for service names. Every pod is configured to use CoreDNS via /etc/resolv.conf — typically at the cluster IP of the CoreDNS service (like 10.96.0.10).
CoreDNS uses plugins to achieve its functionality. The kubedns plugin handles Kubernetes service records. The forward plugin forwards external DNS queries to upstream resolvers. The loop plugin detects forwarding loops. The log plugin enables query logging for debugging.
A common misconfiguration: not setting resource limits for CoreDNS. Under heavy query load, CoreDNS can become a bottleneck. We've seen clusters where a single CoreDNS pod crashed due to memory limits, causing intermittent DNS failures across the cluster. Monitor CoreDNS's memory and CPU — set requests and limits based on cluster size.
Another subtlety: the ndots configuration in pod DNS policy. By default, /etc/resolv.conf sets ndots:5. This means if a domain name has fewer than 5 dots, the resolver will first try appending cluster search domains before making the absolute query. This adds unnecessary latency for single-name services (e.g., my-service) — it tries my-service.default.svc.cluster.local first (good), but also tries my-service.svc.cluster.local and my-service.cluster.local. For high-traffic applications, tune ndots to 1 or set DNSConfig in the pod.
Also, CoreDNS's forward plugin retries on failure. The default policy first sends queries to the first upstream and only tries others on failure. If the first upstream is slow, all queries wait. Change to policy: sequential for better load distribution.
In large clusters, consider deploying NodeLocal DNSCache. It runs a DaemonSet that caches DNS queries per node, reducing load on CoreDNS and improving resolution latency. This is almost essential for clusters with thousands of pods.
log plugin to the CoreDNS ConfigMap. The logs will show each query and response. Use with caution in production — log volume can be high. The command kubectl logs -n kube-system deployment/coredns --tail=10 shows recent queries.coredns_dns_request_duration_seconds to catch degradation early.kube-dns has endpoints. Check NetworkPolicy blocking DNS.Kubelet Probes and Pod Lifecycle: What Happens When a Probe Fails
The kubelet executes three types of probes on containers: liveness, readiness, and startup. Each probe is a periodic check (HTTP GET, TCP socket, or command execution) that determines whether the container is alive, ready to serve traffic, or still starting up. The probe results directly affect the pod's lifecycle and the cluster's behavior.
Liveness probes determine if the container is running. If it fails, the kubelet restarts the container per the pod's restartPolicy. Readiness probes determine if the container is ready to serve traffic. If it fails, the pod is removed from Service endpoints. Startup probes delay the start of liveness and readiness probes until the container finishes initialization — critical for slow-starting applications like Java apps or legacy monoliths.
Here's the nuance most teams get wrong: the failure threshold for liveness probes often causes cascading restarts in rolling updates. Default failureThreshold is 3, and the default periodSeconds is 10. That means a pod will be killed after 30 seconds of failure. But during a deployment, if the new version takes 40 seconds to respond, the kubelet restarts it before it ever becomes ready. The fix: increase the startup probe threshold or use an initial delay.
Another subtlety: when a readiness probe fails, the pod remains running but is removed from Service endpoints. This means traffic stops flowing, but the pod continues consuming CPU and memory. If many pods become unready, the remaining pods may be overloaded, causing a cascade of readiness failures. Always set resource limits to prevent unready pods from starving healthy ones.
Also, the kubelet doesn't kill containers for failing readiness probes. That's intentional — the probe is just traffic routing. But if you rely on readiness probe for health checking in monitoring, you'll get false positives. Readiness probe failures are not alerts unless they exceed a high threshold.
One more: the kubelet records probe results as events on the pod. You can see them with kubectl describe pod. But there's a default event limit of 1000 per pod, and if probes are failing rapidly, older events get pruned. You may lose the root cause. For continuous monitoring, use metrics from kubelet (kubelet_pod_start_duration_seconds, kubelet_pod_lifecycle_event_gauge).
initialDelaySeconds on the liveness probe. That's fragile. Always verify your cluster version before relying on startup probes.Why You Can't Afford to Ignore Docker in Kubernetes
Docker isn't just a container runtime. It's the foundation that Kubernetes orchestrates. Every pod you run on Kubernetes is just a collection of Docker containers (or OCI-compatible images). If you don't understand how Docker builds, layers, and caches images, you'll ship bloated containers that crash your cluster in production.
Docker images are read-only templates. Containers are runtime instances of those templates. The Docker daemon on each node pulls images from registries, mounts layers via union filesystems, and isolates processes using cgroups and namespaces. Kubernetes trusts the kubelet to talk to the container runtime – typically containerd or Docker – to start, stop, and monitor those containers.
Here's the trap: Docker caches layers locally. If your image has a massive base OS layer that never changes, it sits on every node. Multiply that by 50 nodes. You're burning disk space for no reason. The fix? Use distroless or Alpine-based images. And always pin versions – latest is a production fire waiting to happen.
docker build --no-cache in CI unless you want every build to download all layers fresh. Cache busting is fine, but full rebuilds are slow and waste bandwidth. Use docker build --cache-from to reuse cached layers from a registry.Docker vs Virtual Machines: The Performance Reality
Virtual machines emulate hardware. Each VM has its own kernel, runs a full OS, and uses hypervisors like KVM or Hyper-V to allocate resources. That overhead is massive – boot time in seconds, memory footprint in gigabytes, and CPU cycles lost to virtualization layers.
Docker containers share the host kernel. They start in milliseconds, consume megabytes of memory overhead, and don't emulate hardware. The tradeoff? Isolation. Containers use cgroups to limit CPU/memory and namespaces to isolate processes, but they still share the kernel. A kernel panic on the host takes down every container. With VMs, you get full isolation at a huge performance cost.
Here's the math for a typical microservice: VM with 256MB RAM, 1 vCPU, 10GB disk = ~$30/month on a cloud provider. Docker container with the same resources = negligible cost inside a Kubernetes cluster. But if you need hard isolation for untrusted workloads, stick with VMs. Containers aren't sandboxes – they're lightweight processes.
How etcd Fails (and What You Must Do About It)
etcd is the single source of truth for your entire Kubernetes cluster. If it goes down, you can't schedule pods, update deployments, or even view cluster state. It's a distributed key-value store that uses the Raft consensus protocol to maintain consistency across nodes.
Here's what happens when etcd fails silently: A node gets partitioned from the rest. Raft requires a majority to elect a leader. If a partition splits the cluster into two groups, and neither has a majority, the cluster becomes read-only. No writes allowed. Your kubectl apply commands hang indefinitely.
The fix? Run etcd on dedicated nodes with dedicated disks – SSDs are mandatory. Disable swapping on those nodes. Etcd is latency-sensitive: one slow disk can wreck cluster stability for everyone. Set --quota-backend-bytes to something reasonable for your scale – default 2GB, but bump to 8GB for production clusters. And always back up etcd snapshots to object storage every hour. Always. Test those restores. I've seen teams lose entire clusters because they thought file-level backups of etcd were enough.
.dockerignore: The One Line That Shrinks Image Builds
Every Docker build sends the entire context directory to the daemon. Without a .dockerignore, you include node_modules, .git, and build artifacts—bloating the image and slowing CI/CD. This is especially brutal in Kubernetes because every pod pull wastes bandwidth and increases startup latency. The fix: a single file. Add patterns for node_modules, .env, .git, dist, and any local cache. The build context is filtered server-side, so only essential files reach the Docker engine. Result: faster builds, smaller images, fewer registry costs. Teams that skip .dockerignore routinely ship 500MB+ images that are 90% unnecessary. In Kubernetes, where pods restart frequently, this drag manifests as degraded cluster performance. The .dockerignore file is not optional—it is the cheapest optimization you will ever make.
docker-compose: Local Workflow That Mirrors Kubernetes Semantics
Kubernetes is not a local development tool. docker-compose fills that gap by declaring multi-container applications in a single YAML file. Why it matters to Kubernetes architects: compose models services, networks, volumes, and environment variables—the same primitives you later map to Deployments, Services, and ConfigMaps. Use compose for iterative local testing before writing a Helm chart. The 'docker-compose up' command starts your entire stack with consistent networking and dependency order. When containers fail, you can inspect logs without kubectl. The trap: treating compose as a production orchestrator. It lacks self-healing, rolling updates, and cluster-wide scheduling. Use it to validate image tags, environment injection, and volume mounts before pushing to a registry. In practice, every serious Kubernetes project I've seen uses a compose file for the developer loop. It catches config drift early, saving hours of cluster debugging.
Objects: The Declarative Building Blocks of Kubernetes
Every resource in Kubernetes is an Object—a persistent entity that represents the desired state of your cluster. Pods, Deployments, Services, ConfigMaps, and Secrets are all Objects defined via YAML or JSON manifests. The declarative model means you describe what you want, and the control plane (API Server, Controller Manager, etc.) converges the actual state to match. For example, a Deployment Object specifies the number of replicas, the container image, and update strategy. The Deployment Controller then creates ReplicaSets and Pods automatically. Understanding Objects is fundamental because they enforce idempotency, auditability, and self-healing. Without mastering Objects, you'll treat Kubernetes like a scripting language—prone to drift and manual fixes. Every kubectl apply communicates with the API Server to store your Object definition in etcd, triggering reconciliation loops.
Real-World Use Cases & Projects: Where Kubernetes Shines
Kubernetes excels in microservices orchestration, multi-cloud deployments, and machine learning pipelines. A real-world project is deploying a stateless web app with auto-scaling: combine a Deployment, HorizontalPodAutoscaler (HPA), and Service. In production, companies like Spotify use Kubernetes to run thousands of services with canary deployments and traffic splitting. Another use case is stateful workloads (databases, message queues) via StatefulSets and PersistentVolumeClaims—critical for e-commerce order systems. Edge computing projects run lightweight K3s on Raspberry Pi clusters for IoT data ingestion. For CI/CD, teams build GitOps pipelines with ArgoCD: every merge to main triggers a sync that updates the Deployment Object, rollbacking automatically if probes fail. These patterns reduce downtime and deployment friction. Start with a simple microservices stack (Frontend + API + DB) to grasp the lifecycle before scaling to multi-cluster federation.
Cascading API Server Failure Due to etcd Disk Latency
kubectl operations returned timeout errors. Controller-manager logs showed failed lease renewals. New pods stuck in Pending.fsync duration and disk IOPS.
4. Added disk latency alerts with p99 > 15ms triggering immediate escalation.- etcd is the cluster's central nervous system. Its performance is non-negotiable.
- Disk latency, not network, is the most common cause of etcd instability.
- etcd must be isolated and its hardware provisioned for predictable, low-latency I/O.
- Always test your etcd restore procedure quarterly — a backup you've never restored is no backup at all.
kubectl get --raw=/readyz?verbose and etcdctl endpoint health --cluster.kubectl describe node). Check for resource fragmentation or taints/tolerations mismatches. Also verify scheduler pod health.kubectl describe pod). The container runtime (e.g., containerd) logs are critical here. Check journalctl -u kubelet and crictl ps -a.systemctl status kubelet). Check disk pressure, memory pressure, and PID pressure using kubectl describe node. Look for eviction thresholds being hit.nslookup kubernetes.default.svc.cluster.local from inside a pod.etcdctl endpoint health --clusterkubectl get --raw='/readyz?verbose'iostat -x 1). Isolate etcd immediately. If API Server is slow, check apiserver_current_inflight_requests metric.Key takeaways
Common mistakes to avoid
5 patternsUsing default eviction thresholds without customization
evictionHard.memory.available defaults to 100MiB, which is too low for predictable behavior.evictionHard.memory.available: "10%" or a value like 500Mi in the kubelet configuration. Test with a load generator to verify thresholds trigger before system OOM.Running etcd on shared nodes with other workloads
--heartbeat-interval and --election-timeout for your network latency.Not setting resource limits for CoreDNS
forward plugin's max_concurrent.Applying a NetworkPolicy that denies all ingress without allowing DNS
Ignoring scheduler cache staleness
--node-informer-resync-period or using a more recent scheduler version that handles this better.Interview Questions on This Topic
Explain how the Kubernetes scheduler assigns a pod to a node. What happens if no node passes filtering?
kubectl describe pod. The scheduler then scores the remaining nodes with priority functions (e.g., spread, balanced resource allocation) and assigns the pod to the highest-scored node. The scheduler does not re-evaluate once the pod is bound; the kubelet handles eviction if the node becomes overcommitted.Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
That's Kubernetes. Mark it forged?
27 min read · try the examples if you haven't