etcd Disk Latency — Kubernetes Architecture Failure
etcd disk latency from co-located workloads caused Raft leader election failures, crashing the API server.
- Kubernetes is a declarative control loop: you define desired state, the system reconciles toward it.
- Control Plane components: API Server, Scheduler, Controller Manager, and etcd — the single source of truth using Raft.
- Nodes run the kubelet, kube-proxy, and container runtime; they execute pod specs and report back.
- Performance insight: etcd's fsync latency dictates cluster responsiveness — keep it under 10ms or expect leader elections.
- Production insight: API Server is the only component that talks directly to etcd; all others watch it. If etcd is slow, the entire control plane slows.
- Biggest mistake: treating etcd as a generic KV store — it's a replicated log, not an OLTP database.
- Scaling insight: watch cache exhaustion causes re-list storms; increase
--watch-cache-sizesfor resource-heavy clusters. - Scheduler nuance: default
--percentage-of-nodes-to-scorecan skip optimal nodes at scale — set to 100% for latency-sensitive workloads.
Imagine a massive Amazon warehouse. There's a central manager's office (the Control Plane) that decides which workers (Nodes) pick which packages (containers), tracks every shelf location (etcd), and reschedules sick workers automatically. The workers don't think — they just follow orders from the office, report their status, and run their assigned tasks. Kubernetes is exactly that warehouse management system, but for software running on servers.
Kubernetes replaces bespoke deployment scripts and manual server management with a declarative control loop. You describe the desired state, and the cluster continuously reconciles reality toward it. The architecture is not a black box; it's a set of coordinated components with specific failure domains and performance characteristics. Understanding these internals is what separates engineers who debug Kubernetes from those who are confused by it.
This is not a getting-started guide. It is for engineers already running Kubernetes who need to understand the 'why' behind scheduler decisions, etcd consistency guarantees, and kubelet behavior under pressure. We will trace what happens, component by component, when you run kubectl apply -f deployment.yaml, and identify the production decisions that bite teams hardest. The single most overlooked truth: the API Server is the bottleneck, but etcd is the clock that drives it.
One more thing: don't assume HA means safe. Running three API Server replicas without understanding leader election or etcd quorum is a false sense of security. We'll cover exactly what breaks and how to catch it before your on-call phone rings.
What is Kubernetes Architecture Explained?
Kubernetes Architecture Explained is a core concept in DevOps. Rather than starting with a dry definition, let's see it in action and understand why it exists. The architecture's design directly enables its core promise: a self-healing, declarative system for running distributed applications.
At its heart, Kubernetes is a set of independent control loops running across a cluster of machines. Each controller watches the API Server for its specific resource type, compares the current state to the desired state, and takes action to reconcile them. This decoupling is what makes the system resilient—when the scheduler goes down, controllers still keep running replicas healthy. But it also introduces eventual consistency: after you apply a manifest, the cluster converges toward the desired state, not to a point-in-time snapshot.
Here's the part that catches teams off guard: the control loop frequency matters. The default --sync-period for most controllers is 10 seconds. That means after you fix a problem—say delete a stuck pod—you'll stare at a Terminating state for up to 10 seconds before the controller notices and creates a replacement. In production, this delay accumulates during rolling updates and can make deployments feel sluggish. Tuning these sync periods across controllers is a lever many engineers ignore until it costs them a slow canary.
Custom controllers can override the default informer resync period. Reducing it below the default increases API Server load exponentially. Always benchmark resync periods with real workloads.
One more nuance: the control loop model means you can't trust a single kubectl get call to show the truth. The cluster is always converging. When debugging, watch the events with kubectl get events --watch and inspect controller logs to see what's happening in real time.
Another subtlety: when you delete a namespace, the namespace controller doesn't delete resources synchronously — it issues deletion requests and waits. If a webhook is slow, namespace deletion can hang for minutes.
Here's a production reality: if you're running a controller manager with leader election (you are), the leader's sync loop is the only one doing work. If the leader pod gets OOMKilled, it takes ~15s for a new leader to take over. During that window, no controller reconciles. In a chaos experiment, we saw a 30-second gap where replicas dropped to zero because the replication controller didn't notice a node failure during that handoff. Lesson: monitor leader election metrics.
And let's talk about informer caches — every controller uses a local cache of objects from the API Server. If the cache becomes stale because the API Server dropped a watch (e.g., due to too old resource version), the controller must re-list all objects. That's expensive. In large clusters with thousands of Deployments, a re-list can spike API Server CPU and cause cascading timeouts. Mitigate by increasing the watch cache sizes or setting appropriate informer resync periods.
- You write a Deployment YAML (desired state).
- The Deployment controller creates ReplicaSets.
- The ReplicaSet controller ensures the right number of Pods exist.
- The Scheduler assigns Pods to Nodes.
- The Kubelet on each Node runs the containers.
Control Plane: The Cluster's Brain
The Control Plane makes global decisions (e.g., scheduling) and detects and responds to cluster events. It consists of the API Server, etcd, Scheduler, and Controller Manager. In production, it is almost always replicated across multiple nodes for high availability.
The API Server (kube-apiserver) is the front end—all communication, whether from kubectl, controllers, or nodes, goes through it. It validates and persists resources to etcd, and exposes a watch API that components use to detect changes. The Scheduler (kube-scheduler) watches for unscheduled Pods and assigns them to nodes. The Controller Manager (kube-controller-manager) bundles multiple controllers: Node Controller, ReplicaSet Controller, Endpoint Controller, etc. Each runs as a separate loop but shares the same binary.
One detail that bites teams: the API Server's --max-mutating-requests-inflight and --max-requests-inflight defaults are 200 and 400 respectively. If you run a large cluster with aggressive automation, you'll hit this limit and calls start queuing. Monitor apiserver_request_count and apiserver_current_inflight_requests early—before your CI/CD pipeline starts timing out.
Another subtlety: leader election among controller manager replicas is handled via endpoints in kube-system. If the leader dies unexpectedly, it takes about 15 seconds for a new leader to take over (governed by lease duration and renew deadline). During that window, the corresponding controller loop stops. For example, the node controller stops evicting pods from unreachable nodes, which can cause service disruption. Know your lease parameters.
Another often overlooked component: the cloud-controller-manager. If you're on AWS, GCP, or Azure, this controller interacts with the cloud provider's API to manage load balancers, routes, and nodes. A misconfigured cloud-controller-manager can prevent nodes from joining the cluster even though the API Server is healthy. Always check its logs when nodes fail to register.
One more: the API Server's etcd client uses a watch cache. If writes exceed the cache capacity, watch requests get 'too old resource version' errors and clients must re-list. This can cascade into a thundering herd problem. Mitigate by increasing --watch-cache-sizes or reducing watch concurrency. In one incident, a misconfigured monitoring system created too many watches, causing all controllers to re-sync every few minutes, spiking API Server CPU to 100%.
And here's a trap with admission webhooks: they run before the request reaches etcd. A slow webhook blocks the entire request pipeline. We've seen a single webhook that took 5 seconds to respond because it called an external service that was throttled. That 5-second delay was added to every write to that resource type. The fix was to add a circuit breaker and a timeout at the webhook level. Monitor apiserver_admission_webhook_admission_duration_seconds — if the 99th percentile exceeds 1 second, you have a problem.
Component Interaction Flow: From kubectl to Running Pod
When you run kubectl apply -f deployment.yaml, a chain of events propagates through the control plane to the target node. Understanding this flow is essential for debugging latency and identifying where failures occur.
The sequence starts with kubectl sending a REST POST request to the API Server's /apis/apps/v1/namespaces/default/deployments endpoint. The API Server authenticates the request (via TLS certificates or bearer tokens), authorizes it against RBAC, then passes the object through a set of admission controllers (mutating and validating webhooks). If all passes, the API Server persists the Deployment object into etcd using a Raft write.
Once the write is committed, the API Server's watch mechanism notifies the Deployment controller (part of kube-controller-manager). The Deployment controller sees the new object and creates a ReplicaSet. This in turn triggers the ReplicaSet controller to create a Pod object. The Scheduler, watching for unscheduled Pods, picks a suitable node and updates the Pod with the node binding.
The API Server persists the binding and notifies the target node's kubelet via its watch. The kubelet receives the Pod spec and begins execution: it pulls the container image (if not cached), starts the container via the CRI, mounts volumes, configures networking via CNI, and runs startup/liveness probes. At the same time, kube-proxy updates iptables rules to route Service traffic to the new pod.
The entire process typically takes 2–10 seconds for a small deployment, but can stretch to minutes with large images or slow webhooks. Each step introduces latency variables: admission webhook latency, etcd write latency, scheduler queue delay, image pull time, and CNI configuration time.
A common production pitfall: a slow admission webhook (e.g., 2 seconds per request) adds 2 seconds to every resource creation. If you create 100 pods during a rollout, that's 200 seconds of additional delay. Monitor apiserver_admission_webhook_admission_duration_seconds to catch this.
The diagram below visualizes the interaction sequence between components using a simplified mermaid sequence diagram.
kubectl proxy and inspect the stage timestamps. You can correlate admission webhook durations, etcd round trips, and response times to pinpoint the slowest step in the pipeline.Nodes: The Worker Machines
A Node is a worker machine (VM or physical) where containers are run. Each node contains the services necessary to run Pods: the kubelet, the container runtime (e.g., containerd), and the kube-proxy.
The kubelet is the node's primary agent. It receives PodSpecs from the API Server (for pods assigned to its node) and ensures the described containers are running. It does this by talking to the container runtime via the Container Runtime Interface (CRI). The kubelet also runs liveness and readiness probes, mounts volumes, and reports node conditions like DiskPressure, MemoryPressure, and PIDPressure to the API Server.
Kube-proxy (runs as a DaemonSet) maintains network rules on the node. It watches Services and EndpointSlices and updates iptables or IPVS rules so traffic to a Service's ClusterIP is load‑balanced to the actual pods.
Here's what most people miss: the kubelet's --image-pull-progress-deadline default is 1 minute. If your image is large or the registry is slow, the kubelet kills the pull and retries, creating a cycle that leaves pods in ImagePullBackOff. Set this higher or use image streaming in production.
Also, the kubelet's eviction logic uses soft and hard eviction thresholds. Hard eviction triggers immediate pod killing when exceeded, while soft eviction has a grace period. By default, evictionHard.memory.available is 100MiB — that's practically zero. Set it to 10% of node memory for predictable behavior.
One more thing: the kubelet's node status updates are sent to the API Server periodically (default 10 seconds). If the API Server is under high load or network is congested, the node may appear NotReady even though it's healthy. This is called 'node flapping' and is often a symptom of control plane load rather than node failure. Tune the node-status-update-frequency and node-monitor-grace-period accordingly.
Another detail: the kubelet's --max-pods default is 110, but that's a hard count, not a resource limit. A node may have free CPU/memory but hit this limit. In clusters running many sidecars, you can exhaust the pod slot quickly. Use --max-pods or the scheduling plugin to enforce a more appropriate cap based on your workload density.
And let's talk about system reserved resources. If you don't configure --system-reserved and --kube-reserved, the kubelet assumes all node resources are available for pods. But the operating system and the kubelet itself consume some. Without proper reservations, pods can steal resources from system daemons, leading to SSH timeouts or node instability. Always set --system-reserved=cpu=500m,memory=1Gi (adjust per node size) and enable eviction thresholds.
- Runs liveness and readiness probes.
- Mounts volumes specified in the PodSpec.
- Reports node conditions (MemoryPressure, DiskPressure).
- Manages cgroups for resource isolation.
--max-pods (default 110) is a hard cap — new pods won't schedule even if CPU/RAM is free.journalctl and crictl, not just kubectl.Worker Node Component Reference Table
The following table provides a quick reference for the core components running on every worker node. This is useful when triaging node-level issues or validating node configurations during cluster upgrades.
| Component | Description | Default Port | Log Location | Common Failure Modes |
|---|---|---|---|---|
| kubelet | Primary node agent; manages pods and reports node status | 10250 (kubelet API), 10255 (read-only) | journalctl -u kubelet | OOM due to missing limits, disk pressure, stalled Docker socket (legacy) |
| kube-proxy | Network proxy; maintains iptables/IPVS rules for Services | 10249 (metrics) | journalctl -u kube-proxy | iptables corruption (large clusters), IPVS mode fallback |
| containerd | Container runtime (default) | 10010 (CRI) | journalctl -u containerd, crictl logs | Image pull timeout, storage driver issues (overlay2), dead containerd socket |
| CRI-O | Alternative container runtime (Red Hat) | 10010 (CRI) | journalctl -u crio | Image pull timeout, conmon OOM, conmon vs runc mismatch |
| Calico (cni-calico) | CNI plugin providing network policies and BGP routing | 9099 (felix metrics) | /var/log/calico/cni/ | IPAM exhaustion, BGP peer failure, policy misconfiguration |
| Flannel (cni-flannel) | CNI plugin for simple overlay networking | – | journalctl -u flanneld | VXLAN MTU mismatch, subnet lease conflicts, no network policy support |
| Cilium (cni-cilium) | eBPF-based CNI with advanced observability | 9090 (cilium-agent metrics), 9961 (hubble) | cilium status or hubble observe | Kernel version < 5.10, eBPF feature gaps, conflicting NetworkPolicies |
Key insight: the container runtime must be the same across all nodes. Mixed runtimes (containerd on some, CRI-O on others) work but introduce subtle differences in CRI implementation – always test in a staging cluster before rolling out.
Another common pitfall: the kubelet's --container-runtime-endpoint defaults to /run/containerd/containerd.sock. If containerd is restarted and the socket disappears temporarily, kubelet will fail to start new pods. Use a systemd socket activation or a health check that waits for the socket to appear before starting the kubelet.
Also note that kube-proxy in IPVS mode (set --proxy-mode=ipvs) scales better for large clusters but requires the ipvsadm module loaded on the node. Without it, kube-proxy falls back to iptables mode, which can be a surprise if you've tuned for IPVS performance.
- Pin kubelet, kube-proxy, and container runtime versions to cluster release.
- Pre-pull images used by core components (CNI, kube-proxy) during node bootstrapping.
- Validate node components weekly against a baseline – any mismatch should trigger an alert.
etcd: The Cluster's Source of Truth
etcd is a distributed, consistent key-value store that powers Kubernetes. It uses the Raft consensus algorithm to achieve strong consistency across a cluster of members (typically 3 or 5). All cluster state—Pods, ConfigMaps, Secrets, Deployments, RBAC policy—is stored in etcd. The API Server is the only component that writes to etcd; all other components watch the API Server for changes.
The key insight: etcd is a replicated log, not a traditional database. Every write is appended to a log and only committed when a majority of members (quorum) acknowledge it. This design ensures strong consistency but makes performance heavily dependent on disk I/O latency for fsync operations. If one etcd member's disk is slow, the entire cluster's write throughput suffers.
Another thing: etcd's default --snapshot-count is 100,000. After that many changes, etcd takes a snapshot and compacts the log. On slow disks, this snapshot can spike latency and cause temporary leader election issues. Tune this value down (e.g., 50000) on clusters with frequent writes.
Less obvious: etcd's database file grows even after compaction because old data is freed but not returned to the OS. You must run etcdctl defrag periodically to reclaim space. Skipping this leads to quota limit errors (mvcc: database space exceeded) that lock the cluster.
Also, consider the impact of network latency between etcd members. Raft heartbeats are sent every 100ms by default. If round-trip time exceeds 50ms, you risk false leader elections. In multi-datacenter setups, place etcd members close together or tune heartbeat-interval to account for latency.
A production scenario: during a large-scale cluster upgrade, etcd writes spike as many resources are updated simultaneously. If you haven't tuned --snapshot-count or --quota-backend-bytes, the WAL compaction can cause fsync storms. One team saw a 5-second write latency every 2 minutes during an upgrade, causing repeated leader elections. The fix: increase --quota-backend-bytes to 16GB and set --auto-compaction-retention=1h to avoid bursts.
And the silent killer: clock skew. Raft relies on election timeouts that are based on monotonic clocks. If two etcd nodes have clock drift greater than the election timeout, they may both think the leader has timed out and start new elections, causing a split that can lead to quorum loss. Use chrony or ntpd with reliable upstream time sources and monitor clock offset across etcd members.
iostat and network latency between etcd members. Also check clock skew. Increase heartbeat-interval and election-timeout proportionally.The Scheduler: How Pods Are Assigned to Nodes
The Kubernetes scheduler (kube-scheduler) is responsible for assigning unscheduled Pods to appropriate Nodes. It does this via a two-phase pipeline: Filtering (Predicates) and Scoring (Priorities).
In the filtering phase, the scheduler selects nodes that meet the pod's hard constraints: resource requests, nodeSelector, node affinity, taints/tolerations, topology constraints (e.g., pod anti-affinity). Node conditions like DiskPressure or MemoryPressure also cause the node to be filtered out.
In the scoring phase, the scheduler ranks feasible nodes based on priority functions: spread pods across zones, minimize resource fragmentation (balanced allocation), cluster autoscaler preferences, and user-defined custom scores. The node with the highest score gets the pod.
If no node passes the filtering phase, the pod remains Pending. The scheduler emits events that can be inspected via kubectl describe pod or scheduler logs.
A lesser-known nuance: the scheduler does not re-evaluate decisions for already scheduled pods. If a node becomes overcommitted after scheduling, the pod stays there—it won't be rescheduled. That's the kubelet's job (eviction), not the scheduler's.
Another performance detail: the scheduler's --kube-api-qps defaults to 50. If you have hundreds of nodes and thousands of pods being created rapidly (e.g., during a scaling event), the scheduler may fall behind. Increase this value but watch API Server load.
Another subtlety: the scheduler uses a scheduler cache of node information to avoid hitting the API Server for every pod. But this cache can become stale if nodes update frequently. In extreme cases, the scheduler may try to place a pod on a node that no longer exists or has changed. This manifests as a pod stuck in 'PodScheduled' condition with the event 'node does not exist'. The scheduler eventually retries and the cache refreshes, but you can force this by restarting the scheduler pod.
One more: the scheduler's --percentage-of-nodes-to-score defaults to 50% for clusters with >100 nodes. This is a performance optimization — it scores only a subset of feasible nodes. But it can lead to suboptimal placements if that subset doesn't include the best node. In latency-sensitive deployments, consider setting it to 100% to always get the best score at the cost of scheduling latency.
And let's talk about pod priority and preemption. If you have pods with different priority classes, the scheduler can preempt lower-priority pods to make room for higher-priority ones. But preemption is not instant — it can take up to 30 seconds because the scheduler must gracefully evict the lower-priority pods and wait for the kubelet to terminate them. During that window, the high-priority pod remains Pending. If your application requires rapid recovery, design your priority classes with realistic timeout expectations.
scheduler_queue_incoming_pods metrics and adjust --kube-api-qps or tune scoring algorithms.--percentage-of-nodes-to-score default can skip optimal nodes — set to 100% for latency-sensitive workloads.kubectl describe pod for the reason.--percentage-of-nodes-to-score to reduce scheduling latency.kubectl describe pod for insufficient CPU/memory errors. Either increase cluster capacity or reduce requests.Kubernetes Networking: CNI, Services, and kube-proxy
Kubernetes assumes a flat network where every Pod can communicate with every other Pod without NAT, across nodes. This is achieved through the Container Network Interface (CNI), a plugin-based layer that configures network interfaces and routes on each node.
Each pod gets its own IP address (IP-per-pod model). CNI plugins like Calico, Flannel, Weave, or Cilium set up virtual interfaces and routing rules to enable cross-node pod-to-pod communication. Services abstract pod IPs and provide stable virtual IPs (ClusterIP) for pod discovery. kube-proxy watches Services and EndpointSlices and programs iptables or IPVS rules to forward traffic to the correct pods.
DNS resolution: Kubernetes DNS (CoreDNS) serves A/AAAA records for Services. Pods can resolve service names to ClusterIPs, enabling simple service discovery.
One common misconfiguration: kube-proxy --cluster-cidr must match your pod CIDR range. If they differ, kube-proxy may program incorrect routing. Also, when using Calico with NetworkPolicy, remember that Calico's default behavior is to allow all traffic unless a policy matches. This is different from Kubernetes NetworkPolicy which defaults to deny when any policy targets the pod.
Choosing the right CNI matters: Calico offers rich network policies and BGP-based routing, Flannel is simpler but lacks policy support, Cilium uses eBPF for high performance. Evaluate based on your scale and security requirements.
Also, when using Cilium with eBPF, the kube-proxy can be completely removed (kube-proxy replacement). This reduces iptables overhead and improves performance at scale. But it requires careful validation of network policies and service routing, as Cilium's implementation may differ from standard kube-proxy in edge cases like externalTrafficPolicy.
A common production pitfall: IP address management (IPAM) exhaustion. If your CNI allocates pod IPs from a fixed CIDR and you have many pods terminating and starting, the IP pool can fragment. Calico uses a block-based approach that mitigates this, but Flannel's default allocation can lead to rapid exhaustion. Monitor IP utilization metrics from your CNI plugin.
And don't forget about MTU issues. If your CNI's overlay network uses encapsulation (e.g., VXLAN with 50 bytes overhead), and your underlying network has a standard 1500 MTU, the effective MTU for pods is 1450. If your application sends large packets that require fragmentation, you may see degraded performance or timeouts. Set the MTU on your CNI config to account for encapsulation overhead, or use direct-routing mode (e.g., Calico with BGP) to avoid encapsulation.
CNI Plugins: Calico and Flannel in Kubernetes Architecture
The Container Network Interface (CNI) plugin is a critical architectural component that determines how pods communicate within and across nodes. Two of the most widely used CNI plugins are Calico and Flannel. Each implements the Kubernetes networking model differently, with distinct trade-offs in complexity, security, performance, and scale.
Calico uses a pure Layer 3 approach by default, routing pod traffic using BGP (Border Gateway Protocol) without needing overlays. This eliminates encapsulation overhead (no VXLAN/IPSEC) and allows pod-to-pod packets to be forwarded at near wire speed. Calico also implements rich network policies using iptables or eBPF, supporting granular ingress/egress rules, namespace isolation, and dynamic policy enforcement. It includes its own IPAM (IP Address Management) using block-based allocation, which reduces fragmentation.
Flannel uses an overlay network (most commonly VXLAN) to encapsulate pod traffic. It is simpler to deploy and requires no BGP infrastructure or complex routing configuration. Flannel provides a flat network where every pod gets a unique IP, but it does not support Kubernetes NetworkPolicy natively—meaning no firewall rules between pods unless you combine it with a separate policy engine like Calico or Cilium. Flannel's IPAM is simpler and can exhaust IPs faster under high pod churn.
When to choose which? For production clusters requiring network policies, multi-tenancy, and high throughput (e.g., financial services, large e-commerce), Calico is the recommended choice. For small dev/test clusters, or teams new to Kubernetes where simplicity is paramount, Flannel suffices. Many production deployments run Calico in BGP mode with eBPF acceleration for best performance.
Common issues with Calico: BGP peer misconfiguration causes routing failures; calico-node pod crash due to missing kernel modules (e.g., ip_tables, nf_conntrack); IPAM pool exhaustion leading to pods stuck in ContainerCreating. For Flannel: VXLAN MTU mismatch (set to 1450 if underlying network MTU is 1500); subnet lease conflicts when nodes are recreated quickly.
Performance consideration: Calico's iptables-based policy enforcement can become a bottleneck at scale (thousands of policies). Switch to eBPF mode (requires Linux kernel >= 5.10) for better throughput. Flannel's VXLAN incurs a 50-byte overhead per packet—increase MTU on the host interface to compensate.
The diagram below illustrates how a packet travels from one pod to another across nodes using a Calico BGP route versus a Flannel VXLAN tunnel.
API Server: The Gateway to the Cluster
The API Server is the frontend of the control plane and the only component that directly interacts with etcd. Every kubectl command, controller watch, and component communication goes through it. Understanding its request flow is critical for debugging and performance tuning.
The API Server authenticates the request (via client certificates, bearer tokens, or OIDC), authorizes it against RBAC policies, and then passes it through admission controllers (mutating and validating) before persisting to etcd. Admission controllers can modify or reject resources — this is where PodSecurity admission, resource quota enforcement, and custom webhooks run.
What most engineers miss: admission webhooks add latency to every API request. A slow webhook can increase API Server response time from milliseconds to seconds, impacting the entire cluster. Monitor apiserver_admission_webhook_admission_duration_seconds for outliers. Also, the API Server caches responses for watch requests, but the cache size is limited. Large clusters with many watches can cause cache thrashing and increased etcd read load.
Another overlooked metric: apiserver_request_duration_seconds with a high 99th percentile indicates either webhook latency or etcd slowness. Correlate with etcd metrics to pinpoint the bottleneck.
Another scenario: the API Server's watch cache can become inconsistent on very large clusters with frequent resource updates. When a watch request fails with 'too old resource version', the client must re-list all resources. This can cause cascading failures as controllers re-sync and generate additional load. Mitigate by increasing the kube-apiserver's --watch-cache-sizes or using the --watch-cache flag with appropriate values.
One more: the API Server's --max-requests-inflight and --max-mutating-requests-inflight defaults can be too low for clusters with many controllers or automation. If you see 429 Too Many Requests errors in logs, tune these up gradually while monitoring memory usage. Each inflight request consumes memory for the request context, so increasing them too aggressively can cause OOM.
And here's something about watch timeouts: by default, watch connections are long-lived. If a client (e.g., a controller) disconnects unexpectedly, the API Server keeps the watch goroutine until the timeout (default 5 minutes). In clusters with many controllers, these orphaned watches can accumulate and consume significant memory. Set --watch-termination-timeout to a lower value (e.g., 60s) to clean up stale watches faster.
webhook.Server with timeouts and always set failurePolicy: Fail in production to avoid silent bypass.--max-requests-inflight while monitoring memory.--watch-termination-timeout to a reasonable value.--max-mutating-requests-inflight and --max-requests-inflight or reduce concurrency of automated kubectl usage.--watch-cache-sizes for the affected resource types. Consider using a higher resource version tolerance in client code.CoreDNS and Service Discovery: The Cluster's Internal DNS
CoreDNS is the default DNS resolver for Kubernetes clusters. It runs as a deployment in the kube-system namespace and watches Services and EndpointSlices to provide name resolution for service names. Every pod is configured to use CoreDNS via /etc/resolv.conf — typically at the cluster IP of the CoreDNS service (like 10.96.0.10).
CoreDNS uses plugins to achieve its functionality. The kubedns plugin handles Kubernetes service records. The forward plugin forwards external DNS queries to upstream resolvers. The loop plugin detects forwarding loops. The log plugin enables query logging for debugging.
A common misconfiguration: not setting resource limits for CoreDNS. Under heavy query load, CoreDNS can become a bottleneck. We've seen clusters where a single CoreDNS pod crashed due to memory limits, causing intermittent DNS failures across the cluster. Monitor CoreDNS's memory and CPU — set requests and limits based on cluster size.
Another subtlety: the ndots configuration in pod DNS policy. By default, /etc/resolv.conf sets ndots:5. This means if a domain name has fewer than 5 dots, the resolver will first try appending cluster search domains before making the absolute query. This adds unnecessary latency for single-name services (e.g., my-service) — it tries my-service.default.svc.cluster.local first (good), but also tries my-service.svc.cluster.local and my-service.cluster.local. For high-traffic applications, tune ndots to 1 or set DNSConfig in the pod.
Also, CoreDNS's forward plugin retries on failure. The default policy first sends queries to the first upstream and only tries others on failure. If the first upstream is slow, all queries wait. Change to policy: sequential for better load distribution.
In large clusters, consider deploying NodeLocal DNSCache. It runs a DaemonSet that caches DNS queries per node, reducing load on CoreDNS and improving resolution latency. This is almost essential for clusters with thousands of pods.
log plugin to the CoreDNS ConfigMap. The logs will show each query and response. Use with caution in production — log volume can be high. The command kubectl logs -n kube-system deployment/coredns --tail=10 shows recent queries.coredns_dns_request_duration_seconds to catch degradation early.kube-dns has endpoints. Check NetworkPolicy blocking DNS.Kubelet Probes and Pod Lifecycle: What Happens When a Probe Fails
The kubelet executes three types of probes on containers: liveness, readiness, and startup. Each probe is a periodic check (HTTP GET, TCP socket, or command execution) that determines whether the container is alive, ready to serve traffic, or still starting up. The probe results directly affect the pod's lifecycle and the cluster's behavior.
Liveness probes determine if the container is running. If it fails, the kubelet restarts the container per the pod's restartPolicy. Readiness probes determine if the container is ready to serve traffic. If it fails, the pod is removed from Service endpoints. Startup probes delay the start of liveness and readiness probes until the container finishes initialization — critical for slow-starting applications like Java apps or legacy monoliths.
Here's the nuance most teams get wrong: the failure threshold for liveness probes often causes cascading restarts in rolling updates. Default failureThreshold is 3, and the default periodSeconds is 10. That means a pod will be killed after 30 seconds of failure. But during a deployment, if the new version takes 40 seconds to respond, the kubelet restarts it before it ever becomes ready. The fix: increase the startup probe threshold or use an initial delay.
Another subtlety: when a readiness probe fails, the pod remains running but is removed from Service endpoints. This means traffic stops flowing, but the pod continues consuming CPU and memory. If many pods become unready, the remaining pods may be overloaded, causing a cascade of readiness failures. Always set resource limits to prevent unready pods from starving healthy ones.
Also, the kubelet doesn't kill containers for failing readiness probes. That's intentional — the probe is just traffic routing. But if you rely on readiness probe for health checking in monitoring, you'll get false positives. Readiness probe failures are not alerts unless they exceed a high threshold.
One more: the kubelet records probe results as events on the pod. You can see them with kubectl describe pod. But there's a default event limit of 1000 per pod, and if probes are failing rapidly, older events get pruned. You may lose the root cause. For continuous monitoring, use metrics from kubelet (kubelet_pod_start_duration_seconds, kubelet_pod_lifecycle_event_gauge).
initialDelaySeconds on the liveness probe. That's fragile. Always verify your cluster version before relying on startup probes.Cascading API Server Failure Due to etcd Disk Latency
kubectl operations returned timeout errors. Controller-manager logs showed failed lease renewals. New pods stuck in Pending.fsync duration and disk IOPS.
4. Added disk latency alerts with p99 > 15ms triggering immediate escalation.- etcd is the cluster's central nervous system. Its performance is non-negotiable.
- Disk latency, not network, is the most common cause of etcd instability.
- etcd must be isolated and its hardware provisioned for predictable, low-latency I/O.
- Always test your etcd restore procedure quarterly — a backup you've never restored is no backup at all.
kubectl get --raw=/readyz?verbose and etcdctl endpoint health --cluster.kubectl describe node). Check for resource fragmentation or taints/tolerations mismatches. Also verify scheduler pod health.kubectl describe pod). The container runtime (e.g., containerd) logs are critical here. Check journalctl -u kubelet and crictl ps -a.systemctl status kubelet). Check disk pressure, memory pressure, and PID pressure using kubectl describe node. Look for eviction thresholds being hit.nslookup kubernetes.default.svc.cluster.local from inside a pod.iostat -x 1). Isolate etcd immediately. If API Server is slow, check apiserver_current_inflight_requests metric.Common mistakes to avoid
5 patternsUsing default eviction thresholds without customization
evictionHard.memory.available defaults to 100MiB, which is too low for predictable behavior.evictionHard.memory.available: "10%" or a value like 500Mi in the kubelet configuration. Test with a load generator to verify thresholds trigger before system OOM.Running etcd on shared nodes with other workloads
--heartbeat-interval and --election-timeout for your network latency.Not setting resource limits for CoreDNS
forward plugin's max_concurrent.Applying a NetworkPolicy that denies all ingress without allowing DNS
Ignoring scheduler cache staleness
--node-informer-resync-period or using a more recent scheduler version that handles this better.Interview Questions on This Topic
Explain how the Kubernetes scheduler assigns a pod to a node. What happens if no node passes filtering?
kubectl describe pod. The scheduler then scores the remaining nodes with priority functions (e.g., spread, balanced resource allocation) and assigns the pod to the highest-scored node. The scheduler does not re-evaluate once the pod is bound; the kubelet handles eviction if the node becomes overcommitted.That's Kubernetes. Mark it forged?
23 min read · try the examples if you haven't