Senior 23 min · March 06, 2026

etcd Disk Latency — Kubernetes Architecture Failure

etcd disk latency from co-located workloads caused Raft leader election failures, crashing the API server.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Kubernetes is a declarative control loop: you define desired state, the system reconciles toward it.
  • Control Plane components: API Server, Scheduler, Controller Manager, and etcd — the single source of truth using Raft.
  • Nodes run the kubelet, kube-proxy, and container runtime; they execute pod specs and report back.
  • Performance insight: etcd's fsync latency dictates cluster responsiveness — keep it under 10ms or expect leader elections.
  • Production insight: API Server is the only component that talks directly to etcd; all others watch it. If etcd is slow, the entire control plane slows.
  • Biggest mistake: treating etcd as a generic KV store — it's a replicated log, not an OLTP database.
  • Scaling insight: watch cache exhaustion causes re-list storms; increase --watch-cache-sizes for resource-heavy clusters.
  • Scheduler nuance: default --percentage-of-nodes-to-score can skip optimal nodes at scale — set to 100% for latency-sensitive workloads.
Plain-English First

Imagine a massive Amazon warehouse. There's a central manager's office (the Control Plane) that decides which workers (Nodes) pick which packages (containers), tracks every shelf location (etcd), and reschedules sick workers automatically. The workers don't think — they just follow orders from the office, report their status, and run their assigned tasks. Kubernetes is exactly that warehouse management system, but for software running on servers.

Kubernetes replaces bespoke deployment scripts and manual server management with a declarative control loop. You describe the desired state, and the cluster continuously reconciles reality toward it. The architecture is not a black box; it's a set of coordinated components with specific failure domains and performance characteristics. Understanding these internals is what separates engineers who debug Kubernetes from those who are confused by it.

This is not a getting-started guide. It is for engineers already running Kubernetes who need to understand the 'why' behind scheduler decisions, etcd consistency guarantees, and kubelet behavior under pressure. We will trace what happens, component by component, when you run kubectl apply -f deployment.yaml, and identify the production decisions that bite teams hardest. The single most overlooked truth: the API Server is the bottleneck, but etcd is the clock that drives it.

One more thing: don't assume HA means safe. Running three API Server replicas without understanding leader election or etcd quorum is a false sense of security. We'll cover exactly what breaks and how to catch it before your on-call phone rings.

What is Kubernetes Architecture Explained?

Kubernetes Architecture Explained is a core concept in DevOps. Rather than starting with a dry definition, let's see it in action and understand why it exists. The architecture's design directly enables its core promise: a self-healing, declarative system for running distributed applications.

At its heart, Kubernetes is a set of independent control loops running across a cluster of machines. Each controller watches the API Server for its specific resource type, compares the current state to the desired state, and takes action to reconcile them. This decoupling is what makes the system resilient—when the scheduler goes down, controllers still keep running replicas healthy. But it also introduces eventual consistency: after you apply a manifest, the cluster converges toward the desired state, not to a point-in-time snapshot.

Here's the part that catches teams off guard: the control loop frequency matters. The default --sync-period for most controllers is 10 seconds. That means after you fix a problem—say delete a stuck pod—you'll stare at a Terminating state for up to 10 seconds before the controller notices and creates a replacement. In production, this delay accumulates during rolling updates and can make deployments feel sluggish. Tuning these sync periods across controllers is a lever many engineers ignore until it costs them a slow canary.

Custom controllers can override the default informer resync period. Reducing it below the default increases API Server load exponentially. Always benchmark resync periods with real workloads.

One more nuance: the control loop model means you can't trust a single kubectl get call to show the truth. The cluster is always converging. When debugging, watch the events with kubectl get events --watch and inspect controller logs to see what's happening in real time.

Another subtlety: when you delete a namespace, the namespace controller doesn't delete resources synchronously — it issues deletion requests and waits. If a webhook is slow, namespace deletion can hang for minutes.

Here's a production reality: if you're running a controller manager with leader election (you are), the leader's sync loop is the only one doing work. If the leader pod gets OOMKilled, it takes ~15s for a new leader to take over. During that window, no controller reconciles. In a chaos experiment, we saw a 30-second gap where replicas dropped to zero because the replication controller didn't notice a node failure during that handoff. Lesson: monitor leader election metrics.

And let's talk about informer caches — every controller uses a local cache of objects from the API Server. If the cache becomes stale because the API Server dropped a watch (e.g., due to too old resource version), the controller must re-list all objects. That's expensive. In large clusters with thousands of Deployments, a re-list can spike API Server CPU and cause cascading timeouts. Mitigate by increasing the watch cache sizes or setting appropriate informer resync periods.

io/thecodeforge/kubernetes/basic-deployment.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
  spec:
    containers:
      - name: nginx
        image: nginx:1.25
        ports:
          - containerPort: 80
The Declarative Control Loop
  • You write a Deployment YAML (desired state).
  • The Deployment controller creates ReplicaSets.
  • The ReplicaSet controller ensures the right number of Pods exist.
  • The Scheduler assigns Pods to Nodes.
  • The Kubelet on each Node runs the containers.
Production Insight
Control loop model gives resilience but introduces eventual consistency.
Drift occurs when external tools modify resources outside controllers.
Race condition: applying ConfigMap + Deployment quickly may use old ConfigMap — use rollout status, not sleep.
Leader election handoffs create reconciliation gaps — monitor controller manager's leader change metrics.
Informer cache re-lists are expensive — set appropriate resync periods and watch cache sizes.
Key Takeaway
Kubernetes is a set of independent control loops, not a monolithic program.
This decoupling makes it resilient but introduces eventual consistency.
Shift your debugging mindset: not 'what command failed?' but 'what state is the system converging toward, and what is blocking it?'
The informer cache is the Achilles' heel — watch for re-list storms.
Understanding Control Loop Behavior
IfResource created but no controller reacts
UseCheck if the corresponding controller is running and has leader election enabled. Look for controller logs.
IfResource stuck in intermediate state (e.g., Terminating)
UseCheck finalizers. The controller handling that finalizer may be down. Remove the finalizer manually only after verifying.
IfNamespace deletion hangs
UseCheck for slow or failing admission webhooks. Inspect the namespace's finalizers and force remove if safe.

Control Plane: The Cluster's Brain

The Control Plane makes global decisions (e.g., scheduling) and detects and responds to cluster events. It consists of the API Server, etcd, Scheduler, and Controller Manager. In production, it is almost always replicated across multiple nodes for high availability.

The API Server (kube-apiserver) is the front end—all communication, whether from kubectl, controllers, or nodes, goes through it. It validates and persists resources to etcd, and exposes a watch API that components use to detect changes. The Scheduler (kube-scheduler) watches for unscheduled Pods and assigns them to nodes. The Controller Manager (kube-controller-manager) bundles multiple controllers: Node Controller, ReplicaSet Controller, Endpoint Controller, etc. Each runs as a separate loop but shares the same binary.

One detail that bites teams: the API Server's --max-mutating-requests-inflight and --max-requests-inflight defaults are 200 and 400 respectively. If you run a large cluster with aggressive automation, you'll hit this limit and calls start queuing. Monitor apiserver_request_count and apiserver_current_inflight_requests early—before your CI/CD pipeline starts timing out.

Another subtlety: leader election among controller manager replicas is handled via endpoints in kube-system. If the leader dies unexpectedly, it takes about 15 seconds for a new leader to take over (governed by lease duration and renew deadline). During that window, the corresponding controller loop stops. For example, the node controller stops evicting pods from unreachable nodes, which can cause service disruption. Know your lease parameters.

Another often overlooked component: the cloud-controller-manager. If you're on AWS, GCP, or Azure, this controller interacts with the cloud provider's API to manage load balancers, routes, and nodes. A misconfigured cloud-controller-manager can prevent nodes from joining the cluster even though the API Server is healthy. Always check its logs when nodes fail to register.

One more: the API Server's etcd client uses a watch cache. If writes exceed the cache capacity, watch requests get 'too old resource version' errors and clients must re-list. This can cascade into a thundering herd problem. Mitigate by increasing --watch-cache-sizes or reducing watch concurrency. In one incident, a misconfigured monitoring system created too many watches, causing all controllers to re-sync every few minutes, spiking API Server CPU to 100%.

And here's a trap with admission webhooks: they run before the request reaches etcd. A slow webhook blocks the entire request pipeline. We've seen a single webhook that took 5 seconds to respond because it called an external service that was throttled. That 5-second delay was added to every write to that resource type. The fix was to add a circuit breaker and a timeout at the webhook level. Monitor apiserver_admission_webhook_admission_duration_seconds — if the 99th percentile exceeds 1 second, you have a problem.

io/thecodeforge/kubernetes/check_control_plane.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
# Check health of core control plane components

# 1. API Server health (verbose)
kubectl get --raw='/healthz?verbose'

# 2. etcd member list (run on a control plane node)
etcdctl member list -w table

# 3. Leaders for scheduler and controller manager
kubectl get endpoints kube-scheduler -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
echo
kubectl get endpoints kube-controller-manager -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
echo

# 4. Check pod health of control plane components
kubectl get pods -n kube-system | grep -E 'etcd|apiserver|scheduler|controller'
Control Plane Redundancy Trap
Running multiple replicas of the API Server and controller manager is not enough if etcd is a single point of failure. etcd must have its own high-availability setup (typically 3 or 5 members on separate nodes). Without it, the whole control plane collapses if the single etcd instance fails.
Production Insight
etcd is the single point of failure. Monitor its fsync latency; if it drifts above 10ms, expect leader elections.
Admission webhooks silently modify resources — audit in staging, not in production.
Cloud-controller-manager failure prevents node registration — check its logs.
Watch cache exhaustion causes cascading re-lists — monitor apiserver_watch_cache_size.
Admission webhook latency above 1s at p99 will stall the control plane — instrument and timeout.
Key Takeaway
The Control Plane's reliability is the cluster's reliability.
etcd is the foundation; without it performing well, no amount of API Server replicas will save you.
Always monitor etcd fsync latency and have a tested backup restore procedure.
Admission webhooks are a common bottleneck — profile them before they become a crisis.
Control Plane Health Triage
IfAPI Server responds but etcd is unreachable
UseCheck etcd nodes' connectivity and disk latency. Restart etcd if a leader election is stuck.
IfScheduler logs show no events but pods are pending
UseCheck if scheduler leader is healthy. Restart kube-scheduler pod and verify it can connect to API Server.
IfController manager fails to reconcile resources like ReplicaSets
UseCheck leader election endpoints. Look for conflicting leader annotations – sign of split-brain.
IfNodes fail to join the cluster
UseCheck cloud-controller-manager logs for API errors. Verify cloud provider credentials and permissions.

Component Interaction Flow: From kubectl to Running Pod

When you run kubectl apply -f deployment.yaml, a chain of events propagates through the control plane to the target node. Understanding this flow is essential for debugging latency and identifying where failures occur.

The sequence starts with kubectl sending a REST POST request to the API Server's /apis/apps/v1/namespaces/default/deployments endpoint. The API Server authenticates the request (via TLS certificates or bearer tokens), authorizes it against RBAC, then passes the object through a set of admission controllers (mutating and validating webhooks). If all passes, the API Server persists the Deployment object into etcd using a Raft write.

Once the write is committed, the API Server's watch mechanism notifies the Deployment controller (part of kube-controller-manager). The Deployment controller sees the new object and creates a ReplicaSet. This in turn triggers the ReplicaSet controller to create a Pod object. The Scheduler, watching for unscheduled Pods, picks a suitable node and updates the Pod with the node binding.

The API Server persists the binding and notifies the target node's kubelet via its watch. The kubelet receives the Pod spec and begins execution: it pulls the container image (if not cached), starts the container via the CRI, mounts volumes, configures networking via CNI, and runs startup/liveness probes. At the same time, kube-proxy updates iptables rules to route Service traffic to the new pod.

The entire process typically takes 2–10 seconds for a small deployment, but can stretch to minutes with large images or slow webhooks. Each step introduces latency variables: admission webhook latency, etcd write latency, scheduler queue delay, image pull time, and CNI configuration time.

A common production pitfall: a slow admission webhook (e.g., 2 seconds per request) adds 2 seconds to every resource creation. If you create 100 pods during a rollout, that's 200 seconds of additional delay. Monitor apiserver_admission_webhook_admission_duration_seconds to catch this.

The diagram below visualizes the interaction sequence between components using a simplified mermaid sequence diagram.

Trace the Full Flow with Audit Logs
Enable the API Server audit log with kubectl proxy and inspect the stage timestamps. You can correlate admission webhook durations, etcd round trips, and response times to pinpoint the slowest step in the pipeline.
Production Insight
Every component in the flow is a potential bottleneck. The most common is admission webhook latency (slows the API Server write path) and image pull time (kubelet). Use startup probes to decouple readiness from slow starts. Monitor scheduler queue depth and etcd fsync latency as leading indicators of slowdowns.
Key Takeaway
The kubectl apply flow is a multi-hop chain through control plane components. The slowest link determines the end-to-end latency. Profile each step in production using metrics and audit logs to identify bottlenecks.

Nodes: The Worker Machines

A Node is a worker machine (VM or physical) where containers are run. Each node contains the services necessary to run Pods: the kubelet, the container runtime (e.g., containerd), and the kube-proxy.

The kubelet is the node's primary agent. It receives PodSpecs from the API Server (for pods assigned to its node) and ensures the described containers are running. It does this by talking to the container runtime via the Container Runtime Interface (CRI). The kubelet also runs liveness and readiness probes, mounts volumes, and reports node conditions like DiskPressure, MemoryPressure, and PIDPressure to the API Server.

Kube-proxy (runs as a DaemonSet) maintains network rules on the node. It watches Services and EndpointSlices and updates iptables or IPVS rules so traffic to a Service's ClusterIP is load‑balanced to the actual pods.

Here's what most people miss: the kubelet's --image-pull-progress-deadline default is 1 minute. If your image is large or the registry is slow, the kubelet kills the pull and retries, creating a cycle that leaves pods in ImagePullBackOff. Set this higher or use image streaming in production.

Also, the kubelet's eviction logic uses soft and hard eviction thresholds. Hard eviction triggers immediate pod killing when exceeded, while soft eviction has a grace period. By default, evictionHard.memory.available is 100MiB — that's practically zero. Set it to 10% of node memory for predictable behavior.

One more thing: the kubelet's node status updates are sent to the API Server periodically (default 10 seconds). If the API Server is under high load or network is congested, the node may appear NotReady even though it's healthy. This is called 'node flapping' and is often a symptom of control plane load rather than node failure. Tune the node-status-update-frequency and node-monitor-grace-period accordingly.

Another detail: the kubelet's --max-pods default is 110, but that's a hard count, not a resource limit. A node may have free CPU/memory but hit this limit. In clusters running many sidecars, you can exhaust the pod slot quickly. Use --max-pods or the scheduling plugin to enforce a more appropriate cap based on your workload density.

And let's talk about system reserved resources. If you don't configure --system-reserved and --kube-reserved, the kubelet assumes all node resources are available for pods. But the operating system and the kubelet itself consume some. Without proper reservations, pods can steal resources from system daemons, leading to SSH timeouts or node instability. Always set --system-reserved=cpu=500m,memory=1Gi (adjust per node size) and enable eviction thresholds.

io/thecodeforge/kubernetes/node_debug.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
# Deep inspection of a Node's status

NODE_NAME="worker-node-1"

# 1. Node conditions and allocatable resources
kubectl describe node $NODE_NAME | grep -A 15 Conditions
echo "---"
kubectl describe node $NODE_NAME | grep -A 5 Allocatable

# 2. Check kubelet logs (run on the node)
journalctl -u kubelet --since "1 hour ago" --no-pager | grep -i "error\|fail"

# 3. Check container runtime (containerd example)
sudo crictl info | jq '.config.systemdCgroup'
sudo crictl ps

# 4. Check kube-proxy iptables rules (run on node)
sudo iptables -L -n | grep -i "kube-sevice\|KUBE-SVC"
The kubelet: Node-Level Controller
  • Runs liveness and readiness probes.
  • Mounts volumes specified in the PodSpec.
  • Reports node conditions (MemoryPressure, DiskPressure).
  • Manages cgroups for resource isolation.
Production Insight
The kubelet's --max-pods (default 110) is a hard cap — new pods won't schedule even if CPU/RAM is free.
Resource reservations subtract from node capacity — misconfiguring causes overcommit.
Set evictionHard.memory.available explicitly; default 100MiB leads to silent OOM kills.
Node flapping is often API Server load, not node health — check control plane first.
System reserved resources protect node stability — define them or risk SSH failures.
Key Takeaway
Nodes are intelligent agents enforcing local state — not dumb workers.
Node failures are often local (disk pressure, kubelet crash, runtime hang) — debug with journalctl and crictl, not just kubectl.
Understand the kubelet's eviction thresholds and resource reservations to avoid pod evictions at scale.
System reserved resources are not optional — configure them early.
Node Not Ready Diagnosis
IfNode condition shows DiskPressure
UseFree up disk space: remove unused images, increase node disk size, or implement image garbage collection thresholds.
IfNode condition shows MemoryPressure
UseEvict low-priority pods or reduce memory limits. Check node allocatable memory.
IfNode condition shows PIDPressure
UseIncrease pids limit on the node or reduce the number of running containers.
IfNode condition shows NetworkUnavailable
UseCheck CNI plugin status. Restart CNI daemon and verify network interface configuration.
IfNode flapping between Ready and NotReady
UseIncrease node-status-update-frequency and node-monitor-grace-period. Check API Server load.

Worker Node Component Reference Table

The following table provides a quick reference for the core components running on every worker node. This is useful when triaging node-level issues or validating node configurations during cluster upgrades.

ComponentDescriptionDefault PortLog LocationCommon Failure Modes
kubeletPrimary node agent; manages pods and reports node status10250 (kubelet API), 10255 (read-only)journalctl -u kubeletOOM due to missing limits, disk pressure, stalled Docker socket (legacy)
kube-proxyNetwork proxy; maintains iptables/IPVS rules for Services10249 (metrics)journalctl -u kube-proxyiptables corruption (large clusters), IPVS mode fallback
containerdContainer runtime (default)10010 (CRI)journalctl -u containerd, crictl logsImage pull timeout, storage driver issues (overlay2), dead containerd socket
CRI-OAlternative container runtime (Red Hat)10010 (CRI)journalctl -u crioImage pull timeout, conmon OOM, conmon vs runc mismatch
Calico (cni-calico)CNI plugin providing network policies and BGP routing9099 (felix metrics)/var/log/calico/cni/IPAM exhaustion, BGP peer failure, policy misconfiguration
Flannel (cni-flannel)CNI plugin for simple overlay networkingjournalctl -u flanneldVXLAN MTU mismatch, subnet lease conflicts, no network policy support
Cilium (cni-cilium)eBPF-based CNI with advanced observability9090 (cilium-agent metrics), 9961 (hubble)cilium status or hubble observeKernel version < 5.10, eBPF feature gaps, conflicting NetworkPolicies

Key insight: the container runtime must be the same across all nodes. Mixed runtimes (containerd on some, CRI-O on others) work but introduce subtle differences in CRI implementation – always test in a staging cluster before rolling out.

Another common pitfall: the kubelet's --container-runtime-endpoint defaults to /run/containerd/containerd.sock. If containerd is restarted and the socket disappears temporarily, kubelet will fail to start new pods. Use a systemd socket activation or a health check that waits for the socket to appear before starting the kubelet.

Also note that kube-proxy in IPVS mode (set --proxy-mode=ipvs) scales better for large clusters but requires the ipvsadm module loaded on the node. Without it, kube-proxy falls back to iptables mode, which can be a surprise if you've tuned for IPVS performance.

io/thecodeforge/kubernetes/check_node_components.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
# Verify node component versions

# kubelet version
kubelet --version 2>/dev/null || journalctl -u kubelet | grep 'Version'

# container runtime version
crictl version

# kube-proxy version (from pod)
kubectl get pod -n kube-system -l k8s-app=kube-proxy -o jsonpath='{.items[0].spec.containers[0].image}'

# CNI version (calico example)
kubectl get daemonset -n kube-system calico-node -o jsonpath='{.spec.template.spec.containers[0].image}'

# Loaded kernel modules for IPVS
lsmod | grep -E 'ip_vs|ip_vs_rr|ip_vs_wrr|ip_vs_sh'
Standardize Node Images
  • Pin kubelet, kube-proxy, and container runtime versions to cluster release.
  • Pre-pull images used by core components (CNI, kube-proxy) during node bootstrapping.
  • Validate node components weekly against a baseline – any mismatch should trigger an alert.
Production Insight
Standardize node images with pinned component versions to avoid silent drift-related failures.
Container runtime socket availability is a startup dependency – implement wait-for-socket logic in systemd.
kube-proxy in IPVS mode requires kernel modules – verify on each node after kernel upgrades.
CNI plugin IPAM exhaustion is a leading cause of new pods stuck in ContainerCreating – monitor IP pool usage per node.
Mixed container runtimes between nodes work but create subtle CRI differences – test combinations before production rollout.
Key Takeaway
Worker nodes host multiple cooperating system components: kubelet, kube-proxy, container runtime, and CNI plugin. Each has specific port, log, and failure mode characteristics. Use this reference table to accelerate node-level troubleshooting and ensure consistent configurations across the cluster.

etcd: The Cluster's Source of Truth

etcd is a distributed, consistent key-value store that powers Kubernetes. It uses the Raft consensus algorithm to achieve strong consistency across a cluster of members (typically 3 or 5). All cluster state—Pods, ConfigMaps, Secrets, Deployments, RBAC policy—is stored in etcd. The API Server is the only component that writes to etcd; all other components watch the API Server for changes.

The key insight: etcd is a replicated log, not a traditional database. Every write is appended to a log and only committed when a majority of members (quorum) acknowledge it. This design ensures strong consistency but makes performance heavily dependent on disk I/O latency for fsync operations. If one etcd member's disk is slow, the entire cluster's write throughput suffers.

Another thing: etcd's default --snapshot-count is 100,000. After that many changes, etcd takes a snapshot and compacts the log. On slow disks, this snapshot can spike latency and cause temporary leader election issues. Tune this value down (e.g., 50000) on clusters with frequent writes.

Less obvious: etcd's database file grows even after compaction because old data is freed but not returned to the OS. You must run etcdctl defrag periodically to reclaim space. Skipping this leads to quota limit errors (mvcc: database space exceeded) that lock the cluster.

Also, consider the impact of network latency between etcd members. Raft heartbeats are sent every 100ms by default. If round-trip time exceeds 50ms, you risk false leader elections. In multi-datacenter setups, place etcd members close together or tune heartbeat-interval to account for latency.

A production scenario: during a large-scale cluster upgrade, etcd writes spike as many resources are updated simultaneously. If you haven't tuned --snapshot-count or --quota-backend-bytes, the WAL compaction can cause fsync storms. One team saw a 5-second write latency every 2 minutes during an upgrade, causing repeated leader elections. The fix: increase --quota-backend-bytes to 16GB and set --auto-compaction-retention=1h to avoid bursts.

And the silent killer: clock skew. Raft relies on election timeouts that are based on monotonic clocks. If two etcd nodes have clock drift greater than the election timeout, they may both think the leader has timed out and start new elections, causing a split that can lead to quorum loss. Use chrony or ntpd with reliable upstream time sources and monitor clock offset across etcd members.

io/thecodeforge/kubernetes/etcd_health.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
# etcd health and performance checks

# 1. Endpoint health
etcdctl endpoint health --cluster -w table

# 2. Check member list and leader
etcdctl endpoint status --cluster -w table

# 3. Measure fsync duration (run on etcd node)
# Using etcd's built-in metrics: /metrics endpoint
curl -s http://localhost:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds_sum

# 4. Database size and quota
sudo du -sh /var/lib/etcd/member/snap/db
echo "quota-backend-bytes: $(ps aux | grep etcd | grep -oP 'quota-backend-bytes=\K[^ ]+')"

# 5. Backup snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db
Quorum Loss is Catastrophic
If you lose a majority of etcd members (e.g., 2 out of 3), the cluster cannot commit new writes. The API Server goes into read-only mode. No pods can be created, updated, or deleted. The only recovery is to restore from a snapshot, which may lose minutes of state. Always run at least 3 etcd members on separate hosts and monitor clock skew and disk latency.
Production Insight
Tune heartbeat-interval and election-timeout based on network RTT; defaults cause false leader elections in high-latency environments.
etcd defragmentation is necessary but blocking — schedule it during maintenance windows.
Never deploy etcd on shared nodes; a burst in I/O from other workloads can cause fsync latency spikes that trigger leader elections.
Network latency between etcd members must be under 50ms to avoid false elections.
Snapshot compaction during upgrades can cause fsync storms — pre-tune quota and compaction retention.
Clock skew between etcd members can cause split-brain — monitor with NTP and keep drift under 10ms.
Key Takeaway
etcd's disk I/O, not CPU or memory, determines cluster performance and stability.
Run it on dedicated SSDs, monitor fsync latency, backup snapshots regularly.
Learn to restore from backup before you need it — quorum loss is catastrophic.
Clock skew is an overlooked source of election storms — enforce NTP.
etcd Troubleshooting
Ifetcd member unreachable
UseCheck network connectivity, process status, and disk latency. If member is dead and cannot be recovered, remove it and re-add a new member.
Ifetcd database size > 8GB or approaching quota-backend-bytes
UseRun defragmentation. If the database is growing abnormally, check for excessive event history or leaked resources (e.g. too many ConfigMaps).
IfLeader election storms (frequent leader changes)
UseCheck disk latency with iostat and network latency between etcd members. Also check clock skew. Increase heartbeat-interval and election-timeout proportionally.

The Scheduler: How Pods Are Assigned to Nodes

The Kubernetes scheduler (kube-scheduler) is responsible for assigning unscheduled Pods to appropriate Nodes. It does this via a two-phase pipeline: Filtering (Predicates) and Scoring (Priorities).

In the filtering phase, the scheduler selects nodes that meet the pod's hard constraints: resource requests, nodeSelector, node affinity, taints/tolerations, topology constraints (e.g., pod anti-affinity). Node conditions like DiskPressure or MemoryPressure also cause the node to be filtered out.

In the scoring phase, the scheduler ranks feasible nodes based on priority functions: spread pods across zones, minimize resource fragmentation (balanced allocation), cluster autoscaler preferences, and user-defined custom scores. The node with the highest score gets the pod.

If no node passes the filtering phase, the pod remains Pending. The scheduler emits events that can be inspected via kubectl describe pod or scheduler logs.

A lesser-known nuance: the scheduler does not re-evaluate decisions for already scheduled pods. If a node becomes overcommitted after scheduling, the pod stays there—it won't be rescheduled. That's the kubelet's job (eviction), not the scheduler's.

Another performance detail: the scheduler's --kube-api-qps defaults to 50. If you have hundreds of nodes and thousands of pods being created rapidly (e.g., during a scaling event), the scheduler may fall behind. Increase this value but watch API Server load.

Another subtlety: the scheduler uses a scheduler cache of node information to avoid hitting the API Server for every pod. But this cache can become stale if nodes update frequently. In extreme cases, the scheduler may try to place a pod on a node that no longer exists or has changed. This manifests as a pod stuck in 'PodScheduled' condition with the event 'node does not exist'. The scheduler eventually retries and the cache refreshes, but you can force this by restarting the scheduler pod.

One more: the scheduler's --percentage-of-nodes-to-score defaults to 50% for clusters with >100 nodes. This is a performance optimization — it scores only a subset of feasible nodes. But it can lead to suboptimal placements if that subset doesn't include the best node. In latency-sensitive deployments, consider setting it to 100% to always get the best score at the cost of scheduling latency.

And let's talk about pod priority and preemption. If you have pods with different priority classes, the scheduler can preempt lower-priority pods to make room for higher-priority ones. But preemption is not instant — it can take up to 30 seconds because the scheduler must gracefully evict the lower-priority pods and wait for the kubelet to terminate them. During that window, the high-priority pod remains Pending. If your application requires rapid recovery, design your priority classes with realistic timeout expectations.

io/thecodeforge/kubernetes/scheduler_debug.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# Debugging scheduler behaviour

# 1. Check scheduler logs
kubectl logs -n kube-system deployment/kube-scheduler --tail 50

# 2. Check pending pod's scheduler events
POD_NAME="my-pod"
kubectl describe pod $POD_NAME | grep -A 10 Events

# 3. Show nodes with taints and allocatable resources
kubectl get nodes -o custom-columns='NAME:.metadata.name,"ALLOCATABLE_CPU":status.allocatable.cpu,"ALLOCATABLE_MEM":status.allocatable.memory,"TAINTS":spec.taints'

# 4. Check scheduler metrics for queue depth
kubectl get --raw=/metrics?filter=scheduler_queue_incoming_pods
Scheduler queue depth is a leading indicator
If the scheduler queue backs up, it often means the scheduler cannot keep up with pod creation rates, or filtering/scoring is slow due to high node counts. Monitor scheduler_queue_incoming_pods metrics and adjust --kube-api-qps or tune scoring algorithms.
Production Insight
Scheduler default priority functions are not topology-aware — in multi-zone clusters, pods pack into one zone.
Resource fragmentation occurs when requests don't match limits — overcommitted nodes cause CPU throttling.
podAntiAffinity with requiredDuringScheduling can render a cluster unschedulable — prefer preferredDuringScheduling.
Scheduler cache can become stale — restart scheduler pod if pods are stuck on non-existent nodes.
--percentage-of-nodes-to-score default can skip optimal nodes — set to 100% for latency-sensitive workloads.
Pod preemption can take 30+ seconds — design priority classes with realistic expectations.
Key Takeaway
The scheduler filters then scores — if pods stay Pending, the filter phase is failing. Check kubectl describe pod for the reason.
In large clusters, tune --percentage-of-nodes-to-score to reduce scheduling latency.
Always use podTopologySpreadConstraints to optimize cost and resilience across zones.
Pod preemption is not instant — factor in eviction grace periods.
Pod Stuck in Pending
IfNo nodes match resource requests
UseCheck kubectl describe pod for insufficient CPU/memory errors. Either increase cluster capacity or reduce requests.
IfPod has node selector or affinity that no node satisfies
UseVerify node labels. If using availability zones, check zonal capacity.
IfTaints and tolerations mismatch
UseCheck node taints and pod tolerations. Adjust tolerations or remove taint if appropriate.
IfAll nodes have pod anti-affinity conflicts
UseReview anti-affinity rules. Reduce requiredDuringScheduling to preferredDuringScheduling if possible.
IfPod is stuck with 'node does not exist' event
UseRestart the scheduler pod to refresh its cache. Check for node lifecycle events that may have caused stale data.

Kubernetes Networking: CNI, Services, and kube-proxy

Kubernetes assumes a flat network where every Pod can communicate with every other Pod without NAT, across nodes. This is achieved through the Container Network Interface (CNI), a plugin-based layer that configures network interfaces and routes on each node.

Each pod gets its own IP address (IP-per-pod model). CNI plugins like Calico, Flannel, Weave, or Cilium set up virtual interfaces and routing rules to enable cross-node pod-to-pod communication. Services abstract pod IPs and provide stable virtual IPs (ClusterIP) for pod discovery. kube-proxy watches Services and EndpointSlices and programs iptables or IPVS rules to forward traffic to the correct pods.

DNS resolution: Kubernetes DNS (CoreDNS) serves A/AAAA records for Services. Pods can resolve service names to ClusterIPs, enabling simple service discovery.

One common misconfiguration: kube-proxy --cluster-cidr must match your pod CIDR range. If they differ, kube-proxy may program incorrect routing. Also, when using Calico with NetworkPolicy, remember that Calico's default behavior is to allow all traffic unless a policy matches. This is different from Kubernetes NetworkPolicy which defaults to deny when any policy targets the pod.

Choosing the right CNI matters: Calico offers rich network policies and BGP-based routing, Flannel is simpler but lacks policy support, Cilium uses eBPF for high performance. Evaluate based on your scale and security requirements.

Also, when using Cilium with eBPF, the kube-proxy can be completely removed (kube-proxy replacement). This reduces iptables overhead and improves performance at scale. But it requires careful validation of network policies and service routing, as Cilium's implementation may differ from standard kube-proxy in edge cases like externalTrafficPolicy.

A common production pitfall: IP address management (IPAM) exhaustion. If your CNI allocates pod IPs from a fixed CIDR and you have many pods terminating and starting, the IP pool can fragment. Calico uses a block-based approach that mitigates this, but Flannel's default allocation can lead to rapid exhaustion. Monitor IP utilization metrics from your CNI plugin.

And don't forget about MTU issues. If your CNI's overlay network uses encapsulation (e.g., VXLAN with 50 bytes overhead), and your underlying network has a standard 1500 MTU, the effective MTU for pods is 1450. If your application sends large packets that require fragmentation, you may see degraded performance or timeouts. Set the MTU on your CNI config to account for encapsulation overhead, or use direct-routing mode (e.g., Calico with BGP) to avoid encapsulation.

io/thecodeforge/kubernetes/network_debug.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# Debugging pod networking

# 1. Check pod IP and node
kubectl get pod <pod-name> -o wide

# 2. Verify connectivity between pods
kubectl exec <pod-a> -- curl -m 2 <pod-b-ip>
# If it fails, check CNI policies and network policy

# 3. Check service endpoints
kubectl describe service <service-name>

# 4. Check kube-proxy rules on the node
sudo iptables-save | grep -i "KUBE-SVC" | head -20

# 5. Check DNS resolution inside a pod
kubectl exec <pod-name> -- nslookup kubernetes.default.svc.cluster.local
NetworkPolicy Gotchas
NetworkPolicies are evaluated by the CNI plugin. If you apply a NetworkPolicy that denies all ingress but do not allow DNS, CoreDNS becomes unreachable and pod-to-service names resolves fails. Always allow traffic to CoreDNS (port 53 UDP/TCP) in your default deny policies.
Production Insight
Most common networking failure is CNI IPAM exhaustion — pods stay in ContainerCreating. Monitor IP pool usage.
iptables-based kube-proxy degrades with thousands of services — switch to IPVS mode for better scalability.
NetworkPolicy misconfigurations often block DNS traffic — always allow CoreDNS (port 53) in deny policies.
Cilium eBPF mode removes kube-proxy but requires validation of external traffic policies.
IPAM fragmentation can exhaust pod CIDR even when total utilization is low — use block-based CNI like Calico.
MTU mismatches in overlay networks cause packet fragmentation — set CNI MTU to account for encapsulation overhead.
Key Takeaway
Kubernetes networking is CNI-driven — the IP-per-pod model simplifies routing but makes troubleshooting complex.
Always check CoreDNS first for DNS issues; it's the most common point of failure.
Never apply a blanket deny NetworkPolicy without allowing DNS on port 53 — you'll break the entire cluster.
MTU matters — test with large packets to catch fragmentation issues early.
Pod Connectivity Issues
IfPod cannot reach another pod's IP
UseCheck if they are on the same node: if yes, check CNI bridge/overlay. If different nodes, check node-to-node routing and firewall rules.
IfPod cannot reach a Service DNS name
UseCheck CoreDNS pod status and logs. Verify Service exists and EndpointSlices are populated. Check NetworkPolicy allowing DNS traffic.
IfExternal traffic not reaching service
UseVerify Service type LoadBalancer/NodePort, check cloud load balancer health, and ensure node security groups allow traffic.

CNI Plugins: Calico and Flannel in Kubernetes Architecture

The Container Network Interface (CNI) plugin is a critical architectural component that determines how pods communicate within and across nodes. Two of the most widely used CNI plugins are Calico and Flannel. Each implements the Kubernetes networking model differently, with distinct trade-offs in complexity, security, performance, and scale.

Calico uses a pure Layer 3 approach by default, routing pod traffic using BGP (Border Gateway Protocol) without needing overlays. This eliminates encapsulation overhead (no VXLAN/IPSEC) and allows pod-to-pod packets to be forwarded at near wire speed. Calico also implements rich network policies using iptables or eBPF, supporting granular ingress/egress rules, namespace isolation, and dynamic policy enforcement. It includes its own IPAM (IP Address Management) using block-based allocation, which reduces fragmentation.

Flannel uses an overlay network (most commonly VXLAN) to encapsulate pod traffic. It is simpler to deploy and requires no BGP infrastructure or complex routing configuration. Flannel provides a flat network where every pod gets a unique IP, but it does not support Kubernetes NetworkPolicy natively—meaning no firewall rules between pods unless you combine it with a separate policy engine like Calico or Cilium. Flannel's IPAM is simpler and can exhaust IPs faster under high pod churn.

When to choose which? For production clusters requiring network policies, multi-tenancy, and high throughput (e.g., financial services, large e-commerce), Calico is the recommended choice. For small dev/test clusters, or teams new to Kubernetes where simplicity is paramount, Flannel suffices. Many production deployments run Calico in BGP mode with eBPF acceleration for best performance.

Common issues with Calico: BGP peer misconfiguration causes routing failures; calico-node pod crash due to missing kernel modules (e.g., ip_tables, nf_conntrack); IPAM pool exhaustion leading to pods stuck in ContainerCreating. For Flannel: VXLAN MTU mismatch (set to 1450 if underlying network MTU is 1500); subnet lease conflicts when nodes are recreated quickly.

Performance consideration: Calico's iptables-based policy enforcement can become a bottleneck at scale (thousands of policies). Switch to eBPF mode (requires Linux kernel >= 5.10) for better throughput. Flannel's VXLAN incurs a 50-byte overhead per packet—increase MTU on the host interface to compensate.

The diagram below illustrates how a packet travels from one pod to another across nodes using a Calico BGP route versus a Flannel VXLAN tunnel.

Combine Calico for Policies with Flannel for Simplicity?
It is possible to use Flannel for the overlay network and Calico for network policies (Calico's 'policy-only' mode). This gives you the simplicity of Flannel's IPAM and the security of Calico policies. However, you lose Calico's BGP routing and eBPF performance features. Evaluate whether the complexity trade-off is worth it.
Production Insight
Choose Calico for any cluster requiring network policies, multi-tenancy, or high throughput. Flannel is suitable for small dev/test clusters. If using Calico, monitor IP pool usage and BGP peer health. For Flannel, validate MTU settings and subnet lease timeouts. Always test CNI plugin upgrades in a staging environment before production rollout.
Key Takeaway
Calico and Flannel represent two ends of the CNI spectrum: Calico offers rich policies and high performance via BGP/eBPF, while Flannel prioritizes simplicity via VXLAN overlays. Choose based on your security and scaling requirements, not just familiarity. IPAM exhaustion is a common failure mode for both—monitor early.

API Server: The Gateway to the Cluster

The API Server is the frontend of the control plane and the only component that directly interacts with etcd. Every kubectl command, controller watch, and component communication goes through it. Understanding its request flow is critical for debugging and performance tuning.

The API Server authenticates the request (via client certificates, bearer tokens, or OIDC), authorizes it against RBAC policies, and then passes it through admission controllers (mutating and validating) before persisting to etcd. Admission controllers can modify or reject resources — this is where PodSecurity admission, resource quota enforcement, and custom webhooks run.

What most engineers miss: admission webhooks add latency to every API request. A slow webhook can increase API Server response time from milliseconds to seconds, impacting the entire cluster. Monitor apiserver_admission_webhook_admission_duration_seconds for outliers. Also, the API Server caches responses for watch requests, but the cache size is limited. Large clusters with many watches can cause cache thrashing and increased etcd read load.

Another overlooked metric: apiserver_request_duration_seconds with a high 99th percentile indicates either webhook latency or etcd slowness. Correlate with etcd metrics to pinpoint the bottleneck.

Another scenario: the API Server's watch cache can become inconsistent on very large clusters with frequent resource updates. When a watch request fails with 'too old resource version', the client must re-list all resources. This can cause cascading failures as controllers re-sync and generate additional load. Mitigate by increasing the kube-apiserver's --watch-cache-sizes or using the --watch-cache flag with appropriate values.

One more: the API Server's --max-requests-inflight and --max-mutating-requests-inflight defaults can be too low for clusters with many controllers or automation. If you see 429 Too Many Requests errors in logs, tune these up gradually while monitoring memory usage. Each inflight request consumes memory for the request context, so increasing them too aggressively can cause OOM.

And here's something about watch timeouts: by default, watch connections are long-lived. If a client (e.g., a controller) disconnects unexpectedly, the API Server keeps the watch goroutine until the timeout (default 5 minutes). In clusters with many controllers, these orphaned watches can accumulate and consume significant memory. Set --watch-termination-timeout to a lower value (e.g., 60s) to clean up stale watches faster.

io/thecodeforge/kubernetes/apiserver_metrics.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# Check API Server metrics

# 1. Request counts by verb
kubectl get --raw=/metrics | grep apiserver_request_total | grep -v '#'

# 2. Admission webhook latencies
kubectl get --raw=/metrics | grep apiserver_admission_webhook_admission_duration_seconds_sum

# 3. Inflight requests
kubectl get --raw=/metrics | grep apiserver_current_inflight_requests

# 4. Watch cache performance
kubectl get --raw=/metrics | grep apiserver_watch_cache_size
Admission Webhook Latency
If you have custom admission webhooks, validate they complete within the default timeout of 30 seconds. A slow webhook blocks all API requests of that type. Use webhook.Server with timeouts and always set failurePolicy: Fail in production to avoid silent bypass.
Production Insight
Admission webhooks can become the single bottleneck if not tuned — monitor latencies and set timeouts.
The API Server's etcd client cache can stale quickly under write-heavy loads — expect 403/429 errors with 'too old resource version'.
Never run admission webhooks without Circuit Breaker patterns — a slow webhook can bring down the entire API Server.
Watch cache inconsistencies on large clusters cause cascading re-list storms — tune watch cache sizes.
Inflight request limits cause 429 errors — gradually increase --max-requests-inflight while monitoring memory.
Orphaned watch connections consume memory — set --watch-termination-timeout to a reasonable value.
Key Takeaway
The API Server is the heartbeat of the cluster.
Admission webhooks and etcd latency are the two most common performance killers.
Always monitor API Server metrics before blaming network issues.
Watch timeout and orphaned connections can silently degrade performance — clean them up.
API Server Response Issues
Ifkubectl commands return 'Connection refused'
UseCheck API Server pod status and load balancer health. Ensure the API Server process is running and reachable.
Ifkubectl commands return 'timeout' after 30s
UseCheck etcd health and admission webhook latencies. Use the metrics endpoint to identify whether it's an etcd issue or webhook slowdown.
IfAPI Server returns 429 Too Many Requests
UseIncrease --max-mutating-requests-inflight and --max-requests-inflight or reduce concurrency of automated kubectl usage.
IfWatch request fails with 'too old resource version'
UseIncrease --watch-cache-sizes for the affected resource types. Consider using a higher resource version tolerance in client code.

CoreDNS and Service Discovery: The Cluster's Internal DNS

CoreDNS is the default DNS resolver for Kubernetes clusters. It runs as a deployment in the kube-system namespace and watches Services and EndpointSlices to provide name resolution for service names. Every pod is configured to use CoreDNS via /etc/resolv.conf — typically at the cluster IP of the CoreDNS service (like 10.96.0.10).

CoreDNS uses plugins to achieve its functionality. The kubedns plugin handles Kubernetes service records. The forward plugin forwards external DNS queries to upstream resolvers. The loop plugin detects forwarding loops. The log plugin enables query logging for debugging.

A common misconfiguration: not setting resource limits for CoreDNS. Under heavy query load, CoreDNS can become a bottleneck. We've seen clusters where a single CoreDNS pod crashed due to memory limits, causing intermittent DNS failures across the cluster. Monitor CoreDNS's memory and CPU — set requests and limits based on cluster size.

Another subtlety: the ndots configuration in pod DNS policy. By default, /etc/resolv.conf sets ndots:5. This means if a domain name has fewer than 5 dots, the resolver will first try appending cluster search domains before making the absolute query. This adds unnecessary latency for single-name services (e.g., my-service) — it tries my-service.default.svc.cluster.local first (good), but also tries my-service.svc.cluster.local and my-service.cluster.local. For high-traffic applications, tune ndots to 1 or set DNSConfig in the pod.

Also, CoreDNS's forward plugin retries on failure. The default policy first sends queries to the first upstream and only tries others on failure. If the first upstream is slow, all queries wait. Change to policy: sequential for better load distribution.

In large clusters, consider deploying NodeLocal DNSCache. It runs a DaemonSet that caches DNS queries per node, reducing load on CoreDNS and improving resolution latency. This is almost essential for clusters with thousands of pods.

io/thecodeforge/kubernetes/coredns-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           policy sequential
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }
Enabling CoreDNS Query Logging
To debug DNS failures, enable query logging by adding the log plugin to the CoreDNS ConfigMap. The logs will show each query and response. Use with caution in production — log volume can be high. The command kubectl logs -n kube-system deployment/coredns --tail=10 shows recent queries.
Production Insight
CoreDNS without resource limits is a common cause of intermittent DNS timeouts — set requests and limits per pod.
ndots:5 causes unnecessary search domain queries — tune to 1 for latency-sensitive apps.
Default forward policy 'first' can amplify latency when the primary upstream is slow — use 'sequential' with timeout.
NodeLocal DNSCache cuts CoreDNS query latency by 50% in large clusters — deploy it.
Monitor CoreDNS's failure rate metric coredns_dns_request_duration_seconds to catch degradation early.
Key Takeaway
CoreDNS is the single point of failure for service discovery — protect it with resource limits and health checks.
ndots:5 is a hidden latency tax — tune it down for performance-critical paths.
NodeLocal DNSCache is not optional for clusters over 50 nodes — deploy it proactively.
DNS Resolution Failures
IfPod cannot resolve any service names (nslookup times out)
UseCheck CoreDNS pods are running and not being evicted. Verify service kube-dns has endpoints. Check NetworkPolicy blocking DNS.

Kubelet Probes and Pod Lifecycle: What Happens When a Probe Fails

The kubelet executes three types of probes on containers: liveness, readiness, and startup. Each probe is a periodic check (HTTP GET, TCP socket, or command execution) that determines whether the container is alive, ready to serve traffic, or still starting up. The probe results directly affect the pod's lifecycle and the cluster's behavior.

Liveness probes determine if the container is running. If it fails, the kubelet restarts the container per the pod's restartPolicy. Readiness probes determine if the container is ready to serve traffic. If it fails, the pod is removed from Service endpoints. Startup probes delay the start of liveness and readiness probes until the container finishes initialization — critical for slow-starting applications like Java apps or legacy monoliths.

Here's the nuance most teams get wrong: the failure threshold for liveness probes often causes cascading restarts in rolling updates. Default failureThreshold is 3, and the default periodSeconds is 10. That means a pod will be killed after 30 seconds of failure. But during a deployment, if the new version takes 40 seconds to respond, the kubelet restarts it before it ever becomes ready. The fix: increase the startup probe threshold or use an initial delay.

Another subtlety: when a readiness probe fails, the pod remains running but is removed from Service endpoints. This means traffic stops flowing, but the pod continues consuming CPU and memory. If many pods become unready, the remaining pods may be overloaded, causing a cascade of readiness failures. Always set resource limits to prevent unready pods from starving healthy ones.

Also, the kubelet doesn't kill containers for failing readiness probes. That's intentional — the probe is just traffic routing. But if you rely on readiness probe for health checking in monitoring, you'll get false positives. Readiness probe failures are not alerts unless they exceed a high threshold.

One more: the kubelet records probe results as events on the pod. You can see them with kubectl describe pod. But there's a default event limit of 1000 per pod, and if probes are failing rapidly, older events get pruned. You may lose the root cause. For continuous monitoring, use metrics from kubelet (kubelet_pod_start_duration_seconds, kubelet_pod_lifecycle_event_gauge).

io/thecodeforge/kubernetes/probes-deployment.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
apiVersion: apps/v1
kind: Deployment
metadata:
  name: slow-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: slow
  template:
    metadata:
      labels:
        app: slow
    spec:
      containers:
      - name: app
        image: myregistry/slow-app:1.0
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 30  # 2.5 minutes to start
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
Startup Probe Ignored in Older Clusters
Startup probes were introduced in Kubernetes 1.16. If you're running an older cluster (unlikely but possible in enterprise), you must rely on a large initialDelaySeconds on the liveness probe. That's fragile. Always verify your cluster version before relying on startup probes.
Production Insight
Liveness probe failure thresholds that are too aggressive cause restarts during slow rolling updates — tune them with startup probes.
Readiness probe failures do not kill containers — they just remove traffic. Monitor readiness metrics separately.
Startup probes are essential for containers that take >30s to start — set them with high failureThreshold.
Event pruning can hide probe failures — use kubelet metrics for continuous visibility.
Cascading readiness failures happen when unready pods still consume resources — set CPU/memory limits.
Key Takeaway
Probes are the kubelet's decision mechanism for pod health — get them wrong and your deployments become unreliable.
Always use startup probes for slow-starting containers.
Readiness failures are not pod failures — they are traffic routing signals.
Monitor probe metrics separately from pod events to catch cascading failures.
Probe Failure Diagnosis
IfPod shows 'CrashLoopBackOff' after deployment
UseCheck liveness probe configuration. Increase initialDelaySeconds or add a startup probe if the application is slow to start.
● Production incidentPOST-MORTEMseverity: high

Cascading API Server Failure Due to etcd Disk Latency

Symptom
All kubectl operations returned timeout errors. Controller-manager logs showed failed lease renewals. New pods stuck in Pending.
Assumption
Network partition or API Server OOM.
Root cause
etcd members were deployed on the same nodes as other workloads. A batch job caused high disk I/O on those nodes. etcd's consensus protocol (Raft) requires fsync to disk within an election timeout. High disk latency caused leader elections to fail, which made the API Server unable to write new state.
Fix
1. Isolated etcd onto dedicated nodes with local SSDs and no other workloads. 2. Configured etcd heartbeat-interval and election-timeout appropriately for the network. 3. Set up monitoring for etcd fsync duration and disk IOPS. 4. Added disk latency alerts with p99 > 15ms triggering immediate escalation.
Key lesson
  • etcd is the cluster's central nervous system. Its performance is non-negotiable.
  • Disk latency, not network, is the most common cause of etcd instability.
  • etcd must be isolated and its hardware provisioned for predictable, low-latency I/O.
  • Always test your etcd restore procedure quarterly — a backup you've never restored is no backup at all.
Production debug guideA symptom-first investigation path for control plane and node issues.5 entries
Symptom · 01
kubectl commands timeout or fail with 'server error'.
Fix
Check API Server and etcd health first. The API Server is the gateway; if it's down, nothing else works. Use kubectl get --raw=/readyz?verbose and etcdctl endpoint health --cluster.
Symptom · 02
Pods stuck in Pending state.
Fix
Investigate scheduler logs and node resource availability (kubectl describe node). Check for resource fragmentation or taints/tolerations mismatches. Also verify scheduler pod health.
Symptom · 03
Pods in CrashLoopBackOff.
Fix
Inspect kubelet logs on the node and the pod's events (kubectl describe pod). The container runtime (e.g., containerd) logs are critical here. Check journalctl -u kubelet and crictl ps -a.
Symptom · 04
Node marked NotReady.
Fix
SSH to the node. Check kubelet and container runtime status (systemctl status kubelet). Check disk pressure, memory pressure, and PID pressure using kubectl describe node. Look for eviction thresholds being hit.
Symptom · 05
Pod cannot resolve service DNS name.
Fix
Check CoreDNS pods and logs. Verify NetworkPolicy allows DNS traffic on port 53. Ensure Service exists and EndpointSlices are populated. Try nslookup kubernetes.default.svc.cluster.local from inside a pod.
★ Kubernetes Control Plane & Node TriageRapid commands to isolate cluster issues.
Cluster-wide unresponsiveness.
Immediate action
Check etcd health and API Server logs.
Commands
etcdctl endpoint health --cluster
kubectl get --raw='/readyz?verbose'
Fix now
If etcd is unhealthy, check disk latency on etcd nodes (iostat -x 1). Isolate etcd immediately. If API Server is slow, check apiserver_current_inflight_requests metric.
Specific node not scheduling new pods.+
Immediate action
Check node conditions and resource allocation.
Commands
kubectl describe node <node-name> | grep -A 10 Conditions
kubectl top node <node-name>
Fix now
If under DiskPressure, clean up unused images/containers. If under MemoryPressure, identify memory-hungry pods and consider increasing eviction thresholds.
Pods failing to start on a node.+
Immediate action
Check kubelet and container runtime logs.
Commands
journalctl -u kubelet -n 50 --no-pager
crictl ps -a # or docker ps -a
Fix now
If kubelet cannot talk to the runtime, restart the runtime service. Check for CNI plugin errors. Also verify kubelet config for --image-pull-progress-deadline.
API Server response times high.+
Immediate action
Check admission webhook latencies and etcd health.
Commands
kubectl get --raw=/metrics | grep apiserver_admission_webhook_admission_duration_seconds_sum
kubectl get --raw=/readyz?verbose
Fix now
Identify slow webhook and add a timeout or circuit breaker. If etcd is slow, scale up etcd nodes or isolate them. Increase --max-requests-inflight if 429 errors appear.
Scheduler queue growing (pods stuck Pending with no event).+
Immediate action
Check scheduler logs and metrics.
Commands
kubectl logs -n kube-system deployment/kube-scheduler --tail 50
kubectl get --raw=/metrics | grep scheduler_queue_incoming_pods
Fix now
If queue is backing up, increase --kube-api-qps on scheduler or reduce number of pod creations per second. Consider adding more scheduler replicas.
DNS lookup fails for service names but works for external names.+
Immediate action
Check CoreDNS pod status and logs.
Commands
kubectl logs -n kube-system deployment/coredns --tail 30
kubectl exec -n kube-system deployment/coredns -- nslookup kubernetes.default.svc.cluster.local 127.0.0.1
Fix now
If CoreDNS is restarting, check resource limits. If no failure, verify Service and endpoint slices. If CoreDNS is healthy, check NetworkPolicy and pod /etc/resolv.conf.

Common mistakes to avoid

5 patterns
×

Using default eviction thresholds without customization

Symptom
Pods are evicted silently under memory pressure because evictionHard.memory.available defaults to 100MiB, which is too low for predictable behavior.
Fix
Set evictionHard.memory.available: "10%" or a value like 500Mi in the kubelet configuration. Test with a load generator to verify thresholds trigger before system OOM.
×

Running etcd on shared nodes with other workloads

Symptom
etcd leader elections become frequent during batch jobs or peak load, causing API Server timeouts and pod scheduling delays.
Fix
Isolate etcd on dedicated nodes with local SSDs. Set resource reservations for etcd pods and configure --heartbeat-interval and --election-timeout for your network latency.
×

Not setting resource limits for CoreDNS

Symptom
CoreDNS pod crashes under high query load, causing intermittent DNS resolution failures across the cluster.
Fix
Set CPU and memory requests/limits for CoreDNS. For clusters with >100 pods, deploy NodeLocal DNSCache and tune forward plugin's max_concurrent.
×

Applying a NetworkPolicy that denies all ingress without allowing DNS

Symptom
Pods cannot resolve service names; DNS queries to CoreDNS are blocked by the policy.
Fix
Always add an ingress rule allowing UDP/TCP port 53 from all namespaces to the CoreDNS service (or use a prepared NetworkPolicy that includes DNS).
×

Ignoring scheduler cache staleness

Symptom
Pods get stuck in 'PodScheduled' with event 'node does not exist' after node churn, even though the node is alive.
Fix
Restart the kube-scheduler pod to refresh its cache. For production, consider tuning the scheduler's --node-informer-resync-period or using a more recent scheduler version that handles this better.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how the Kubernetes scheduler assigns a pod to a node. What happe...
Q02SENIOR
What happens when etcd loses quorum? How do you recover?
Q01 of 02SENIOR

Explain how the Kubernetes scheduler assigns a pod to a node. What happens if no node passes filtering?

ANSWER
The scheduler uses a two-phase approach: Filtering (Predicates) and Scoring (Priorities). In the filtering phase, it applies hard constraints like resource requests, node selector, taints/tolerations, and affinity rules. If no node passes, the pod remains unscheduled with a 'Pending' status. The scheduler emits an event that can be inspected via kubectl describe pod. The scheduler then scores the remaining nodes with priority functions (e.g., spread, balanced resource allocation) and assigns the pod to the highest-scored node. The scheduler does not re-evaluate once the pod is bound; the kubelet handles eviction if the node becomes overcommitted.
🔥

That's Kubernetes. Mark it forged?

23 min read · try the examples if you haven't

Previous
Kubernetes Monitoring with Prometheus
10 / 12 · Kubernetes
Next
Kubernetes Network Policies