Senior 27 min · March 06, 2026

etcd Disk Latency — Kubernetes Architecture Failure

Q: What are the default API Server request limits and how do they affect large clusters?

The defaults are `--max-mutating-requests-inflight=200` and `--max-requests-inflight=400`. In large clusters with aggressive automation or CI/CD pipelines, these limits are easily hit, causing new requests to queue and eventually time out. Monitor `apiserver_current_inflight_requests` and `apiserver_request_count` to detect throttling before it impacts production.

Q: How does leader election in the controller manager affect failover time?

Controller manager replicas use leader election via endpoints in kube-system. If the leader dies, a new leader takes over after about 15 seconds, governed by lease duration and renew deadline. During that window, the corresponding controller loop (e.g., node controller) stops processing, which can delay pod evictions or other reconciliation tasks.

Q: What happens when the API Server's watch cache is too small?

If writes exceed the watch cache capacity, watch requests get 'too old resource version' errors and clients must re-list all objects. This triggers a thundering herd of full list calls, spiking API Server CPU and etcd load. Mitigate by increasing `--watch-cache-sizes` or reducing the number of concurrent watches.

etcd disk latency from co-located workloads caused Raft leader election failures, crashing the API server.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Production

production tested

June 10, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Kubernetes is a declarative control loop: you define desired state, the system reconciles toward it.
Control Plane components: API Server, Scheduler, Controller Manager, and etcd — the single source of truth using Raft.
Nodes run the kubelet, kube-proxy, and container runtime; they execute pod specs and report back.
Performance insight: etcd's fsync latency dictates cluster responsiveness — keep it under 10ms or expect leader elections.
Production insight: API Server is the only component that talks directly to etcd; all others watch it. If etcd is slow, the entire control plane slows.
Biggest mistake: treating etcd as a generic KV store — it's a replicated log, not an OLTP database.
Scaling insight: watch cache exhaustion causes re-list storms; increase --watch-cache-sizes for resource-heavy clusters.
Scheduler nuance: default --percentage-of-nodes-to-score can skip optimal nodes at scale — set to 100% for latency-sensitive workloads.

✦ Definition~90s read

What is Kubernetes Architecture?

Kubernetes architecture is a distributed system design where a control plane manages a fleet of worker nodes that run containerized applications. The control plane — typically 3 or 5 machines — hosts critical components like the API server, scheduler, controller manager, and etcd.

★

Imagine a massive Amazon warehouse.

These components coordinate to maintain the cluster's desired state, which you declare via YAML manifests. The worker nodes run a kubelet agent, a container runtime (containerd, CRI-O, or Docker), and kube-proxy for networking. This separation of concerns means the control plane makes all scheduling and reconciliation decisions, while nodes are relatively dumb executors.

When etcd disk latency spikes, the entire control plane stalls — the API server can't read or write state, the scheduler can't bind pods, and controllers can't reconcile. This is why etcd performance is the single most critical factor in cluster reliability; a 99th percentile disk fsync latency above 10ms will cause cascading failures across every component that depends on cluster state.

Plain-English First

Imagine a massive Amazon warehouse. There's a central manager's office (the Control Plane) that decides which workers (Nodes) pick which packages (containers), tracks every shelf location (etcd), and reschedules sick workers automatically. The workers don't think — they just follow orders from the office, report their status, and run their assigned tasks. Kubernetes is exactly that warehouse management system, but for software running on servers.

Kubernetes replaces bespoke deployment scripts and manual server management with a declarative control loop. You describe the desired state, and the cluster continuously reconciles reality toward it. The architecture is not a black box; it's a set of coordinated components with specific failure domains and performance characteristics. Understanding these internals is what separates engineers who debug Kubernetes from those who are confused by it.

This is not a getting-started guide. It is for engineers already running Kubernetes who need to understand the 'why' behind scheduler decisions, etcd consistency guarantees, and kubelet behavior under pressure. We will trace what happens, component by component, when you run kubectl apply -f deployment.yaml, and identify the production decisions that bite teams hardest. The single most overlooked truth: the API Server is the bottleneck, but etcd is the clock that drives it.

One more thing: don't assume HA means safe. Running three API Server replicas without understanding leader election or etcd quorum is a false sense of security. We'll cover exactly what breaks and how to catch it before your on-call phone rings.

What Kubernetes Architecture Actually Is

Kubernetes architecture is a distributed system designed to manage containerized workloads across a cluster of machines. The core mechanic is a declarative control loop: you specify the desired state (e.g., 3 replicas of a web server), and the control plane continuously reconciles the actual state to match it. This loop runs on a set of master components — API server, etcd, scheduler, controller manager — while worker nodes run kubelet, kube-proxy, and a container runtime.

In practice, etcd is the single source of truth for all cluster state. Every API request, every pod scheduling decision, every config map update goes through etcd. If etcd slows down, the entire control plane stalls. The scheduler can't assign pods, the API server times out, and node heartbeats fail. A 100ms write latency in etcd can cascade into minutes of cluster unavailability.

You use Kubernetes when you need automated deployment, scaling, and healing of applications across multiple hosts. It matters because it abstracts away individual machines and provides a uniform API for operations. But the architecture's central dependency on etcd means that disk I/O performance on the etcd nodes directly determines cluster reliability — a fact many teams discover only after a production outage.

etcd Is Not a General-Purpose Database

etcd is optimized for consistency and low latency, not throughput. Treating it like a regular key-value store with heavy write patterns will cause cluster-wide instability.

Production Insight

A large e-commerce platform saw their Kubernetes API server become unresponsive during a Black Friday traffic spike. The root cause was etcd disk write latency spiking to 500ms due to shared SSDs with a logging agent. The rule of thumb: dedicate fast SSDs (NVMe) to etcd, monitor disk latency with a 10ms alert threshold, and never co-locate etcd with write-heavy workloads.

Key Takeaway

etcd is the single point of failure in Kubernetes architecture — protect its disk I/O above all else.

The control plane is only as fast as its slowest component, and etcd is usually the bottleneck.

Always run etcd on dedicated, low-latency storage with strict resource isolation.

thecodeforge.io

etcd Disk Latency — Kubernetes Architecture Failure

Kubernetes Architecture

Control Plane: The Cluster's Brain

The Control Plane makes global decisions (e.g., scheduling) and detects and responds to cluster events. It consists of the API Server, etcd, Scheduler, and Controller Manager. In production, it is almost always replicated across multiple nodes for high availability.

The API Server (kube-apiserver) is the front end—all communication, whether from kubectl, controllers, or nodes, goes through it. It validates and persists resources to etcd, and exposes a watch API that components use to detect changes. The Scheduler (kube-scheduler) watches for unscheduled Pods and assigns them to nodes. The Controller Manager (kube-controller-manager) bundles multiple controllers: Node Controller, ReplicaSet Controller, Endpoint Controller, etc. Each runs as a separate loop but shares the same binary.

One detail that bites teams: the API Server's --max-mutating-requests-inflight and --max-requests-inflight defaults are 200 and 400 respectively. If you run a large cluster with aggressive automation, you'll hit this limit and calls start queuing. Monitor apiserver_request_count and apiserver_current_inflight_requests early—before your CI/CD pipeline starts timing out.

Another subtlety: leader election among controller manager replicas is handled via endpoints in kube-system. If the leader dies unexpectedly, it takes about 15 seconds for a new leader to take over (governed by lease duration and renew deadline). During that window, the corresponding controller loop stops. For example, the node controller stops evicting pods from unreachable nodes, which can cause service disruption. Know your lease parameters.

Another often overlooked component: the cloud-controller-manager. If you're on AWS, GCP, or Azure, this controller interacts with the cloud provider's API to manage load balancers, routes, and nodes. A misconfigured cloud-controller-manager can prevent nodes from joining the cluster even though the API Server is healthy. Always check its logs when nodes fail to register.

One more: the API Server's etcd client uses a watch cache. If writes exceed the cache capacity, watch requests get 'too old resource version' errors and clients must re-list. This can cascade into a thundering herd problem. Mitigate by increasing --watch-cache-sizes or reducing watch concurrency. In one incident, a misconfigured monitoring system created too many watches, causing all controllers to re-sync every few minutes, spiking API Server CPU to 100%.

And here's a trap with admission webhooks: they run before the request reaches etcd. A slow webhook blocks the entire request pipeline. We've seen a single webhook that took 5 seconds to respond because it called an external service that was throttled. That 5-second delay was added to every write to that resource type. The fix was to add a circuit breaker and a timeout at the webhook level. Monitor apiserver_admission_webhook_admission_duration_seconds — if the 99th percentile exceeds 1 second, you have a problem.

io/thecodeforge/kubernetes/check_control_plane.shBASH

#!/bin/bash
# Check health of core control plane components

# 1. API Server health (verbose)
kubectl get --raw='/healthz?verbose'

# 2. etcd member list (run on a control plane node)
etcdctl member list -w table

# 3. Leaders for scheduler and controller manager
kubectl get endpoints kube-scheduler -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
echo
kubectl get endpoints kube-controller-manager -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
echo

# 4. Check pod health of control plane components
kubectl get pods -n kube-system | grep -E 'etcd|apiserver|scheduler|controller'

Control Plane Redundancy Trap

Running multiple replicas of the API Server and controller manager is not enough if etcd is a single point of failure. etcd must have its own high-availability setup (typically 3 or 5 members on separate nodes). Without it, the whole control plane collapses if the single etcd instance fails.

Production Insight

etcd is the single point of failure. Monitor its fsync latency; if it drifts above 10ms, expect leader elections.

Admission webhooks silently modify resources — audit in staging, not in production.

Cloud-controller-manager failure prevents node registration — check its logs.

Watch cache exhaustion causes cascading re-lists — monitor apiserver_watch_cache_size.

Admission webhook latency above 1s at p99 will stall the control plane — instrument and timeout.

Key Takeaway

The Control Plane's reliability is the cluster's reliability.

etcd is the foundation; without it performing well, no amount of API Server replicas will save you.

Always monitor etcd fsync latency and have a tested backup restore procedure.

Admission webhooks are a common bottleneck — profile them before they become a crisis.

Control Plane Health Triage

IfAPI Server responds but etcd is unreachable

→

UseCheck etcd nodes' connectivity and disk latency. Restart etcd if a leader election is stuck.

IfScheduler logs show no events but pods are pending

→

UseCheck if scheduler leader is healthy. Restart kube-scheduler pod and verify it can connect to API Server.

IfController manager fails to reconcile resources like ReplicaSets

→

UseCheck leader election endpoints. Look for conflicting leader annotations – sign of split-brain.

IfNodes fail to join the cluster

→

UseCheck cloud-controller-manager logs for API errors. Verify cloud provider credentials and permissions.

Component Interaction Flow: From kubectl to Running Pod

When you run kubectl apply -f deployment.yaml, a chain of events propagates through the control plane to the target node. Understanding this flow is essential for debugging latency and identifying where failures occur.

The sequence starts with kubectl sending a REST POST request to the API Server's /apis/apps/v1/namespaces/default/deployments endpoint. The API Server authenticates the request (via TLS certificates or bearer tokens), authorizes it against RBAC, then passes the object through a set of admission controllers (mutating and validating webhooks). If all passes, the API Server persists the Deployment object into etcd using a Raft write.

Once the write is committed, the API Server's watch mechanism notifies the Deployment controller (part of kube-controller-manager). The Deployment controller sees the new object and creates a ReplicaSet. This in turn triggers the ReplicaSet controller to create a Pod object. The Scheduler, watching for unscheduled Pods, picks a suitable node and updates the Pod with the node binding.

The API Server persists the binding and notifies the target node's kubelet via its watch. The kubelet receives the Pod spec and begins execution: it pulls the container image (if not cached), starts the container via the CRI, mounts volumes, configures networking via CNI, and runs startup/liveness probes. At the same time, kube-proxy updates iptables rules to route Service traffic to the new pod.

The entire process typically takes 2–10 seconds for a small deployment, but can stretch to minutes with large images or slow webhooks. Each step introduces latency variables: admission webhook latency, etcd write latency, scheduler queue delay, image pull time, and CNI configuration time.

A common production pitfall: a slow admission webhook (e.g., 2 seconds per request) adds 2 seconds to every resource creation. If you create 100 pods during a rollout, that's 200 seconds of additional delay. Monitor apiserver_admission_webhook_admission_duration_seconds to catch this.

The diagram below visualizes the interaction sequence between components using a simplified mermaid sequence diagram.

Trace the Full Flow with Audit Logs

Enable the API Server audit log with kubectl proxy and inspect the stage timestamps. You can correlate admission webhook durations, etcd round trips, and response times to pinpoint the slowest step in the pipeline.

Production Insight

Every component in the flow is a potential bottleneck. The most common is admission webhook latency (slows the API Server write path) and image pull time (kubelet). Use startup probes to decouple readiness from slow starts. Monitor scheduler queue depth and etcd fsync latency as leading indicators of slowdowns.

Key Takeaway

The kubectl apply flow is a multi-hop chain through control plane components. The slowest link determines the end-to-end latency. Profile each step in production using metrics and audit logs to identify bottlenecks.

kubectl apply flow

Nodes: The Worker Machines

A Node is a worker machine (VM or physical) where containers are run. Each node contains the services necessary to run Pods: the kubelet, the container runtime (e.g., containerd), and the kube-proxy.

The kubelet is the node's primary agent. It receives PodSpecs from the API Server (for pods assigned to its node) and ensures the described containers are running. It does this by talking to the container runtime via the Container Runtime Interface (CRI). The kubelet also runs liveness and readiness probes, mounts volumes, and reports node conditions like DiskPressure, MemoryPressure, and PIDPressure to the API Server.

Kube-proxy (runs as a DaemonSet) maintains network rules on the node. It watches Services and EndpointSlices and updates iptables or IPVS rules so traffic to a Service's ClusterIP is load‑balanced to the actual pods.

Here's what most people miss: the kubelet's --image-pull-progress-deadline default is 1 minute. If your image is large or the registry is slow, the kubelet kills the pull and retries, creating a cycle that leaves pods in ImagePullBackOff. Set this higher or use image streaming in production.

Also, the kubelet's eviction logic uses soft and hard eviction thresholds. Hard eviction triggers immediate pod killing when exceeded, while soft eviction has a grace period. By default, evictionHard.memory.available is 100MiB — that's practically zero. Set it to 10% of node memory for predictable behavior.

One more thing: the kubelet's node status updates are sent to the API Server periodically (default 10 seconds). If the API Server is under high load or network is congested, the node may appear NotReady even though it's healthy. This is called 'node flapping' and is often a symptom of control plane load rather than node failure. Tune the node-status-update-frequency and node-monitor-grace-period accordingly.

Another detail: the kubelet's --max-pods default is 110, but that's a hard count, not a resource limit. A node may have free CPU/memory but hit this limit. In clusters running many sidecars, you can exhaust the pod slot quickly. Use --max-pods or the scheduling plugin to enforce a more appropriate cap based on your workload density.

And let's talk about system reserved resources. If you don't configure --system-reserved and --kube-reserved, the kubelet assumes all node resources are available for pods. But the operating system and the kubelet itself consume some. Without proper reservations, pods can steal resources from system daemons, leading to SSH timeouts or node instability. Always set --system-reserved=cpu=500m,memory=1Gi (adjust per node size) and enable eviction thresholds.

io/thecodeforge/kubernetes/node_debug.shBASH

#!/bin/bash
# Deep inspection of a Node's status

NODE_NAME="worker-node-1"

# 1. Node conditions and allocatable resources
kubectl describe node $NODE_NAME | grep -A 15 Conditions
echo "---"
kubectl describe node $NODE_NAME | grep -A 5 Allocatable

# 2. Check kubelet logs (run on the node)
journalctl -u kubelet --since "1 hour ago" --no-pager | grep -i "error\|fail"

# 3. Check container runtime (containerd example)
sudo crictl info | jq '.config.systemdCgroup'
sudo crictl ps

# 4. Check kube-proxy iptables rules (run on node)
sudo iptables -L -n | grep -i "kube-sevice\|KUBE-SVC"

The kubelet: Node-Level Controller

Runs liveness and readiness probes.
Mounts volumes specified in the PodSpec.
Reports node conditions (MemoryPressure, DiskPressure).
Manages cgroups for resource isolation.

Production Insight

The kubelet's --max-pods (default 110) is a hard cap — new pods won't schedule even if CPU/RAM is free.

Resource reservations subtract from node capacity — misconfiguring causes overcommit.

Set evictionHard.memory.available explicitly; default 100MiB leads to silent OOM kills.

Node flapping is often API Server load, not node health — check control plane first.

System reserved resources protect node stability — define them or risk SSH failures.

Key Takeaway

Nodes are intelligent agents enforcing local state — not dumb workers.

Node failures are often local (disk pressure, kubelet crash, runtime hang) — debug with journalctl and crictl, not just kubectl.

Understand the kubelet's eviction thresholds and resource reservations to avoid pod evictions at scale.

System reserved resources are not optional — configure them early.

Node Not Ready Diagnosis

IfNode condition shows DiskPressure

→

UseFree up disk space: remove unused images, increase node disk size, or implement image garbage collection thresholds.

IfNode condition shows MemoryPressure

→

UseEvict low-priority pods or reduce memory limits. Check node allocatable memory.

IfNode condition shows PIDPressure

→

UseIncrease pids limit on the node or reduce the number of running containers.

IfNode condition shows NetworkUnavailable

→

UseCheck CNI plugin status. Restart CNI daemon and verify network interface configuration.

IfNode flapping between Ready and NotReady

→

UseIncrease node-status-update-frequency and node-monitor-grace-period. Check API Server load.

Worker Node Component Reference Table

The following table provides a quick reference for the core components running on every worker node. This is useful when triaging node-level issues or validating node configurations during cluster upgrades.

Component	Description	Default Port	Log Location	Common Failure Modes
kubelet	Primary node agent; manages pods and reports node status	10250 (kubelet API), 10255 (read-only)	`journalctl -u kubelet`	OOM due to missing limits, disk pressure, stalled Docker socket (legacy)
kube-proxy	Network proxy; maintains iptables/IPVS rules for Services	10249 (metrics)	`journalctl -u kube-proxy`	iptables corruption (large clusters), IPVS mode fallback
containerd	Container runtime (default)	10010 (CRI)	`journalctl -u containerd`, `crictl logs`	Image pull timeout, storage driver issues (overlay2), dead containerd socket
CRI-O	Alternative container runtime (Red Hat)	10010 (CRI)	`journalctl -u crio`	Image pull timeout, conmon OOM, conmon vs runc mismatch
Calico (cni-calico)	CNI plugin providing network policies and BGP routing	9099 (felix metrics)	`/var/log/calico/cni/`	IPAM exhaustion, BGP peer failure, policy misconfiguration
Flannel (cni-flannel)	CNI plugin for simple overlay networking	–	`journalctl -u flanneld`	VXLAN MTU mismatch, subnet lease conflicts, no network policy support
Cilium (cni-cilium)	eBPF-based CNI with advanced observability	9090 (cilium-agent metrics), 9961 (hubble)	`cilium status` or `hubble observe`	Kernel version < 5.10, eBPF feature gaps, conflicting NetworkPolicies

Key insight: the container runtime must be the same across all nodes. Mixed runtimes (containerd on some, CRI-O on others) work but introduce subtle differences in CRI implementation – always test in a staging cluster before rolling out.

Another common pitfall: the kubelet's --container-runtime-endpoint defaults to /run/containerd/containerd.sock. If containerd is restarted and the socket disappears temporarily, kubelet will fail to start new pods. Use a systemd socket activation or a health check that waits for the socket to appear before starting the kubelet.

Also note that kube-proxy in IPVS mode (set --proxy-mode=ipvs) scales better for large clusters but requires the ipvsadm module loaded on the node. Without it, kube-proxy falls back to iptables mode, which can be a surprise if you've tuned for IPVS performance.

io/thecodeforge/kubernetes/check_node_components.shBASH

#!/bin/bash
# Verify node component versions

# kubelet version
kubelet --version 2>/dev/null || journalctl -u kubelet | grep 'Version'

# container runtime version
crictl version

# kube-proxy version (from pod)
kubectl get pod -n kube-system -l k8s-app=kube-proxy -o jsonpath='{.items[0].spec.containers[0].image}'

# CNI version (calico example)
kubectl get daemonset -n kube-system calico-node -o jsonpath='{.spec.template.spec.containers[0].image}'

# Loaded kernel modules for IPVS
lsmod | grep -E 'ip_vs|ip_vs_rr|ip_vs_wrr|ip_vs_sh'

Standardize Node Images

Pin kubelet, kube-proxy, and container runtime versions to cluster release.
Pre-pull images used by core components (CNI, kube-proxy) during node bootstrapping.
Validate node components weekly against a baseline – any mismatch should trigger an alert.

Production Insight

Standardize node images with pinned component versions to avoid silent drift-related failures.

Container runtime socket availability is a startup dependency – implement wait-for-socket logic in systemd.

kube-proxy in IPVS mode requires kernel modules – verify on each node after kernel upgrades.

CNI plugin IPAM exhaustion is a leading cause of new pods stuck in ContainerCreating – monitor IP pool usage per node.

Mixed container runtimes between nodes work but create subtle CRI differences – test combinations before production rollout.

Key Takeaway

Worker nodes host multiple cooperating system components: kubelet, kube-proxy, container runtime, and CNI plugin. Each has specific port, log, and failure mode characteristics. Use this reference table to accelerate node-level troubleshooting and ensure consistent configurations across the cluster.

etcd: The Cluster's Source of Truth

etcd is a distributed, consistent key-value store that powers Kubernetes. It uses the Raft consensus algorithm to achieve strong consistency across a cluster of members (typically 3 or 5). All cluster state—Pods, ConfigMaps, Secrets, Deployments, RBAC policy—is stored in etcd. The API Server is the only component that writes to etcd; all other components watch the API Server for changes.

The key insight: etcd is a replicated log, not a traditional database. Every write is appended to a log and only committed when a majority of members (quorum) acknowledge it. This design ensures strong consistency but makes performance heavily dependent on disk I/O latency for fsync operations. If one etcd member's disk is slow, the entire cluster's write throughput suffers.

Another thing: etcd's default --snapshot-count is 100,000. After that many changes, etcd takes a snapshot and compacts the log. On slow disks, this snapshot can spike latency and cause temporary leader election issues. Tune this value down (e.g., 50000) on clusters with frequent writes.

Less obvious: etcd's database file grows even after compaction because old data is freed but not returned to the OS. You must run etcdctl defrag periodically to reclaim space. Skipping this leads to quota limit errors (mvcc: database space exceeded) that lock the cluster.

Also, consider the impact of network latency between etcd members. Raft heartbeats are sent every 100ms by default. If round-trip time exceeds 50ms, you risk false leader elections. In multi-datacenter setups, place etcd members close together or tune heartbeat-interval to account for latency.

A production scenario: during a large-scale cluster upgrade, etcd writes spike as many resources are updated simultaneously. If you haven't tuned --snapshot-count or --quota-backend-bytes, the WAL compaction can cause fsync storms. One team saw a 5-second write latency every 2 minutes during an upgrade, causing repeated leader elections. The fix: increase --quota-backend-bytes to 16GB and set --auto-compaction-retention=1h to avoid bursts.

And the silent killer: clock skew. Raft relies on election timeouts that are based on monotonic clocks. If two etcd nodes have clock drift greater than the election timeout, they may both think the leader has timed out and start new elections, causing a split that can lead to quorum loss. Use chrony or ntpd with reliable upstream time sources and monitor clock offset across etcd members.

io/thecodeforge/kubernetes/etcd_health.shBASH

#!/bin/bash
# etcd health and performance checks

# 1. Endpoint health
etcdctl endpoint health --cluster -w table

# 2. Check member list and leader
etcdctl endpoint status --cluster -w table

# 3. Measure fsync duration (run on etcd node)
# Using etcd's built-in metrics: /metrics endpoint
curl -s http://localhost:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds_sum

# 4. Database size and quota
sudo du -sh /var/lib/etcd/member/snap/db
echo "quota-backend-bytes: $(ps aux | grep etcd | grep -oP 'quota-backend-bytes=\K[^ ]+')"

# 5. Backup snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db

Quorum Loss is Catastrophic

If you lose a majority of etcd members (e.g., 2 out of 3), the cluster cannot commit new writes. The API Server goes into read-only mode. No pods can be created, updated, or deleted. The only recovery is to restore from a snapshot, which may lose minutes of state. Always run at least 3 etcd members on separate hosts and monitor clock skew and disk latency.

Production Insight

Tune heartbeat-interval and election-timeout based on network RTT; defaults cause false leader elections in high-latency environments.

etcd defragmentation is necessary but blocking — schedule it during maintenance windows.

Never deploy etcd on shared nodes; a burst in I/O from other workloads can cause fsync latency spikes that trigger leader elections.

Network latency between etcd members must be under 50ms to avoid false elections.

Snapshot compaction during upgrades can cause fsync storms — pre-tune quota and compaction retention.

Clock skew between etcd members can cause split-brain — monitor with NTP and keep drift under 10ms.

Key Takeaway

etcd's disk I/O, not CPU or memory, determines cluster performance and stability.

Run it on dedicated SSDs, monitor fsync latency, backup snapshots regularly.

Learn to restore from backup before you need it — quorum loss is catastrophic.

Clock skew is an overlooked source of election storms — enforce NTP.

etcd Troubleshooting

Ifetcd member unreachable

→

UseCheck network connectivity, process status, and disk latency. If member is dead and cannot be recovered, remove it and re-add a new member.

Ifetcd database size > 8GB or approaching quota-backend-bytes

→

UseRun defragmentation. If the database is growing abnormally, check for excessive event history or leaked resources (e.g. too many ConfigMaps).

IfLeader election storms (frequent leader changes)

→

UseCheck disk latency with iostat and network latency between etcd members. Also check clock skew. Increase heartbeat-interval and election-timeout proportionally.

The Scheduler: How Pods Are Assigned to Nodes

The Kubernetes scheduler (kube-scheduler) is responsible for assigning unscheduled Pods to appropriate Nodes. It does this via a two-phase pipeline: Filtering (Predicates) and Scoring (Priorities).

In the filtering phase, the scheduler selects nodes that meet the pod's hard constraints: resource requests, nodeSelector, node affinity, taints/tolerations, topology constraints (e.g., pod anti-affinity). Node conditions like DiskPressure or MemoryPressure also cause the node to be filtered out.

In the scoring phase, the scheduler ranks feasible nodes based on priority functions: spread pods across zones, minimize resource fragmentation (balanced allocation), cluster autoscaler preferences, and user-defined custom scores. The node with the highest score gets the pod.

If no node passes the filtering phase, the pod remains Pending. The scheduler emits events that can be inspected via kubectl describe pod or scheduler logs.

A lesser-known nuance: the scheduler does not re-evaluate decisions for already scheduled pods. If a node becomes overcommitted after scheduling, the pod stays there—it won't be rescheduled. That's the kubelet's job (eviction), not the scheduler's.

Another performance detail: the scheduler's --kube-api-qps defaults to 50. If you have hundreds of nodes and thousands of pods being created rapidly (e.g., during a scaling event), the scheduler may fall behind. Increase this value but watch API Server load.

Another subtlety: the scheduler uses a scheduler cache of node information to avoid hitting the API Server for every pod. But this cache can become stale if nodes update frequently. In extreme cases, the scheduler may try to place a pod on a node that no longer exists or has changed. This manifests as a pod stuck in 'PodScheduled' condition with the event 'node does not exist'. The scheduler eventually retries and the cache refreshes, but you can force this by restarting the scheduler pod.

One more: the scheduler's --percentage-of-nodes-to-score defaults to 50% for clusters with >100 nodes. This is a performance optimization — it scores only a subset of feasible nodes. But it can lead to suboptimal placements if that subset doesn't include the best node. In latency-sensitive deployments, consider setting it to 100% to always get the best score at the cost of scheduling latency.

And let's talk about pod priority and preemption. If you have pods with different priority classes, the scheduler can preempt lower-priority pods to make room for higher-priority ones. But preemption is not instant — it can take up to 30 seconds because the scheduler must gracefully evict the lower-priority pods and wait for the kubelet to terminate them. During that window, the high-priority pod remains Pending. If your application requires rapid recovery, design your priority classes with realistic timeout expectations.

io/thecodeforge/kubernetes/scheduler_debug.shBASH

#!/bin/bash
# Debugging scheduler behaviour

# 1. Check scheduler logs
kubectl logs -n kube-system deployment/kube-scheduler --tail 50

# 2. Check pending pod's scheduler events
POD_NAME="my-pod"
kubectl describe pod $POD_NAME | grep -A 10 Events

# 3. Show nodes with taints and allocatable resources
kubectl get nodes -o custom-columns='NAME:.metadata.name,"ALLOCATABLE_CPU":status.allocatable.cpu,"ALLOCATABLE_MEM":status.allocatable.memory,"TAINTS":spec.taints'

# 4. Check scheduler metrics for queue depth
kubectl get --raw=/metrics?filter=scheduler_queue_incoming_pods

Scheduler queue depth is a leading indicator

If the scheduler queue backs up, it often means the scheduler cannot keep up with pod creation rates, or filtering/scoring is slow due to high node counts. Monitor scheduler_queue_incoming_pods metrics and adjust --kube-api-qps or tune scoring algorithms.

Production Insight

Scheduler default priority functions are not topology-aware — in multi-zone clusters, pods pack into one zone.

Resource fragmentation occurs when requests don't match limits — overcommitted nodes cause CPU throttling.

podAntiAffinity with requiredDuringScheduling can render a cluster unschedulable — prefer preferredDuringScheduling.

Scheduler cache can become stale — restart scheduler pod if pods are stuck on non-existent nodes.

--percentage-of-nodes-to-score default can skip optimal nodes — set to 100% for latency-sensitive workloads.

Pod preemption can take 30+ seconds — design priority classes with realistic expectations.

Key Takeaway

The scheduler filters then scores — if pods stay Pending, the filter phase is failing. Check kubectl describe pod for the reason.

In large clusters, tune --percentage-of-nodes-to-score to reduce scheduling latency.

Always use podTopologySpreadConstraints to optimize cost and resilience across zones.

Pod preemption is not instant — factor in eviction grace periods.

Pod Stuck in Pending

IfNo nodes match resource requests

→

UseCheck kubectl describe pod for insufficient CPU/memory errors. Either increase cluster capacity or reduce requests.

IfPod has node selector or affinity that no node satisfies

→

UseVerify node labels. If using availability zones, check zonal capacity.

IfTaints and tolerations mismatch

→

UseCheck node taints and pod tolerations. Adjust tolerations or remove taint if appropriate.

IfAll nodes have pod anti-affinity conflicts

→

UseReview anti-affinity rules. Reduce requiredDuringScheduling to preferredDuringScheduling if possible.

IfPod is stuck with 'node does not exist' event

→

UseRestart the scheduler pod to refresh its cache. Check for node lifecycle events that may have caused stale data.

Kubernetes Networking: CNI, Services, and kube-proxy

Kubernetes assumes a flat network where every Pod can communicate with every other Pod without NAT, across nodes. This is achieved through the Container Network Interface (CNI), a plugin-based layer that configures network interfaces and routes on each node.

Each pod gets its own IP address (IP-per-pod model). CNI plugins like Calico, Flannel, Weave, or Cilium set up virtual interfaces and routing rules to enable cross-node pod-to-pod communication. Services abstract pod IPs and provide stable virtual IPs (ClusterIP) for pod discovery. kube-proxy watches Services and EndpointSlices and programs iptables or IPVS rules to forward traffic to the correct pods.

DNS resolution: Kubernetes DNS (CoreDNS) serves A/AAAA records for Services. Pods can resolve service names to ClusterIPs, enabling simple service discovery.

One common misconfiguration: kube-proxy --cluster-cidr must match your pod CIDR range. If they differ, kube-proxy may program incorrect routing. Also, when using Calico with NetworkPolicy, remember that Calico's default behavior is to allow all traffic unless a policy matches. This is different from Kubernetes NetworkPolicy which defaults to deny when any policy targets the pod.

Choosing the right CNI matters: Calico offers rich network policies and BGP-based routing, Flannel is simpler but lacks policy support, Cilium uses eBPF for high performance. Evaluate based on your scale and security requirements.

Also, when using Cilium with eBPF, the kube-proxy can be completely removed (kube-proxy replacement). This reduces iptables overhead and improves performance at scale. But it requires careful validation of network policies and service routing, as Cilium's implementation may differ from standard kube-proxy in edge cases like externalTrafficPolicy.

A common production pitfall: IP address management (IPAM) exhaustion. If your CNI allocates pod IPs from a fixed CIDR and you have many pods terminating and starting, the IP pool can fragment. Calico uses a block-based approach that mitigates this, but Flannel's default allocation can lead to rapid exhaustion. Monitor IP utilization metrics from your CNI plugin.

And don't forget about MTU issues. If your CNI's overlay network uses encapsulation (e.g., VXLAN with 50 bytes overhead), and your underlying network has a standard 1500 MTU, the effective MTU for pods is 1450. If your application sends large packets that require fragmentation, you may see degraded performance or timeouts. Set the MTU on your CNI config to account for encapsulation overhead, or use direct-routing mode (e.g., Calico with BGP) to avoid encapsulation.

io/thecodeforge/kubernetes/network_debug.shBASH

#!/bin/bash
# Debugging pod networking

# 1. Check pod IP and node
kubectl get pod <pod-name> -o wide

# 2. Verify connectivity between pods
kubectl exec <pod-a> -- curl -m 2 <pod-b-ip>
# If it fails, check CNI policies and network policy

# 3. Check service endpoints
kubectl describe service <service-name>

# 4. Check kube-proxy rules on the node
sudo iptables-save | grep -i "KUBE-SVC" | head -20

# 5. Check DNS resolution inside a pod
kubectl exec <pod-name> -- nslookup kubernetes.default.svc.cluster.local

NetworkPolicy Gotchas

NetworkPolicies are evaluated by the CNI plugin. If you apply a NetworkPolicy that denies all ingress but do not allow DNS, CoreDNS becomes unreachable and pod-to-service names resolves fails. Always allow traffic to CoreDNS (port 53 UDP/TCP) in your default deny policies.

Production Insight

Most common networking failure is CNI IPAM exhaustion — pods stay in ContainerCreating. Monitor IP pool usage.

iptables-based kube-proxy degrades with thousands of services — switch to IPVS mode for better scalability.

NetworkPolicy misconfigurations often block DNS traffic — always allow CoreDNS (port 53) in deny policies.

Cilium eBPF mode removes kube-proxy but requires validation of external traffic policies.

IPAM fragmentation can exhaust pod CIDR even when total utilization is low — use block-based CNI like Calico.

MTU mismatches in overlay networks cause packet fragmentation — set CNI MTU to account for encapsulation overhead.

Key Takeaway

Kubernetes networking is CNI-driven — the IP-per-pod model simplifies routing but makes troubleshooting complex.

Always check CoreDNS first for DNS issues; it's the most common point of failure.

Never apply a blanket deny NetworkPolicy without allowing DNS on port 53 — you'll break the entire cluster.

MTU matters — test with large packets to catch fragmentation issues early.

Pod Connectivity Issues

IfPod cannot reach another pod's IP

→

UseCheck if they are on the same node: if yes, check CNI bridge/overlay. If different nodes, check node-to-node routing and firewall rules.

IfPod cannot reach a Service DNS name

→

UseCheck CoreDNS pod status and logs. Verify Service exists and EndpointSlices are populated. Check NetworkPolicy allowing DNS traffic.

IfExternal traffic not reaching service

→

UseVerify Service type LoadBalancer/NodePort, check cloud load balancer health, and ensure node security groups allow traffic.

CNI Plugins: Calico and Flannel in Kubernetes Architecture

The Container Network Interface (CNI) plugin is a critical architectural component that determines how pods communicate within and across nodes. Two of the most widely used CNI plugins are Calico and Flannel. Each implements the Kubernetes networking model differently, with distinct trade-offs in complexity, security, performance, and scale.

Calico uses a pure Layer 3 approach by default, routing pod traffic using BGP (Border Gateway Protocol) without needing overlays. This eliminates encapsulation overhead (no VXLAN/IPSEC) and allows pod-to-pod packets to be forwarded at near wire speed. Calico also implements rich network policies using iptables or eBPF, supporting granular ingress/egress rules, namespace isolation, and dynamic policy enforcement. It includes its own IPAM (IP Address Management) using block-based allocation, which reduces fragmentation.

Flannel uses an overlay network (most commonly VXLAN) to encapsulate pod traffic. It is simpler to deploy and requires no BGP infrastructure or complex routing configuration. Flannel provides a flat network where every pod gets a unique IP, but it does not support Kubernetes NetworkPolicy natively—meaning no firewall rules between pods unless you combine it with a separate policy engine like Calico or Cilium. Flannel's IPAM is simpler and can exhaust IPs faster under high pod churn.

When to choose which? For production clusters requiring network policies, multi-tenancy, and high throughput (e.g., financial services, large e-commerce), Calico is the recommended choice. For small dev/test clusters, or teams new to Kubernetes where simplicity is paramount, Flannel suffices. Many production deployments run Calico in BGP mode with eBPF acceleration for best performance.

Common issues with Calico: BGP peer misconfiguration causes routing failures; calico-node pod crash due to missing kernel modules (e.g., ip_tables, nf_conntrack); IPAM pool exhaustion leading to pods stuck in ContainerCreating. For Flannel: VXLAN MTU mismatch (set to 1450 if underlying network MTU is 1500); subnet lease conflicts when nodes are recreated quickly.

Performance consideration: Calico's iptables-based policy enforcement can become a bottleneck at scale (thousands of policies). Switch to eBPF mode (requires Linux kernel >= 5.10) for better throughput. Flannel's VXLAN incurs a 50-byte overhead per packet—increase MTU on the host interface to compensate.

The diagram below illustrates how a packet travels from one pod to another across nodes using a Calico BGP route versus a Flannel VXLAN tunnel.

Combine Calico for Policies with Flannel for Simplicity?

It is possible to use Flannel for the overlay network and Calico for network policies (Calico's 'policy-only' mode). This gives you the simplicity of Flannel's IPAM and the security of Calico policies. However, you lose Calico's BGP routing and eBPF performance features. Evaluate whether the complexity trade-off is worth it.

Production Insight

Choose Calico for any cluster requiring network policies, multi-tenancy, or high throughput. Flannel is suitable for small dev/test clusters. If using Calico, monitor IP pool usage and BGP peer health. For Flannel, validate MTU settings and subnet lease timeouts. Always test CNI plugin upgrades in a staging environment before production rollout.

Key Takeaway

Calico and Flannel represent two ends of the CNI spectrum: Calico offers rich policies and high performance via BGP/eBPF, while Flannel prioritizes simplicity via VXLAN overlays. Choose based on your security and scaling requirements, not just familiarity. IPAM exhaustion is a common failure mode for both—monitor early.

CNI packet flow comparison

API Server: The Gateway to the Cluster

The API Server is the frontend of the control plane and the only component that directly interacts with etcd. Every kubectl command, controller watch, and component communication goes through it. Understanding its request flow is critical for debugging and performance tuning.

The API Server authenticates the request (via client certificates, bearer tokens, or OIDC), authorizes it against RBAC policies, and then passes it through admission controllers (mutating and validating) before persisting to etcd. Admission controllers can modify or reject resources — this is where PodSecurity admission, resource quota enforcement, and custom webhooks run.

What most engineers miss: admission webhooks add latency to every API request. A slow webhook can increase API Server response time from milliseconds to seconds, impacting the entire cluster. Monitor apiserver_admission_webhook_admission_duration_seconds for outliers. Also, the API Server caches responses for watch requests, but the cache size is limited. Large clusters with many watches can cause cache thrashing and increased etcd read load.

Another overlooked metric: apiserver_request_duration_seconds with a high 99th percentile indicates either webhook latency or etcd slowness. Correlate with etcd metrics to pinpoint the bottleneck.

Another scenario: the API Server's watch cache can become inconsistent on very large clusters with frequent resource updates. When a watch request fails with 'too old resource version', the client must re-list all resources. This can cause cascading failures as controllers re-sync and generate additional load. Mitigate by increasing the kube-apiserver's --watch-cache-sizes or using the --watch-cache flag with appropriate values.

One more: the API Server's --max-requests-inflight and --max-mutating-requests-inflight defaults can be too low for clusters with many controllers or automation. If you see 429 Too Many Requests errors in logs, tune these up gradually while monitoring memory usage. Each inflight request consumes memory for the request context, so increasing them too aggressively can cause OOM.

And here's something about watch timeouts: by default, watch connections are long-lived. If a client (e.g., a controller) disconnects unexpectedly, the API Server keeps the watch goroutine until the timeout (default 5 minutes). In clusters with many controllers, these orphaned watches can accumulate and consume significant memory. Set --watch-termination-timeout to a lower value (e.g., 60s) to clean up stale watches faster.

io/thecodeforge/kubernetes/apiserver_metrics.shBASH

#!/bin/bash
# Check API Server metrics

# 1. Request counts by verb
kubectl get --raw=/metrics | grep apiserver_request_total | grep -v '#'

# 2. Admission webhook latencies
kubectl get --raw=/metrics | grep apiserver_admission_webhook_admission_duration_seconds_sum

# 3. Inflight requests
kubectl get --raw=/metrics | grep apiserver_current_inflight_requests

# 4. Watch cache performance
kubectl get --raw=/metrics | grep apiserver_watch_cache_size

Admission Webhook Latency

If you have custom admission webhooks, validate they complete within the default timeout of 30 seconds. A slow webhook blocks all API requests of that type. Use webhook.Server with timeouts and always set failurePolicy: Fail in production to avoid silent bypass.

Production Insight

Admission webhooks can become the single bottleneck if not tuned — monitor latencies and set timeouts.

The API Server's etcd client cache can stale quickly under write-heavy loads — expect 403/429 errors with 'too old resource version'.

Never run admission webhooks without Circuit Breaker patterns — a slow webhook can bring down the entire API Server.

Watch cache inconsistencies on large clusters cause cascading re-list storms — tune watch cache sizes.

Inflight request limits cause 429 errors — gradually increase --max-requests-inflight while monitoring memory.

Orphaned watch connections consume memory — set --watch-termination-timeout to a reasonable value.

Key Takeaway

The API Server is the heartbeat of the cluster.

Admission webhooks and etcd latency are the two most common performance killers.

Always monitor API Server metrics before blaming network issues.

Watch timeout and orphaned connections can silently degrade performance — clean them up.

API Server Response Issues

Ifkubectl commands return 'Connection refused'

→

UseCheck API Server pod status and load balancer health. Ensure the API Server process is running and reachable.

Ifkubectl commands return 'timeout' after 30s

→

UseCheck etcd health and admission webhook latencies. Use the metrics endpoint to identify whether it's an etcd issue or webhook slowdown.

IfAPI Server returns 429 Too Many Requests

→

UseIncrease --max-mutating-requests-inflight and --max-requests-inflight or reduce concurrency of automated kubectl usage.

IfWatch request fails with 'too old resource version'

→

UseIncrease --watch-cache-sizes for the affected resource types. Consider using a higher resource version tolerance in client code.

CoreDNS and Service Discovery: The Cluster's Internal DNS

CoreDNS is the default DNS resolver for Kubernetes clusters. It runs as a deployment in the kube-system namespace and watches Services and EndpointSlices to provide name resolution for service names. Every pod is configured to use CoreDNS via /etc/resolv.conf — typically at the cluster IP of the CoreDNS service (like 10.96.0.10).

CoreDNS uses plugins to achieve its functionality. The kubedns plugin handles Kubernetes service records. The forward plugin forwards external DNS queries to upstream resolvers. The loop plugin detects forwarding loops. The log plugin enables query logging for debugging.

A common misconfiguration: not setting resource limits for CoreDNS. Under heavy query load, CoreDNS can become a bottleneck. We've seen clusters where a single CoreDNS pod crashed due to memory limits, causing intermittent DNS failures across the cluster. Monitor CoreDNS's memory and CPU — set requests and limits based on cluster size.

Another subtlety: the ndots configuration in pod DNS policy. By default, /etc/resolv.conf sets ndots:5. This means if a domain name has fewer than 5 dots, the resolver will first try appending cluster search domains before making the absolute query. This adds unnecessary latency for single-name services (e.g., my-service) — it tries my-service.default.svc.cluster.local first (good), but also tries my-service.svc.cluster.local and my-service.cluster.local. For high-traffic applications, tune ndots to 1 or set DNSConfig in the pod.

Also, CoreDNS's forward plugin retries on failure. The default policy first sends queries to the first upstream and only tries others on failure. If the first upstream is slow, all queries wait. Change to policy: sequential for better load distribution.

In large clusters, consider deploying NodeLocal DNSCache. It runs a DaemonSet that caches DNS queries per node, reducing load on CoreDNS and improving resolution latency. This is almost essential for clusters with thousands of pods.

io/thecodeforge/kubernetes/coredns-config.yamlYAML

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           policy sequential
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

Enabling CoreDNS Query Logging

To debug DNS failures, enable query logging by adding the log plugin to the CoreDNS ConfigMap. The logs will show each query and response. Use with caution in production — log volume can be high. The command kubectl logs -n kube-system deployment/coredns --tail=10 shows recent queries.

Production Insight

CoreDNS without resource limits is a common cause of intermittent DNS timeouts — set requests and limits per pod.

ndots:5 causes unnecessary search domain queries — tune to 1 for latency-sensitive apps.

Default forward policy 'first' can amplify latency when the primary upstream is slow — use 'sequential' with timeout.

NodeLocal DNSCache cuts CoreDNS query latency by 50% in large clusters — deploy it.

Monitor CoreDNS's failure rate metric coredns_dns_request_duration_seconds to catch degradation early.

Key Takeaway

CoreDNS is the single point of failure for service discovery — protect it with resource limits and health checks.

ndots:5 is a hidden latency tax — tune it down for performance-critical paths.

NodeLocal DNSCache is not optional for clusters over 50 nodes — deploy it proactively.

DNS Resolution Failures

IfPod cannot resolve any service names (nslookup times out)

→

UseCheck CoreDNS pods are running and not being evicted. Verify service kube-dns has endpoints. Check NetworkPolicy blocking DNS.

Kubelet Probes and Pod Lifecycle: What Happens When a Probe Fails

The kubelet executes three types of probes on containers: liveness, readiness, and startup. Each probe is a periodic check (HTTP GET, TCP socket, or command execution) that determines whether the container is alive, ready to serve traffic, or still starting up. The probe results directly affect the pod's lifecycle and the cluster's behavior.

Liveness probes determine if the container is running. If it fails, the kubelet restarts the container per the pod's restartPolicy. Readiness probes determine if the container is ready to serve traffic. If it fails, the pod is removed from Service endpoints. Startup probes delay the start of liveness and readiness probes until the container finishes initialization — critical for slow-starting applications like Java apps or legacy monoliths.

Here's the nuance most teams get wrong: the failure threshold for liveness probes often causes cascading restarts in rolling updates. Default failureThreshold is 3, and the default periodSeconds is 10. That means a pod will be killed after 30 seconds of failure. But during a deployment, if the new version takes 40 seconds to respond, the kubelet restarts it before it ever becomes ready. The fix: increase the startup probe threshold or use an initial delay.

Another subtlety: when a readiness probe fails, the pod remains running but is removed from Service endpoints. This means traffic stops flowing, but the pod continues consuming CPU and memory. If many pods become unready, the remaining pods may be overloaded, causing a cascade of readiness failures. Always set resource limits to prevent unready pods from starving healthy ones.

Also, the kubelet doesn't kill containers for failing readiness probes. That's intentional — the probe is just traffic routing. But if you rely on readiness probe for health checking in monitoring, you'll get false positives. Readiness probe failures are not alerts unless they exceed a high threshold.

One more: the kubelet records probe results as events on the pod. You can see them with kubectl describe pod. But there's a default event limit of 1000 per pod, and if probes are failing rapidly, older events get pruned. You may lose the root cause. For continuous monitoring, use metrics from kubelet (kubelet_pod_start_duration_seconds, kubelet_pod_lifecycle_event_gauge).

io/thecodeforge/kubernetes/probes-deployment.yamlYAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: slow-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: slow
  template:
    metadata:
      labels:
        app: slow
    spec:
      containers:
      - name: app
        image: myregistry/slow-app:1.0
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 30  # 2.5 minutes to start
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2

Startup Probe Ignored in Older Clusters

Startup probes were introduced in Kubernetes 1.16. If you're running an older cluster (unlikely but possible in enterprise), you must rely on a large initialDelaySeconds on the liveness probe. That's fragile. Always verify your cluster version before relying on startup probes.

Production Insight

Liveness probe failure thresholds that are too aggressive cause restarts during slow rolling updates — tune them with startup probes.

Readiness probe failures do not kill containers — they just remove traffic. Monitor readiness metrics separately.

Startup probes are essential for containers that take >30s to start — set them with high failureThreshold.

Event pruning can hide probe failures — use kubelet metrics for continuous visibility.

Cascading readiness failures happen when unready pods still consume resources — set CPU/memory limits.

Key Takeaway

Probes are the kubelet's decision mechanism for pod health — get them wrong and your deployments become unreliable.

Always use startup probes for slow-starting containers.

Readiness failures are not pod failures — they are traffic routing signals.

Monitor probe metrics separately from pod events to catch cascading failures.

Probe Failure Diagnosis

IfPod shows 'CrashLoopBackOff' after deployment

→

UseCheck liveness probe configuration. Increase initialDelaySeconds or add a startup probe if the application is slow to start.

Why You Can't Afford to Ignore Docker in Kubernetes

Docker isn't just a container runtime. It's the foundation that Kubernetes orchestrates. Every pod you run on Kubernetes is just a collection of Docker containers (or OCI-compatible images). If you don't understand how Docker builds, layers, and caches images, you'll ship bloated containers that crash your cluster in production.

Docker images are read-only templates. Containers are runtime instances of those templates. The Docker daemon on each node pulls images from registries, mounts layers via union filesystems, and isolates processes using cgroups and namespaces. Kubernetes trusts the kubelet to talk to the container runtime – typically containerd or Docker – to start, stop, and monitor those containers.

Here's the trap: Docker caches layers locally. If your image has a massive base OS layer that never changes, it sits on every node. Multiply that by 50 nodes. You're burning disk space for no reason. The fix? Use distroless or Alpine-based images. And always pin versions – latest is a production fire waiting to happen.

DockerImageOptimization.ymlYAML

// io.thecodeforge — devops tutorial

// Example: Multi-stage Dockerfile for a Go app
FROM golang:1.21 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /app/server .

FROM alpine:3.18
RUN apk --no-cache add ca-certificates
COPY --from=builder /app/server /server
EXPOSE 8080
CMD ["/server"]

Output

REPOSITORY TAG IMAGE ID CREATED SIZE

my-app sha256:f2a9d1b6 1.8MB

alpine 3.18 9c6f07244728 2 weeks ago 7.05MB

<-- Total image size: 8.85MB (vs ~800MB if using full Go base)

Production Trap:

Never use docker build --no-cache in CI unless you want every build to download all layers fresh. Cache busting is fine, but full rebuilds are slow and waste bandwidth. Use docker build --cache-from to reuse cached layers from a registry.

Key Takeaway

Docker images are layered. Keep layers small, base images minimal, and always multi-stage build in production.

Docker vs Virtual Machines: The Performance Reality

Virtual machines emulate hardware. Each VM has its own kernel, runs a full OS, and uses hypervisors like KVM or Hyper-V to allocate resources. That overhead is massive – boot time in seconds, memory footprint in gigabytes, and CPU cycles lost to virtualization layers.

Docker containers share the host kernel. They start in milliseconds, consume megabytes of memory overhead, and don't emulate hardware. The tradeoff? Isolation. Containers use cgroups to limit CPU/memory and namespaces to isolate processes, but they still share the kernel. A kernel panic on the host takes down every container. With VMs, you get full isolation at a huge performance cost.

Here's the math for a typical microservice: VM with 256MB RAM, 1 vCPU, 10GB disk = ~$30/month on a cloud provider. Docker container with the same resources = negligible cost inside a Kubernetes cluster. But if you need hard isolation for untrusted workloads, stick with VMs. Containers aren't sandboxes – they're lightweight processes.

ResourceComparison.ymlYAML

// io.thecodeforge — devops tutorial

// Quick reference: VM vs Container overhead
# Virtual Machine
#  - Hypervisor: ESXi, Hyper-V, KVM
#  - OS: Full guest kernel (Ubuntu, Debian, etc.)
#  - Boot time: 30-90 seconds
#  - Memory overhead: ~500MB to 2GB+ 
#  - Disk: Clone whole VMDK or QCOW2 files

# Docker Container
#  - Runtime: containerd, Docker
#  - OS: Shared host kernel + minimal rootfs
#  - Boot time: <1 second
#  - Memory overhead: ~5-20MB
#  - Disk: Needs only difference layers

Output

METRIC VM Docker Container

Boot time 60s 0.3s

Memory overhead 512MB 15MB

Isolation Full kernel separation Namespace/cgroup

Performance overhead 5-15% CPU <1% CPU

Density per host 5-10 VMs Hundreds

Senior Shortcut:

Use VMs for legacy monoliths that need OS-level access or run untrusted third-party code. Use containers for everything else – especially stateless microservices. The cost savings on compute alone will fund your coffee budget for a year.

Key Takeaway

VMs provide stronger isolation at 10x the cost. Containers share the kernel and are fast, cheap, and ephemeral. Choose based on trust boundaries, not hype.

How etcd Fails (and What You Must Do About It)

etcd is the single source of truth for your entire Kubernetes cluster. If it goes down, you can't schedule pods, update deployments, or even view cluster state. It's a distributed key-value store that uses the Raft consensus protocol to maintain consistency across nodes.

Here's what happens when etcd fails silently: A node gets partitioned from the rest. Raft requires a majority to elect a leader. If a partition splits the cluster into two groups, and neither has a majority, the cluster becomes read-only. No writes allowed. Your kubectl apply commands hang indefinitely.

The fix? Run etcd on dedicated nodes with dedicated disks – SSDs are mandatory. Disable swapping on those nodes. Etcd is latency-sensitive: one slow disk can wreck cluster stability for everyone. Set --quota-backend-bytes to something reasonable for your scale – default 2GB, but bump to 8GB for production clusters. And always back up etcd snapshots to object storage every hour. Always. Test those restores. I've seen teams lose entire clusters because they thought file-level backups of etcd were enough.

EtcdSnapshotBackup.ymlYAML

// io.thecodeforge — devops tutorial

// Automated etcd snapshot with retention
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: etcdctl
            image: gcr.io/etcd-development/etcd:v3.5.9
            command:
            - /bin/sh
            - -c
            - |
              ETCDCTL_API=3 etcdctl \
                --endpoints=https://127.0.0.1:2379 \
                --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                --cert=/etc/kubernetes/pki/etcd/server.crt \
                --key=/etc/kubernetes/pki/etcd/server.key \
                snapshot save /backups/etcd-snapshot-$(date +%Y%m%d%H%M).db && \
              gzip /backups/etcd-snapshot-*.db && \
              find /backups -name "*.gz" -mtime +7 -delete
            volumeMounts:
            - name: backup-vol
              mountPath: /backups
          restartPolicy: OnFailure
          volumes:
          - name: backup-vol
            hostPath:
              path: /var/lib/etcd-backups
              type: DirectoryOrCreate

Output

CronJob created. Next run in 45 minutes.

-> Snapshot saved to /var/lib/etcd-backups/etcd-snapshot-202503191530.db.gz

-> 7-day retention enabled.

Production Trap:

Don't co-locate etcd with other workloads. Etcd nodes are not cattle – they're pets. If you run etcd on the same nodes as your application pods, one noisy neighbor can cause latency spikes that trigger leader elections and degrade the entire cluster.

Key Takeaway

etcd is the cluster's brain. Back up snapshots hourly, run on dedicated nodes, use SSDs, and test restores quarterly. Treat etcd failures as fire drills.

.dockerignore: The One Line That Shrinks Image Builds

Every Docker build sends the entire context directory to the daemon. Without a .dockerignore, you include node_modules, .git, and build artifacts—bloating the image and slowing CI/CD. This is especially brutal in Kubernetes because every pod pull wastes bandwidth and increases startup latency. The fix: a single file. Add patterns for node_modules, .env, .git, dist, and any local cache. The build context is filtered server-side, so only essential files reach the Docker engine. Result: faster builds, smaller images, fewer registry costs. Teams that skip .dockerignore routinely ship 500MB+ images that are 90% unnecessary. In Kubernetes, where pods restart frequently, this drag manifests as degraded cluster performance. The .dockerignore file is not optional—it is the cheapest optimization you will ever make.

Example.ymlYAML

// io.thecodeforge — devops tutorial

// .dockerignore: reduce build context by 80%
node_modules
.git
.env
*.log
dist
.cache
.DS_Store
coverage
.terraform

Output

exact

Production Trap:

A missing .dockerignore can cause secret leakage if .env files or SSH keys sit in the build context. Always audit the context with 'docker build -t test .' before pushing.

Key Takeaway

Always add .dockerignore before the first Dockerfile line—it prevents bloat, speeds builds, and protects secrets.

docker-compose: Local Workflow That Mirrors Kubernetes Semantics

Kubernetes is not a local development tool. docker-compose fills that gap by declaring multi-container applications in a single YAML file. Why it matters to Kubernetes architects: compose models services, networks, volumes, and environment variables—the same primitives you later map to Deployments, Services, and ConfigMaps. Use compose for iterative local testing before writing a Helm chart. The 'docker-compose up' command starts your entire stack with consistent networking and dependency order. When containers fail, you can inspect logs without kubectl. The trap: treating compose as a production orchestrator. It lacks self-healing, rolling updates, and cluster-wide scheduling. Use it to validate image tags, environment injection, and volume mounts before pushing to a registry. In practice, every serious Kubernetes project I've seen uses a compose file for the developer loop. It catches config drift early, saving hours of cluster debugging.

Example.ymlYAML

// io.thecodeforge — devops tutorial

version: "3.9"
services:
  web:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DB_HOST=db
  db:
    image: postgres:15
    volumes:
      - pgdata:/var/lib/postgresql/data
volumes:
  pgdata:

Output

exact

Production Trap:

Never use 'depends_on' alone—containers start but may not be ready. Add healthcheck blocks or use a wait script to mirror Kubernetes init containers.

Key Takeaway

docker-compose is the fastest way to validate multi-service configs locally before translating to Kubernetes manifests.

Objects: The Declarative Building Blocks of Kubernetes

Every resource in Kubernetes is an Object—a persistent entity that represents the desired state of your cluster. Pods, Deployments, Services, ConfigMaps, and Secrets are all Objects defined via YAML or JSON manifests. The declarative model means you describe what you want, and the control plane (API Server, Controller Manager, etc.) converges the actual state to match. For example, a Deployment Object specifies the number of replicas, the container image, and update strategy. The Deployment Controller then creates ReplicaSets and Pods automatically. Understanding Objects is fundamental because they enforce idempotency, auditability, and self-healing. Without mastering Objects, you'll treat Kubernetes like a scripting language—prone to drift and manual fixes. Every kubectl apply communicates with the API Server to store your Object definition in etcd, triggering reconciliation loops.

deployment-object.ymlYAML

// io.thecodeforge — devops tutorial
// Minimal Deployment Object defining desired state
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deploy
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80

Production Trap:

Never omit resource requests and limits in your Pod Object. Without them, the scheduler treats your app as best-effort, risking OOM kills during peaks.

Key Takeaway

Kubernetes Objects are the single source of truth: always write declarative manifests, not imperative commands.

Real-World Use Cases & Projects: Where Kubernetes Shines

Kubernetes excels in microservices orchestration, multi-cloud deployments, and machine learning pipelines. A real-world project is deploying a stateless web app with auto-scaling: combine a Deployment, HorizontalPodAutoscaler (HPA), and Service. In production, companies like Spotify use Kubernetes to run thousands of services with canary deployments and traffic splitting. Another use case is stateful workloads (databases, message queues) via StatefulSets and PersistentVolumeClaims—critical for e-commerce order systems. Edge computing projects run lightweight K3s on Raspberry Pi clusters for IoT data ingestion. For CI/CD, teams build GitOps pipelines with ArgoCD: every merge to main triggers a sync that updates the Deployment Object, rollbacking automatically if probes fail. These patterns reduce downtime and deployment friction. Start with a simple microservices stack (Frontend + API + DB) to grasp the lifecycle before scaling to multi-cluster federation.

hpa-autoscale.ymlYAML

// io.thecodeforge — devops tutorial
// HorizontalPodAutoscaler for CPU-based scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deploy
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Architecture Decision:

Avoid HPA based only on CPU for bursty apps—use custom metrics (e.g., request latency via Prometheus) for accurate scaling.

Key Takeaway

Start small: deploy a three-tier app with HPA and GitOps before tackling clusters with thousands of nodes.

● Production incidentPOST-MORTEMseverity: high

Cascading API Server Failure Due to etcd Disk Latency

Symptom

All kubectl operations returned timeout errors. Controller-manager logs showed failed lease renewals. New pods stuck in Pending.

Assumption

Network partition or API Server OOM.

Root cause

etcd members were deployed on the same nodes as other workloads. A batch job caused high disk I/O on those nodes. etcd's consensus protocol (Raft) requires fsync to disk within an election timeout. High disk latency caused leader elections to fail, which made the API Server unable to write new state.

Fix

1. Isolated etcd onto dedicated nodes with local SSDs and no other workloads. 2. Configured etcd heartbeat-interval and election-timeout appropriately for the network. 3. Set up monitoring for etcd fsync duration and disk IOPS. 4. Added disk latency alerts with p99 > 15ms triggering immediate escalation.

Key lesson

etcd is the cluster's central nervous system. Its performance is non-negotiable.
Disk latency, not network, is the most common cause of etcd instability.
etcd must be isolated and its hardware provisioned for predictable, low-latency I/O.
Always test your etcd restore procedure quarterly — a backup you've never restored is no backup at all.

Production debug guideA symptom-first investigation path for control plane and node issues.5 entries

Symptom · 01

kubectl commands timeout or fail with 'server error'.

→

Fix

Check API Server and etcd health first. The API Server is the gateway; if it's down, nothing else works. Use kubectl get --raw=/readyz?verbose and etcdctl endpoint health --cluster.

Symptom · 02

Pods stuck in Pending state.

→

Fix

Investigate scheduler logs and node resource availability (kubectl describe node). Check for resource fragmentation or taints/tolerations mismatches. Also verify scheduler pod health.

Symptom · 03

Pods in CrashLoopBackOff.

→

Fix

Inspect kubelet logs on the node and the pod's events (kubectl describe pod). The container runtime (e.g., containerd) logs are critical here. Check journalctl -u kubelet and crictl ps -a.

Symptom · 04

Node marked NotReady.

→

Fix

SSH to the node. Check kubelet and container runtime status (systemctl status kubelet). Check disk pressure, memory pressure, and PID pressure using kubectl describe node. Look for eviction thresholds being hit.

Symptom · 05

Pod cannot resolve service DNS name.

→

Fix

Check CoreDNS pods and logs. Verify NetworkPolicy allows DNS traffic on port 53. Ensure Service exists and EndpointSlices are populated. Try nslookup kubernetes.default.svc.cluster.local from inside a pod.

★ Kubernetes Control Plane & Node TriageRapid commands to isolate cluster issues.

Cluster-wide unresponsiveness.−

Immediate action

Check etcd health and API Server logs.

Commands

etcdctl endpoint health --cluster

kubectl get --raw='/readyz?verbose'

Fix now

If etcd is unhealthy, check disk latency on etcd nodes (iostat -x 1). Isolate etcd immediately. If API Server is slow, check apiserver_current_inflight_requests metric.

Specific node not scheduling new pods.+

Pods failing to start on a node.+

API Server response times high.+

Scheduler queue growing (pods stuck Pending with no event).+

DNS lookup fails for service names but works for external names.+

Key takeaways

etcd disk latency above 10ms at the 99th percentile will cascade into control plane unavailability; isolate etcd on dedicated hardware with fast SSDs.

API Server default inflight request limits (200 mutating, 400 read-only) are easily hit in large clusters; monitor and adjust early.

Controller manager leader election takes ~15 seconds to failover; during that window the corresponding controller loop stops processing.

A slow admission webhook (99th percentile >1s) blocks all writes to the affected resource type; add timeouts and circuit breakers.

Watch cache exhaustion causes 'too old resource version' errors and thundering herd re-lists; increase cache sizes or limit watch concurrency.

Common mistakes to avoid

5 patterns

Using default eviction thresholds without customization

Symptom

Pods are evicted silently under memory pressure because evictionHard.memory.available defaults to 100MiB, which is too low for predictable behavior.

Fix

Set evictionHard.memory.available: "10%" or a value like 500Mi in the kubelet configuration. Test with a load generator to verify thresholds trigger before system OOM.

Running etcd on shared nodes with other workloads

Symptom

etcd leader elections become frequent during batch jobs or peak load, causing API Server timeouts and pod scheduling delays.

Fix

Isolate etcd on dedicated nodes with local SSDs. Set resource reservations for etcd pods and configure --heartbeat-interval and --election-timeout for your network latency.

Not setting resource limits for CoreDNS

Symptom

CoreDNS pod crashes under high query load, causing intermittent DNS resolution failures across the cluster.

Fix

Set CPU and memory requests/limits for CoreDNS. For clusters with >100 pods, deploy NodeLocal DNSCache and tune forward plugin's max_concurrent.

Applying a NetworkPolicy that denies all ingress without allowing DNS

Symptom

Pods cannot resolve service names; DNS queries to CoreDNS are blocked by the policy.

Fix

Always add an ingress rule allowing UDP/TCP port 53 from all namespaces to the CoreDNS service (or use a prepared NetworkPolicy that includes DNS).

Ignoring scheduler cache staleness

Symptom

Pods get stuck in 'PodScheduled' with event 'node does not exist' after node churn, even though the node is alive.

Fix

Restart the kube-scheduler pod to refresh its cache. For production, consider tuning the scheduler's --node-informer-resync-period or using a more recent scheduler version that handles this better.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain how the Kubernetes scheduler assigns a pod to a node. What happe...

Q02SENIOR

What happens when etcd loses quorum? How do you recover?

Q01 of 02SENIOR

Explain how the Kubernetes scheduler assigns a pod to a node. What happens if no node passes filtering?

ANSWER

The scheduler uses a two-phase approach: Filtering (Predicates) and Scoring (Priorities). In the filtering phase, it applies hard constraints like resource requests, node selector, taints/tolerations, and affinity rules. If no node passes, the pod remains unscheduled with a 'Pending' status. The scheduler emits an event that can be inspected via kubectl describe pod. The scheduler then scores the remaining nodes with priority functions (e.g., spread, balanced resource allocation) and assigns the pod to the highest-scored node. The scheduler does not re-evaluate once the pod is bound; the kubelet handles eviction if the node becomes overcommitted.

FAQ · 4 QUESTIONS

Frequently Asked Questions

Why does etcd disk latency cause the entire control plane to stall?

What are the default API Server request limits and how do they affect large clusters?

How does leader election in the controller manager affect failover time?

What happens when the API Server's watch cache is too small?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Verified

production tested

June 10, 2026

last updated

1,554

articles · all by Naren

🔥

That's Kubernetes. Mark it forged?

27 min read · try the examples if you haven't