Senior 14 min · March 06, 2026

etcd Disk Latency — How 800ms Killed the Kubernetes Cluster

The degraded etcd EBS volume caused 800ms write latency, stalling Raft consensus and freezing the entire cluster.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Kubernetes is a declarative container orchestration platform that continuously reconciles observed state with desired state.
  • Control plane: kube-apiserver, etcd, kube-scheduler, kube-controller-manager — each has a distinct role and failure mode.
  • etcd is the single source of truth — its disk latency is the cluster's performance ceiling.
  • The scheduler filters then scores nodes; it does NOT rebalance or predict load.
  • kubelet on each node runs the actual containers and reports status back to the API server.
  • Most production outages trace back to etcd misconfiguration, not application code.
✦ Definition~90s read
What is Introduction to Kubernetes?

Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications. At its core, you manage a cluster: a set of machines called nodes. One node is the master (control plane), the rest are workers (data plane).

Imagine you own a giant warehouse with hundreds of workers.

You define your app's desired state — how many replicas, which image, what ports — in a YAML manifest. Kubernetes then ensures the cluster matches that state, healing failures, scaling load, and rolling updates. A Pod is the smallest unit: one or more containers sharing networking and storage.

Deployments manage replica sets. Services provide stable network endpoints. This declarative approach means you tell Kubernetes what you want, not how to achieve it. The system handles the rest, watching for drift and correcting it automatically.

Plain-English First

Imagine you own a giant warehouse with hundreds of workers. Instead of telling each worker exactly what to do every minute, you hire a smart manager who reads a wish list ('I need 5 boxes packed, always'), watches the floor, and reassigns workers automatically when someone calls in sick. Kubernetes is that manager — you describe what your software should look like, and Kubernetes keeps reality matching the wish list, forever, across thousands of machines.

Kubernetes is not a deployment tool. It is a distributed state reconciliation engine. Every component — from the scheduler to the kubelet — operates on the same principle: watch the desired state in etcd, compare it with observed state, and act to close the gap. This is the mental model that unlocks real debugging capability.

The control plane is the brain. etcd is the memory. The kubelet is the muscle on each node. The scheduler decides placement. When any of these components degrades, the symptoms are often misleading — a Pod stuck in Pending looks like a scheduling problem but is frequently an etcd latency issue or a resource quota misconfiguration.

The common misconception is that Kubernetes 'runs containers.' It does not. Kubernetes manages the desired state of workloads. The container runtime (containerd, CRI-O) runs containers. Kubernetes tells the runtime what to run, monitors whether it is running, and corrects deviations. This distinction matters when debugging crashes, image pull failures, and networking issues.

What etcd Latency Actually Does to Kubernetes

etcd is the distributed key-value store that backs Kubernetes, holding all cluster state — pods, services, configmaps, secrets. The core mechanic: every write to etcd must be committed to a majority of nodes (quorum) before it's considered durable. This means a single slow disk on one node can stall the entire cluster. In practice, etcd's performance is measured by fsync latency: the time to flush a write to disk. Kubernetes control-plane components — kube-apiserver, scheduler, controller-manager — all depend on etcd's linearizable reads and writes. When fsync latency exceeds 100ms, watch timeouts and leader elections cascade. At 800ms, the cluster enters a death spiral: heartbeats fail, leaders step down, and no new writes succeed. You use etcd in every Kubernetes cluster, but its sensitivity to disk I/O is often underestimated. Understanding this matters because a single slow disk — not CPU, not memory — is the most common cause of control-plane outages in production.

Disk Speed Is Not CPU Speed
A fast CPU cannot compensate for a slow disk. etcd's fsync latency is the bottleneck — provision dedicated SSDs with guaranteed IOPS, not shared cloud volumes.
Production Insight
A team ran etcd on a shared EBS gp2 volume with burst credits exhausted. The symptom: intermittent API server timeouts and leader election storms every 90 seconds. The rule: provision etcd on dedicated NVMe SSDs or local SSDs with at least 5000 IOPS and monitor fsync latency — alert if p99 exceeds 50ms.
Key Takeaway
etcd is the single source of truth for cluster state — its latency is your cluster's latency.
Disk fsync latency is the critical metric; anything above 100ms p99 will cause control-plane instability.
Always run etcd on dedicated, low-latency storage — never share a disk with other workloads or use network-attached volumes with burst limits.
etcd Disk Latency Impact on Kubernetes THECODEFORGE.IO etcd Disk Latency Impact on Kubernetes How 800ms latency disrupts control plane and cluster operations etcd Disk Latency 800ms write delay causes leader election Control Plane Failure API server, scheduler, controller manager stall Scheduler Unavailability No pod placement decisions possible Pod Networking Disruption CNI plugins fail to assign IPs Cluster Degradation Pods stuck, services unreachable Recovery via etcd Tuning Reduce disk I/O, use SSD, tune heartbeat ⚠ etcd disk latency > 100ms triggers instability Use dedicated SSD storage and monitor fsync latency THECODEFORGE.IO
thecodeforge.io
etcd Disk Latency Impact on Kubernetes
Introduction Kubernetes

The 4 Essential Objects: Pod, Service, Deployment, Namespace

Before diving into the architecture, you need a concrete mental model of the four objects you'll use every day. Kubernetes exposes hundreds of resource types, but 80% of your interactions will involve these four.

Pod is the smallest deployable unit. A Pod wraps one or more containers, gives them a shared network namespace (one IP per Pod), and optionally shared storage volumes. Containers in the same Pod can communicate via localhost. Pods are ephemeral — they can be killed and rescheduled at any time. Never run a single Pod without a controller (Deployment, StatefulSet, DaemonSet).

Service provides a stable network endpoint for a set of Pods. Because Pods can die and be replaced with new IPs, a Service gives a fixed IP (ClusterIP) and DNS name that load-balances across the healthy Pods. The Service uses label selectors to determine which Pods belong to it.

Deployment is the most common controller. It declares the desired state for your stateless applications: how many replicas, which container image, resource limits, health checks, update strategy. The Deployment controller creates a ReplicaSet, which creates the Pods. When you update the Pod template, the Deployment creates a new ReplicaSet and gradually scales it up and the old one down (rolling update).

Namespace is a virtual cluster boundary. It isolates resources, RBAC, and network policies. Every resource lives in a namespace — except cluster-scoped resources like Nodes and PersistentVolumes. Use namespaces to separate environments (dev, staging, prod) or teams.

Together, these objects form the foundation: you define a Deployment that creates Pods, expose them via a Service, and organize everything in a Namespace.

four-essential-objects.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 1. Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
---
# 2. Deployment creating 3 replicas of a web app
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deploy
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.27
        ports:
        - containerPort: 80
---
# 3. Service exposing the Deployment internally
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
  namespace: production
spec:
  selector:
    app: nginx
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP
Output
# After applying the above three objects:
kubectl get ns
NAME STATUS AGE
production Active 10s
kubectl get deploy -n production
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-deploy 3/3 3 3 10s
kubectl get pods -n production -o wide
NAME READY STATUS RESTARTS AGE IP
nginx-deploy-7b5c6f8d9-abc12 1/1 Running 0 8s 10.244.1.10
nginx-deploy-7b5c6f8d9-def34 1/1 Running 0 8s 10.244.1.11
nginx-deploy-7b5c6f8d9-ghi56 1/1 Running 0 8s 10.244.1.12
kubectl get svc -n production
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-service ClusterIP 10.96.123.45 <none> 80/TCP 10s
# Test internal connectivity from a temporary Pod:
kubectl run -it --rm test --image=curlimages/curl -- sh
/ $ curl nginx-service.production.svc.cluster.local
<!DOCTYPE html> (nginx default page)
Namespaces Are Not Optional in Production
Always create your objects inside a specific namespace. The default namespace is a shared space where conflicts happen. Set a resource quota and limit range on every namespace to prevent teams from exhausting cluster resources.
Production Insight
The most common production mistake is creating a Deployment without a Service, then trying to reach Pods by IP. Always pair every Deployment with a Service (even if you only need internal access). For external access, use an Ingress or LoadBalancer Service. Also, remember that Deployments are for stateless workloads — for stateful workloads (databases), use a StatefulSet.
Key Takeaway
Pod, Service, Deployment, and Namespace are the four pillars of everyday Kubernetes. Understand their lifecycle and how they interact: Deployments create Pods, Services provide stable network endpoints, and Namespaces keep everything organized and isolated.

Control Plane Architecture: The Brain of the Cluster

The Kubernetes control plane consists of four components that work together to maintain cluster state. Understanding each component's role — and its failure modes — is essential for production operations.

kube-apiserver is the front door. Every kubectl command, every controller reconciliation, every kubelet status report goes through the API server. It validates requests, persists state to etcd, and serves as the watch endpoint for all controllers. It is stateless — you can run multiple replicas behind a load balancer for HA.

etcd is the single source of truth. It is a distributed, consistent key-value store built on the Raft consensus protocol. All cluster state — Pod definitions, ConfigMaps, Secrets, node registrations — lives in etcd. If etcd loses quorum, the cluster cannot make any state changes. etcd is the most critical component and the most commonly under-provisioned.

kube-scheduler watches for unscheduled Pods and assigns them to nodes. It does not run Pods — it only writes the nodeName field. The kubelet on the assigned node then pulls the image and starts the container. The scheduler uses a two-phase process: filtering (eliminate infeasible nodes) and scoring (rank feasible nodes, pick the highest score).

kube-controller-manager runs the control loops. Each controller watches a specific resource type and reconciles actual state with desired state. The Deployment controller ensures the right number of replicas exist. The Node controller detects when nodes go unhealthy. The Endpoint controller updates Service endpoints as Pods come and go.

control-plane-architecture.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Control Plane Health CheckRun this to verify all components are healthy
# Save as check-control-plane.sh

# 1. API Server health (returns 200 if healthy)
curl -k https://localhost:6443/healthz
# Expected: "ok"

# 2. etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \n  --endpoints=https://127.0.0.1:2379 \n  --cacert=/etc/kubernetes/pki/etcd/ca.crt \n  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \n  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Expected: "is healthy"

# 3. etcd cluster member status
ETCDCTL_API=3 etcdctl endpoint status \n  --endpoints=https://127.0.0.1:2379 \n  --cacert=/etc/kubernetes/pki/etcd/ca.crt \n  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \n  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \n  --write-out=table
# Shows: ID, Status, Version, DB Size, Raft Term, Raft Index

# 4. Scheduler and Controller-Manager leader election
kubectl get endpoints kube-scheduler -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
kubectl get endpoints kube-controller-manager -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'

# 5. All control plane components running
kubectl get pods -n kube-system -o wide
Output
ok
127.0.0.1:2379 is healthy: successfully committed proposal: took = 2.145ms
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://127.0.0.1:2379 | 8e9e05c52164694d | 3.5.9 | 25 MB | true | false | 4 | 18234 | 18234 | |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
{"holderIdentity":"master-1_xxxxx","leaseDurationSeconds":15,"acquireTime":"2026-03-01T10:00:00Z","renewTime":"2026-04-07T14:30:00Z","leaderTransitions":3}
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system coredns-5d78c9869d-abc12 1/1 Running 0 30d 10.244.0.5 master-1
kube-system etcd-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-apiserver-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-controller-manager-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-proxy-xyz78 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-scheduler-master-1 1/1 Running 0 30d 192.168.1.10 master-1
The Reconciliation Loop — The Heartbeat of Kubernetes
  • The API server is the only component that talks to etcd. All other components go through the API server.
  • Controllers are level-triggered, not edge-triggered. They care about the current state, not the event that caused it.
  • This is why Kubernetes is self-healing. It does not remember what happened — it only checks what is true right now.
Production Insight
Control plane HA requires at least 3 etcd members and 2+ API server replicas.
A single-node control plane is a single point of failure — if the master node dies, existing Pods keep running (kubelet is independent), but you cannot deploy, scale, or modify anything until the control plane recovers.
etcd quorum requires (n/2)+1 members alive. With 3 members, you can tolerate 1 failure.
Never run an even number of etcd members — split-brain scenarios become possible.
Key Takeaway
The control plane is a distributed system with etcd as its consensus backbone.
Every API request, every scheduler decision, every controller reconciliation depends on etcd's health.
Production clusters need 3+ etcd members on local SSDs — and etcd latency monitoring is not optional, it's the earliest warning of cluster degradation.
Control Plane Sizing for Production
IfDev/test cluster, non-critical workloads.
UseSingle control plane node is acceptable. Accept the risk of API unavailability during maintenance.
IfProduction cluster, < 100 nodes.
Use3 control plane nodes with stacked etcd. etcd runs on the same nodes as the API server — cost-effective HA.
IfProduction cluster, > 100 nodes or strict SLA requirements.
Use3–5 dedicated etcd nodes + 2+ API server nodes (external etcd). Isolates etcd disk I/O from API load.
IfMulti-region cluster.
UseStretched etcd across regions needs < 10ms latency. If higher, use separate clusters per region.

Control Plane vs Data Plane Architecture

Kubernetes is divided into two logical planes: the control plane (brain) and the data plane (muscle). The control plane makes decisions about the cluster state — what should run, where it should run, and whether the current state matches the desired state. The data plane executes those decisions — it runs the actual containers, provides the network connectivity, and reports back the observed state.

The control plane components (kube-apiserver, etcd, scheduler, controller-manager) typically run on dedicated master nodes, though in smaller clusters they may be colocated. The data plane consists of the worker nodes, each running kubelet, kube-proxy, the container runtime, and the CNI plugin.

The key architectural insight: control plane components communicate with each other and with etcd, but they never directly interact with the user containers. All interactions go through the API server. The kubelet on each worker node polls the API server for Pods assigned to its node, then instructs the container runtime to pull images and start containers. kube-proxy watches the API server for Service changes and programs iptables/IPVS rules accordingly.

This separation means that if the control plane fails, existing containers continue running (the kubelet is autonomous for running workloads) but you cannot make any changes. Conversely, if a worker node fails, the control plane detects it (via the Node controller) and reschedules the Pods on healthy nodes after a timeout.

check-data-plane.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Check kubelet and kube-proxy status on a worker node
# Run this on the worker node itself

# 1. Check kubelet is running
systemctl status kubelet

# 2. Check kube-proxy (runs as a DaemonSet, check from master)
kubectl get pods -n kube-system | grep kube-proxy

# 3. Check container runtime is responsive
crictl pods

# 4. Verify that the node can reach the API server
curl -k https://<api-server-ip>:6443/healthz

# 5. Test that local kubelet can actually start a Pod
kubectl run test --image=busybox --restart=Never -- sleep 30
kubectl get pods -n default
kubectl delete pod test
Output
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded /lib/systemd/system/kubelet.service; enabled
Active: active (running)
kube-proxy-abc12 1/1 Running 0 10d
kube-proxy-def34 1/1 Running 0 10d
CONTAINER ID IMAGE CREATED STATE NAME
c1d2e3f4 <image> 2 hours ago Ready <pod>
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized"} # (because we didn't auth, but reachability is confirmed)
pod/test created
Why This Separation Matters
When debugging, ask: is this a control plane issue or a data plane issue? Symptoms like kubectl hanging point to control plane (API server or etcd). Symptoms like a Pod running but unreachable point to data plane (kube-proxy, CNI). This mental shortcut saves hours.
Production Insight
In production, always run the control plane on separate nodes from the data plane. Colocating control plane components with user workloads on the same nodes can lead to noisy neighbor problems — a CPU-hungry Pod can starve the API server or etcd, causing cluster-wide instability. For large clusters (>500 nodes), consider using dedicated etcd nodes for further isolation.
Key Takeaway
The control plane decides, the data plane executes. This separation is the foundation of Kubernetes scalability and resilience. Understanding which plane a symptom belongs to is the first step in systematic debugging.
Kubernetes Control Plane and Data Plane
Worker NodeWorker NodeControl_PlaneStatus ReportsStatus ReportsWatch & AssignWatch & AssignKube-APIServerEtcdKube-SchedulerKube-Controller-ManagerKubeletContainer RuntimeKube-Proxyiptables/IPVSKubeletContainer RuntimeKube-Proxyiptables/IPVS

The Scheduler: How Kubernetes Decides Where Pods Run

The kube-scheduler is the component that assigns Pods to nodes. It does not run Pods — it only writes the spec.nodeName field on the Pod object. The kubelet on the assigned node then pulls the image and starts the container.

Filtering (Feasibility): Eliminate nodes that cannot run the Pod. Filter reasons include: insufficient CPU/memory, node taints the Pod cannot tolerate, node affinity mismatches, volume zone constraints, and Pod topology spread constraints. After filtering, if zero nodes remain, the Pod stays in Pending.

Scoring (Ranking): Rank the feasible nodes by a set of scoring plugins. Default scoring includes: NodeResourcesBalancedAllocation (prefer nodes with balanced CPU/memory usage), ImageLocality (prefer nodes that already have the container image), InterPodAffinity (prefer nodes where affinity rules are satisfied), and TaintToleration (prefer nodes with fewer taints). The node with the highest weighted score wins.

The scheduler makes decisions based on the state of the cluster at scheduling time. It does not predict future load. It does not rebalance existing Pods. Once a Pod is scheduled, only explicit actions (eviction, deletion, preemption) can move it.

scheduler-configuration.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Example: Pod with scheduling constraints
# This Pod will ONLY be scheduled on nodes with the label 'disktype=ssd'
# and will prefer nodes in zone 'us-east-1a'
apiVersion: v1
kind: Pod
metadata:
  name: io-thecodeforge-payment-service
  namespace: production
spec:
  # Hard requirement: node MUST have this label
  nodeSelector:
    disktype: ssd

  # Soft preference: scheduler tries to place here, but can choose elsewhere
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1a
    # Pod affinity: prefer to run near other payment-service Pods
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: payment-service
            topologyKey: kubernetes.io/hostname

  # Tolerations: allow scheduling on nodes with the 'dedicated=high-cpu' taint
  tolerations:
    - key: dedicated
      operator: Equal
      value: high-cpu
      effect: NoSchedule

  # Topology spread: distribute replicas evenly across zones
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: payment-service

  containers:
    - name: payment-service
      image: registry.thecodeforge.io/payment-service:v2.4.1
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"
        limits:
          cpu: "1000m"
          memory: "1Gi"
Output
pod/io-thecodeforge-payment-service created
# Verify scheduling decision
kubectl describe pod io-thecodeforge-payment-service -n production | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5s default-scheduler Successfully assigned production/io-thecodeforge-payment-service to node-3
Requests vs Limits — The Scheduler Only Sees Requests
  • Guaranteed QoS (requests=limits): Pod is last to be evicted under resource pressure.
  • Burstable QoS (requests < limits): Pod can burst but is evicted before Guaranteed Pods.
  • BestEffort QoS (no requests, no limits): First to be evicted. Never use in production.
Production Insight
Scheduler performance degrades with cluster size and Pod count.
At > 5000 Pods, scheduling latency can exceed 1 second, causing deployment rollouts to slow dramatically.
Mitigate with: scheduler extenders for custom logic, Pod topology spread constraints instead of pod anti-affinity (more efficient), and multiple scheduler profiles for different workload classes.
The scheduler's scoring algorithm is pluggable — you can weight or disable scoring plugins via a KubeSchedulerConfiguration.
Key Takeaway
The scheduler is a scoring engine, not a bin-packer.
It ranks feasible nodes and picks the best match at scheduling time.
It does not rebalance, predict load, or consider limits.
Understanding the filter-then-score pipeline — and how nodeSelector, affinity, taints, and topology spread interact within it — is essential for controlling Pod placement at scale.
Scheduling Constraint Selection
IfPod MUST run on a specific type of node (e.g., GPU, SSD).
UseUse nodeSelector or nodeAffinity required mode. Hard constraint — Pod stays Pending if no node matches.
IfPod PREFERS a specific node type but can run elsewhere.
UseUse nodeAffinity preferred mode. Soft constraint — scheduler tries to match but places elsewhere if needed.
IfReplicas must be spread across failure domains (zones, nodes).
UseUse topologySpreadConstraints. More flexible and performant than pod anti-affinity.
IfPod should run near (or away from) other specific Pods.
UseUse podAffinity (co-locate) or podAntiAffinity (spread). At scale prefer topologySpreadConstraints.
IfNode has taints (dedicated nodes, spot instances).
UseAdd tolerations to the Pod spec. Without a matching toleration, the Pod won't schedule on the tainted node.

Pod Networking: How Containers Talk to Each Other

Kubernetes networking has three fundamental requirements, enforced by the CNI (Container Network Interface) plugin:

  1. Every Pod gets its own IP address, unique across the cluster.
  2. Pods on any node can communicate with Pods on any other node without NAT.
  3. Agents on a node (kubelet, system daemons) can communicate with all Pods on that node.

These requirements are simple to state but complex to implement. The CNI plugin (Calico, Cilium, Flannel, AWS VPC CNI) is responsible for wiring this up. It allocates IP addresses from the node's Pod CIDR range, sets up network interfaces inside the Pod's network namespace, and configures routing rules so Pods can reach each other across nodes.

kube-proxy handles Service networking. It watches the API server for Service and Endpoint objects, then programs iptables rules (or IPVS rules) on each node. When a Pod connects to a Service's ClusterIP, the kernel's iptables rules intercept the connection and DNAT it to one of the backend Pod IPs. This is why Service IPs are virtual — they do not exist on any network interface.

networking-debug.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Debugging Pod networking step by step

# 1. Verify Pod has an IP address
kubectl get pods -n production -o wide
# If Pod IP is <none>, the CNI plugin failed to assign an address

# 2. Check if the CNI plugin is healthy
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|aws-node'

# 3. Verify Pod CIDR allocation per node
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
# Each node must have a unique, non-overlapping CIDR

# 4. Test Pod-to-Pod connectivity across nodes
kubectl exec -it pod-on-node-a -- ping <pod-ip-on-node-b>
# If this fails but intra-node works, the CNI cross-node routing is broken

# 5. Check Service endpoints
kubectl get endpoints payment-service -n production
# If endpoints are empty, no Pods match the Service's selector

# 6. Test Service DNS resolution
kubectl exec -it <pod> -- nslookup payment-service.production.svc.cluster.local
# If DNS fails, check CoreDNS pods: kubectl get pods -n kube-system | grep coredns

# 7. Inspect iptables rules for a Service
# (run on the node where your Pod is running)
iptables-save | grep <service-cluster-ip>
Output
NAME READY STATUS IP NODE
payment-service-7d8f9-abc12 1/1 Running 10.244.1.45 node-2
payment-service-7d8f9-def34 1/1 Running 10.244.2.78 node-3
NAME READY STATUS RESTARTS AGE
calico-node-abc12 1/1 Running 0 30d
calico-kube-controllers-5d78-def34 1/1 Running 0 30d
node-1 10.244.0.0/24
node-2 10.244.1.0/24
node-3 10.244.2.0/24
PING 10.244.2.78 (10.244.2.78): 56 data bytes
64 bytes from 10.244.2.78: seq=0 ttl=62 time=0.456 ms
NAME ENDPOINTS AGE
payment-service 10.244.1.45:8080,10.244.2.78:8080 15d
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: payment-service.production.svc.cluster.local
Address 1: 10.96.45.12 payment-service.production.svc.cluster.local
The Three Layers of K8s Networking
  • Pod IP works but Service IP fails: kube-proxy or iptables issue.
  • Service IP works but DNS fails: CoreDNS issue.
  • DNS works but external access fails: Ingress controller or cloud LB issue.
Production Insight
CNI plugin selection has massive performance and operational implications.
Calico (BGP mode) scales well but requires BGP peering knowledge.
Cilium (eBPF) bypasses iptables entirely, offering better performance at scale (>1000 Services) but is more complex to debug.
AWS VPC CNI assigns real VPC IP addresses to Pods, simplifying security group integration but consuming VPC IP space rapidly.
Evaluate CNI based on: Service count, NetworkPolicy requirements, observability needs, and team expertise.
Key Takeaway
Kubernetes networking is a layered system: CNI for Pod connectivity, kube-proxy for Service load balancing, Ingress for external access.
Debug from the bottom up — Pod IP, then ClusterIP, then DNS, then Ingress.
CNI plugin choice is a long-term architectural decision with performance, security, and operational trade-offs.

Kubernetes Storage: PersistentVolumes, Claims, and StorageClasses

Kubernetes storage decouples Pod lifecycle from data life. A Pod can be deleted and recreated, but its data persists if it uses a PersistentVolume (PV) and PersistentVolumeClaim (PVC). This is critical for stateful workloads like databases.

PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically by a StorageClass. It is a cluster resource, like a node. PVs have a capacity and access mode (ReadWriteOnce, ReadOnlyMany, ReadWriteMany).

PersistentVolumeClaim (PVC) is a request for storage by a user. It specifies size and access mode. Kubernetes binds a PVC to a PV that meets the requirements. If no matching PV exists, the PVC remains Pending — unless a StorageClass with a dynamic provisioner is referenced.

StorageClass defines a class of storage. It specifies the provisioner (e.g., kubernetes.io/aws-ebs), parameters (type, IOPS), and reclaim policy. When a PVC requests a StorageClass, the provisioner automatically creates a PV that satisfies the claim.

storage-example.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# StorageClass for AWS gp3 volumes with 3000 IOPS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: io-thecodeforge-fast
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
# PVC that uses the StorageClass above
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: io-thecodeforge-payment-db-pvc
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: io-thecodeforge-fast
---
# Pod using the PVC
apiVersion: v1
kind: Pod
metadata:
  name: io-thecodeforge-payment-db
  namespace: production
spec:
  containers:
    - name: postgres
      image: postgres:16
      env:
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
      volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: io-thecodeforge-payment-db-pvc
Output
# After creation
kubectl get sc
# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
# io-thecodeforge-fast ebs.csi.aws.com Delete WaitForFirstConsumer
kubectl get pvc -n production
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# io-thecodeforge-payment-db-pvc Bound pvc-abc123 100Gi RWO io-thecodeforge-fast
kubectl get pv
# NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
# pvc-abc123 100Gi RWO Delete Bound production/io-thecodeforge-payment-db-pvc
PV Reclaim Policy: The Silent Data Loss Trap
If a PVC is deleted, what happens to the underlying PV depends on the persistentVolumeReclaimPolicy: - Retain: PV remains but is in Released state — you must manually reclaim it. - Delete: PV and underlying storage are deleted. This is default for dynamic provisioners. - Recycle: Deprecated. Attempts to scrub and re-use. Production gotcha: If you delete a PVC with a Delete reclaim policy without first taking a snapshot, you lose all data. Always set Retain for critical databases, or use a backup solution.
Production Insight
Dynamic provisioning with a default StorageClass is dangerous — a typo in a PVC's storageClassName can fall back to the default class.
If you delete a Namespace, all PVCs in it are deleted, and with ReclaimPolicy=Delete, all data is gone.
Monitor PVC usage and set ResourceQuotas on storage requests to prevent runaway claims from exhausting your cloud budget.
For stateful workloads, use StatefulSets with volumeClaimTemplates to automatically generate unique PVCs per replica.
Key Takeaway
Kubernetes storage is about lifecycle decoupling: PV is a resource, PVC is a claim, StorageClass enables dynamic provisioning.
The reclaim policy determines whether data survives PVC deletion — set to Retain for anything irreplaceable.
StatefulSets with volumeClaimTemplates are the correct pattern for database-like workloads.
Choosing a Storage Approach
IfEphemeral data — logs, scratch space, caches.
UseUse emptyDir volume. Data is lost when the Pod is deleted, which is expected.
IfPersistent data that must survive Pod restarts (single replica).
UseUse a PVC with a StorageClass that provides a block store (EBS, Persistent Disk). Use ReadWriteOnce access mode.
IfMulti-replica app that needs shared read-write access (e.g., NFS, shared config).
UseUse a PVC with ReadWriteMany access mode. Not all provisioners support it — consider NFS, EFS, or GlusterFS.
IfDatabase with strict consistency requirements (Postgres, MySQL).
UseUse a single PVC per replica (StatefulSet + volumeClaimTemplates). Prefer local SSDs or dedicated EBS volumes. Never use ReadWriteMany for databases.

Namespaces, Resource Quotas, and Multi-Tenancy

Namespaces are virtual clusters within a physical cluster. They provide isolation boundaries for resources, RBAC, and network policies. Every resource lives in a namespace — except cluster-scoped resources like Nodes and PersistentVolumes.

ResourceQuota limits aggregate resource consumption within a namespace. You can set quotas on CPU, memory, Pod count, PVC storage, and even the number of Services. Without quotas, a single misconfigured application can consume all cluster resources and starve others.

LimitRange sets default requests/limits and min/max constraints for Pods in a namespace. This prevents a Pod from requesting an absurd amount of resources or running without any limits.

Multi-tenancy with Namespaces is common: each team gets its own namespace, with RBAC restricting cross-namespace access. But true multi-tenancy (running untrusted workloads) requires additional isolation — consider virtual clusters (vClusters) or sandbox containers (gVisor, Kata Containers).

quota-and-limitrange.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# ResourceQuota for a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: io-thecodeforge-team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "10"
    requests.storage: 500Gi
    pods: "50"
    services: "10"
---
# LimitRange to enforce default resource boundaries
apiVersion: v1
kind: LimitRange
metadata:
  name: io-thecodeforge-default-limits
  namespace: team-a
spec:
  limits:
    - default:
        cpu: "500m"
        memory: 512Mi
      defaultRequest:
        cpu: "100m"
        memory: 128Mi
      max:
        cpu: "2"
        memory: 4Gi
      min:
        cpu: "50m"
        memory: 64Mi
      type: Container
Output
# After applying
kubectl describe resourcequota -n team-a
# Name: io-thecodeforge-team-quota
# Namespace: team-a
# Resource Used Hard
# -------- --- ---
# pods 12 50
# requests.cpu 3.5 10
# requests.memory 7Gi 20Gi
# limits.cpu 8 20
# limits.memory 18Gi 40Gi
# persistentvolumeclaims 2 10
# requests.storage 120Gi 500Gi
# services 4 10
kubectl describe limitrange -n team-a
# Limits:
# Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
# ---- -------- --- --- --------------- ------------- -----------------------
# Container cpu 50m 2 100m 500m -
# Container memory 64Mi 4Gi 128Mi 512Mi -
Multi-Tenancy Warning
Using Namespaces alone for security isolation between untrusted tenants is insufficient. A Pod in one namespace can still connect to a Service in another namespace unless NetworkPolicies block it. Also, a Pod can access the API server with its ServiceAccount token — RBAC must be scoped per namespace. For hard multi-tenancy, consider dedicated clusters, virtual clusters (vCluster), or sandbox runtimes.
Production Insight
Without ResourceQuotas, a single team's application can silently consume all cluster resources and block other teams.
Quota enforcement is immediate — if a team tries to create a Pod that would exceed its quota, the API server rejects it.
LimitRange is essential for environments where teams might forget to set requests and limits — it prevents BestEffort Pods by default.
Monitor namespace usage with kubectl top and Prometheus alerts to catch quota exhaustion before it causes deployment failures.
Key Takeaway
Namespaces provide lightweight isolation — RBAC, ResourceQuotas, and NetworkPolicies make them safe for trusted tenants.
Always set ResourceQuotas and LimitRange per namespace in shared clusters — without them, one team can starve everyone.
For truly untrusted workloads, Namespaces alone are not enough: reach for dedicated clusters or sandbox containers.
Namespace Isolation Strategy
IfSame team, different environments (dev, staging).
UseSeparate namespaces per environment. RBAC restricts team members to their environment. No ResourceQuota needed for dev, but set for staging to match production.
IfMultiple teams on a shared cluster.
UseOne namespace per team with ResourceQuota, LimitRange, and RBAC. NetworkPolicy denies all cross-namespace traffic by default, allow specific flows explicitly.
IfRunning untrusted code (CI/CD agents, third-party apps).
UseUse a separate cluster or sandbox runtime. Namespace-level isolation is insufficient. Consider virtual clusters with vCluster or run in a separate pool of nodes with taints and tolerations.

Kubernetes vs Docker Compose: When to Use Each

Docker Compose and Kubernetes both orchestrate containers, but they serve fundamentally different use cases. Docker Compose is a single-host orchestration tool designed for development environments and small deployments. Kubernetes is a multi-host, production-grade orchestration system with automated healing, scaling, and rolling updates.

FeatureDocker ComposeKubernetes
ScopeSingle hostMulti-node cluster
ScalingManual (docker-compose up --scale)Automatic (Horizontal Pod Autoscaler, Cluster Autoscaler)
Self-healingNone (no automatic restart of failed containers)Automatic (controllers restart/recreate Pods)
Rolling updatesBasic (stop all, start new)Configurable (maxSurge, maxUnavailable, Canary, Blue-Green)
NetworkingFlat network with linksService abstraction with DNS, kube-proxy, CNI
StorageNamed volumes on single hostPV/PVC with dynamic provisioning across nodes
Secrets managementPlain text env filesSecrets (base64, encryption at rest), external CSI drivers
ConfigurationIndividual YAML per serviceDeclarative API with multiple resource types
Learning curveLowHigh

Choose Docker Compose when: you are developing locally, running CI integrations tests, or deploying a simple application on a single VM where orchestration overhead is not justified.

Choose Kubernetes when: you need high availability, rolling updates, auto-scaling, multi-node deployment, or run multiple microservices that need advanced networking (service discovery, load balancing, network policies). Many teams start with Docker Compose for development and then write Kubernetes manifests for production — maintaining both can be an overhead, but tools like Kompose can convert Compose files to Kubernetes YAML.

docker-compose-vs-k8s.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# Docker Compose version (docker-compose.yml)
version: '3.8'
services:
  web:
    image: nginx:1.27
    ports:
      - "80:80"
  app:
    build: .
    environment:
      - DATABASE_URL=postgres://db:5432/mydb
    depends_on:
      - db
  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    volumes:
      - pgdata:/var/lib/postgresql/data

secrets:
  db_password:
    file: ./secrets/db_password.txt

volumes:
  pgdata:

---
# Equivalent Kubernetes manifest (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.27
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
  - port: 80
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 5
  selector:
    matchLabels:
      app: app
  template:
    metadata:
      labels:
        app: app
    spec:
      containers:
      - name: app
        image: myapp:latest
        env:
        - name: DATABASE_URL
          value: postgres://db:5432/mydb
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-password
              key: password
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
spec:
  serviceName: db
  selector:
    matchLabels:
      app: db
  template:
    metadata:
      labels:
        app: db
    spec:
      containers:
      - name: postgres
        image: postgres:16
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-password
              key: password
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: db-password
type: Opaque
data:
  password: <base64-encoded-password>
---
apiVersion: v1
kind: Service
metadata:
  name: db
spec:
  selector:
    app: db
  ports:
  - port: 5432
  clusterIP: None  # Headless service for StatefulSet
Output
# Output not applicable; provided for comparison visibility.
Don't Over-Engineer: Start Simple
If your team is small, your application fits on one machine, and you don't need zero-downtime deployments, Docker Compose is perfectly adequate. Adopting Kubernetes too early adds operational overhead that can slow development. Start with Compose, and migrate to Kubernetes when you hit scaling or availability requirements.
Production Insight
When migrating from Docker Compose to Kubernetes, expect the following pain points:
- Networking: Compose's simple links become Service/Endpoint/Ingress objects.
- Storage: Named volumes become PV/PVC with persistent lifecycle.
- Secrets: File-based secrets need to be base64 encoded and managed via Secret resources or external providers.
- Scaling: Kubernetes replica management and rolling updates require understanding of PodDisruptionBudget and health probes.
Use Kompose or manual translation; always validate with kubectl apply --dry-run=server.
Key Takeaway
Docker Compose is for single-host, small-scale, development use. Kubernetes is for multi-host, production-grade orchestration. Choose based on your current need, not the industry hype.

The Evolution of Deployment — Why We Stopped Trusting Bare Metal

You don't understand Kubernetes until you understand the deployment hell it replaced. Before containers, you had two choices: dump a JAR on a physical server and pray nothing else touched the port, or waste 40% of your budget on VM overhead because each app needed its own OS instance.

Virtualization fixed the hardware waste but introduced its own cancer — golden images that rotted over time, configuration drift that turned production into a snowflake zoo, and boot times measured in coffee breaks. Then came containers. Docker gave you repeatable build artifacts and second-level startup. But now you had 50 containers on a single VM and no sane way to manage them.

That's the gap Kubernetes fills. Not as a container manager — as a control system. It takes your container images and applies a desired state loop. You say "I want 3 replicas of payment-api behind a stable DNS name." Kubernetes makes it true, then keeps it true. No SSH, no manual restart, no "it works on my machine." The whole industry pivoted from pet servers to cattle because manual operations don't scale past 5 microservices.

BareMetalVsK8s.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — devops tutorial

# What deployment looked like in 2014
# One physical server. One app. Hours to reprovision.
server:
  hostname: prod-payments-01
  os: CentOS 7.2
  app: payments.jar v1.3.2
  ports:
    - 8080
  dependencies:
    - postgresql-9.6
  scaling: buy another server, wait 3 days

# What Kubernetes does instead
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payments
  template:
    spec:
      containers:
        - name: payments
          image: payments:v1.3.2
          ports:
            - containerPort: 8080
Output
Kubernetes reconciles desired state automatically.
Manual reprovisioning is dead.
Production Trap:
Don't treat Kubernetes as 'Docker Compose for production'. Compose is linear, single-host, and assumes you can docker-compose down. Kubernetes assumes nodes die hourly. Your app must survive that.
Key Takeaway
Kubernetes solves the orchestration problem containers created — not the containers themselves.

Why Kubernetes Stands Out — The Desired State Loop

Every other orchestrator tells you how to start processes. Kubernetes tells you how the system should look and makes reality match the spec. This is the single most important concept to internalize.

When you write a Deployment, you declare: "3 replicas, port 8080, liveness probe hitting /healthz." The control plane stores that intent in etcd. Then the kubelet on each node continuously checks: "Does my pod match what etcd says? No? Fix it." This isn't a one-time deploy. It's a running reconciliation loop that fires every second until you delete the resource.

Why does this matter? Because production never stays still. A node crashes — the controller sees 2 replicas instead of 3 and spawns a replacement. A pod runs out of memory — the restart policy kills and re-creates it. Traffic spikes — your HorizontalPodAutoscaler reads the metrics and tells the deployment to scale to 10. No human touching a terminal at 3 AM.

The magic isn't the containers. It's the control theory applied to distributed systems. You describe the steady state. Kubernetes enforces it. Period.

DesiredStateLoop.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — devops tutorial

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gateway
  template:
    spec:
      containers:
        - name: gateway
          image: api-gateway:2.4.1
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            requests:
              cpu: 500m
              memory: 256Mi
Output
$ kubectl get pods -l app=gateway
NAME READY STATUS RESTARTS AGE
api-gateway-7d4b9f8c4f-2xk9q 1/1 Running 0 12m
api-gateway-7d4b9f8c4f-5hbtp 1/1 Running 0 12m
api-gateway-7d4b9f8c4f-9n3m1 1/1 Running 0 12m
# Kill one pod — see reconciliation in action
$ kubectl delete pod api-gateway-7d4b9f8c4f-2xk9q
pod "api-gateway-7d4b9f8c4f-2xk9q" deleted
$ kubectl get pods -l app=gateway
NAME READY STATUS RESTARTS AGE
api-gateway-7d4b9f8c4f-5hbtp 1/1 Running 0 12m
api-gateway-7d4b9f8c4f-9n3m1 1/1 Running 0 12m
api-gateway-7d4b9f8c4f-df6kj 1/1 Running 0 5s
Senior Shortcut:
When debugging state drift, never ssh into a pod to 'fix' files. The controller will revert them on the next reconciliation. Change the spec or the image. Everything else is futile.
Key Takeaway
Kubernetes is a control system, not a process manager. Declare the end state, and it handles the infinite loop to keep you there.

Real-World Kubernetes: Where the Theory Dies

You've read the docs. You've deployed a pod. Now what? The real value of Kubernetes isn't container orchestration — it's the patterns that survive production. Three use cases define modern k8s: stateless web backends, event-driven batch jobs, and stateful data pipelines.

Stateless apps are the entry drug. Horizontal Pod Autoscaler + Deployment + Service = you can absorb traffic spikes without waking up at 3 AM. Batch jobs use Jobs and CronJobs to replace cron on bare metal — way easier to restart failed pods than re-ssh into a dead VM. Stateful stuff uses StatefulSets with PersistentVolumeClaims for databases like PostgreSQL or Kafka. You don't need to manage the database lifecycle in k8s — just give it stable storage and a stable network identity.

The trap? Thinking every app belongs in k8s. Latency-sensitive workloads, GPU training jobs that don't scale horizontally, or anything that needs raw hardware access — leave those on bare metal or spot instances. Kubernetes is not a Swiss Army knife. It's a hammer. Use it on nails.

production-use-case.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// io.thecodeforge — devops tutorial

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-backend-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: report
            image: reporting:2.1.0
            command: ["/bin/report"]
          restartPolicy: OnFailure
Output
horizontalpodautoscaler.autoscaling/web-backend-hpa created
cronjob.batch/nightly-report created
Production Trap:
Don't put your Postgres master in k8s unless you have dedicated node pools with reserved resources. The network filesystem (CSI driver) adds latency that kills write-heavy databases. Use managed DB services outside the cluster.
Key Takeaway
Pick the workload pattern (stateless, batch, stateful) before you pick the Kubernetes resource.

Kuberenetes Projects That Actually Teach You Something

Stop running nginx in a playground. Build something that breaks, then fix it. Three projects will teach you more than any certification: a multi-service web app with zero-downtime deploys, a CI/CD pipeline that runs entirely inside the cluster, and a GitOps setup that auto-remediates drift.

First project: Deploy a frontend + API + database. Use rolling updates with readiness probes. Simulate a bad deploy — watch the probe kill it and rollback automatically. You'll understand why livenessProbe and readinessProbe are not optional. Second: Run a Jenkins or Argo Workflows executor inside k8s with dynamic pod-per-build. Learn how PersistentVolumeClaims hold workspace data and how cluster autoscaling handles build spikes. Third: Set up Argo CD or Flux with a Git repo. Break the cluster state manually — watch it self-heal. That's the Desired State Loop in action, not theory.

These projects force you to hit real problems: pod eviction, OOMKilled containers, RBAC misconfigurations, and etcd latency when the control plane gets hammered. You'll stop treating k8s like magic and start treating it like a distributed system that demands respect.

project-stack.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — devops tutorial

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    spec:
      containers:
      - name: gateway
        image: myorg/api-gateway:2.4.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20
Output
deployment.apps/api-gateway created
Rolling update will replace 1 pod at a time
Readiness probe will wait 5s before checking /health
Senior Shortcut:
If your readiness probe fails but liveness passes, the pod stays alive but stops receiving traffic. That's the exact pattern for canary deploys — new version validates itself before taking live traffic.
Key Takeaway
Build something that breaks. Debugging a broken rollout teaches you more than ten perfect tutorials.

What Is Kubernetes? The Bare Minimum You Need to Know

Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications. At its core, you manage a cluster: a set of machines called nodes. One node is the master (control plane), the rest are workers (data plane). You define your app's desired state — how many replicas, which image, what ports — in a YAML manifest. Kubernetes then ensures the cluster matches that state, healing failures, scaling load, and rolling updates. A Pod is the smallest unit: one or more containers sharing networking and storage. Deployments manage replica sets. Services provide stable network endpoints. This declarative approach means you tell Kubernetes what you want, not how to achieve it. The system handles the rest, watching for drift and correcting it automatically.

basic-pod.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — devops tutorial
thecodeforge: basics/kubernetes
apiVersion: v1
kind: Pod
metadata:
  name: nginx-basic
  labels:
    app: webserver
spec:
  containers:
    - name: nginx
      image: nginx:1.25
      ports:
        - containerPort: 80
      resources:
        requests:
          memory: "128Mi"
          cpu: "250m"
        limits:
          memory: "256Mi"
          cpu: "500m"
Output
kubectl apply -f basic-pod.yml
# pod/nginx-basic created
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# nginx-basic 1/1 Running 0 5s
Production Trap:
Don't run bare Pods in production — they won't auto-restart if the node fails. Always use a Deployment unless you have a one-shot job.
Key Takeaway
Kubernetes is a declarative system: you define the desired state, and the control loop drives reality toward it.

Key Primitives You Must Understand

Beyond Pods, four objects form Kubernetes' backbone. A Deployment manages Pod lifecycles: rolling updates, rollbacks, replica scaling. It creates a ReplicaSet that watches pod counts. A Service provides stable networking — Pods get ephemeral IPs, but a Service gives a fixed ClusterIP or LoadBalancer. Ingress routes external HTTP/S traffic to Services. ConfigMaps and Secrets decouple configuration from images. Volumes (PersistentVolumeClaims) persist data beyond Pod restarts. Namespaces isolate resources within a cluster. RBAC (Role-Based Access Control) locks down who can do what. These primitives layer on each other: Deployment → ReplicaSet → Pod → Container. Knowing which to use and when separates beginners from pros. Start with a Deployment + Service pair; that covers 80% of use cases.

deployment-and-service.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — devops tutorial
thecodeforge: basics/primitives
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deploy
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.25
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-svc
spec:
  selector:
    app: nginx
  ports:
    - port: 80
      targetPort: 80
  type: ClusterIP
Output
kubectl apply -f deploy-svc.yml
kubectl get deploy,svc,pods
# Shows 3 replicas, service endpoint, and running pods
Key Pattern:
Always pair a Deployment with a Service. Without a Service, Pod scaling breaks client connections. Use port naming for easier Ingress routing.
Key Takeaway
Deployment + Service is the minimum viable pattern: Deployments ensure availability, Services ensure discoverability.
● Production incidentPOST-MORTEMseverity: high

The etcd Disk That Killed the Entire Cluster

Symptom
kubectl commands hang or timeout. New Pods stuck in Pending. Deployment rollouts never complete. API server logs show 'etcdserver: request timed out' errors. Controller-manager logs show leader election failures.
Assumption
The API server is overloaded, or the scheduler has crashed.
Root cause
etcd's data directory was on a network-attached EBS volume that had degraded to p99 write latency of 800ms (normal: 2ms). etcd requires sub-10ms disk writes for stable operation. The degraded disk caused the Raft consensus protocol to stall — the cluster could not commit new state changes. The API server, which depends on etcd for every operation, began queuing requests until it exhausted its connection pool. The scheduler and controller-manager, which watch etcd via the API server, received no updates and effectively froze.
Fix
1. Immediately migrate etcd to local NVMe SSDs (provisioned IOPS EBS or instance-local storage). 2. Set etcd disk latency alerts at p99 > 10ms as critical. 3. Implement etcd defragmentation on a schedule (etcdctl defrag). 4. Configure etcd auto-compaction (--auto-compaction-retention=8) to prevent unbounded data growth. 5. Monitor etcd member health with etcdctl endpoint health and etcdctl endpoint status.
Key lesson
  • etcd is the single point of failure for the entire cluster. Its disk performance is the cluster's ceiling.
  • Never run etcd on network-attached storage in production. Local SSDs are mandatory.
  • API server timeouts are often etcd problems, not API server problems. Trace downward, not upward.
  • etcd requires periodic defragmentation. Without it, space is freed but not reclaimed, leading to disk pressure.
Production debug guideSymptom-driven investigation paths for the most common failure modes.6 entries
Symptom · 01
Pod stuck in Pending state.
Fix
1. Run kubectl describe pod <name> and read the Events section. 2. Common causes: insufficient CPU/memory on any node (check kubectl describe nodes for Allocatable vs Allocated), PersistentVolumeClaim not bound, node affinity/taint mismatches, resource quotas exceeded. 3. If no events appear, the scheduler may be down — check kubectl get pods -n kube-system for kube-scheduler.
Symptom · 02
Pod stuck in CrashLoopBackOff.
Fix
1. Run kubectl logs <pod> --previous to see the logs from the crashed container (current logs may be empty). 2. Common causes: missing environment variables, failed health checks, OOMKill (check kubectl describe pod for Last State), misconfigured entrypoint. 3. If OOMKill, increase memory limits or fix the memory leak. Check kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState}'.
Symptom · 03
Pods cannot reach each other across nodes.
Fix
1. Verify the CNI plugin is healthy: kubectl get pods -n kube-system | grep calico (or flannel/weave). 2. Check if Pod CIDR ranges overlap between nodes: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. 3. Verify kube-proxy is running: kubectl get pods -n kube-system | grep kube-proxy. 4. Test from within a Pod: kubectl exec -it <pod> -- curl <service-ip>:<port>.
Symptom · 04
Deployment rollout hangs — new ReplicaSet never becomes ready.
Fix
1. Check the new ReplicaSet: kubectl describe rs <new-rs-name>. 2. Look for Pods that are Pending or CrashLoopBackOff. 3. Check if the new image exists in the registry and if imagePullSecrets are configured. 4. If using rolling update with maxUnavailable=0 and the cluster has no spare capacity, new Pods cannot be scheduled. 5. Rollback: kubectl rollout undo deployment/<name>.
Symptom · 05
etcd high latency alerts firing — API server slow.
Fix
1. Check etcd latency: etcdctl endpoint health --write-out=table. 2. Check disk I/O on etcd nodes: iostat -x 1. 3. Check etcd database size: etcdctl endpoint status --write-out=table. 4. If disk is the bottleneck, migrate to local SSDs. 5. If database is large, run defragmentation: etcdctl defrag.
Symptom · 06
PersistentVolumeClaim stuck in Pending.
Fix
1. Check if any PersistentVolume matches: kubectl get pv. 2. Describe the PVC: kubectl describe pvc <name>. Common causes: no PV available with matching accessModes and storageClassName, or the StorageClass has no provisioner. 3. If using dynamic provisioning, verify the storage provisioner pod is running and hasn't hit a quota or permission error.
★ Kubernetes Triage Cheat SheetFirst-response commands for common K8s production incidents.
Pod not starting — no events visible.
Immediate action
Check if the scheduler is running and if nodes have capacity.
Commands
kubectl get pods -n kube-system | grep scheduler
kubectl describe nodes | grep -A 5 'Allocated resources'
Fix now
If scheduler is down: check kube-system logs. If no capacity: scale the cluster or evict low-priority Pods.
Service returns 502/503 intermittently.+
Immediate action
Check if endpoints exist and Pods are passing readiness probes.
Commands
kubectl get endpoints <service-name>
kubectl get pods -l app=<selector> -o wide
Fix now
If endpoints are empty: Pods are failing readiness probes. Check probe configuration and Pod logs. If endpoints exist but 503 persists: check kube-proxy iptables rules with iptables-save | grep <service-cluster-ip>.
Node marked NotReady — Pods being evicted.+
Immediate action
SSH to the node and check kubelet status.
Commands
kubectl describe node <node-name> | grep -A 10 Conditions
systemctl status kubelet
Fix now
If kubelet is down: systemctl restart kubelet. If disk pressure: clean up unused images with crictl rmi --prune. If memory pressure: identify and kill the offending process.
PersistentVolumeClaim stuck in Pending.+
Immediate action
Check if a PersistentVolume exists that matches the claim's requirements.
Commands
kubectl get pv
kubectl describe pvc <pvc-name>
Fix now
If no PV available: provision one manually or ensure the StorageClass has a provisioner. If PV exists but not binding: check accessModes and storageClassName match.
Pod evicted due to node pressure.+
Immediate action
Identify the type of pressure and the root cause.
Commands
kubectl describe node <node-name> | grep -i pressure
kubectl top node <node-name>
Fix now
DiskPressure: clean up old logs and images, increase disk size. MemoryPressure: reduce Pod memory requests, add more nodes. PIDPressure: reduce number of processes per Pod.
Kubernetes Component Comparison
ComponentRoleFailure ImpactRecovery
kube-apiserverValidates and serves all API requests. Gateway to etcd.No new deployments, scaling, or config changes. Existing Pods continue running.Restart the process. If HA, load balancer routes to healthy replica.
etcdDistributed key-value store. Single source of truth for all cluster state.Cluster freezes — no state changes possible. If quorum lost, cluster is partitioned.Restore from snapshot or replace failed member. Requires etcdctl expertise.
kube-schedulerAssigns unscheduled Pods to nodes based on resource availability and constraints.New Pods stuck in Pending. Existing Pods unaffected.Restart the process. If leader election fails, check lease in etcd.
kube-controller-managerRuns reconciliation loops for Deployments, ReplicaSets, Nodes, Endpoints, etc.No self-healing. Crashed Pods not restarted. Scaling stops. Node failures not detected.Restart the process. Controllers resume reconciliation from current state.
kubeletNode agent. Pulls images, starts containers, reports node status to API server.Pods on that node stop being managed. Node marked NotReady after 40s (default). Pods evicted after 5 minutes.Restart kubelet. If node is unhealthy, cordoning and replacing the node may be necessary.
kube-proxyPrograms iptables/IPVS rules for Service load balancing on each node.Services unreachable from Pods on that node. Cross-node Service access still works from other nodes.Restart the process. Rules are rebuilt from current Service/Endpoint state.
CoreDNSCluster DNS. Resolves Service names to ClusterIPs.Service DNS resolution fails. Pods can still reach other Pods by direct IP.Restart CoreDNS Pods. Check ConfigMap for misconfiguration.

Key takeaways

1
You now understand what Introduction to Kubernetes is and why it exists
2
You've seen it working in a real runnable example
3
Practice daily
the forge only works when it's hot
4
The reconciliation loop is the fundamental operating principle of every Kubernetes controller. Understanding it transforms debugging from trial-and-error to systematic investigation.
5
etcd is the single point of truth and the most common root cause of cluster-wide issues. Its disk latency is the cluster's ceiling.
6
The scheduler scores nodes
it does not bin-pack, predict load, or rebalance. Scheduling decisions are permanent until the Pod is explicitly moved.
7
Kubernetes networking is layered (CNI, kube-proxy, Ingress). Debug from the bottom up
Pod IP, ClusterIP, DNS, Ingress.
8
Resource requests drive scheduling; resource limits drive runtime enforcement. Setting requests=limits (Guaranteed QoS) gives the most predictable behavior.
9
Storage is decoupled from Pod lifecycle via PV/PVC claims. The reclaim policy determines whether data survives PVC deletion
set to Retain for irretrievable data.
10
Namespaces provide isolated environments, but true security requires RBAC, NetworkPolicies, and ResourceQuotas. Without quotas, a single app can starve the cluster.

Common mistakes to avoid

7 patterns
×

Running etcd on network-attached storage

Symptom
API server timeouts, scheduler freezes, cluster becomes unresponsive during high write load.
Fix
etcd requires local SSDs with <10ms p99 write latency. Use provisioned IOPS EBS at minimum, instance-local NVMe ideally. Monitor etcd_disk_wal_fsync_duration_seconds as a critical metric.
×

Setting resource limits without requests (or vice versa)

Symptom
Pods get BestEffort QoS and are first to be evicted under resource pressure, or scheduler places Pods on nodes without actual capacity.
Fix
Always set both requests and limits. For predictable performance, set requests=limits (Guaranteed QoS). Use Vertical Pod Autoscaler (VPA) in 'off' mode to get right-sizing recommendations.
×

Using `latest` tag for container images

Symptom
Different nodes run different versions of the same image because latest is mutable. Rollbacks are impossible because you cannot determine which latest was running at a given time.
Fix
Always use immutable, versioned tags (git SHA or semantic version). Never use latest in production. Use image digests (image: repo@sha256:abc123...) for maximum determinism.
×

No PodDisruptionBudgets on critical services

Symptom
Node maintenance or cluster upgrade drains all Pods of a service simultaneously, causing complete outage.
Fix
Define PDBs with minAvailable: 1 (or percentage) for all production services. This ensures voluntary disruptions (drains) respect availability constraints.
×

Ignoring liveness probes that restart Pods unnecessarily

Symptom
Pods in CrashLoopBackOff because liveness probe fails during slow startup. Each restart makes startup slower (cold cache), creating a death spiral.
Fix
Use startupProbe for slow-starting containers. The liveness probe only activates after the startup probe succeeds. Set appropriate initialDelaySeconds and failureThreshold.
×

No RBAC restrictions

Symptom
A compromised Pod with a mounted ServiceAccount token can read all Secrets in the cluster, escalate privileges, and pivot to other namespaces.
Fix
Create dedicated ServiceAccounts per workload. Bind minimal RBAC roles. Set automountServiceAccountToken: false on Pods that don't need API access. Use NetworkPolicies to restrict Pod-to-Pod traffic.
×

Not setting StorageClass reclaim policy for critical data

Symptom
Deleting a PVC permanently deletes the underlying PV and all data if reclaimPolicy is Delete.
Fix
For stateful workloads (databases, queues), create a custom StorageClass with reclaimPolicy: Retain. Set a backup policy and take regular snapshots.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the Kubernetes reconciliation loop. How does it apply to a Deplo...
Q02SENIOR
What happens when you delete a Pod that belongs to a Deployment? Trace t...
Q03SENIOR
How does the kube-scheduler decide which node to place a Pod on? What ar...
Q04SENIOR
What is the difference between a Service's ClusterIP and the Pod IPs it ...
Q05SENIOR
A Pod is stuck in Pending. Walk me through your debugging process, from ...
Q06SENIOR
Explain etcd's role in the cluster. What happens if etcd loses quorum? H...
Q07SENIOR
What is the difference between requests and limits, and how do they affe...
Q08SENIOR
How would you design a zero-downtime deployment strategy using Kubernete...
Q01 of 08SENIOR

Explain the Kubernetes reconciliation loop. How does it apply to a Deployment managing a ReplicaSet managing Pods?

ANSWER
The reconciliation loop is a continuous observe-diff-act cycle. A Deployment controller watches desired state (replicas, pod template) from etcd via the API server. It compares the actual number of ReplicaSets and Pods. If a Pod is deleted, the controller sees the count is below desired, creates a new ReplicaSet or updates an existing one. The ReplicaSet controller then creates a new Pod. The scheduler assigns it to a node, and the kubelet starts it. This loop runs every few seconds, making Kubernetes self-healing without human intervention.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is Introduction to Kubernetes in simple terms?
02
What is the difference between a Deployment, a ReplicaSet, and a Pod?
03
What happens if the control plane node goes down?
04
How does Kubernetes handle node failures?
05
What is the difference between a ConfigMap and a Secret?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Kubernetes. Mark it forged?

14 min read · try the examples if you haven't

Previous
Cannot Connect to Docker Daemon: Causes, Fixes and Prevention
1 / 12 · Kubernetes
Next
Kubernetes Pods and Deployments