Skip to content
Home DevOps Kubernetes Internals Explained — Architecture, Scheduling, and Production Gotchas

Kubernetes Internals Explained — Architecture, Scheduling, and Production Gotchas

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Kubernetes → Topic 1 of 12
Kubernetes internals demystified: control plane, etcd, kube-scheduler, kubelet, and real production gotchas.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
Kubernetes internals demystified: control plane, etcd, kube-scheduler, kubelet, and real production gotchas.
  • You now understand what Introduction to Kubernetes is and why it exists
  • You've seen it working in a real runnable example
  • Practice daily — the forge only works when it's hot
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • You declare desired state via YAML manifests (Deployments, Services, ConfigMaps).
  • The control plane continuously watches actual state and drives it toward desired state.
  • This reconciliation loop is the fundamental operating principle of every K8s component.
  • Control plane: kube-apiserver, etcd, kube-scheduler, kube-controller-manager.
  • Node agents: kubelet (manages pods), kube-proxy (networking), container runtime.
  • etcd: the single source of truth — a distributed key-value store holding all cluster state.
  • etcd latency directly impacts API server throughput — p99 writes above 10ms cause cascading scheduler failures.
  • The scheduler is not a bin-packer — it scores nodes and picks the best fit, but it cannot move running Pods without explicit eviction.
  • Understanding the reconciliation loop is what separates debugging YAML from debugging the system.
🚨 START HERE
Kubernetes Triage Cheat Sheet
First-response commands for common K8s production incidents.
🟡Pod not starting — no events visible.
Immediate ActionCheck if the scheduler is running and if nodes have capacity.
Commands
kubectl get pods -n kube-system | grep scheduler
kubectl describe nodes | grep -A 5 'Allocated resources'
Fix NowIf scheduler is down: check kube-system logs. If no capacity: scale the cluster or evict low-priority Pods.
🟡Service returns 502/503 intermittently.
Immediate ActionCheck if endpoints exist and Pods are passing readiness probes.
Commands
kubectl get endpoints <service-name>
kubectl get pods -l app=<selector> -o wide
Fix NowIf endpoints are empty: Pods are failing readiness probes. Check probe configuration and Pod logs. If endpoints exist but 503 persists: check kube-proxy iptables rules with `iptables-save | grep <service-cluster-ip>`.
🟡Node marked NotReady — Pods being evicted.
Immediate ActionSSH to the node and check kubelet status.
Commands
kubectl describe node <node-name> | grep -A 10 Conditions
systemctl status kubelet
Fix NowIf kubelet is down: `systemctl restart kubelet`. If disk pressure: clean up unused images with `crictl rmi --prune`. If memory pressure: identify and kill the offending process.
🟡PersistentVolumeClaim stuck in Pending.
Immediate ActionCheck if a PersistentVolume exists that matches the claim's requirements.
Commands
kubectl get pv
kubectl describe pvc <pvc-name>
Fix NowIf no PV available: provision one manually or ensure the StorageClass has a provisioner. If PV exists but not binding: check accessModes and storageClassName match.
Production IncidentThe etcd Disk That Killed the Entire ClusterA production cluster with 200 nodes stopped scheduling new Pods. Existing Pods continued running, but all deployments, scaling operations, and config updates hung indefinitely. The cluster appeared healthy from node metrics but was functionally frozen.
Symptomkubectl commands hang or timeout. New Pods stuck in Pending. Deployment rollouts never complete. API server logs show 'etcdserver: request timed out' errors. Controller-manager logs show leader election failures.
AssumptionThe API server is overloaded, or the scheduler has crashed.
Root causeetcd's data directory was on a network-attached EBS volume that had degraded to p99 write latency of 800ms (normal: 2ms). etcd requires sub-10ms disk writes for stable operation. The degraded disk caused the Raft consensus protocol to stall — the cluster could not commit new state changes. The API server, which depends on etcd for every operation, began queuing requests until it exhausted its connection pool. The scheduler and controller-manager, which watch etcd via the API server, received no updates and effectively froze.
Fix1. Immediately migrate etcd to local NVMe SSDs (provisioned IOPS EBS or instance-local storage). 2. Set etcd disk latency alerts at p99 > 10ms as critical. 3. Implement etcd defragmentation on a schedule (etcdctl defrag). 4. Configure etcd auto-compaction (--auto-compaction-retention=8) to prevent unbounded data growth. 5. Monitor etcd member health with etcdctl endpoint health and etcdctl endpoint status.
Key Lesson
etcd is the single point of failure for the entire cluster. Its disk performance is the cluster's ceiling.Never run etcd on network-attached storage in production. Local SSDs are mandatory.API server timeouts are often etcd problems, not API server problems. Trace downward, not upward.etcd requires periodic defragmentation. Without it, space is freed but not reclaimed, leading to disk pressure.
Production Debug GuideSymptom-driven investigation paths for the most common failure modes.
Pod stuck in Pending state.1. Run kubectl describe pod <name> and read the Events section. 2. Common causes: insufficient CPU/memory on any node (check kubectl describe nodes for Allocatable vs Allocated), PersistentVolumeClaim not bound, node affinity/taint mismatches, resource quotas exceeded. 3. If no events appear, the scheduler may be down — check kubectl get pods -n kube-system for kube-scheduler.
Pod stuck in CrashLoopBackOff.1. Run kubectl logs <pod> --previous to see the logs from the crashed container (current logs may be empty). 2. Common causes: missing environment variables, failed health checks, OOMKill (check kubectl describe pod for Last State), misconfigured entrypoint. 3. If OOMKill, increase memory limits or fix the memory leak. Check kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState}'.
Pods cannot reach each other across nodes.1. Verify the CNI plugin is healthy: kubectl get pods -n kube-system | grep calico (or flannel/weave). 2. Check if Pod CIDR ranges overlap between nodes: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. 3. Verify kube-proxy is running: kubectl get pods -n kube-system | grep kube-proxy. 4. Test from within a Pod: kubectl exec -it <pod> -- curl <service-ip>:<port>.
Deployment rollout hangs — new ReplicaSet never becomes ready.1. Check the new ReplicaSet: kubectl describe rs <new-rs-name>. 2. Look for Pods that are Pending or CrashLoopBackOff. 3. Check if the new image exists in the registry and if imagePullSecrets are configured. 4. If using rolling update with maxUnavailable=0 and the cluster has no spare capacity, new Pods cannot be scheduled. 5. Rollback: kubectl rollout undo deployment/<name>.
etcd high latency alerts firing — API server slow.1. Check etcd latency: etcdctl endpoint health --write-out=table. 2. Check disk I/O on etcd nodes: iostat -x 1. 3. Check etcd database size: etcdctl endpoint status --write-out=table. 4. If disk is the bottleneck, migrate to local SSDs. 5. If database is large, run defragmentation: etcdctl defrag.

Kubernetes is not a deployment tool. It is a distributed state reconciliation engine. Every component — from the scheduler to the kubelet — operates on the same principle: watch the desired state in etcd, compare it with observed state, and act to close the gap. This is the mental model that unlocks real debugging capability.

The control plane is the brain. etcd is the memory. The kubelet is the muscle on each node. The scheduler decides placement. When any of these components degrades, the symptoms are often misleading — a Pod stuck in Pending looks like a scheduling problem but is frequently an etcd latency issue or a resource quota misconfiguration.

The common misconception is that Kubernetes 'runs containers.' It does not. Kubernetes manages the desired state of workloads. The container runtime (containerd, CRI-O) runs containers. Kubernetes tells the runtime what to run, monitors whether it is running, and corrects deviations. This distinction matters when debugging crashes, image pull failures, and networking issues.

Control Plane Architecture: The Brain of the Cluster

The Kubernetes control plane consists of four components that work together to maintain cluster state. Understanding each component's role — and its failure modes — is essential for production operations.

kube-apiserver is the front door. Every kubectl command, every controller reconciliation, every kubelet status report goes through the API server. It validates requests, persists state to etcd, and serves as the watch endpoint for all controllers. It is stateless — you can run multiple replicas behind a load balancer for HA.

etcd is the single source of truth. It is a distributed, consistent key-value store built on the Raft consensus protocol. All cluster state — Pod definitions, ConfigMaps, Secrets, node registrations — lives in etcd. If etcd loses quorum, the cluster cannot make any state changes. etcd is the most critical component and the most commonly under-provisioned.

kube-scheduler watches for unscheduled Pods and assigns them to nodes. It does not run Pods — it only writes the nodeName field. The kubelet on the assigned node then pulls the image and starts the container. The scheduler uses a two-phase process: filtering (eliminate infeasible nodes) and scoring (rank feasible nodes, pick the highest score).

kube-controller-manager runs the control loops. Each controller watches a specific resource type and reconciles actual state with desired state. The Deployment controller ensures the right number of replicas exist. The Node controller detects when nodes go unhealthy. The Endpoint controller updates Service endpoints as Pods come and go.

control-plane-architecture.yaml · YAML
123456789101112131415161718192021222324252627282930
# Control Plane Health CheckRun this to verify all components are healthy
# Save as check-control-plane.sh

# 1. API Server health (returns 200 if healthy)
curl -k https://localhost:6443/healthz
# Expected: "ok"

# 2. etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Expected: "is healthy"

# 3. etcd cluster member status
ETCDCTL_API=3 etcdctl endpoint status \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  --write-out=table
# Shows: ID, Status, Version, DB Size, Raft Term, Raft Index

# 4. Scheduler and Controller-Manager leader election
kubectl get endpoints kube-scheduler -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
kubectl get endpoints kube-controller-manager -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'

# 5. All control plane components running
kubectl get pods -n kube-system -o wide
▶ Output
ok

127.0.0.1:2379 is healthy: successfully committed proposal: took = 2.145ms

+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://127.0.0.1:2379 | 8e9e05c52164694d | 3.5.9 | 25 MB | true | false | 4 | 18234 | 18234 | |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

{"holderIdentity":"master-1_xxxxx","leaseDurationSeconds":15,"acquireTime":"2026-03-01T10:00:00Z","renewTime":"2026-04-07T14:30:00Z","leaderTransitions":3}

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system coredns-5d78c9869d-abc12 1/1 Running 0 30d 10.244.0.5 master-1
kube-system etcd-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-apiserver-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-controller-manager-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-proxy-xyz78 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-scheduler-master-1 1/1 Running 0 30d 192.168.1.10 master-1
Mental Model
The Reconciliation Loop — The Heartbeat of Kubernetes
Understanding this loop is the single most important concept in Kubernetes.
  • The API server is the only component that talks to etcd. All other components go through the API server.
  • Controllers are level-triggered, not edge-triggered. They care about the current state, not the event that caused it.
  • This is why Kubernetes is self-healing. It does not remember what happened — it only checks what is true right now.
📊 Production Insight
Control plane HA requires at least 3 etcd members and 2+ API server replicas. A single-node control plane is a single point of failure — if the master node dies, existing Pods keep running (kubelet is independent), but you cannot deploy, scale, or modify anything until the control plane recovers. etcd quorum requires (n/2)+1 members alive. With 3 members, you can tolerate 1 failure. With 5, you can tolerate 2. Never run an even number of etcd members — split-brain scenarios become possible.
🎯 Key Takeaway
The control plane is a distributed system with etcd as its consensus backbone. Every API request, every scheduler decision, every controller reconciliation depends on etcd's health. Production clusters need 3+ etcd members on local SSDs, and etcd latency monitoring is not optional — it is the earliest warning of cluster degradation.
Control Plane Sizing for Production
IfDev/test cluster, non-critical workloads.
UseSingle control plane node is acceptable. Accept the risk of API unavailability during maintenance.
IfProduction cluster, < 100 nodes.
Use3 control plane nodes with stacked etcd. etcd runs on the same nodes as the API server — cost-effective HA.
IfProduction cluster, > 100 nodes or strict SLA requirements.
Use3–5 dedicated etcd nodes + 2+ API server nodes (external etcd). Isolates etcd disk I/O from API load.
IfMulti-region cluster.
UseStretched etcd across regions needs < 10ms latency. If higher, use separate clusters per region.

The Scheduler: How Kubernetes Decides Where Pods Run

The kube-scheduler is the component that assigns Pods to nodes. It does not run Pods — it only writes the spec.nodeName field on the Pod object. The kubelet on the assigned node then pulls the image and starts the container.

Filtering (Feasibility): Eliminate nodes that cannot run the Pod. Filter reasons include: insufficient CPU/memory, node taints the Pod cannot tolerate, node affinity mismatches, volume zone constraints, and Pod topology spread constraints. After filtering, if zero nodes remain, the Pod stays in Pending.

Scoring (Ranking): Rank the feasible nodes by a set of scoring plugins. Default scoring includes: NodeResourcesBalancedAllocation (prefer nodes with balanced CPU/memory usage), ImageLocality (prefer nodes that already have the container image), InterPodAffinity (prefer nodes where affinity rules are satisfied), and TaintToleration (prefer nodes with fewer taints). The node with the highest weighted score wins.

The scheduler makes decisions based on the state of the cluster at scheduling time. It does not predict future load. It does not rebalance existing Pods. Once a Pod is scheduled, only explicit actions (eviction, deletion, preemption) can move it.

scheduler-configuration.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
# Example: Pod with scheduling constraints
# This Pod will ONLY be scheduled on nodes with the label 'disktype=ssd'
# and will prefer nodes in zone 'us-east-1a'
apiVersion: v1
kind: Pod
metadata:
  name: io-thecodeforge-payment-service
  namespace: production
spec:
  # Hard requirement: node MUST have this label
  nodeSelector:
    disktype: ssd

  # Soft preference: scheduler tries to place here, but can choose elsewhere
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1a
    # Pod affinity: prefer to run near other payment-service Pods
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: payment-service
            topologyKey: kubernetes.io/hostname

  # Tolerations: allow scheduling on nodes with the 'dedicated=high-cpu' taint
  tolerations:
    - key: dedicated
      operator: Equal
      value: high-cpu
      effect: NoSchedule

  # Topology spread: distribute replicas evenly across zones
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: payment-service

  containers:
    - name: payment-service
      image: registry.thecodeforge.io/payment-service:v2.4.1
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"
        limits:
          cpu: "1000m"
          memory: "1Gi"
▶ Output
pod/io-thecodeforge-payment-service created

# Verify scheduling decision
kubectl describe pod io-thecodeforge-payment-service -n production | grep -A 10 Events

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5s default-scheduler Successfully assigned production/io-thecodeforge-payment-service to node-3
Mental Model
Requests vs Limits — The Scheduler Only Sees Requests
This is why setting requests=limits (Guaranteed QoS) gives the most predictable performance.
  • Guaranteed QoS (requests=limits): Pod is last to be evicted under resource pressure.
  • Burstable QoS (requests < limits): Pod can burst but is evicted before Guaranteed Pods.
  • BestEffort QoS (no requests, no limits): First to be evicted. Never use in production.
📊 Production Insight
Scheduler performance degrades with cluster size and Pod count. At > 5000 Pods, scheduling latency can exceed 1 second, causing deployment rollouts to slow dramatically. Mitigate with: scheduler extenders for custom logic (avoid modifying the scheduler binary), Pod topology spread constraints instead of pod anti-affinity (more efficient), and multiple scheduler profiles for different workload classes. The scheduler's scoring algorithm is pluggable — you can weight or disable scoring plugins via a KubeSchedulerConfiguration.
🎯 Key Takeaway
The scheduler is a scoring engine, not a bin-packer. It ranks feasible nodes and picks the best match at scheduling time. It does not rebalance, predict load, or consider limits. Understanding the filter-then-score pipeline — and how nodeSelector, affinity, taints, and topology spread interact within it — is essential for controlling Pod placement at scale.
Scheduling Constraint Selection
IfPod MUST run on a specific type of node (e.g., GPU, SSD).
UseUse nodeSelector or nodeAffinity required mode. Hard constraint — Pod stays Pending if no node matches.
IfPod PREFERS a specific node type but can run elsewhere.
UseUse nodeAffinity preferred mode. Soft constraint — scheduler tries to match but places elsewhere if needed.
IfReplicas must be spread across failure domains (zones, nodes).
UseUse topologySpreadConstraints. More flexible and performant than pod anti-affinity.
IfPod should run near (or away from) other specific Pods.
UseUse podAffinity (co-locate) or podAntiAffinity (spread). At scale prefer topologySpreadConstraints.
IfNode has taints (dedicated nodes, spot instances).
UseAdd tolerations to the Pod spec. Without a matching toleration, the Pod won't schedule on the tainted node.

Pod Networking: How Containers Talk to Each Other

Kubernetes networking has three fundamental requirements, enforced by the CNI (Container Network Interface) plugin:

  1. Every Pod gets its own IP address, unique across the cluster.
  2. Pods on any node can communicate with Pods on any other node without NAT.
  3. Agents on a node (kubelet, system daemons) can communicate with all Pods on that node.

These requirements are simple to state but complex to implement. The CNI plugin (Calico, Cilium, Flannel, AWS VPC CNI) is responsible for wiring this up. It allocates IP addresses from the node's Pod CIDR range, sets up network interfaces inside the Pod's network namespace, and configures routing rules so Pods can reach each other across nodes.

kube-proxy handles Service networking. It watches the API server for Service and Endpoint objects, then programs iptables rules (or IPVS rules) on each node. When a Pod connects to a Service's ClusterIP, the kernel's iptables rules intercept the connection and DNAT it to one of the backend Pod IPs. This is why Service IPs are virtual — they do not exist on any network interface.

networking-debug.yaml · YAML
12345678910111213141516171819202122232425262728
# Debugging Pod networking step by step

# 1. Verify Pod has an IP address
kubectl get pods -n production -o wide
# If Pod IP is <none>, the CNI plugin failed to assign an address

# 2. Check if the CNI plugin is healthy
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|aws-node'

# 3. Verify Pod CIDR allocation per node
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
# Each node must have a unique, non-overlapping CIDR

# 4. Test Pod-to-Pod connectivity across nodes
kubectl exec -it pod-on-node-a -- ping <pod-ip-on-node-b>
# If this fails but intra-node works, the CNI cross-node routing is broken

# 5. Check Service endpoints
kubectl get endpoints payment-service -n production
# If endpoints are empty, no Pods match the Service's selector

# 6. Test Service DNS resolution
kubectl exec -it <pod> -- nslookup payment-service.production.svc.cluster.local
# If DNS fails, check CoreDNS pods: kubectl get pods -n kube-system | grep coredns

# 7. Inspect iptables rules for a Service
# (run on the node where your Pod is running)
iptables-save | grep <service-cluster-ip>
▶ Output
NAME READY STATUS IP NODE
payment-service-7d8f9-abc12 1/1 Running 10.244.1.45 node-2
payment-service-7d8f9-def34 1/1 Running 10.244.2.78 node-3

NAME READY STATUS RESTARTS AGE
calico-node-abc12 1/1 Running 0 30d
calico-kube-controllers-5d78-def34 1/1 Running 0 30d

node-1 10.244.0.0/24
node-2 10.244.1.0/24
node-3 10.244.2.0/24

PING 10.244.2.78 (10.244.2.78): 56 data bytes
64 bytes from 10.244.2.78: seq=0 ttl=62 time=0.456 ms

NAME ENDPOINTS AGE
payment-service 10.244.1.45:8080,10.244.2.78:8080 15d

Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: payment-service.production.svc.cluster.local
Address 1: 10.96.45.12 payment-service.production.svc.cluster.local
Mental Model
The Three Layers of K8s Networking
Most networking bugs are CNI or DNS issues, not application code issues.
  • Pod IP works but Service IP fails: kube-proxy or iptables issue.
  • Service IP works but DNS fails: CoreDNS issue.
  • DNS works but external access fails: Ingress controller or cloud LB issue.
📊 Production Insight
CNI plugin selection has massive performance and operational implications. Calico (BGP mode) scales well but requires BGP peering knowledge. Cilium (eBPF) bypasses iptables entirely, offering better performance at scale (>1000 Services) but is more complex to debug. AWS VPC CNI assigns real VPC IP addresses to Pods, simplifying security group integration but consuming VPC IP space rapidly. For production, evaluate CNI based on: Service count, NetworkPolicy requirements, observability needs, and team expertise.
🎯 Key Takeaway
Kubernetes networking is a layered system: CNI for Pod connectivity, kube-proxy for Service load balancing, Ingress for external access. Debug from the bottom up — Pod IP, then ClusterIP, then DNS, then Ingress. CNI plugin choice is a long-term architectural decision with performance, security, and operational trade-offs.

Introduction to Kubernetes

Introduction to Kubernetes is a core concept in DevOps. Rather than starting with a dry definition, let's see it in action and understand why it exists.

io/thecodeforge/kubernetes/ForgeExample.java · JAVA
12345678910
// TheCodeForge — Introduction to Kubernetes example
// Always use meaningful names, not x or n
package io.thecodeforge.kubernetes;

public class ForgeExample {
    public static void main(String[] args) {
        String topic = "Introduction to Kubernetes";
        System.out.println("Learning: " + topic);
    }
}
▶ Output
Learning: Introduction to Kubernetes
🔥Forge Tip
Type this code yourself rather than copy-pasting. The muscle memory of writing it will help it stick.
📊 Production Insight
The introductory concept is the foundation for all subsequent sections. In production, understanding the declarative model — where you specify what you want, not how to achieve it — is the paradigm shift that Kubernetes enforces. Teams that fight this model by scripting imperative commands in CI/CD pipelines instead of using declarative manifests create fragile, unreproducible deployments. Always commit manifests to version control. Treat the Git repository as the source of truth, not the live cluster.
🎯 Key Takeaway
Kubernetes is a declarative system. You describe the desired state; controllers reconcile reality toward it. This is not a convenience — it is the architectural foundation. Every operational practice, from GitOps to disaster recovery, flows from this principle.
🗂 Kubernetes Component Comparison
Role, scope, and failure impact of each control plane and node component.
ComponentRoleFailure ImpactRecovery
kube-apiserverValidates and serves all API requests. Gateway to etcd.No new deployments, scaling, or config changes. Existing Pods continue running.Restart the process. If HA, load balancer routes to healthy replica.
etcdDistributed key-value store. Single source of truth for all cluster state.Cluster freezes — no state changes possible. If quorum lost, cluster is partitioned.Restore from snapshot or replace failed member. Requires etcdctl expertise.
kube-schedulerAssigns unscheduled Pods to nodes based on resource availability and constraints.New Pods stuck in Pending. Existing Pods unaffected.Restart the process. If leader election fails, check lease in etcd.
kube-controller-managerRuns reconciliation loops for Deployments, ReplicaSets, Nodes, Endpoints, etc.No self-healing. Crashed Pods not restarted. Scaling stops. Node failures not detected.Restart the process. Controllers resume reconciliation from current state.
kubeletNode agent. Pulls images, starts containers, reports node status to API server.Pods on that node stop being managed. Node marked NotReady after 40s (default). Pods evicted after 5 minutes.Restart kubelet. If node is unhealthy, cordoning and replacing the node may be necessary.
kube-proxyPrograms iptables/IPVS rules for Service load balancing on each node.Services unreachable from Pods on that node. Cross-node Service access still works from other nodes.Restart the process. Rules are rebuilt from current Service/Endpoint state.
CoreDNSCluster DNS. Resolves Service names to ClusterIPs.Service DNS resolution fails. Pods can still reach other Pods by direct IP.Restart CoreDNS Pods. Check ConfigMap for misconfiguration.

🎯 Key Takeaways

  • You now understand what Introduction to Kubernetes is and why it exists
  • You've seen it working in a real runnable example
  • Practice daily — the forge only works when it's hot
  • The reconciliation loop is the fundamental operating principle of every Kubernetes controller. Understanding it transforms debugging from trial-and-error to systematic investigation.
  • etcd is the single point of truth and the most common root cause of cluster-wide issues. Its disk latency is the cluster's ceiling.
  • The scheduler scores nodes — it does not bin-pack, predict load, or rebalance. Scheduling decisions are permanent until the Pod is explicitly moved.
  • Kubernetes networking is layered (CNI, kube-proxy, Ingress). Debug from the bottom up: Pod IP, ClusterIP, DNS, Ingress.
  • Resource requests drive scheduling; resource limits drive runtime enforcement. Setting requests=limits (Guaranteed QoS) gives the most predictable behavior.

⚠ Common Mistakes to Avoid

    Running etcd on network-attached storage
    Symptom

    API server timeouts, scheduler freezes, cluster becomes unresponsive during high write load —

    Fix

    etcd requires local SSDs with <10ms p99 write latency. Use provisioned IOPS EBS at minimum, instance-local NVMe ideally. Monitor etcd_disk_wal_fsync_duration_seconds as a critical metric.

    Setting resource limits without requests (or vice versa)
    Symptom

    Pods get BestEffort QoS and are first to be evicted under resource pressure, or scheduler places Pods on nodes without actual capacity —

    Fix

    Always set both requests and limits. For predictable performance, set requests=limits (Guaranteed QoS). Use Vertical Pod Autoscaler (VPA) in 'off' mode to get right-sizing recommendations.

    Using `latest` tag for container images
    Symptom

    Different nodes run different versions of the same image because latest is mutable. Rollbacks are impossible because you cannot determine which latest was running at a given time —

    Fix

    Always use immutable, versioned tags (git SHA or semantic version). Never use latest in production. Use image digests (image: repo@sha256:abc123...) for maximum determinism.

    No PodDisruptionBudgets on critical services
    Symptom

    Node maintenance or cluster upgrade drains all Pods of a service simultaneously, causing complete outage —

    Fix

    Define PDBs with minAvailable: 1 (or percentage) for all production services. This ensures voluntary disruptions (drains) respect availability constraints.

    Ignoring liveness probes that restart Pods unnecessarily
    Symptom

    Pods in CrashLoopBackOff because liveness probe fails during slow startup. Each restart makes startup slower (cold cache), creating a death spiral —

    Fix

    Use startupProbe for slow-starting containers. The liveness probe only activates after the startup probe succeeds. Set appropriate initialDelaySeconds and failureThreshold.

    No RBAC restrictions
    Symptom

    A compromised Pod with a mounted ServiceAccount token can read all Secrets in the cluster, escalate privileges, and pivot to other namespaces —

    Fix

    Create dedicated ServiceAccounts per workload. Bind minimal RBAC roles. Set automountServiceAccountToken: false on Pods that don't need API access. Use NetworkPolicies to restrict Pod-to-Pod traffic.

Interview Questions on This Topic

  • QExplain the Kubernetes reconciliation loop. How does it apply to a Deployment managing a ReplicaSet managing Pods?
  • QWhat happens when you delete a Pod that belongs to a Deployment? Trace the full sequence of events through every controller involved.
  • QHow does the kube-scheduler decide which node to place a Pod on? What are the two phases, and what plugins participate in each?
  • QWhat is the difference between a Service's ClusterIP and the Pod IPs it routes to? How does kube-proxy implement this?
  • QA Pod is stuck in Pending. Walk me through your debugging process, from the first command you would run to identifying the root cause.
  • QExplain etcd's role in the cluster. What happens if etcd loses quorum? How would you recover?
  • QWhat is the difference between requests and limits, and how do they affect scheduling vs runtime behavior?
  • QHow would you design a zero-downtime deployment strategy using Kubernetes primitives (Deployments, PDBs, health checks)?

Frequently Asked Questions

What is Introduction to Kubernetes in simple terms?

Introduction to Kubernetes is a fundamental concept in DevOps. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

What is the difference between a Deployment, a ReplicaSet, and a Pod?

A Pod is the smallest unit — one or more containers sharing a network namespace. A ReplicaSet ensures a specified number of Pod replicas are running at all times. A Deployment manages ReplicaSets and provides declarative updates (rolling updates, rollbacks). The hierarchy is: Deployment -> ReplicaSet -> Pod. You almost never create ReplicaSets or Pods directly — you create Deployments, and the Deployment controller creates the ReplicaSet, which creates the Pods.

What happens if the control plane node goes down?

Existing Pods on worker nodes continue running — the kubelet on each node operates independently of the control plane for running workloads. However, you cannot deploy new workloads, scale existing workloads, update configurations, or modify any cluster state until the control plane recovers. This is why production clusters need at least 3 control plane nodes for high availability.

How does Kubernetes handle node failures?

The Node controller in kube-controller-manager monitors node heartbeats. If a node stops sending heartbeats (default: every 10s), the node is marked NotReady after 40 seconds. After 5 minutes (the pod-eviction-timeout), the control plane evicts Pods from the unreachable node and reschedules them on healthy nodes. During this 5-minute window, the Pods are running but unreachable if the node is truly down. You can tune this timeout, but setting it too low causes unnecessary evictions during temporary network blips.

What is the difference between a ConfigMap and a Secret?

Functionally, they are identical — both inject configuration data into Pods as environment variables or mounted files. The difference is intent and handling: Secrets are base64-encoded (not encrypted by default), stored separately in etcd, and can be encrypted at rest with an EncryptionConfiguration. ConfigMaps are for non-sensitive configuration. In production, use an external secrets manager (Vault, AWS Secrets Manager) with the Secrets Store CSI Driver instead of Kubernetes Secrets for sensitive data.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →Kubernetes Pods and Deployments
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged