Kubernetes Internals Explained — Architecture, Scheduling, and Production Gotchas
- You now understand what Introduction to Kubernetes is and why it exists
- You've seen it working in a real runnable example
- Practice daily — the forge only works when it's hot
- You declare desired state via YAML manifests (Deployments, Services, ConfigMaps).
- The control plane continuously watches actual state and drives it toward desired state.
- This reconciliation loop is the fundamental operating principle of every K8s component.
- Control plane: kube-apiserver, etcd, kube-scheduler, kube-controller-manager.
- Node agents: kubelet (manages pods), kube-proxy (networking), container runtime.
- etcd: the single source of truth — a distributed key-value store holding all cluster state.
- etcd latency directly impacts API server throughput — p99 writes above 10ms cause cascading scheduler failures.
- The scheduler is not a bin-packer — it scores nodes and picks the best fit, but it cannot move running Pods without explicit eviction.
- Understanding the reconciliation loop is what separates debugging YAML from debugging the system.
Pod not starting — no events visible.
kubectl get pods -n kube-system | grep schedulerkubectl describe nodes | grep -A 5 'Allocated resources'Service returns 502/503 intermittently.
kubectl get endpoints <service-name>kubectl get pods -l app=<selector> -o wideNode marked NotReady — Pods being evicted.
kubectl describe node <node-name> | grep -A 10 Conditionssystemctl status kubeletPersistentVolumeClaim stuck in Pending.
kubectl get pvkubectl describe pvc <pvc-name>Production Incident
etcdctl defrag).
4. Configure etcd auto-compaction (--auto-compaction-retention=8) to prevent unbounded data growth.
5. Monitor etcd member health with etcdctl endpoint health and etcdctl endpoint status.Production Debug GuideSymptom-driven investigation paths for the most common failure modes.
kubectl describe pod <name> and read the Events section. 2. Common causes: insufficient CPU/memory on any node (check kubectl describe nodes for Allocatable vs Allocated), PersistentVolumeClaim not bound, node affinity/taint mismatches, resource quotas exceeded. 3. If no events appear, the scheduler may be down — check kubectl get pods -n kube-system for kube-scheduler.kubectl logs <pod> --previous to see the logs from the crashed container (current logs may be empty). 2. Common causes: missing environment variables, failed health checks, OOMKill (check kubectl describe pod for Last State), misconfigured entrypoint. 3. If OOMKill, increase memory limits or fix the memory leak. Check kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState}'.kubectl get pods -n kube-system | grep calico (or flannel/weave). 2. Check if Pod CIDR ranges overlap between nodes: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. 3. Verify kube-proxy is running: kubectl get pods -n kube-system | grep kube-proxy. 4. Test from within a Pod: kubectl exec -it <pod> -- curl <service-ip>:<port>.kubectl describe rs <new-rs-name>. 2. Look for Pods that are Pending or CrashLoopBackOff. 3. Check if the new image exists in the registry and if imagePullSecrets are configured. 4. If using rolling update with maxUnavailable=0 and the cluster has no spare capacity, new Pods cannot be scheduled. 5. Rollback: kubectl rollout undo deployment/<name>.etcdctl endpoint health --write-out=table. 2. Check disk I/O on etcd nodes: iostat -x 1. 3. Check etcd database size: etcdctl endpoint status --write-out=table. 4. If disk is the bottleneck, migrate to local SSDs. 5. If database is large, run defragmentation: etcdctl defrag.Kubernetes is not a deployment tool. It is a distributed state reconciliation engine. Every component — from the scheduler to the kubelet — operates on the same principle: watch the desired state in etcd, compare it with observed state, and act to close the gap. This is the mental model that unlocks real debugging capability.
The control plane is the brain. etcd is the memory. The kubelet is the muscle on each node. The scheduler decides placement. When any of these components degrades, the symptoms are often misleading — a Pod stuck in Pending looks like a scheduling problem but is frequently an etcd latency issue or a resource quota misconfiguration.
The common misconception is that Kubernetes 'runs containers.' It does not. Kubernetes manages the desired state of workloads. The container runtime (containerd, CRI-O) runs containers. Kubernetes tells the runtime what to run, monitors whether it is running, and corrects deviations. This distinction matters when debugging crashes, image pull failures, and networking issues.
Control Plane Architecture: The Brain of the Cluster
The Kubernetes control plane consists of four components that work together to maintain cluster state. Understanding each component's role — and its failure modes — is essential for production operations.
kube-apiserver is the front door. Every kubectl command, every controller reconciliation, every kubelet status report goes through the API server. It validates requests, persists state to etcd, and serves as the watch endpoint for all controllers. It is stateless — you can run multiple replicas behind a load balancer for HA.
etcd is the single source of truth. It is a distributed, consistent key-value store built on the Raft consensus protocol. All cluster state — Pod definitions, ConfigMaps, Secrets, node registrations — lives in etcd. If etcd loses quorum, the cluster cannot make any state changes. etcd is the most critical component and the most commonly under-provisioned.
kube-scheduler watches for unscheduled Pods and assigns them to nodes. It does not run Pods — it only writes the nodeName field. The kubelet on the assigned node then pulls the image and starts the container. The scheduler uses a two-phase process: filtering (eliminate infeasible nodes) and scoring (rank feasible nodes, pick the highest score).
kube-controller-manager runs the control loops. Each controller watches a specific resource type and reconciles actual state with desired state. The Deployment controller ensures the right number of replicas exist. The Node controller detects when nodes go unhealthy. The Endpoint controller updates Service endpoints as Pods come and go.
# Control Plane Health Check — Run this to verify all components are healthy # Save as check-control-plane.sh # 1. API Server health (returns 200 if healthy) curl -k https://localhost:6443/healthz # Expected: "ok" # 2. etcd cluster health ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key # Expected: "is healthy" # 3. etcd cluster member status ETCDCTL_API=3 etcdctl endpoint status \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ --write-out=table # Shows: ID, Status, Version, DB Size, Raft Term, Raft Index # 4. Scheduler and Controller-Manager leader election kubectl get endpoints kube-scheduler -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' kubectl get endpoints kube-controller-manager -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}' # 5. All control plane components running kubectl get pods -n kube-system -o wide
127.0.0.1:2379 is healthy: successfully committed proposal: took = 2.145ms
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://127.0.0.1:2379 | 8e9e05c52164694d | 3.5.9 | 25 MB | true | false | 4 | 18234 | 18234 | |
+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
{"holderIdentity":"master-1_xxxxx","leaseDurationSeconds":15,"acquireTime":"2026-03-01T10:00:00Z","renewTime":"2026-04-07T14:30:00Z","leaderTransitions":3}
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system coredns-5d78c9869d-abc12 1/1 Running 0 30d 10.244.0.5 master-1
kube-system etcd-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-apiserver-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-controller-manager-master-1 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-proxy-xyz78 1/1 Running 0 30d 192.168.1.10 master-1
kube-system kube-scheduler-master-1 1/1 Running 0 30d 192.168.1.10 master-1
- The API server is the only component that talks to etcd. All other components go through the API server.
- Controllers are level-triggered, not edge-triggered. They care about the current state, not the event that caused it.
- This is why Kubernetes is self-healing. It does not remember what happened — it only checks what is true right now.
The Scheduler: How Kubernetes Decides Where Pods Run
The kube-scheduler is the component that assigns Pods to nodes. It does not run Pods — it only writes the spec.nodeName field on the Pod object. The kubelet on the assigned node then pulls the image and starts the container.
The scheduler uses a two-phase process:
Filtering (Feasibility): Eliminate nodes that cannot run the Pod. Filter reasons include: insufficient CPU/memory, node taints the Pod cannot tolerate, node affinity mismatches, volume zone constraints, and Pod topology spread constraints. After filtering, if zero nodes remain, the Pod stays in Pending.
Scoring (Ranking): Rank the feasible nodes by a set of scoring plugins. Default scoring includes: NodeResourcesBalancedAllocation (prefer nodes with balanced CPU/memory usage), ImageLocality (prefer nodes that already have the container image), InterPodAffinity (prefer nodes where affinity rules are satisfied), and TaintToleration (prefer nodes with fewer taints). The node with the highest weighted score wins.
The scheduler makes decisions based on the state of the cluster at scheduling time. It does not predict future load. It does not rebalance existing Pods. Once a Pod is scheduled, only explicit actions (eviction, deletion, preemption) can move it.
# Example: Pod with scheduling constraints # This Pod will ONLY be scheduled on nodes with the label 'disktype=ssd' # and will prefer nodes in zone 'us-east-1a' apiVersion: v1 kind: Pod metadata: name: io-thecodeforge-payment-service namespace: production spec: # Hard requirement: node MUST have this label nodeSelector: disktype: ssd # Soft preference: scheduler tries to place here, but can choose elsewhere affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - us-east-1a # Pod affinity: prefer to run near other payment-service Pods podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 50 podAffinityTerm: labelSelector: matchLabels: app: payment-service topologyKey: kubernetes.io/hostname # Tolerations: allow scheduling on nodes with the 'dedicated=high-cpu' taint tolerations: - key: dedicated operator: Equal value: high-cpu effect: NoSchedule # Topology spread: distribute replicas evenly across zones topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: payment-service containers: - name: payment-service image: registry.thecodeforge.io/payment-service:v2.4.1 resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1000m" memory: "1Gi"
# Verify scheduling decision
kubectl describe pod io-thecodeforge-payment-service -n production | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5s default-scheduler Successfully assigned production/io-thecodeforge-payment-service to node-3
- Guaranteed QoS (requests=limits): Pod is last to be evicted under resource pressure.
- Burstable QoS (requests < limits): Pod can burst but is evicted before Guaranteed Pods.
- BestEffort QoS (no requests, no limits): First to be evicted. Never use in production.
nodeSelector or nodeAffinity required mode. Hard constraint — Pod stays Pending if no node matches.nodeAffinity preferred mode. Soft constraint — scheduler tries to match but places elsewhere if needed.topologySpreadConstraints. More flexible and performant than pod anti-affinity.podAffinity (co-locate) or podAntiAffinity (spread). At scale prefer topologySpreadConstraints.tolerations to the Pod spec. Without a matching toleration, the Pod won't schedule on the tainted node.Pod Networking: How Containers Talk to Each Other
Kubernetes networking has three fundamental requirements, enforced by the CNI (Container Network Interface) plugin:
- Every Pod gets its own IP address, unique across the cluster.
- Pods on any node can communicate with Pods on any other node without NAT.
- Agents on a node (kubelet, system daemons) can communicate with all Pods on that node.
These requirements are simple to state but complex to implement. The CNI plugin (Calico, Cilium, Flannel, AWS VPC CNI) is responsible for wiring this up. It allocates IP addresses from the node's Pod CIDR range, sets up network interfaces inside the Pod's network namespace, and configures routing rules so Pods can reach each other across nodes.
kube-proxy handles Service networking. It watches the API server for Service and Endpoint objects, then programs iptables rules (or IPVS rules) on each node. When a Pod connects to a Service's ClusterIP, the kernel's iptables rules intercept the connection and DNAT it to one of the backend Pod IPs. This is why Service IPs are virtual — they do not exist on any network interface.
# Debugging Pod networking step by step # 1. Verify Pod has an IP address kubectl get pods -n production -o wide # If Pod IP is <none>, the CNI plugin failed to assign an address # 2. Check if the CNI plugin is healthy kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|aws-node' # 3. Verify Pod CIDR allocation per node kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}' # Each node must have a unique, non-overlapping CIDR # 4. Test Pod-to-Pod connectivity across nodes kubectl exec -it pod-on-node-a -- ping <pod-ip-on-node-b> # If this fails but intra-node works, the CNI cross-node routing is broken # 5. Check Service endpoints kubectl get endpoints payment-service -n production # If endpoints are empty, no Pods match the Service's selector # 6. Test Service DNS resolution kubectl exec -it <pod> -- nslookup payment-service.production.svc.cluster.local # If DNS fails, check CoreDNS pods: kubectl get pods -n kube-system | grep coredns # 7. Inspect iptables rules for a Service # (run on the node where your Pod is running) iptables-save | grep <service-cluster-ip>
payment-service-7d8f9-abc12 1/1 Running 10.244.1.45 node-2
payment-service-7d8f9-def34 1/1 Running 10.244.2.78 node-3
NAME READY STATUS RESTARTS AGE
calico-node-abc12 1/1 Running 0 30d
calico-kube-controllers-5d78-def34 1/1 Running 0 30d
node-1 10.244.0.0/24
node-2 10.244.1.0/24
node-3 10.244.2.0/24
PING 10.244.2.78 (10.244.2.78): 56 data bytes
64 bytes from 10.244.2.78: seq=0 ttl=62 time=0.456 ms
NAME ENDPOINTS AGE
payment-service 10.244.1.45:8080,10.244.2.78:8080 15d
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: payment-service.production.svc.cluster.local
Address 1: 10.96.45.12 payment-service.production.svc.cluster.local
- Pod IP works but Service IP fails: kube-proxy or iptables issue.
- Service IP works but DNS fails: CoreDNS issue.
- DNS works but external access fails: Ingress controller or cloud LB issue.
Introduction to Kubernetes
Introduction to Kubernetes is a core concept in DevOps. Rather than starting with a dry definition, let's see it in action and understand why it exists.
// TheCodeForge — Introduction to Kubernetes example // Always use meaningful names, not x or n package io.thecodeforge.kubernetes; public class ForgeExample { public static void main(String[] args) { String topic = "Introduction to Kubernetes"; System.out.println("Learning: " + topic); } }
| Component | Role | Failure Impact | Recovery |
|---|---|---|---|
| kube-apiserver | Validates and serves all API requests. Gateway to etcd. | No new deployments, scaling, or config changes. Existing Pods continue running. | Restart the process. If HA, load balancer routes to healthy replica. |
| etcd | Distributed key-value store. Single source of truth for all cluster state. | Cluster freezes — no state changes possible. If quorum lost, cluster is partitioned. | Restore from snapshot or replace failed member. Requires etcdctl expertise. |
| kube-scheduler | Assigns unscheduled Pods to nodes based on resource availability and constraints. | New Pods stuck in Pending. Existing Pods unaffected. | Restart the process. If leader election fails, check lease in etcd. |
| kube-controller-manager | Runs reconciliation loops for Deployments, ReplicaSets, Nodes, Endpoints, etc. | No self-healing. Crashed Pods not restarted. Scaling stops. Node failures not detected. | Restart the process. Controllers resume reconciliation from current state. |
| kubelet | Node agent. Pulls images, starts containers, reports node status to API server. | Pods on that node stop being managed. Node marked NotReady after 40s (default). Pods evicted after 5 minutes. | Restart kubelet. If node is unhealthy, cordoning and replacing the node may be necessary. |
| kube-proxy | Programs iptables/IPVS rules for Service load balancing on each node. | Services unreachable from Pods on that node. Cross-node Service access still works from other nodes. | Restart the process. Rules are rebuilt from current Service/Endpoint state. |
| CoreDNS | Cluster DNS. Resolves Service names to ClusterIPs. | Service DNS resolution fails. Pods can still reach other Pods by direct IP. | Restart CoreDNS Pods. Check ConfigMap for misconfiguration. |
🎯 Key Takeaways
- You now understand what Introduction to Kubernetes is and why it exists
- You've seen it working in a real runnable example
- Practice daily — the forge only works when it's hot
- The reconciliation loop is the fundamental operating principle of every Kubernetes controller. Understanding it transforms debugging from trial-and-error to systematic investigation.
- etcd is the single point of truth and the most common root cause of cluster-wide issues. Its disk latency is the cluster's ceiling.
- The scheduler scores nodes — it does not bin-pack, predict load, or rebalance. Scheduling decisions are permanent until the Pod is explicitly moved.
- Kubernetes networking is layered (CNI, kube-proxy, Ingress). Debug from the bottom up: Pod IP, ClusterIP, DNS, Ingress.
- Resource requests drive scheduling; resource limits drive runtime enforcement. Setting requests=limits (Guaranteed QoS) gives the most predictable behavior.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the Kubernetes reconciliation loop. How does it apply to a Deployment managing a ReplicaSet managing Pods?
- QWhat happens when you delete a Pod that belongs to a Deployment? Trace the full sequence of events through every controller involved.
- QHow does the kube-scheduler decide which node to place a Pod on? What are the two phases, and what plugins participate in each?
- QWhat is the difference between a Service's ClusterIP and the Pod IPs it routes to? How does kube-proxy implement this?
- QA Pod is stuck in Pending. Walk me through your debugging process, from the first command you would run to identifying the root cause.
- QExplain etcd's role in the cluster. What happens if etcd loses quorum? How would you recover?
- QWhat is the difference between requests and limits, and how do they affect scheduling vs runtime behavior?
- QHow would you design a zero-downtime deployment strategy using Kubernetes primitives (Deployments, PDBs, health checks)?
Frequently Asked Questions
What is Introduction to Kubernetes in simple terms?
Introduction to Kubernetes is a fundamental concept in DevOps. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.
What is the difference between a Deployment, a ReplicaSet, and a Pod?
A Pod is the smallest unit — one or more containers sharing a network namespace. A ReplicaSet ensures a specified number of Pod replicas are running at all times. A Deployment manages ReplicaSets and provides declarative updates (rolling updates, rollbacks). The hierarchy is: Deployment -> ReplicaSet -> Pod. You almost never create ReplicaSets or Pods directly — you create Deployments, and the Deployment controller creates the ReplicaSet, which creates the Pods.
What happens if the control plane node goes down?
Existing Pods on worker nodes continue running — the kubelet on each node operates independently of the control plane for running workloads. However, you cannot deploy new workloads, scale existing workloads, update configurations, or modify any cluster state until the control plane recovers. This is why production clusters need at least 3 control plane nodes for high availability.
How does Kubernetes handle node failures?
The Node controller in kube-controller-manager monitors node heartbeats. If a node stops sending heartbeats (default: every 10s), the node is marked NotReady after 40 seconds. After 5 minutes (the pod-eviction-timeout), the control plane evicts Pods from the unreachable node and reschedules them on healthy nodes. During this 5-minute window, the Pods are running but unreachable if the node is truly down. You can tune this timeout, but setting it too low causes unnecessary evictions during temporary network blips.
What is the difference between a ConfigMap and a Secret?
Functionally, they are identical — both inject configuration data into Pods as environment variables or mounted files. The difference is intent and handling: Secrets are base64-encoded (not encrypted by default), stored separately in etcd, and can be encrypted at rest with an EncryptionConfiguration. ConfigMaps are for non-sensitive configuration. In production, use an external secrets manager (Vault, AWS Secrets Manager) with the Secrets Store CSI Driver instead of Kubernetes Secrets for sensitive data.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.