Beginner 6 min · March 06, 2026

Introduction to Kubernetes — Complete Guide

Q: What is Introduction to Kubernetes in simple terms?

Introduction to Kubernetes is a fundamental concept in DevOps. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

Q: What is the difference between a Deployment, a ReplicaSet, and a Pod?

A Pod is the smallest unit — one or more containers sharing a network namespace. A ReplicaSet ensures a specified number of Pod replicas are running at all times. A Deployment manages ReplicaSets and provides declarative updates (rolling updates, rollbacks). The hierarchy is: Deployment -> ReplicaSet -> Pod. You almost never create ReplicaSets or Pods directly — you create Deployments, and the Deployment controller creates the ReplicaSet, which creates the Pods.

Q: What happens if the control plane node goes down?

Existing Pods on worker nodes continue running — the kubelet on each node operates independently of the control plane for running workloads. However, you cannot deploy new workloads, scale existing workloads, update configurations, or modify any cluster state until the control plane recovers. This is why production clusters need at least 3 control plane nodes for high availability.

Q: How does Kubernetes handle node failures?

The Node controller in kube-controller-manager monitors node heartbeats. If a node stops sending heartbeats (default: every 10s), the node is marked NotReady after 40 seconds. After 5 minutes (the `pod-eviction-timeout`), the control plane evicts Pods from the unreachable node and reschedules them on healthy nodes. During this 5-minute window, the Pods are running but unreachable if the node is truly down. You can tune this timeout, but setting it too low causes unnecessary evictions during temporary network blips.

Q: What is the difference between a ConfigMap and a Secret?

Functionally, they are identical — both inject configuration data into Pods as environment variables or mounted files. The difference is intent and handling: Secrets are base64-encoded (not encrypted by default), stored separately in etcd, and can be encrypted at rest with an EncryptionConfiguration. ConfigMaps are for non-sensitive configuration. In production, use an external secrets manager (Vault, AWS Secrets Manager) with the Secrets Store CSI Driver instead of Kubernetes Secrets for sensitive data.

The degraded etcd EBS volume caused 800ms write latency, stalling Raft consensus and freezing the entire cluster.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Production DevOps experience
✓Deep understanding of the tool's internals
✓Experience debugging distributed systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Kubernetes is a declarative container orchestration platform that continuously reconciles observed state with desired state.
Control plane: kube-apiserver, etcd, kube-scheduler, kube-controller-manager — each has a distinct role and failure mode.
etcd is the single source of truth — its disk latency is the cluster's performance ceiling.
The scheduler filters then scores nodes; it does NOT rebalance or predict load.
kubelet on each node runs the actual containers and reports status back to the API server.
Most production outages trace back to etcd misconfiguration, not application code.

✦ Definition~90s read

What is Introduction to Kubernetes?

Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications. At its core, you manage a cluster: a set of machines called nodes. One node is the master (control plane), the rest are workers (data plane).

★

Imagine you own a giant warehouse with hundreds of workers.

You define your app's desired state — how many replicas, which image, what ports — in a YAML manifest. Kubernetes then ensures the cluster matches that state, healing failures, scaling load, and rolling updates. A Pod is the smallest unit: one or more containers sharing networking and storage.

Deployments manage replica sets. Services provide stable network endpoints. This declarative approach means you tell Kubernetes what you want, not how to achieve it. The system handles the rest, watching for drift and correcting it automatically.

Plain-English First

Imagine you own a giant warehouse with hundreds of workers. Instead of telling each worker exactly what to do every minute, you hire a smart manager who reads a wish list ('I need 5 boxes packed, always'), watches the floor, and reassigns workers automatically when someone calls in sick. Kubernetes is that manager — you describe what your software should look like, and Kubernetes keeps reality matching the wish list, forever, across thousands of machines.

Kubernetes is not a deployment tool. It is a distributed state reconciliation engine. Every component — from the scheduler to the kubelet — operates on the same principle: watch the desired state in etcd, compare it with observed state, and act to close the gap. This is the mental model that unlocks real debugging capability.

The control plane is the brain. etcd is the memory. The kubelet is the muscle on each node. The scheduler decides placement. When any of these components degrades, the symptoms are often misleading — a Pod stuck in Pending looks like a scheduling problem but is frequently an etcd latency issue or a resource quota misconfiguration.

The common misconception is that Kubernetes 'runs containers.' It does not. Kubernetes manages the desired state of workloads. The container runtime (containerd, CRI-O) runs containers. Kubernetes tells the runtime what to run, monitors whether it is running, and corrects deviations. This distinction matters when debugging crashes, image pull failures, and networking issues.

What etcd Latency Actually Does to Kubernetes

etcd is the distributed key-value store that backs Kubernetes, holding all cluster state — pods, services, configmaps, secrets. The core mechanic: every write to etcd must be committed to a majority of nodes (quorum) before it's considered durable. This means a single slow disk on one node can stall the entire cluster. In practice, etcd's performance is measured by fsync latency: the time to flush a write to disk. Kubernetes control-plane components — kube-apiserver, scheduler, controller-manager — all depend on etcd's linearizable reads and writes. When fsync latency exceeds 100ms, watch timeouts and leader elections cascade. At 800ms, the cluster enters a death spiral: heartbeats fail, leaders step down, and no new writes succeed. You use etcd in every Kubernetes cluster, but its sensitivity to disk I/O is often underestimated. Understanding this matters because a single slow disk — not CPU, not memory — is the most common cause of control-plane outages in production.

⚠ Disk Speed Is Not CPU Speed

A fast CPU cannot compensate for a slow disk. etcd's fsync latency is the bottleneck — provision dedicated SSDs with guaranteed IOPS, not shared cloud volumes.

📊 Production Insight

A team ran etcd on a shared EBS gp2 volume with burst credits exhausted. The symptom: intermittent API server timeouts and leader election storms every 90 seconds. The rule: provision etcd on dedicated NVMe SSDs or local SSDs with at least 5000 IOPS and monitor fsync latency — alert if p99 exceeds 50ms.

🎯 Key Takeaway

etcd is the single source of truth for cluster state — its latency is your cluster's latency.

Disk fsync latency is the critical metric; anything above 100ms p99 will cause control-plane instability.

Always run etcd on dedicated, low-latency storage — never share a disk with other workloads or use network-attached volumes with burst limits.

The 4 Essential Objects: Pod, Service, Deployment, Namespace

Before diving into the architecture, you need a concrete mental model of the four objects you'll use every day. Kubernetes exposes hundreds of resource types, but 80% of your interactions will involve these four.

Pod is the smallest deployable unit. A Pod wraps one or more containers, gives them a shared network namespace (one IP per Pod), and optionally shared storage volumes. Containers in the same Pod can communicate via localhost. Pods are ephemeral — they can be killed and rescheduled at any time. Never run a single Pod without a controller (Deployment, StatefulSet, DaemonSet).

Service provides a stable network endpoint for a set of Pods. Because Pods can die and be replaced with new IPs, a Service gives a fixed IP (ClusterIP) and DNS name that load-balances across the healthy Pods. The Service uses label selectors to determine which Pods belong to it.

Deployment is the most common controller. It declares the desired state for your stateless applications: how many replicas, which container image, resource limits, health checks, update strategy. The Deployment controller creates a ReplicaSet, which creates the Pods. When you update the Pod template, the Deployment creates a new ReplicaSet and gradually scales it up and the old one down (rolling update).

Namespace is a virtual cluster boundary. It isolates resources, RBAC, and network policies. Every resource lives in a namespace — except cluster-scoped resources like Nodes and PersistentVolumes. Use namespaces to separate environments (dev, staging, prod) or teams.

Together, these objects form the foundation: you define a Deployment that creates Pods, expose them via a Service, and organize everything in a Namespace.

four-essential-objects.yamlYAML

# 1. Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
---
# 2. Deployment creating 3 replicas of a web app
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deploy
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.27
        ports:
        - containerPort: 80
---
# 3. Service exposing the Deployment internally
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
  namespace: production
spec:
  selector:
    app: nginx
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP

Output

# After applying the above three objects:

kubectl get ns

NAME STATUS AGE

production Active 10s

kubectl get deploy -n production

NAME READY UP-TO-DATE AVAILABLE AGE

nginx-deploy 3/3 3 3 10s

kubectl get pods -n production -o wide

NAME READY STATUS RESTARTS AGE IP

nginx-deploy-7b5c6f8d9-abc12 1/1 Running 0 8s 10.244.1.10

nginx-deploy-7b5c6f8d9-def34 1/1 Running 0 8s 10.244.1.11

nginx-deploy-7b5c6f8d9-ghi56 1/1 Running 0 8s 10.244.1.12

kubectl get svc -n production

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

nginx-service ClusterIP 10.96.123.45 <none> 80/TCP 10s

# Test internal connectivity from a temporary Pod:

kubectl run -it --rm test --image=curlimages/curl -- sh

/ $ curl nginx-service.production.svc.cluster.local

<!DOCTYPE html> (nginx default page)

🔥Namespaces Are Not Optional in Production

Always create your objects inside a specific namespace. The default namespace is a shared space where conflicts happen. Set a resource quota and limit range on every namespace to prevent teams from exhausting cluster resources.

📊 Production Insight

The most common production mistake is creating a Deployment without a Service, then trying to reach Pods by IP. Always pair every Deployment with a Service (even if you only need internal access). For external access, use an Ingress or LoadBalancer Service. Also, remember that Deployments are for stateless workloads — for stateful workloads (databases), use a StatefulSet.

🎯 Key Takeaway

Pod, Service, Deployment, and Namespace are the four pillars of everyday Kubernetes. Understand their lifecycle and how they interact: Deployments create Pods, Services provide stable network endpoints, and Namespaces keep everything organized and isolated.

thecodeforge.io

Introduction Kubernetes

Control Plane Architecture: The Brain of the Cluster

The Kubernetes control plane consists of four components that work together to maintain cluster state. Understanding each component's role — and its failure modes — is essential for production operations.

kube-apiserver is the front door. Every kubectl command, every controller reconciliation, every kubelet status report goes through the API server. It validates requests, persists state to etcd, and serves as the watch endpoint for all controllers. It is stateless — you can run multiple replicas behind a load balancer for HA.

etcd is the single source of truth. It is a distributed, consistent key-value store built on the Raft consensus protocol. All cluster state — Pod definitions, ConfigMaps, Secrets, node registrations — lives in etcd. If etcd loses quorum, the cluster cannot make any state changes. etcd is the most critical component and the most commonly under-provisioned.

kube-scheduler watches for unscheduled Pods and assigns them to nodes. It does not run Pods — it only writes the nodeName field. The kubelet on the assigned node then pulls the image and starts the container. The scheduler uses a two-phase process: filtering (eliminate infeasible nodes) and scoring (rank feasible nodes, pick the highest score).

kube-controller-manager runs the control loops. Each controller watches a specific resource type and reconciles actual state with desired state. The Deployment controller ensures the right number of replicas exist. The Node controller detects when nodes go unhealthy. The Endpoint controller updates Service endpoints as Pods come and go.

control-plane-architecture.yamlYAML

# Control Plane Health Check — Run this to verify all components are healthy
# Save as check-control-plane.sh

# 1. API Server health (returns 200 if healthy)
curl -k https://localhost:6443/healthz
# Expected: "ok"

# 2. etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \n  --endpoints=https://127.0.0.1:2379 \n  --cacert=/etc/kubernetes/pki/etcd/ca.crt \n  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \n  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Expected: "is healthy"

# 3. etcd cluster member status
ETCDCTL_API=3 etcdctl endpoint status \n  --endpoints=https://127.0.0.1:2379 \n  --cacert=/etc/kubernetes/pki/etcd/ca.crt \n  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \n  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \n  --write-out=table
# Shows: ID, Status, Version, DB Size, Raft Term, Raft Index

# 4. Scheduler and Controller-Manager leader election
kubectl get endpoints kube-scheduler -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'
kubectl get endpoints kube-controller-manager -n kube-system -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'

# 5. All control plane components running
kubectl get pods -n kube-system -o wide

Output

127.0.0.1:2379 is healthy: successfully committed proposal: took = 2.145ms

+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

| https://127.0.0.1:2379 | 8e9e05c52164694d | 3.5.9 | 25 MB | true | false | 4 | 18234 | 18234 | |

+----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

{"holderIdentity":"master-1_xxxxx","leaseDurationSeconds":15,"acquireTime":"2026-03-01T10:00:00Z","renewTime":"2026-04-07T14:30:00Z","leaderTransitions":3}

NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE

kube-system coredns-5d78c9869d-abc12 1/1 Running 0 30d 10.244.0.5 master-1

kube-system etcd-master-1 1/1 Running 0 30d 192.168.1.10 master-1

kube-system kube-apiserver-master-1 1/1 Running 0 30d 192.168.1.10 master-1

kube-system kube-controller-manager-master-1 1/1 Running 0 30d 192.168.1.10 master-1

kube-system kube-proxy-xyz78 1/1 Running 0 30d 192.168.1.10 master-1

kube-system kube-scheduler-master-1 1/1 Running 0 30d 192.168.1.10 master-1

Mental Model

The Reconciliation Loop — The Heartbeat of Kubernetes

Understanding this loop is the single most important concept in Kubernetes.

The API server is the only component that talks to etcd. All other components go through the API server.
Controllers are level-triggered, not edge-triggered. They care about the current state, not the event that caused it.
This is why Kubernetes is self-healing. It does not remember what happened — it only checks what is true right now.

📊 Production Insight

Control plane HA requires at least 3 etcd members and 2+ API server replicas.

A single-node control plane is a single point of failure — if the master node dies, existing Pods keep running (kubelet is independent), but you cannot deploy, scale, or modify anything until the control plane recovers.

etcd quorum requires (n/2)+1 members alive. With 3 members, you can tolerate 1 failure.

Never run an even number of etcd members — split-brain scenarios become possible.

🎯 Key Takeaway

The control plane is a distributed system with etcd as its consensus backbone.

Every API request, every scheduler decision, every controller reconciliation depends on etcd's health.

Production clusters need 3+ etcd members on local SSDs — and etcd latency monitoring is not optional, it's the earliest warning of cluster degradation.

Control Plane Sizing for Production

IfDev/test cluster, non-critical workloads.

→

UseSingle control plane node is acceptable. Accept the risk of API unavailability during maintenance.

IfProduction cluster, < 100 nodes.

→

Use3 control plane nodes with stacked etcd. etcd runs on the same nodes as the API server — cost-effective HA.

IfProduction cluster, > 100 nodes or strict SLA requirements.

→

Use3–5 dedicated etcd nodes + 2+ API server nodes (external etcd). Isolates etcd disk I/O from API load.

IfMulti-region cluster.

→

UseStretched etcd across regions needs < 10ms latency. If higher, use separate clusters per region.

Control Plane vs Data Plane Architecture

Kubernetes is divided into two logical planes: the control plane (brain) and the data plane (muscle). The control plane makes decisions about the cluster state — what should run, where it should run, and whether the current state matches the desired state. The data plane executes those decisions — it runs the actual containers, provides the network connectivity, and reports back the observed state.

The control plane components (kube-apiserver, etcd, scheduler, controller-manager) typically run on dedicated master nodes, though in smaller clusters they may be colocated. The data plane consists of the worker nodes, each running kubelet, kube-proxy, the container runtime, and the CNI plugin.

The key architectural insight: control plane components communicate with each other and with etcd, but they never directly interact with the user containers. All interactions go through the API server. The kubelet on each worker node polls the API server for Pods assigned to its node, then instructs the container runtime to pull images and start containers. kube-proxy watches the API server for Service changes and programs iptables/IPVS rules accordingly.

This separation means that if the control plane fails, existing containers continue running (the kubelet is autonomous for running workloads) but you cannot make any changes. Conversely, if a worker node fails, the control plane detects it (via the Node controller) and reschedules the Pods on healthy nodes after a timeout.

check-data-plane.shBASH

# Check kubelet and kube-proxy status on a worker node
# Run this on the worker node itself

# 1. Check kubelet is running
systemctl status kubelet

# 2. Check kube-proxy (runs as a DaemonSet, check from master)
kubectl get pods -n kube-system | grep kube-proxy

# 3. Check container runtime is responsive
crictl pods

# 4. Verify that the node can reach the API server
curl -k https://<api-server-ip>:6443/healthz

# 5. Test that local kubelet can actually start a Pod
kubectl run test --image=busybox --restart=Never -- sleep 30
kubectl get pods -n default
kubectl delete pod test

Output

● kubelet.service - kubelet: The Kubernetes Node Agent

Loaded: loaded /lib/systemd/system/kubelet.service; enabled

Active: active (running)

kube-proxy-abc12 1/1 Running 0 10d

kube-proxy-def34 1/1 Running 0 10d

CONTAINER ID IMAGE CREATED STATE NAME

c1d2e3f4 <image> 2 hours ago Ready <pod>

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized"} # (because we didn't auth, but reachability is confirmed)

pod/test created

🔥Why This Separation Matters

When debugging, ask: is this a control plane issue or a data plane issue? Symptoms like kubectl hanging point to control plane (API server or etcd). Symptoms like a Pod running but unreachable point to data plane (kube-proxy, CNI). This mental shortcut saves hours.

📊 Production Insight

In production, always run the control plane on separate nodes from the data plane. Colocating control plane components with user workloads on the same nodes can lead to noisy neighbor problems — a CPU-hungry Pod can starve the API server or etcd, causing cluster-wide instability. For large clusters (>500 nodes), consider using dedicated etcd nodes for further isolation.

🎯 Key Takeaway

The control plane decides, the data plane executes. This separation is the foundation of Kubernetes scalability and resilience. Understanding which plane a symptom belongs to is the first step in systematic debugging.

Kubernetes Control Plane and Data Plane

The Scheduler: How Kubernetes Decides Where Pods Run

The kube-scheduler is the component that assigns Pods to nodes. It does not run Pods — it only writes the spec.nodeName field on the Pod object. The kubelet on the assigned node then pulls the image and starts the container.

The scheduler uses a two-phase process:

Filtering (Feasibility): Eliminate nodes that cannot run the Pod. Filter reasons include: insufficient CPU/memory, node taints the Pod cannot tolerate, node affinity mismatches, volume zone constraints, and Pod topology spread constraints. After filtering, if zero nodes remain, the Pod stays in Pending.

Scoring (Ranking): Rank the feasible nodes by a set of scoring plugins. Default scoring includes: NodeResourcesBalancedAllocation (prefer nodes with balanced CPU/memory usage), ImageLocality (prefer nodes that already have the container image), InterPodAffinity (prefer nodes where affinity rules are satisfied), and TaintToleration (prefer nodes with fewer taints). The node with the highest weighted score wins.

The scheduler makes decisions based on the state of the cluster at scheduling time. It does not predict future load. It does not rebalance existing Pods. Once a Pod is scheduled, only explicit actions (eviction, deletion, preemption) can move it.

scheduler-configuration.yamlYAML

# Example: Pod with scheduling constraints
# This Pod will ONLY be scheduled on nodes with the label 'disktype=ssd'
# and will prefer nodes in zone 'us-east-1a'
apiVersion: v1
kind: Pod
metadata:
  name: io-thecodeforge-payment-service
  namespace: production
spec:
  # Hard requirement: node MUST have this label
  nodeSelector:
    disktype: ssd

  # Soft preference: scheduler tries to place here, but can choose elsewhere
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-east-1a
    # Pod affinity: prefer to run near other payment-service Pods
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: payment-service
            topologyKey: kubernetes.io/hostname

  # Tolerations: allow scheduling on nodes with the 'dedicated=high-cpu' taint
  tolerations:
    - key: dedicated
      operator: Equal
      value: high-cpu
      effect: NoSchedule

  # Topology spread: distribute replicas evenly across zones
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: payment-service

  containers:
    - name: payment-service
      image: registry.thecodeforge.io/payment-service:v2.4.1
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"
        limits:
          cpu: "1000m"
          memory: "1Gi"

Output

pod/io-thecodeforge-payment-service created

# Verify scheduling decision

kubectl describe pod io-thecodeforge-payment-service -n production | grep -A 10 Events

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Scheduled 5s default-scheduler Successfully assigned production/io-thecodeforge-payment-service to node-3

Mental Model

Requests vs Limits — The Scheduler Only Sees Requests

This is why setting requests=limits (Guaranteed QoS) gives the most predictable performance.

Guaranteed QoS (requests=limits): Pod is last to be evicted under resource pressure.
Burstable QoS (requests < limits): Pod can burst but is evicted before Guaranteed Pods.
BestEffort QoS (no requests, no limits): First to be evicted. Never use in production.

📊 Production Insight

Scheduler performance degrades with cluster size and Pod count.

At > 5000 Pods, scheduling latency can exceed 1 second, causing deployment rollouts to slow dramatically.

Mitigate with: scheduler extenders for custom logic, Pod topology spread constraints instead of pod anti-affinity (more efficient), and multiple scheduler profiles for different workload classes.

The scheduler's scoring algorithm is pluggable — you can weight or disable scoring plugins via a KubeSchedulerConfiguration.

🎯 Key Takeaway

The scheduler is a scoring engine, not a bin-packer.

It ranks feasible nodes and picks the best match at scheduling time.

It does not rebalance, predict load, or consider limits.

Understanding the filter-then-score pipeline — and how nodeSelector, affinity, taints, and topology spread interact within it — is essential for controlling Pod placement at scale.

Scheduling Constraint Selection

IfPod MUST run on a specific type of node (e.g., GPU, SSD).

→

UseUse nodeSelector or nodeAffinity required mode. Hard constraint — Pod stays Pending if no node matches.

IfPod PREFERS a specific node type but can run elsewhere.

→

UseUse nodeAffinity preferred mode. Soft constraint — scheduler tries to match but places elsewhere if needed.

IfReplicas must be spread across failure domains (zones, nodes).

→

UseUse topologySpreadConstraints. More flexible and performant than pod anti-affinity.

IfPod should run near (or away from) other specific Pods.

→

UseUse podAffinity (co-locate) or podAntiAffinity (spread). At scale prefer topologySpreadConstraints.

IfNode has taints (dedicated nodes, spot instances).

→

UseAdd tolerations to the Pod spec. Without a matching toleration, the Pod won't schedule on the tainted node.

thecodeforge.io

Introduction Kubernetes

Pod Networking: How Containers Talk to Each Other

Kubernetes networking has three fundamental requirements, enforced by the CNI (Container Network Interface) plugin:

Every Pod gets its own IP address, unique across the cluster.
Pods on any node can communicate with Pods on any other node without NAT.
Agents on a node (kubelet, system daemons) can communicate with all Pods on that node.

These requirements are simple to state but complex to implement. The CNI plugin (Calico, Cilium, Flannel, AWS VPC CNI) is responsible for wiring this up. It allocates IP addresses from the node's Pod CIDR range, sets up network interfaces inside the Pod's network namespace, and configures routing rules so Pods can reach each other across nodes.

kube-proxy handles Service networking. It watches the API server for Service and Endpoint objects, then programs iptables rules (or IPVS rules) on each node. When a Pod connects to a Service's ClusterIP, the kernel's iptables rules intercept the connection and DNAT it to one of the backend Pod IPs. This is why Service IPs are virtual — they do not exist on any network interface.

networking-debug.yamlYAML

# Debugging Pod networking step by step

# 1. Verify Pod has an IP address
kubectl get pods -n production -o wide
# If Pod IP is <none>, the CNI plugin failed to assign an address

# 2. Check if the CNI plugin is healthy
kubectl get pods -n kube-system | grep -E 'calico|cilium|flannel|aws-node'

# 3. Verify Pod CIDR allocation per node
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
# Each node must have a unique, non-overlapping CIDR

# 4. Test Pod-to-Pod connectivity across nodes
kubectl exec -it pod-on-node-a -- ping <pod-ip-on-node-b>
# If this fails but intra-node works, the CNI cross-node routing is broken

# 5. Check Service endpoints
kubectl get endpoints payment-service -n production
# If endpoints are empty, no Pods match the Service's selector

# 6. Test Service DNS resolution
kubectl exec -it <pod> -- nslookup payment-service.production.svc.cluster.local
# If DNS fails, check CoreDNS pods: kubectl get pods -n kube-system | grep coredns

# 7. Inspect iptables rules for a Service
# (run on the node where your Pod is running)
iptables-save | grep <service-cluster-ip>

Output

NAME READY STATUS IP NODE

payment-service-7d8f9-abc12 1/1 Running 10.244.1.45 node-2

payment-service-7d8f9-def34 1/1 Running 10.244.2.78 node-3

NAME READY STATUS RESTARTS AGE

calico-node-abc12 1/1 Running 0 30d

calico-kube-controllers-5d78-def34 1/1 Running 0 30d

node-1 10.244.0.0/24

node-2 10.244.1.0/24

node-3 10.244.2.0/24

PING 10.244.2.78 (10.244.2.78): 56 data bytes

64 bytes from 10.244.2.78: seq=0 ttl=62 time=0.456 ms

NAME ENDPOINTS AGE

payment-service 10.244.1.45:8080,10.244.2.78:8080 15d

Server: 10.96.0.10

Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: payment-service.production.svc.cluster.local

Address 1: 10.96.45.12 payment-service.production.svc.cluster.local

Mental Model

The Three Layers of K8s Networking

Most networking bugs are CNI or DNS issues, not application code issues.

Pod IP works but Service IP fails: kube-proxy or iptables issue.
Service IP works but DNS fails: CoreDNS issue.
DNS works but external access fails: Ingress controller or cloud LB issue.

📊 Production Insight

CNI plugin selection has massive performance and operational implications.

Calico (BGP mode) scales well but requires BGP peering knowledge.

Cilium (eBPF) bypasses iptables entirely, offering better performance at scale (>1000 Services) but is more complex to debug.

AWS VPC CNI assigns real VPC IP addresses to Pods, simplifying security group integration but consuming VPC IP space rapidly.

Evaluate CNI based on: Service count, NetworkPolicy requirements, observability needs, and team expertise.

🎯 Key Takeaway

Kubernetes networking is a layered system: CNI for Pod connectivity, kube-proxy for Service load balancing, Ingress for external access.

Debug from the bottom up — Pod IP, then ClusterIP, then DNS, then Ingress.

CNI plugin choice is a long-term architectural decision with performance, security, and operational trade-offs.

Kubernetes Storage: PersistentVolumes, Claims, and StorageClasses

Kubernetes storage decouples Pod lifecycle from data life. A Pod can be deleted and recreated, but its data persists if it uses a PersistentVolume (PV) and PersistentVolumeClaim (PVC). This is critical for stateful workloads like databases.

PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically by a StorageClass. It is a cluster resource, like a node. PVs have a capacity and access mode (ReadWriteOnce, ReadOnlyMany, ReadWriteMany).

PersistentVolumeClaim (PVC) is a request for storage by a user. It specifies size and access mode. Kubernetes binds a PVC to a PV that meets the requirements. If no matching PV exists, the PVC remains Pending — unless a StorageClass with a dynamic provisioner is referenced.

StorageClass defines a class of storage. It specifies the provisioner (e.g., kubernetes.io/aws-ebs), parameters (type, IOPS), and reclaim policy. When a PVC requests a StorageClass, the provisioner automatically creates a PV that satisfies the claim.

storage-example.yamlYAML

# StorageClass for AWS gp3 volumes with 3000 IOPS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: io-thecodeforge-fast
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
# PVC that uses the StorageClass above
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: io-thecodeforge-payment-db-pvc
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: io-thecodeforge-fast
---
# Pod using the PVC
apiVersion: v1
kind: Pod
metadata:
  name: io-thecodeforge-payment-db
  namespace: production
spec:
  containers:
    - name: postgres
      image: postgres:16
      env:
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
      volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: io-thecodeforge-payment-db-pvc

Output

# After creation

kubectl get sc

# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE

# io-thecodeforge-fast ebs.csi.aws.com Delete WaitForFirstConsumer

kubectl get pvc -n production

# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS

# io-thecodeforge-payment-db-pvc Bound pvc-abc123 100Gi RWO io-thecodeforge-fast

kubectl get pv

# NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM

# pvc-abc123 100Gi RWO Delete Bound production/io-thecodeforge-payment-db-pvc

⚠ PV Reclaim Policy: The Silent Data Loss Trap

If a PVC is deleted, what happens to the underlying PV depends on the persistentVolumeReclaimPolicy:
Retain: PV remains but is in Released state — you must manually reclaim it.
Delete: PV and underlying storage are deleted. This is default for dynamic provisioners.
Recycle: Deprecated. Attempts to scrub and re-use.
Production gotcha*: If you delete a PVC with a Delete reclaim policy without first taking a snapshot, you lose all data. Always set Retain for critical databases, or use a backup solution.

📊 Production Insight

Dynamic provisioning with a default StorageClass is dangerous — a typo in a PVC's storageClassName can fall back to the default class.

If you delete a Namespace, all PVCs in it are deleted, and with ReclaimPolicy=Delete, all data is gone.

Monitor PVC usage and set ResourceQuotas on storage requests to prevent runaway claims from exhausting your cloud budget.

For stateful workloads, use StatefulSets with volumeClaimTemplates to automatically generate unique PVCs per replica.

🎯 Key Takeaway

Kubernetes storage is about lifecycle decoupling: PV is a resource, PVC is a claim, StorageClass enables dynamic provisioning.

The reclaim policy determines whether data survives PVC deletion — set to Retain for anything irreplaceable.

StatefulSets with volumeClaimTemplates are the correct pattern for database-like workloads.

Choosing a Storage Approach

IfEphemeral data — logs, scratch space, caches.

→

UseUse emptyDir volume. Data is lost when the Pod is deleted, which is expected.

IfPersistent data that must survive Pod restarts (single replica).

→

UseUse a PVC with a StorageClass that provides a block store (EBS, Persistent Disk). Use ReadWriteOnce access mode.

IfMulti-replica app that needs shared read-write access (e.g., NFS, shared config).

→

UseUse a PVC with ReadWriteMany access mode. Not all provisioners support it — consider NFS, EFS, or GlusterFS.

IfDatabase with strict consistency requirements (Postgres, MySQL).

→

UseUse a single PVC per replica (StatefulSet + volumeClaimTemplates). Prefer local SSDs or dedicated EBS volumes. Never use ReadWriteMany for databases.

thecodeforge.io

Introduction Kubernetes

Namespaces, Resource Quotas, and Multi-Tenancy

Namespaces are virtual clusters within a physical cluster. They provide isolation boundaries for resources, RBAC, and network policies. Every resource lives in a namespace — except cluster-scoped resources like Nodes and PersistentVolumes.

ResourceQuota limits aggregate resource consumption within a namespace. You can set quotas on CPU, memory, Pod count, PVC storage, and even the number of Services. Without quotas, a single misconfigured application can consume all cluster resources and starve others.

LimitRange sets default requests/limits and min/max constraints for Pods in a namespace. This prevents a Pod from requesting an absurd amount of resources or running without any limits.

Multi-tenancy with Namespaces is common: each team gets its own namespace, with RBAC restricting cross-namespace access. But true multi-tenancy (running untrusted workloads) requires additional isolation — consider virtual clusters (vClusters) or sandbox containers (gVisor, Kata Containers).

quota-and-limitrange.yamlYAML

# ResourceQuota for a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: io-thecodeforge-team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "10"
    requests.storage: 500Gi
    pods: "50"
    services: "10"
---
# LimitRange to enforce default resource boundaries
apiVersion: v1
kind: LimitRange
metadata:
  name: io-thecodeforge-default-limits
  namespace: team-a
spec:
  limits:
    - default:
        cpu: "500m"
        memory: 512Mi
      defaultRequest:
        cpu: "100m"
        memory: 128Mi
      max:
        cpu: "2"
        memory: 4Gi
      min:
        cpu: "50m"
        memory: 64Mi
      type: Container

Output

# After applying

kubectl describe resourcequota -n team-a

# Name: io-thecodeforge-team-quota

# Namespace: team-a

# Resource Used Hard

# -------- --- ---

# pods 12 50

# requests.cpu 3.5 10

# requests.memory 7Gi 20Gi

# limits.cpu 8 20

# limits.memory 18Gi 40Gi

# persistentvolumeclaims 2 10

# requests.storage 120Gi 500Gi

# services 4 10

kubectl describe limitrange -n team-a

# Limits:

# Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio

# ---- -------- --- --- --------------- ------------- -----------------------

# Container cpu 50m 2 100m 500m -

# Container memory 64Mi 4Gi 128Mi 512Mi -

🔥Multi-Tenancy Warning

Using Namespaces alone for security isolation between untrusted tenants is insufficient. A Pod in one namespace can still connect to a Service in another namespace unless NetworkPolicies block it. Also, a Pod can access the API server with its ServiceAccount token — RBAC must be scoped per namespace. For hard multi-tenancy, consider dedicated clusters, virtual clusters (vCluster), or sandbox runtimes.

📊 Production Insight

Without ResourceQuotas, a single team's application can silently consume all cluster resources and block other teams.

Quota enforcement is immediate — if a team tries to create a Pod that would exceed its quota, the API server rejects it.

LimitRange is essential for environments where teams might forget to set requests and limits — it prevents BestEffort Pods by default.

Monitor namespace usage with kubectl top and Prometheus alerts to catch quota exhaustion before it causes deployment failures.

🎯 Key Takeaway

Namespaces provide lightweight isolation — RBAC, ResourceQuotas, and NetworkPolicies make them safe for trusted tenants.

Always set ResourceQuotas and LimitRange per namespace in shared clusters — without them, one team can starve everyone.

For truly untrusted workloads, Namespaces alone are not enough: reach for dedicated clusters or sandbox containers.

Namespace Isolation Strategy

IfSame team, different environments (dev, staging).

→

UseSeparate namespaces per environment. RBAC restricts team members to their environment. No ResourceQuota needed for dev, but set for staging to match production.

IfMultiple teams on a shared cluster.

→

UseOne namespace per team with ResourceQuota, LimitRange, and RBAC. NetworkPolicy denies all cross-namespace traffic by default, allow specific flows explicitly.

IfRunning untrusted code (CI/CD agents, third-party apps).

→

UseUse a separate cluster or sandbox runtime. Namespace-level isolation is insufficient. Consider virtual clusters with vCluster or run in a separate pool of nodes with taints and tolerations.

● Production incidentPOST-MORTEMseverity: high

The etcd Disk That Killed the Entire Cluster

Symptom

kubectl commands hang or timeout. New Pods stuck in Pending. Deployment rollouts never complete. API server logs show 'etcdserver: request timed out' errors. Controller-manager logs show leader election failures.

Assumption

The API server is overloaded, or the scheduler has crashed.

Root cause

etcd's data directory was on a network-attached EBS volume that had degraded to p99 write latency of 800ms (normal: 2ms). etcd requires sub-10ms disk writes for stable operation. The degraded disk caused the Raft consensus protocol to stall — the cluster could not commit new state changes. The API server, which depends on etcd for every operation, began queuing requests until it exhausted its connection pool. The scheduler and controller-manager, which watch etcd via the API server, received no updates and effectively froze.

Fix

1. Immediately migrate etcd to local NVMe SSDs (provisioned IOPS EBS or instance-local storage). 2. Set etcd disk latency alerts at p99 > 10ms as critical. 3. Implement etcd defragmentation on a schedule (etcdctl defrag). 4. Configure etcd auto-compaction (--auto-compaction-retention=8) to prevent unbounded data growth. 5. Monitor etcd member health with etcdctl endpoint health and etcdctl endpoint status.

Key lesson

etcd is the single point of failure for the entire cluster. Its disk performance is the cluster's ceiling.
Never run etcd on network-attached storage in production. Local SSDs are mandatory.
API server timeouts are often etcd problems, not API server problems. Trace downward, not upward.
etcd requires periodic defragmentation. Without it, space is freed but not reclaimed, leading to disk pressure.

Production debug guideSymptom-driven investigation paths for the most common failure modes.6 entries

Symptom · 01

Pod stuck in Pending state.

→

Fix

1. Run kubectl describe pod <name> and read the Events section. 2. Common causes: insufficient CPU/memory on any node (check kubectl describe nodes for Allocatable vs Allocated), PersistentVolumeClaim not bound, node affinity/taint mismatches, resource quotas exceeded. 3. If no events appear, the scheduler may be down — check kubectl get pods -n kube-system for kube-scheduler.

Symptom · 02

Pod stuck in CrashLoopBackOff.

→

Fix

1. Run kubectl logs <pod> --previous to see the logs from the crashed container (current logs may be empty). 2. Common causes: missing environment variables, failed health checks, OOMKill (check kubectl describe pod for Last State), misconfigured entrypoint. 3. If OOMKill, increase memory limits or fix the memory leak. Check kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState}'.

Symptom · 03

Pods cannot reach each other across nodes.

→

Fix

1. Verify the CNI plugin is healthy: kubectl get pods -n kube-system | grep calico (or flannel/weave). 2. Check if Pod CIDR ranges overlap between nodes: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'. 3. Verify kube-proxy is running: kubectl get pods -n kube-system | grep kube-proxy. 4. Test from within a Pod: kubectl exec -it <pod> -- curl <service-ip>:<port>.

Symptom · 04

Deployment rollout hangs — new ReplicaSet never becomes ready.

→

Fix

1. Check the new ReplicaSet: kubectl describe rs <new-rs-name>. 2. Look for Pods that are Pending or CrashLoopBackOff. 3. Check if the new image exists in the registry and if imagePullSecrets are configured. 4. If using rolling update with maxUnavailable=0 and the cluster has no spare capacity, new Pods cannot be scheduled. 5. Rollback: kubectl rollout undo deployment/<name>.

Symptom · 05

etcd high latency alerts firing — API server slow.

→

Fix

1. Check etcd latency: etcdctl endpoint health --write-out=table. 2. Check disk I/O on etcd nodes: iostat -x 1. 3. Check etcd database size: etcdctl endpoint status --write-out=table. 4. If disk is the bottleneck, migrate to local SSDs. 5. If database is large, run defragmentation: etcdctl defrag.

Symptom · 06

PersistentVolumeClaim stuck in Pending.

→

Fix

1. Check if any PersistentVolume matches: kubectl get pv. 2. Describe the PVC: kubectl describe pvc <name>. Common causes: no PV available with matching accessModes and storageClassName, or the StorageClass has no provisioner. 3. If using dynamic provisioning, verify the storage provisioner pod is running and hasn't hit a quota or permission error.

★ Kubernetes Triage Cheat SheetFirst-response commands for common K8s production incidents.

Pod not starting — no events visible.−

Immediate action

Check if the scheduler is running and if nodes have capacity.

Commands

kubectl get pods -n kube-system | grep scheduler

kubectl describe nodes | grep -A 5 'Allocated resources'

Fix now

If scheduler is down: check kube-system logs. If no capacity: scale the cluster or evict low-priority Pods.

Service returns 502/503 intermittently.+

Node marked NotReady — Pods being evicted.+

PersistentVolumeClaim stuck in Pending.+

Pod evicted due to node pressure.+

Kubernetes Component Comparison

Component	Role	Failure Impact	Recovery
kube-apiserver	Validates and serves all API requests. Gateway to etcd.	No new deployments, scaling, or config changes. Existing Pods continue running.	Restart the process. If HA, load balancer routes to healthy replica.
etcd	Distributed key-value store. Single source of truth for all cluster state.	Cluster freezes — no state changes possible. If quorum lost, cluster is partitioned.	Restore from snapshot or replace failed member. Requires etcdctl expertise.
kube-scheduler	Assigns unscheduled Pods to nodes based on resource availability and constraints.	New Pods stuck in Pending. Existing Pods unaffected.	Restart the process. If leader election fails, check lease in etcd.
kube-controller-manager	Runs reconciliation loops for Deployments, ReplicaSets, Nodes, Endpoints, etc.	No self-healing. Crashed Pods not restarted. Scaling stops. Node failures not detected.	Restart the process. Controllers resume reconciliation from current state.
kubelet	Node agent. Pulls images, starts containers, reports node status to API server.	Pods on that node stop being managed. Node marked NotReady after 40s (default). Pods evicted after 5 minutes.	Restart kubelet. If node is unhealthy, cordoning and replacing the node may be necessary.
kube-proxy	Programs iptables/IPVS rules for Service load balancing on each node.	Services unreachable from Pods on that node. Cross-node Service access still works from other nodes.	Restart the process. Rules are rebuilt from current Service/Endpoint state.
CoreDNS	Cluster DNS. Resolves Service names to ClusterIPs.	Service DNS resolution fails. Pods can still reach other Pods by direct IP.	Restart CoreDNS Pods. Check ConfigMap for misconfiguration.

⚙ Quick Reference

7 commands from this guide

File	Command / Code	Purpose
four-essential-objects.yaml	apiVersion: v1	The 4 Essential Objects
control-plane-architecture.yaml	curl -k https://localhost:6443/healthz	Control Plane Architecture
check-data-plane.sh	systemctl status kubelet	Control Plane vs Data Plane Architecture
scheduler-configuration.yaml	apiVersion: v1	The Scheduler
networking-debug.yaml	kubectl get pods -n production -o wide	Pod Networking
storage-example.yaml	apiVersion: storage.k8s.io/v1	Kubernetes Storage
quota-and-limitrange.yaml	apiVersion: v1	Namespaces, Resource Quotas, and Multi-Tenancy

Key takeaways

You now understand what Introduction to Kubernetes is and why it exists

You've seen it working in a real runnable example

Practice daily

the forge only works when it's hot

The reconciliation loop is the fundamental operating principle of every Kubernetes controller. Understanding it transforms debugging from trial-and-error to systematic investigation.

etcd is the single point of truth and the most common root cause of cluster-wide issues. Its disk latency is the cluster's ceiling.

The scheduler scores nodes

it does not bin-pack, predict load, or rebalance. Scheduling decisions are permanent until the Pod is explicitly moved.

Kubernetes networking is layered (CNI, kube-proxy, Ingress). Debug from the bottom up

Pod IP, ClusterIP, DNS, Ingress.

Resource requests drive scheduling; resource limits drive runtime enforcement. Setting requests=limits (Guaranteed QoS) gives the most predictable behavior.

Storage is decoupled from Pod lifecycle via PV/PVC claims. The reclaim policy determines whether data survives PVC deletion

set to Retain for irretrievable data.

Namespaces provide isolated environments, but true security requires RBAC, NetworkPolicies, and ResourceQuotas. Without quotas, a single app can starve the cluster.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the Kubernetes reconciliation loop. How does it apply to a Deplo...

Q02SENIOR

What happens when you delete a Pod that belongs to a Deployment? Trace t...

Q03SENIOR

How does the kube-scheduler decide which node to place a Pod on? What ar...

Q04SENIOR

What is the difference between a Service's ClusterIP and the Pod IPs it ...

Q05SENIOR

A Pod is stuck in Pending. Walk me through your debugging process, from ...

Q06SENIOR

Explain etcd's role in the cluster. What happens if etcd loses quorum? H...

Q07SENIOR

What is the difference between requests and limits, and how do they affe...

Q08SENIOR

How would you design a zero-downtime deployment strategy using Kubernete...

Q01 of 08SENIOR

Explain the Kubernetes reconciliation loop. How does it apply to a Deployment managing a ReplicaSet managing Pods?

ANSWER

The reconciliation loop is a continuous observe-diff-act cycle. A Deployment controller watches desired state (replicas, pod template) from etcd via the API server. It compares the actual number of ReplicaSets and Pods. If a Pod is deleted, the controller sees the count is below desired, creates a new ReplicaSet or updates an existing one. The ReplicaSet controller then creates a new Pod. The scheduler assigns it to a node, and the kubelet starts it. This loop runs every few seconds, making Kubernetes self-healing without human intervention.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is Introduction to Kubernetes in simple terms?

What is the difference between a Deployment, a ReplicaSet, and a Pod?

What happens if the control plane node goes down?

How does Kubernetes handle node failures?

What is the difference between a ConfigMap and a Secret?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Kubernetes. Mark it forged?

6 min read · try the examples if you haven't