Advanced Kubernetes Interview Questions — Internals, Edge Cases & Production Gotchas
- Kubernetes is a state-reconciliation engine; controllers constantly work to drive 'Current State' toward the 'Desired State' stored in etcd.
- The Control Plane request flow (Auth -> Mutating -> Validating -> Etcd) is the gatekeeper of cluster stability.
- Resource management isn't just about avoiding OOMKills; it's about defining 'Requests' accurately so the Scheduler can make intelligent placement decisions.
- Control Plane request lifecycle: Auth -> Mutating Webhook -> Validation -> etcd -> Controllers -> Scheduler -> Kubelet
- etcd: Raft consensus, split-brain scenarios, compaction, and why disk latency kills clusters
- Networking: CNI overlay vs flat networking, kube-proxy iptables vs IPVS, NetworkPolicy enforcement
- Resource Management: Requests vs Limits, QoS classes, OOMKill behavior, CPU throttling
- Autoscaling: HPA algorithm, stabilization windows, KEDA, HPA/VPA conflict
- RBAC and Admission: Webhook chains, OPA/Gatekeeper, service account token risks
Pod stuck in Pending.
kubectl describe pod <pod> | grep -A 20 Eventskubectl describe nodes | grep -A 5 Allocatable -B 2Namespace stuck in Terminating.
kubectl get namespace <ns> -o json | jq .spec.finalizerskubectl api-resources --verbs=list -o name | xargs -I{} kubectl get {} -n <ns> --ignore-not-found -o json 2>/dev/null | jq '.items[] | select(.metadata.finalizers) | {kind, name, finalizers}'Service returning 503.
kubectl get endpoints <service-name> -n <ns>kubectl get pods -n <ns> -l app=<label> -o wide | grep -v Runningetcd cluster degraded.
etcdctl endpoint health --cluster --write-out=tableetcdctl endpoint status --write-out=tableRBAC permission denied errors in application logs.
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa-name> -n <ns>kubectl get clusterrolebinding,rolebinding -A -o json | jq '.items[] | select(.subjects[]?.name=="<sa-name>") | .metadata.name'Production Incident
service.kubernetes.io/load-balancer-cleanup). The cloud controller manager (CCM) was responsible for removing this finalizer after deleting the cloud load balancer. However, the CCM had been redeployed with a new service account that lacked IAM permissions to delete load balancers. The CCM silently failed to remove the finalizer, and Kubernetes refused to complete namespace deletion because finalizers were still present on resources within the namespace.kubectl patch service <name> -p '{"metadata":{"finalizers":null}}'.
3. Verified the cloud load balancer was already deleted (no orphaned resources).
4. Namespace deletion completed immediately after finalizer removal.
5. Added monitoring for namespaces in Terminating state for more than 5 minutes.kubectl get <resource> -o json | jq .metadata.finalizers to identify which controller is blocking.Production Debug GuideSymptom-first investigation path for senior-level Kubernetes failures.
kubectl api-resources --verbs=list -o name | xargs -n 1 kubectl get -n <ns> --ignore-not-found -o json | jq '.items[] | select(.metadata.finalizers) | {kind: .kind, name: .metadata.name, finalizers: .metadata.finalizers}'. Patch or investigate each blocking resource.externalTrafficPolicy — if set to Local, traffic only routes to nodes with local pods. Check kube-proxy mode and logs.kubectl rollout status deployment/<name>. If maxUnavailable is 0 and a new pod cannot be scheduled, the rollout blocks forever. Check for resource quota limits, PDB conflicts, and node capacity.etcdctl endpoint health --cluster. Check disk latency on etcd nodes (iostat -x 1). High fsync latency causes Raft timeouts. Check network connectivity between etcd members.Kubernetes has become the de facto operating system for cloud-native infrastructure. At senior and staff-level interviews, nobody is going to ask you what a Pod is. They want to know what happens inside the API server when you run kubectl apply, why your HPA isn't scaling when CPU is clearly spiking, or how etcd consistency guarantees affect your cluster's behaviour under partition.
The gap between 'I know Kubernetes' and 'I understand Kubernetes' comes down to internals. When something breaks at 3am — a node drains but Pods stay Pending, a Deployment rolls out but traffic never shifts, a namespace hangs in Terminating forever — the engineers who can diagnose and fix fast are the ones who understand the watch-loop reconciliation model, the scheduler predicates and priorities, and how the CNI interacts with kube-proxy.
This guide covers the failure modes, edge cases, and architectural decisions that surface in real senior/staff-level interviews at companies running Kubernetes at scale. Every question maps to a production incident you will eventually encounter.
The Anatomy of a Request: What Happens When You Run 'kubectl apply'?
A senior candidate must articulate the journey of a manifest from the CLI to the Kubelet. It isn't just 'the API server saves it.' The lifecycle involves Authentication/Authorization, Mutating Admission Webhooks (which might inject sidecars like Istio or Linkerd), Schema Validation, and finally, Validating Admission Webhooks (like OPA/Gatekeeper).
Once persisted in etcd, the Control Plane controllers see the state change via a watch event. The Deployment controller creates a ReplicaSet, which creates Pod objects. These Pods remain in a 'Pending' state with an empty nodeName until the Kube-Scheduler performs its two-step dance: Filtering (Predicates) to find capable nodes, and Scoring (Priorities) to find the best node. Only then does the Kubelet on the target node see the Pod and instruct the Container Runtime (CRI) to pull images and start containers.
apiVersion: v1 kind: Pod metadata: name: forge-app namespace: production labels: app: forge-api tier: backend spec: containers: - name: forge-container image: io.thecodeforge/api:v1.2.0 resources: requests: memory: "256Mi" cpu: "500m" limits: memory: "512Mi" cpu: "1" livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 15 readinessProbe: httpGet: path: /ready port: 8080 topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: forge-api
- Authentication: Service account tokens, OIDC, certificates.
- Authorization: RBAC, ABAC, Webhook authorizers.
- Mutating Webhooks: Istio sidecar injection, default resource limits, label injection.
- Validating Webhooks: OPA/Gatekeeper policies, image signature verification, namespace quotas.
- etcd: Only persisted after all gates pass. The API Server is the only component that writes to etcd.
failurePolicy: Ignore for non-critical webhooks, webhook HA (multiple replicas), and monitoring webhook latency. Never set failurePolicy: Fail on a webhook that is not absolutely critical.Networking Internals: Services, Kube-Proxy, and the CNI
A Service in Kubernetes is not a process; it's a virtual IP (VIP) managed by kube-proxy. You should be prepared to explain the difference between the legacy iptables mode and the modern IPVS mode. While iptables uses sequential rule checking (O(n) complexity), IPVS uses hash tables (O(1) complexity), making it significantly more performant for clusters with thousands of services.
Furthermore, the CNI (Container Network Interface) is responsible for the 'plumbing' — assigning IPs to Pods and ensuring they can talk across nodes. If an interviewer asks why a Pod can't reach another Pod, your answer should start with the CNI overlay (Calico/Cilium) and move to NetworkPolicies, rather than just 'checking the app logs.'
# TheCodeForge Network Debugging Toolkit # Package: io.thecodeforge.k8s # 1. Check if the Service IP is active in iptables iptables -L -t nat | grep FORGE-SERVICE-NAME # 2. Inspect the CNI logs on the specific node journalctl -u kubelet | grep cni # 3. Test Pod-to-Pod connectivity bypassing the Service VIP kubectl exec -it debug-pod -- curl <target-pod-ip>:8080/healthz # 4. Check kube-proxy mode and health kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode kubectl logs -n kube-system -l k8s-app=kube-proxy | tail -50 # 5. Verify NetworkPolicy is not blocking traffic kubectl get networkpolicy -n <namespace> -o yaml # If policies exist, check ingress/egress rules against the Pod's labels
DNAT tcp -- 0.0.0.0/0 10.96.0.10 tcp dpt:53
- iptables: Simple, well-understood, but O(n) rule matching. No native load balancing algorithms.
- IPVS: O(1) hash matching, native LB algorithms (rr, lc, sh), but more complex debugging.
- eBPF (Cilium): Bypasses both iptables and IPVS entirely. Kernel-level packet processing. The future.
- kube-proxy is being replaced by eBPF-based CNIs in high-performance clusters.
externalTrafficPolicy: Cluster (default) distributes traffic evenly across all nodes, then to pods. This loses the client source IP. externalTrafficPolicy: Local only routes traffic to nodes that have local pods, preserving the source IP but risking uneven load distribution if pods are not evenly spread. This is a common interview question and a common production misconfiguration.etcd Internals: Raft, Consistency, and Failure Modes
etcd is the single source of truth for all Kubernetes cluster state. It uses the Raft consensus algorithm to replicate data across an odd number of members (typically 3 or 5). Understanding Raft is essential for diagnosing cluster-wide failures.
# etcd Diagnostic Commands # Package: io.thecodeforge.k8s # 1. Check cluster member health etcdctl endpoint health --cluster --write-out=table # 2. Check member status (leader, DB size, Raft index) etcdctl endpoint status --write-out=table # 3. Check for alarm conditions (e.g., NOSPACE) etcdctl alarm list # 4. Defragment a member (reclaims space after compaction) etcdctl defrag --endpoints=<endpoint> # 5. Compact old revisions (prevents unbounded DB growth) etcdctl compact $(etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision') # 6. Snapshot backup etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db
- Raft leader: Elected by members. All writes go through the leader.
- Heartbeat interval: Leader sends heartbeats (default 100ms). If a follower misses elections (default 1000ms), it starts a new election.
- Disk latency: etcd requires fsync on every write. Slow disks cause leader elections and cluster instability.
- Compaction: Old revisions accumulate. Periodic compaction and defragmentation are required to prevent unbounded growth.
--quota-backend-bytes (default 2GB, max 8GB) is the hard limit on the database size. If exceeded, etcd enters a maintenance mode that rejects all writes, effectively halting the cluster. Monitor etcd_mvcc_db_total_size_in_bytes and alert at 75%. Run compaction and defragmentation regularly. In large clusters with many ConfigMaps/Secrets, etcd can grow quickly. Consider externalizing large data (e.g., Helm charts) to object storage.Resource Management: Requests, Limits, and QoS Classes
Resource requests and limits are not just about preventing OOMKills. They define the contract between the application and the scheduler. Requests are used for scheduling decisions (can this Pod fit on this node?). Limits are enforced by the kernel cgroup (can this Pod use more than allocated?).
# QoS Class: Guaranteed (requests == limits for all containers) # Highest priority during eviction. Never OOMKilled unless node is under extreme pressure. apiVersion: v1 kind: Pod metadata: name: critical-service namespace: production spec: containers: - name: app image: io.thecodeforge/api:stable resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "500m" # Equal to request = Guaranteed QoS memory: "512Mi" # Equal to request = Guaranteed QoS --- # QoS Class: Burstable (requests < limits) # Medium priority. Can burst above request but may be evicted under pressure. apiVersion: v1 kind: Pod metadata: name: web-frontend namespace: production spec: containers: - name: app image: io.thecodeforge/frontend:latest resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "1" # Can use up to 1 CPU core memory: "1Gi" # Can use up to 1Gi RAM --- # QoS Class: BestEffort (no requests or limits set) # Lowest priority. First to be evicted. Not recommended for production. apiVersion: v1 kind: Pod metadata: name: debug-tool namespace: development spec: containers: - name: debug image: io.thecodeforge/debug:latest # No resources defined = BestEffort QoS
- Guaranteed: requests == limits for all containers. Highest eviction priority.
- Burstable: requests < limits (or only requests set). Medium priority.
- BestEffort: No requests or limits. Lowest priority. First to be evicted.
- CPU throttling: If CPU limit is set, the container is throttled when it exceeds the limit. This is NOT an eviction — it is a performance penalty.
- Memory OOMKill: If memory usage exceeds the limit, the kernel kills the container (OOMKill, exit code 137).
container_cpu_cfs_throttled_periods_total in cAdvisor metrics.RBAC, Service Accounts, and Admission Control
RBAC (Role-Based Access Control) is the primary authorization mechanism in Kubernetes. It defines who (Subject) can do what (Verb) on which resources (Resource) in which scope (Namespace or Cluster). Understanding RBAC is critical for security and for debugging 'access denied' errors.
# Least-privilege RBAC for a microservice # Package: io.thecodeforge.k8s # 1. Dedicated ServiceAccount (not default) apiVersion: v1 kind: ServiceAccount metadata: name: order-service namespace: production automountServiceAccountToken: false # Disable unless API access needed --- # 2. Namespace-scoped Role with minimal permissions apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: order-service-role namespace: production rules: - apiGroups: [""] resources: ["configmaps"] resourceNames: ["order-service-config"] # Only specific configmap verbs: ["get", "watch"] - apiGroups: [""] resources: ["secrets"] resourceNames: ["db-credentials"] # Only specific secret verbs: ["get"] --- # 3. Bind Role to ServiceAccount apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: order-service-binding namespace: production subjects: - kind: ServiceAccount name: order-service namespace: production roleRef: kind: Role name: order-service-role apiGroup: rbac.authorization.k8s.io
- Role: Namespace-scoped. RoleBinding binds it to subjects within the namespace.
- ClusterRole: Cluster-scoped. ClusterRoleBinding binds it to subjects across all namespaces.
- ServiceAccount: The identity for a Pod. Default SA is mounted into every Pod unless
automountServiceAccountToken: false. - Aggregated ClusterRoles: Combine multiple ClusterRoles using label selectors. Used by operators to extend permissions dynamically.
automountServiceAccountToken: false as the namespace default, creating dedicated ServiceAccounts per workload, and auditing ClusterRoleBindings regularly with kubectl auth can-i --list --as=system:serviceaccount:<ns>:<sa>.Scheduler Internals: Filtering, Scoring, and Custom Schedulers
The Kubernetes scheduler is a control loop that watches for Pods with an empty nodeName and assigns them to nodes. It does not actually run Pods — it only sets the nodeName field, and the kubelet on that node picks up the Pod. The scheduler's decision process has two phases: Filtering (formerly Predicates) and Scoring (formerly Priorities).
// Simplified scheduler decision model // Package: io.thecodeforge.k8s.scheduling package io.thecodeforge.k8s.scheduling; import java.util.List; import java.util.Map; public class SchedulerDecisionModel { /** * Phase 1: Filtering — Eliminate nodes that cannot run the Pod. * Filters are applied in order. If no nodes pass, the Pod stays Pending. */ public List<String> filterNodes(List<String> allNodes, Pod pod) { return allNodes.stream() .filter(node -> hasEnoughResources(node, pod)) // NodeResourcesFit .filter(node -> matchesNodeAffinity(node, pod)) // NodeAffinity .filter(node -> toleratesTaints(node, pod)) // TaintToleration .filter(node -> matchesPodTopology(node, pod)) // PodTopologySpread .filter(node -> hasVolumeCapacity(node, pod)) // VolumeBinding .toList(); } /** * Phase 2: Scoring — Rank feasible nodes by desirability. * Each scoring plugin assigns 0-100 points. Scores are summed. * The node with the highest total score wins. */ public Map<String, Integer> scoreNodes(List<String> feasibleNodes, Pod pod) { // Simplified: In reality, each plugin scores independently return feasibleNodes.stream() .collect(java.util.stream.Collectors.toMap( node -> node, node -> scoreResourceBalancing(node, pod) // NodeResourcesBalancedAllocation + scorePodSpread(node, pod) // PodTopologySpread + scoreInterPodAffinity(node, pod) // InterPodAffinity + scoreImageLocality(node, pod) // ImageLocality )); } private boolean hasEnoughResources(String node, Pod pod) { return true; } private boolean matchesNodeAffinity(String node, Pod pod) { return true; } private boolean toleratesTaints(String node, Pod pod) { return true; } private boolean matchesPodTopology(String node, Pod pod) { return true; } private boolean hasVolumeCapacity(String node, Pod pod) { return true; } private int scoreResourceBalancing(String node, Pod pod) { return 50; } private int scorePodSpread(String node, Pod pod) { return 50; } private int scoreInterPodAffinity(String node, Pod pod) { return 50; } private int scoreImageLocality(String node, Pod pod) { return 50; } }
kubectl describe pod <name> for the scheduling failure event. It tells you exactly which filter failed.- NodeResourcesFit: Checks if the node has enough CPU/memory for the Pod's requests.
- NodeAffinity: Matches nodeSelector and nodeAffinity rules.
- TaintToleration: Ensures the Pod tolerates all taints on the node.
- PodTopologySpread: Enforces topology spread constraints (zone, hostname).
- VolumeBinding: Ensures required PVs can be bound on the node.
- ImageLocality: Prefers nodes that already have the container image cached.
topologySpreadConstraints or podAntiAffinity to force spread. Also, the scheduler's --percentage-of-nodes-to-score flag (default 50%) limits scoring to a subset of feasible nodes for performance. In small clusters, set this to 100% to ensure optimal placement.Probes Deep Dive: Liveness, Readiness, and Startup
Probes are the kubelet's mechanism for monitoring container health. Misconfigured probes are one of the most common causes of production incidents: liveness probes that kill healthy-but-slow containers, readiness probes that flap during cache warm-up, and missing startup probes that cause crash loops on legacy applications.
# Production-grade probe configuration # Package: io.thecodeforge.k8s apiVersion: v1 kind: Pod metadata: name: api-server namespace: production spec: containers: - name: api image: io.thecodeforge/api:3.0.0 # Startup probe: Gates liveness/readiness until app boots # Critical for apps with slow startup (>30s) startupProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 30 # 30 * 5s = 150s max startup time successThreshold: 1 # Liveness probe: Detects deadlocks and hung processes # Only active after startup probe succeeds livenessProbe: httpGet: path: /healthz port: 8080 periodSeconds: 10 failureThreshold: 3 # 3 failures = restart after 30s successThreshold: 1 timeoutSeconds: 5 # Readiness probe: Controls traffic routing # Failing = removed from Service endpoints readinessProbe: httpGet: path: /ready port: 8080 periodSeconds: 5 failureThreshold: 2 # 2 failures = remove from endpoints after 10s successThreshold: 1 # 1 success = add back to endpoints timeoutSeconds: 3 resources: requests: cpu: "500m" memory: "512Mi"
- Startup probe: Only runs during boot. Gates liveness/readiness.
- Liveness probe: Runs continuously. Failure = container restart.
- Readiness probe: Runs continuously. Failure = remove from Service endpoints.
- Probe types: httpGet, tcpSocket, exec (command).
- timeoutSeconds: Must be less than periodSeconds, or the probe is always considered failed.
| Aspect | Liveness Probe | Readiness Probe | Startup Probe |
|---|---|---|---|
| Primary Goal | Detect deadlocks and hung processes | Control traffic routing to the Pod | Gate liveness/readiness until boot completes |
| Failure Action | Kubelet kills the container; triggers restart | Pod removed from Service endpoints; no traffic | If it fails, container is restarted like liveness |
| Success Action | Container continues running | Pod added to Service endpoints; receives traffic | Liveness and readiness probes are activated |
| Runs When | After startup probe succeeds (or immediately if no startup probe) | After startup probe succeeds (or immediately if no startup probe) | Immediately when container starts |
| Typical Use Case | Catching deadlocks, memory leaks, infinite loops | Waiting for cache warm-up, DB connection pool init | Legacy apps with 2+ minute startup times |
| Failure Threshold | 3 (default) — restart after 3 failures | 3 (default) — remove from endpoints after 3 failures | 30 (recommended) — allows up to 150s startup with 5s period |
| Common Mistake | Checking downstream dependencies (DB, cache) — causes cascading restarts | Too aggressive — causes endpoint flapping during transient load | Missing entirely — causes CrashLoopBackOff for slow-starting apps |
🎯 Key Takeaways
- Kubernetes is a state-reconciliation engine; controllers constantly work to drive 'Current State' toward the 'Desired State' stored in etcd.
- The Control Plane request flow (Auth -> Mutating -> Validating -> Etcd) is the gatekeeper of cluster stability.
- Resource management isn't just about avoiding OOMKills; it's about defining 'Requests' accurately so the Scheduler can make intelligent placement decisions.
- Networking in K8s relies on a combination of the CNI (Pod-to-Pod) and kube-proxy (Service abstraction) to handle the ephemeral nature of IPs.
- etcd is the single point of failure. Raft consensus prevents split-brain but requires quorum. Disk latency is the most common cause of instability.
- RBAC is additive with no deny mechanism. Least-privilege design, dedicated ServiceAccounts, and admission webhooks for policy enforcement are the production standard.
- Probes are the difference between a resilient service and a cascading failure. Liveness checks internal health only. Readiness can check dependencies. Startup gates both.
- Every production Kubernetes failure has a root cause in one of: etcd, scheduler, kubelet, CNI, or admission webhooks. Knowing which component to check first is the skill.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the 'Etcd Split Brain' scenario. How does the Raft consensus algorithm handle a network partition between three master nodes?
- QA Pod is stuck in 'ImagePullBackOff', but you've verified the image exists in the registry. What are the next three things you check? (Hint: Node IAM roles, ImagePullSecrets, and Disk Pressure).
- QDescribe the 'Thundering Herd' problem in the context of Horizontal Pod Autoscaling (HPA) and how 'Cool Down' periods or scale-down stabilization windows mitigate it.
- QWhat is the difference between an Ingress Controller and a Service of type LoadBalancer? When would you choose one over the other for a multi-tenant cluster?
- QTrace the full lifecycle of a kubectl apply command from the CLI to a running container. What admission webhooks are involved?
- QExplain the difference between iptables and IPVS kube-proxy modes. When would you switch to IPVS?
- QA namespace is stuck in Terminating. Walk me through your debugging process and how you would resolve it.
- QHow does the Kubernetes scheduler decide where to place a Pod? What happens if no nodes pass the filtering phase?
- QWhat is the difference between a liveness probe and a readiness probe? What happens if you use a liveness probe to check database connectivity?
- QExplain QoS classes in Kubernetes. How does the kubelet decide which Pods to evict when a node is under resource pressure?
- QHow does RBAC evaluation work? Can you deny a permission that has been granted by another role?
- QWhat is envelope encryption for Kubernetes Secrets? Why is base64 encoding not sufficient?
- QDescribe the interaction between HPA, VPA, and Cluster Autoscaler. What happens if HPA and VPA both operate on CPU?
- QWhat is a PodDisruptionBudget and why is it critical during cluster upgrades?
- QHow would you design a zero-downtime deployment strategy in Kubernetes? What configuration parameters matter?
- QExplain the admission webhook chain. What happens if a mutating webhook times out?
- QWhat is the 'thundering herd' problem during HPA scale-up and how do stabilization windows solve it?
- QHow do you debug a Pod that is stuck in Pending with no scheduling events?
- QWhat is the difference between a Deployment, a StatefulSet, and a DaemonSet? When would you use each?
- QExplain how NetworkPolicies work. What happens if you create a deny-all policy without allowing DNS?
Frequently Asked Questions
Why is my Pod OOMKilled even if the node has plenty of free RAM?
OOMKill (Exit Code 137) is enforced at the container level by the Cgroup, not the node level. If your container's memory usage exceeds its defined 'Limit' in the YAML, the kernel will kill the process to protect the rest of the node, regardless of how much 'free' RAM the physical machine has.
What is the difference between a 'Taint' and a 'NodeSelector'?
A NodeSelector (or NodeAffinity) is a preference or requirement for a Pod to go to a specific node (the Pod wants the Node). A Taint is the opposite: it allows a Node to repel a set of Pods (the Node rejects the Pod) unless those Pods have a specific 'Toleration'.
What happens if etcd goes down?
If etcd is unavailable, the cluster becomes 'read-only.' Existing workloads will continue to run, but no new Pods can be scheduled, no Deployments can be updated, and the API server will return 500 errors for any write operations. High availability for etcd (3 or 5 nodes) is critical for production clusters.
How does the scheduler decide where to place a Pod?
The scheduler uses a two-phase process: Filtering (eliminates nodes that cannot run the Pod based on resource availability, taints, affinity, topology constraints) and Scoring (ranks feasible nodes by desirability using resource balance, image locality, pod spread). The highest-scoring node wins.
What is the admission webhook chain and why does it matter?
Every Kubernetes write request passes through: Authentication -> Authorization -> Mutating Admission Webhooks -> Schema Validation -> Validating Admission Webhooks -> etcd. Mutating webhooks can modify objects (e.g., inject sidecars). Validating webhooks can reject objects (e.g., OPA policies). If a webhook is unavailable and has failurePolicy: Fail, the entire operation is rejected.
How do I debug a namespace stuck in Terminating?
List all resources in the namespace and check for finalizers: kubectl api-resources --verbs=list -o name | xargs -n 1 kubectl get -n <ns> --ignore-not-found -o json | jq '.items[] | select(.metadata.finalizers)'. Each finalizer blocks deletion until the responsible controller acknowledges cleanup. If the controller is broken, you may need to patch the finalizer to null manually.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.