Kubernetes HPA Deep Dive: Autoscaling Internals, Gotchas & Production Tuning
- HPA is a proportional control loop, not a binary threshold trigger. It scales based on the ratio of current to target metric.
- Stabilization windows are the dampening mechanism. Asymmetric windows (fast up, slow down) are the production standard.
- Custom metrics make HPA application-aware but add a dependency chain. Always include CPU/memory as fallback.
- HPA runs a control loop every 15 seconds (default) that reads metrics, computes desired replicas, and scales.
- Algorithm: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
- Supports CPU, memory, custom metrics (Prometheus), and external metrics (cloud provider queues).
- Scaling behavior is configurable via
behaviorfield: separate policies for scale-up and scale-down. - Faster polling = more responsive but higher API Server and metrics-server load.
- Aggressive scale-up = risk of over-provisioning and cluster resource exhaustion.
- Conservative scale-down = cost waste but stability during traffic dips.
- Setting target CPU to 80% without understanding that CPU requests must be set on the container. Without requests, HPA has no denominator and will not function.
HPA showing unknown or missing metrics.
kubectl top pods -n <namespace>kubectl describe hpa <hpa-name> -n <namespace>HPA not scaling despite load.
kubectl get hpa <hpa-name> -n <namespace> -o yaml | grep -A 20 statuskubectl get events -n <namespace> --field-selector involvedObject.name=<hpa-name>HPA flapping (rapid scale up/down).
kubectl get hpa <hpa-name> -o jsonpath='{.spec.behavior}'kubectl get hpa <hpa-name> -o jsonpath='{.status.lastScaleTime}'Custom metrics not feeding HPA.
kubectl get apiservice v1beta1.custom.metrics.k8s.io -o yamlkubectl logs -n custom-metrics deploy/prometheus-adapterProduction Incident
pods: 4 with periodSeconds: 15, meaning HPA could add 4 pods every 15 seconds. The new pods started with low CPU (cold start), which pulled the average below the 50% threshold, triggering scale-down. Once pods terminated, CPU spiked again, triggering scale-up. The cycle repeated indefinitely. The core issue: the scale-up stabilization window was 0 seconds (default), so HPA reacted immediately to every metric change without dampening.behavior.scaleUp.stabilizationWindowSeconds: 120 to prevent rapid scale-up during pod initialization.
2. Changed behavior.scaleUp.policies from pods: 4 to percent: 50 for proportional scaling.
3. Set behavior.scaleDown.stabilizationWindowSeconds: 300 to be conservative on scale-down.
4. Added a readiness probe with initialDelaySeconds: 30 so pods were not counted in metrics until fully warm.stabilizationWindowSeconds for both scale-up and scale-down.Readiness probes gate when a pod is counted in the HPA metric calculation. Delay it for warm-up.Production Debug GuideSymptom-first investigation path for HPA misbehavior.
<unknown> for metric values.→Check that metrics-server is running and healthy. Verify that the target deployment has CPU/memory requests set. Without requests, HPA cannot calculate utilization percentages.kubectl describe hpa). Look for 'failed to get cpu utilization' or 'unable to fetch metrics'. Verify metrics-server is collecting data. Check if maxReplicas has been reached.behavior.scaleDown policy. Default scale-down stabilization window is 300 seconds (5 minutes). Verify the metric is actually below the target after stabilization. Check if minReplicas has been reached.periodSeconds to reduce polling frequency. Check if cold-start pods are skewing metrics.kubectl get apiservice). Check adapter configuration for the metric query. Ensure the metric label selectors match the deployment's pods.Every production system eventually hits the same wall: traffic is unpredictable, and over-provisioning is expensive while under-provisioning is catastrophic. A Black Friday spike, a viral tweet, a nightly batch job — any of these can kneecap a statically-sized deployment in minutes.
Kubernetes Horizontal Pod Autoscaler (HPA) solves the reactive scaling problem by continuously watching resource metrics and adjusting pod replica counts to match demand. But the naive 'just set CPU threshold to 80%' approach breaks in subtle and painful ways in production — flapping deployments, ignored metrics, race conditions with the Cluster Autoscaler, and custom metrics that silently stop working.
This is not a getting-started guide. It is for engineers who need to understand the HPA algorithm at the formula level, how stabilization windows prevent flapping, how to wire up custom metrics via Prometheus and KEDA, how HPA interacts with VPA and Cluster Autoscaler, and the production mistakes that wake senior engineers at 3am.
What is Kubernetes HPA — Autoscaling?
Kubernetes HPA (Horizontal Pod Autoscaler) is a control loop that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics. It is the primary mechanism for reactive horizontal scaling in Kubernetes. HPA does not add nodes — it adds pods. If pods cannot be scheduled due to insufficient node capacity, the Cluster Autoscaler is responsible for adding nodes.
# Basic HPA with CPU utilization target # Package: io.thecodeforge.kubernetes apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: payment-service-hpa namespace: production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: payment-service minReplicas: 3 maxReplicas: 50 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleUp: stabilizationWindowSeconds: 120 policies: - type: Percent value: 50 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 25 periodSeconds: 120
replicas field on the Deployment, which triggers the ReplicaSet controller to reconcile.- Loop interval: 15s default, configurable via controller flag.
- Metric source: metrics-server for CPU/memory, custom.metrics.k8s.io for Prometheus, external.metrics.k8s.io for cloud metrics.
- Stabilization: HPA keeps a history of metric values and uses the max (for scale-up) or min (for scale-down) during the stabilization window.
- Cooldown: There is no explicit cooldown. Stabilization windows serve as the dampening mechanism.
desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]. If a pod has no metric (e.g., not yet ready), HPA assumes it uses 100% of the target for scale-up calculations and 0% for scale-down. This conservative behavior prevents premature scale-down during deployments. However, it also means that during a rolling update, HPA may over-provision because old terminating pods and new unready pods both count as 'missing' metrics.The HPA Algorithm: How Desired Replicas Are Calculated
The HPA algorithm is proportional, not binary. It does not simply 'add one pod' when a threshold is breached. Instead, it calculates the ratio of current metric to target metric and scales proportionally. This means high load causes rapid scale-up (doubling or more), while moderate load causes gradual adjustments.
// Simplified HPA algorithm implementation // Package: io.thecodeforge.kubernetes.autoscaling package io.thecodeforge.kubernetes.autoscaling; public class HpaAlgorithm { /** * Core HPA formula: * desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)] * * For multiple metrics, HPA computes the max across all metric sources. */ public static int calculateDesiredReplicas( int currentReplicas, long currentMetricValue, long targetMetricValue, int minReplicas, int maxReplicas ) { if (targetMetricValue == 0) { throw new IllegalArgumentException("Target metric value cannot be zero"); } double ratio = (double) currentMetricValue / targetMetricValue; int desired = (int) Math.ceil(currentReplicas * ratio); // Clamp to min/max bounds return Math.max(minReplicas, Math.min(desired, maxReplicas)); } public static void main(String[] args) { // Example: 10 replicas, CPU at 1500m, target 1000m // desired = ceil(10 * (1500/1000)) = ceil(15) = 15 int desired = calculateDesiredReplicas(10, 1500, 1000, 3, 50); System.out.println("Desired replicas: " + desired); // 15 // Example: 10 replicas, CPU at 500m, target 1000m // desired = ceil(10 * (500/1000)) = ceil(5) = 5 int scaledDown = calculateDesiredReplicas(10, 500, 1000, 3, 50); System.out.println("Scaled down replicas: " + scaledDown); // 5 } }
Scaled down replicas: 5
pods: 4policy: can add at most 4 pods per period.percent: 50policy: can add at most 50% of current replicas per period.- Multiple policies: HPA uses the policy that allows the most scaling (max).
- For scale-down: same logic applies but in reverse. HPA uses the min across policies for safety.
behavior to cap aggressiveness, and remember: the MAX across metrics wins.Stabilization Windows: Preventing Flapping
Stabilization windows are the primary mechanism for preventing HPA flapping — the rapid oscillation between scale-up and scale-down. During the stabilization window, HPA considers only the most conservative metric value (max for scale-up, min for scale-down) from the history of collected samples.
# HPA with tuned stabilization windows # Package: io.thecodeforge.kubernetes apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-gateway-hpa namespace: production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-gateway minReplicas: 5 maxReplicas: 100 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 65 behavior: # Scale UP: Aggressive but dampened scaleUp: stabilizationWindowSeconds: 60 # Consider max metric over last 60s policies: - type: Percent value: 100 # Can double replicas per period periodSeconds: 60 - type: Pods value: 10 # Or add 10 pods, whichever is more periodSeconds: 60 selectPolicy: Max # Use the policy allowing most scaling # Scale DOWN: Conservative and slow scaleDown: stabilizationWindowSeconds: 600 # Consider min metric over last 10 minutes policies: - type: Percent value: 10 # Remove at most 10% per period periodSeconds: 120 selectPolicy: Min # Use the most conservative policy
- Scale-up window: 0s default. Set to 60-120s to dampen during deployments and cold starts.
- Scale-down window: 300s default. Set to 300-600s for production stability.
- Window only applies to the stabilization decision, not the scaling policy rate.
- Pods in
CrashLoopBackOffor not yet ready are treated as using 100% of target for scale-up.
scaleUp.stabilizationWindowSeconds: 60-120 to let new pods warm up before HPA reacts.Custom and External Metrics: Beyond CPU and Memory
CPU and memory are often poor proxies for actual application load. A web server might be CPU-bound during image processing but network-bound during API calls. HPA supports custom metrics (per-pod metrics from Prometheus) and external metrics (cluster-external signals like SQS queue depth or Pub/Sub backlog) through the Kubernetes API aggregation layer.
# HPA with custom metrics from Prometheus Adapter # Requires: prometheus-adapter installed and configured # Package: io.thecodeforge.kubernetes apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: order-processor-hpa namespace: production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: order-processor minReplicas: 2 maxReplicas: 40 metrics: # Custom metric: requests per second from Prometheus - type: Pods pods: metric: name: http_requests_per_second target: type: AverageValue averageValue: "100" # Scale when avg RPS per pod exceeds 100 # External metric: SQS queue depth - type: External external: metric: name: sqs_queue_length selector: matchLabels: queue: "order-processing" target: type: AverageValue averageValue: "50" # Scale when queue depth per pod exceeds 50 # Keep CPU as fallback - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 75
- Custom metrics: Prometheus Adapter, Datadog, or Stackdriver adapter.
- External metrics: Cloud-provider specific (AWS CloudWatch, GCP Stackdriver, Azure Monitor).
- Label selectors must match the metric's labels to the target pods.
- If the metric API is unavailable, HPA falls back to CPU/memory (if configured).
<unknown> for custom metrics and stops scaling on those metrics. Always include a CPU or memory metric as a fallback. Monitor the adapter's availability and latency — a slow adapter causes delayed scaling decisions. The adapter's --metrics-relist-interval (default 1m) controls how often it re-reads available Prometheus metrics. Set it lower if you add new metrics frequently.HPA, VPA, and Cluster Autoscaler: The Scaling Stack
HPA, VPA (Vertical Pod Autoscaler), and Cluster Autoscaler are complementary but interact in non-obvious ways. HPA scales horizontally (more pods). VPA scales vertically (bigger pods). Cluster Autoscaler scales infrastructure (more nodes). Using them together requires careful configuration to avoid conflicts.
# VPA in 'Off' mode: Recommends but does not auto-apply resource requests # This avoids conflict with HPA which also modifies replica counts # Package: io.thecodeforge.kubernetes apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: payment-service-vpa namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-service updatePolicy: updateMode: "Off" # CRITICAL: Do not use 'Auto' with HPA on CPU/memory resourcePolicy: containerPolicies: - containerName: '*' minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 4 memory: 4Gi controlledResources: ["cpu", "memory"] --- # Cluster Autoscaler: Scales nodes based on pending pods # No YAML needed — it watches for unschedulable pods automatically # Key flags: # --scale-down-delay-after-add=10m # --scale-down-unneeded-time=10m # --max-node-provision-time=15m
- HPA on CPU + VPA on memory: Safe. They operate on different metrics.
- HPA on custom metrics + VPA on CPU/memory: Safe. HPA ignores resource metrics.
- HPA on CPU + VPA on CPU (Auto mode): Dangerous. Feedback loop.
- Best practice: HPA for scaling, VPA in 'Off' mode for right-sizing recommendations.
--scale-down-delay-after-add flag (default 10 minutes) that prevents it from removing nodes immediately after HPA scales up. Without this, the Cluster Autoscaler could remove nodes while HPA is still creating pods, causing scheduling failures. Conversely, if the Cluster Autoscaler is too slow to add nodes, HPA's desired replicas may exceed available capacity, leaving pods in Pending state. Monitor for pods stuck in Pending with reason: Unschedulable — this indicates the Cluster Autoscaler cannot provision nodes fast enough or has hit its --max-nodes limit.KEDA: Event-Driven Autoscaling Beyond HPA
KEDA (Kubernetes Event-Driven Autoscaler) extends HPA by enabling scale-to-zero and supporting a wider range of event sources (message queues, databases, cron schedules). KEDA acts as an HPA adapter — it creates and manages HPA resources internally, but provides a simpler API and more scaler options.
# KEDA ScaledObject: Scale based on SQS queue depth with scale-to-zero # Package: io.thecodeforge.kubernetes apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: order-processor-scaler namespace: production spec: scaleTargetRef: name: order-processor minReplicaCount: 0 # Scale to zero when no messages maxReplicaCount: 50 pollingInterval: 15 # Check metrics every 15 seconds cooldownPeriod: 300 # Wait 5 minutes before scaling to zero triggers: - type: aws-sqs-queue authenticationRef: name: aws-credentials metadata: queueURL: "https://sqs.us-east-1.amazonaws.com/123456789/orders" queueLength: "5" # 1 pod per 5 messages awsRegion: "us-east-1" - type: cron metadata: timezone: America/New_York start: "0 8 * * *" # Pre-scale at 8 AM end: "0 20 * * *" # End pre-scale at 8 PM desiredReplicas: "10"
- Scale-to-zero: KEDA sets minReplicas to 0 and manages the 0->1 transition.
- 50+ scalers: SQS, Kafka, RabbitMQ, Prometheus, PostgreSQL, cron, and more.
- HPA under the hood: KEDA creates an HPA with custom metrics for each ScaledObject.
- Cooldown period: Prevents rapid scale-to-zero when a batch temporarily drains the queue.
ScaledObject status for Active: false — this means KEDA has scaled to zero and is waiting for events. If events arrive but pods are slow to start, increase cooldownPeriod to prevent premature scale-to-zero between message batches.| Component | Scales | Trigger | Scale to Zero | Best For |
|---|---|---|---|---|
| HPA | Pod replicas (horizontal) | CPU, memory, custom metrics | No (minReplicas >= 1) | Stateless web services, APIs with variable traffic |
| VPA | Pod resource requests (vertical) | Historical resource usage | No | Stateful workloads, databases, single-instance services |
| Cluster Autoscaler | Nodes (infrastructure) | Pending unschedulable pods | Yes (scale down empty nodes) | Cost optimization, burst capacity |
| KEDA | Pod replicas (horizontal) | Event sources (queues, DBs, cron, etc.) | Yes | Queue consumers, batch jobs, event-driven architectures |
🎯 Key Takeaways
- HPA is a proportional control loop, not a binary threshold trigger. It scales based on the ratio of current to target metric.
- Stabilization windows are the dampening mechanism. Asymmetric windows (fast up, slow down) are the production standard.
- Custom metrics make HPA application-aware but add a dependency chain. Always include CPU/memory as fallback.
- Never let HPA and VPA auto-scale on the same metric. Use VPA in 'Off' mode for right-sizing recommendations.
- KEDA extends HPA with event-driven scaling and scale-to-zero. Use it for queue-based workloads.
- Test HPA behavior under load before production. The scaling algorithm, stabilization windows, and Cluster Autoscaler interactions all need validation.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the HPA algorithm. How does it compute the desired replica count?
- QWhat are stabilization windows and why are they important for preventing flapping?
- QHow does HPA handle multiple metrics? What happens if CPU says 'scale up' but memory says 'scale down'?
- QDescribe the conflict between HPA and VPA when both operate on CPU. How do you resolve it?
- QHow do you wire up custom metrics from Prometheus to HPA? What happens if the metrics adapter fails?
- QWhat is KEDA and how does it differ from native HPA? When would you use it?
- QHow does HPA interact with the Cluster Autoscaler? What are the race conditions?
- QA deployment is flapping between 3 and 15 replicas every minute. Walk me through your debugging process.
- QHow do you design an HPA configuration for a service with predictable daily traffic patterns?
- QWhat is the
selectPolicyfield in HPA behavior and how does it affect scaling decisions?
Frequently Asked Questions
How does HPA calculate the desired number of replicas?
HPA uses the formula: desiredReplicas = ceil[currentReplicas (currentMetricValue / targetMetricValue)]. For example, if you have 10 replicas with CPU at 150m and a target of 100m, HPA calculates ceil(10 150/100) = ceil(15) = 15 replicas. When multiple metrics are configured, HPA computes the desired replicas for each and uses the maximum.
Why is my HPA showing '' for metrics?
This typically means either: (1) metrics-server is not running or not collecting data, (2) the target deployment does not have CPU/memory requests set (HPA needs requests to compute utilization), or (3) for custom metrics, the metrics adapter (e.g., Prometheus Adapter) is down or misconfigured. Run kubectl top pods to verify metrics-server is working.
Can HPA and VPA work together?
Yes, but they must not operate on the same metric. The safe patterns are: (1) HPA on CPU + VPA on memory, (2) HPA on custom metrics + VPA on CPU/memory, or (3) HPA on CPU/memory + VPA in 'Off' mode (recommendations only, no auto-apply). Never use HPA on CPU with VPA in 'Auto' mode on CPU — this creates a feedback loop.
What is KEDA and when should I use it instead of HPA?
KEDA (Kubernetes Event-Driven Autoscaler) extends HPA with scale-to-zero support and 50+ event-based scalers (message queues, databases, cron schedules). Use KEDA for: queue consumers that should scale to zero when idle, event-driven workloads triggered by external systems, and batch jobs with predictable scheduling. KEDA creates HPA resources internally, so it is complementary, not a replacement.
How do I prevent HPA flapping?
Set stabilization windows for both scale-up and scale-down. Recommended: scaleUp.stabilizationWindowSeconds: 60-120, scaleDown.stabilizationWindowSeconds: 300-600. Also ensure pods have readiness probes with appropriate initialDelaySeconds so new pods are not counted in metrics until they are warm. Use behavior policies to cap the rate of scaling changes.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.