HPA Flapping — Cold-Start Pods Trigger 15-Second Cycles
HPA flapping occurs when scale-up has no stabilization window; cold-start pods drop CPU average, triggering rapid scale-down.
- HPA runs a control loop every 15 seconds (default) that reads metrics, computes desired replicas, and scales.
- Algorithm: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
- Supports CPU, memory, custom metrics (Prometheus), and external metrics (cloud provider queues).
- Scaling behavior is configurable via
behaviorfield: separate policies for scale-up and scale-down. - Faster polling = more responsive but higher API Server and metrics-server load.
- Aggressive scale-up = risk of over-provisioning and cluster resource exhaustion.
- Conservative scale-down = cost waste but stability during traffic dips.
- Setting target CPU to 80% without understanding that CPU requests must be set on the container. Without requests, HPA has no denominator and will not function.
Imagine a burger restaurant that only opens new cash registers when the queue gets too long, and closes them when it empties out. You don't pay 10 cashiers to stand around at 6am — you scale up at noon rush and scale back down by 3pm. Kubernetes HPA is exactly that manager watching the queue (CPU, memory, or custom metrics) and telling the kitchen (your cluster) to add or remove servers automatically. You set the rules once, and it handles the rest.
Every production system eventually hits the same wall: traffic is unpredictable, and over-provisioning is expensive while under-provisioning is catastrophic. A Black Friday spike, a viral tweet, a nightly batch job — any of these can kneecap a statically-sized deployment in minutes.
Kubernetes Horizontal Pod Autoscaler (HPA) solves the reactive scaling problem by continuously watching resource metrics and adjusting pod replica counts to match demand. But the naive 'just set CPU threshold to 80%' approach breaks in subtle and painful ways in production — flapping deployments, ignored metrics, race conditions with the Cluster Autoscaler, and custom metrics that silently stop working.
This is not a getting-started guide. It is for engineers who need to understand the HPA algorithm at the formula level, how stabilization windows prevent flapping, how to wire up custom metrics via Prometheus and KEDA, how HPA interacts with VPA and Cluster Autoscaler, and the production mistakes that wake senior engineers at 3am.
What is Kubernetes HPA — Autoscaling?
Kubernetes HPA (Horizontal Pod Autoscaler) is a control loop that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics. It is the primary mechanism for reactive horizontal scaling in Kubernetes. HPA does not add nodes — it adds pods. If pods cannot be scheduled due to insufficient node capacity, the Cluster Autoscaler is responsible for adding nodes.
- Loop interval: 15s default, configurable via controller flag.
- Metric source: metrics-server for CPU/memory, custom.metrics.k8s.io for Prometheus, external.metrics.k8s.io for cloud metrics.
- Stabilization: HPA keeps a history of metric values and uses the max (for scale-up) or min (for scale-down) during the stabilization window.
- Cooldown: There is no explicit cooldown. Stabilization windows serve as the dampening mechanism.
desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]. If a pod has no metric (e.g., not yet ready), HPA assumes it uses 100% of the target for scale-up calculations and 0% for scale-down. This conservative behavior prevents premature scale-down during deployments. However, it also means that during a rolling update, HPA may over-provision because old terminating pods and new unready pods both count as 'missing' metrics.Metrics Server Install & Troubleshoot Guide
The Kubernetes Metrics Server is the backbone for HPA CPU/memory scaling. It collects resource metrics (CPU and memory usage) from Kubelets and exposes them through the metrics.k8s.io API. Without it, HPA cannot compute utilization percentages and will show <unknown> for resource metrics.
Installation (standard method): ``bash kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml ` This installs the Metrics Server in the kube-system` namespace. For production clusters, customize the deployment with resource limits and readiness probes.
Cloud-specific variations: - Amazon EKS: Set --kubelet-insecure-tls if using a self-signed CA (common with EKS managed node groups). - Azure AKS: Enable the addon via CLI: az aks enable-addons --addons monitoring --name <cluster> --resource-group <rg>; Metrics Server is included with Azure Monitor. - Google GKE: GKE provides a managed Metrics Server automatically. No manual install needed.
Troubleshooting checklist: 1. Verify the Metrics Server deployment is running: kubectl get deployment metrics-server -n kube-system 2. Check pod logs: kubectl logs deployment/metrics-server -n kube-system — look for TLS or authentication errors. 3. Test metric collection: kubectl top pods and kubectl top nodes. If empty, Metrics Server is not collecting. 4. Confirm API service is healthy: kubectl get apiservice v1beta1.metrics.k8s.io — status should be True. 5. If using a custom CA, pass --kubelet-preferred-address-types=InternalIP and --kubelet-insecure-tls flags. 6. Ensure each node has at least 1 vCPU and 1 GB memory – Metrics Server can be resource-hungry on large clusters. 7. For clusters with hundreds of nodes, increase the Metrics Server request limit: ``yaml resources: requests: cpu: 100m memory: 200Mi limits: cpu: 200m memory: 500Mi ``
kubectl top pods --containers to check per-container metrics, which helps diagnose missing requests. In production, set up alerts for the metrics.k8s.io API service status — if it becomes unavailable, HPA will show <unknown> and stop scaling.The HPA Algorithm: How Desired Replicas Are Calculated
The HPA algorithm is proportional, not binary. It does not simply 'add one pod' when a threshold is breached. Instead, it calculates the ratio of current metric to target metric and scales proportionally. This means high load causes rapid scale-up (doubling or more), while moderate load causes gradual adjustments.
pods: 4policy: can add at most 4 pods per period.percent: 50policy: can add at most 50% of current replicas per period.- Multiple policies: HPA uses the policy that allows the most scaling (max).
- For scale-down: same logic applies but in reverse. HPA uses the min across policies for safety.
behavior to cap aggressiveness, and remember: the MAX across metrics wins.Stabilization Windows: Preventing Flapping
Stabilization windows are the primary mechanism for preventing HPA flapping — the rapid oscillation between scale-up and scale-down. During the stabilization window, HPA considers only the most conservative metric value (max for scale-up, min for scale-down) from the history of collected samples.
- Scale-up window: 0s default. Set to 60-120s to dampen during deployments and cold starts.
- Scale-down window: 300s default. Set to 300-600s for production stability.
- Window only applies to the stabilization decision, not the scaling policy rate.
- Pods in
CrashLoopBackOffor not yet ready are treated as using 100% of target for scale-up.
scaleUp.stabilizationWindowSeconds: 60-120 to let new pods warm up before HPA reacts.Custom and External Metrics: Beyond CPU and Memory
CPU and memory are often poor proxies for actual application load. A web server might be CPU-bound during image processing but network-bound during API calls. HPA supports custom metrics (per-pod metrics from Prometheus) and external metrics (cluster-external signals like SQS queue depth or Pub/Sub backlog) through the Kubernetes API aggregation layer.
- Custom metrics: Prometheus Adapter, Datadog, or Stackdriver adapter.
- External metrics: Cloud-provider specific (AWS CloudWatch, GCP Stackdriver, Azure Monitor).
- Label selectors must match the metric's labels to the target pods.
- If the metric API is unavailable, HPA falls back to CPU/memory (if configured).
<unknown> for custom metrics and stops scaling on those metrics. Always include a CPU or memory metric as a fallback. Monitor the adapter's availability and latency — a slow adapter causes delayed scaling decisions. The adapter's --metrics-relist-interval (default 1m) controls how often it re-reads available Prometheus metrics. Set it lower if you add new metrics frequently.Custom Metrics (Prometheus Adapter) Guide
The Prometheus Adapter is the most common way to expose application-level metrics to HPA. It implements the custom.metrics.k8s.io API and translates Prometheus queries into metric values that HPA can consume. Below is a complete guide to installing and configuring the adapter for production use.
Installation using Helm (recommended): ``bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus-adapter prometheus-community/prometheus-adapter \ --namespace monitoring \ --set prometheus.url=http://prometheus-server.monitoring.svc \ --set rules.custom[0]=default=true ``
Configuration via ConfigMap: The adapter uses a series of rules that define which Prometheus series become custom metrics. Each rule specifies a seriesQuery (PromQL to find matching time series) and template transformations for pods and nodes.
``yaml # ConfigMap for prometheus-adapter rules apiVersion: v1 kind: ConfigMap metadata: name: prometheus-adapter-config namespace: monitoring data: config.yaml: | rules: - seriesQuery: 'http_requests_total' resources: overrides: namespace: {resource: "namespace"} pod: {resource: "pod"} name: matches: "^(.*)_total$" as: "${1}_per_second" metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m]) / <<.GroupBy>>' ``
Testing custom metrics: Once the adapter is running, check the available metrics: ``bash kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq . ` This should return a list of metric names. Then verify a specific metric for a pod: `bash kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second | jq . ``
Common issues: - No metrics appear: The adapter cannot connect to Prometheus. Check prometheus.url and network policies. - <unknown> in HPA: The metric query returns no data. Verify the label selectors in HPA match the pod labels. - High latency: The PromQL query is too expensive. Use aggregation (<<.GroupBy>>) and limit time range. - Adapter crashes: Memory or CPU limit too low. Increase resources and add --logtostderr for debug logs.
custom.metrics.k8s.io API. In most clusters, the default ClusterRole system:controller:horizontal-pod-autoscaler already includes this, but if you are using a custom RBAC setup, ensure the binding exists. Without it, HPA will show 'unauthorized' errors./metrics) for request latency and error rates. A common mistake is to expose raw metrics without proper aggregation, causing the adapter to return large data sets that overwhelm HPA. Use sum and rate in the metricsQuery to reduce cardinality. Also, ensure the adapter's replicas: 2 for high availability — its failure silently breaks all custom metrics scaling.kubectl get --raw and include fallback resource metrics in HPA.HPA, VPA, and Cluster Autoscaler: The Scaling Stack
HPA, VPA (Vertical Pod Autoscaler), and Cluster Autoscaler are complementary but interact in non-obvious ways. HPA scales horizontally (more pods). VPA scales vertically (bigger pods). Cluster Autoscaler scales infrastructure (more nodes). Using them together requires careful configuration to avoid conflicts.
- HPA on CPU + VPA on memory: Safe. They operate on different metrics.
- HPA on custom metrics + VPA on CPU/memory: Safe. HPA ignores resource metrics.
- HPA on CPU + VPA on CPU (Auto mode): Dangerous. Feedback loop.
- Best practice: HPA for scaling, VPA in 'Off' mode for right-sizing recommendations.
--scale-down-delay-after-add flag (default 10 minutes) that prevents it from removing nodes immediately after HPA scales up. Without this, the Cluster Autoscaler could remove nodes while HPA is still creating pods, causing scheduling failures. Conversely, if the Cluster Autoscaler is too slow to add nodes, HPA's desired replicas may exceed available capacity, leaving pods in Pending state. Monitor for pods stuck in Pending with reason: Unschedulable — this indicates the Cluster Autoscaler cannot provision nodes fast enough or has hit its --max-nodes limit.VPA vs HPA vs Cluster Autoscaler Comparison Table
Understanding the differences and interactions between the three Kubernetes scaling components is critical for designing a robust autoscaling strategy. Below is a detailed comparison table.
| Feature | HPA (Horizontal Pod Autoscaler) | VPA (Vertical Pod Autoscaler) | Cluster Autoscaler |
|---|---|---|---|
| What it scales | Number of pod replicas | CPU/memory requests per pod | Number of cluster nodes |
| Metric source | CPU, memory, custom (Prometheus), external (cloud) | Historical resource usage (recommendations) | Pending pods (unschedulable) |
| Scaling direction | Scale out (increase replicas) / Scale in (decrease replicas) | Scale up (increase requests) / Scale down (decrease requests) | Scale up (add nodes) / Scale down (remove empty nodes) |
| Scale to zero | No (minReplicas >= 1) | No | Yes (empty nodes removed) |
| Conflict with HPA? | N/A | Yes, if both operate on CPU/memory. Use VPA in 'Off' mode. | No, complementary. |
| Best use case | Stateless microservices with variable traffic | Stateful applications, databases, batch jobs | Cost optimization for fluctuating cluster demand |
| Latency impact | Fast (within 15s loop) | Slow (minutes to hours for recommendations) | Slow (node provisioning takes 2-10 min) |
| Configuration complexity | Low (basic CPU) to medium (custom metrics) | Medium (need historical data) | Medium (cloud provider integration) |
When to combine: - HPA + Cluster Autoscaler: The most common pair. HPA adds pods; Cluster Autoscaler adds nodes when pods can't be scheduled. - VPA (Off mode) + HPA: Safe. VPA provides dashboards/charts for right-sizing requests; HPA handles actual scaling. - VPA (Auto mode) alone: Works for single-instance workloads but cannot horizontally scale.
Production pitfalls: - Running VPA in Auto mode with HPA on the same metrics causes oscillation (see callout in the scaling stack section). - Cluster Autoscaler may conflict with HPA if scale-down is too aggressive — pods terminated by HPA trigger node removal, causing new pods to be pending. - Total scaling delay = HPA reaction time + Cluster Autoscaler provisioning time. For bursty traffic, pre-provision nodes or use KEDA with scale-to-zero.
--scale-down-delay-after-delete (default 10 minutes). For example, set HPA scale-down stabilization window to at least 600s and periodSeconds to 120s to prevent rapid pod churn.KEDA: Event-Driven Autoscaling Beyond HPA
KEDA (Kubernetes Event-Driven Autoscaler) extends HPA by enabling scale-to-zero and supporting a wider range of event sources (message queues, databases, cron schedules). KEDA acts as an HPA adapter — it creates and manages HPA resources internally, but provides a simpler API and more scaler options.
- Scale-to-zero: KEDA sets minReplicas to 0 and manages the 0->1 transition.
- 50+ scalers: SQS, Kafka, RabbitMQ, Prometheus, PostgreSQL, cron, and more.
- HPA under the hood: KEDA creates an HPA with custom metrics for each ScaledObject.
- Cooldown period: Prevents rapid scale-to-zero when a batch temporarily drains the queue.
ScaledObject status for Active: false — this means KEDA has scaled to zero and is waiting for events. If events arrive but pods are slow to start, increase cooldownPeriod to prevent premature scale-to-zero between message batches.HPA Flapping: Pods Scaling Up and Down Every 15 Seconds
pods: 4 with periodSeconds: 15, meaning HPA could add 4 pods every 15 seconds. The new pods started with low CPU (cold start), which pulled the average below the 50% threshold, triggering scale-down. Once pods terminated, CPU spiked again, triggering scale-up. The cycle repeated indefinitely. The core issue: the scale-up stabilization window was 0 seconds (default), so HPA reacted immediately to every metric change without dampening.behavior.scaleUp.stabilizationWindowSeconds: 120 to prevent rapid scale-up during pod initialization.
2. Changed behavior.scaleUp.policies from pods: 4 to percent: 50 for proportional scaling.
3. Set behavior.scaleDown.stabilizationWindowSeconds: 300 to be conservative on scale-down.
4. Added a readiness probe with initialDelaySeconds: 30 so pods were not counted in metrics until fully warm.- HPA flapping is caused by asymmetric scale-up and scale-down policies. Both need stabilization windows.
- Cold-start pods with low CPU skew the average metric and cause premature scale-down.
- Always set
stabilizationWindowSecondsfor both scale-up and scale-down. - Readiness probes gate when a pod is counted in the HPA metric calculation. Delay it for warm-up.
<unknown> for metric values.kubectl describe hpa). Look for 'failed to get cpu utilization' or 'unable to fetch metrics'. Verify metrics-server is collecting data. Check if maxReplicas has been reached.behavior.scaleDown policy. Default scale-down stabilization window is 300 seconds (5 minutes). Verify the metric is actually below the target after stabilization. Check if minReplicas has been reached.periodSeconds to reduce polling frequency. Check if cold-start pods are skewing metrics.kubectl get apiservice). Check adapter configuration for the metric query. Ensure the metric label selectors match the deployment's pods.kubectl top fails, restart metrics-server. If HPA shows 'missing request for cpu', add resource requests to the deployment.Key takeaways
Interview Questions on This Topic
Frequently Asked Questions
That's Kubernetes. Mark it forged?
6 min read · try the examples if you haven't