Advanced 6 min · March 06, 2026

HPA Flapping — Cold-Start Pods Trigger 15-Second Cycles

HPA flapping occurs when scale-up has no stabilization window; cold-start pods drop CPU average, triggering rapid scale-down.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • HPA runs a control loop every 15 seconds (default) that reads metrics, computes desired replicas, and scales.
  • Algorithm: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
  • Supports CPU, memory, custom metrics (Prometheus), and external metrics (cloud provider queues).
  • Scaling behavior is configurable via behavior field: separate policies for scale-up and scale-down.
  • Faster polling = more responsive but higher API Server and metrics-server load.
  • Aggressive scale-up = risk of over-provisioning and cluster resource exhaustion.
  • Conservative scale-down = cost waste but stability during traffic dips.
  • Setting target CPU to 80% without understanding that CPU requests must be set on the container. Without requests, HPA has no denominator and will not function.
Plain-English First

Imagine a burger restaurant that only opens new cash registers when the queue gets too long, and closes them when it empties out. You don't pay 10 cashiers to stand around at 6am — you scale up at noon rush and scale back down by 3pm. Kubernetes HPA is exactly that manager watching the queue (CPU, memory, or custom metrics) and telling the kitchen (your cluster) to add or remove servers automatically. You set the rules once, and it handles the rest.

Every production system eventually hits the same wall: traffic is unpredictable, and over-provisioning is expensive while under-provisioning is catastrophic. A Black Friday spike, a viral tweet, a nightly batch job — any of these can kneecap a statically-sized deployment in minutes.

Kubernetes Horizontal Pod Autoscaler (HPA) solves the reactive scaling problem by continuously watching resource metrics and adjusting pod replica counts to match demand. But the naive 'just set CPU threshold to 80%' approach breaks in subtle and painful ways in production — flapping deployments, ignored metrics, race conditions with the Cluster Autoscaler, and custom metrics that silently stop working.

This is not a getting-started guide. It is for engineers who need to understand the HPA algorithm at the formula level, how stabilization windows prevent flapping, how to wire up custom metrics via Prometheus and KEDA, how HPA interacts with VPA and Cluster Autoscaler, and the production mistakes that wake senior engineers at 3am.

What is Kubernetes HPA — Autoscaling?

Kubernetes HPA (Horizontal Pod Autoscaler) is a control loop that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics. It is the primary mechanism for reactive horizontal scaling in Kubernetes. HPA does not add nodes — it adds pods. If pods cannot be scheduled due to insufficient node capacity, the Cluster Autoscaler is responsible for adding nodes.

io/thecodeforge/kubernetes/hpa-basic.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Basic HPA with CPU utilization target
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120
Output
HPA configured with CPU/memory targets and scaling behavior policies.
The HPA Control Loop
  • Loop interval: 15s default, configurable via controller flag.
  • Metric source: metrics-server for CPU/memory, custom.metrics.k8s.io for Prometheus, external.metrics.k8s.io for cloud metrics.
  • Stabilization: HPA keeps a history of metric values and uses the max (for scale-up) or min (for scale-down) during the stabilization window.
  • Cooldown: There is no explicit cooldown. Stabilization windows serve as the dampening mechanism.
Production Insight
The HPA algorithm uses the formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]. If a pod has no metric (e.g., not yet ready), HPA assumes it uses 100% of the target for scale-up calculations and 0% for scale-down. This conservative behavior prevents premature scale-down during deployments. However, it also means that during a rolling update, HPA may over-provision because old terminating pods and new unready pods both count as 'missing' metrics.
Key Takeaway
HPA is a control loop, not a threshold trigger. It computes proportional scaling, not binary on/off. Understanding the algorithm and stabilization windows is essential to preventing flapping and over-provisioning.

Metrics Server Install & Troubleshoot Guide

The Kubernetes Metrics Server is the backbone for HPA CPU/memory scaling. It collects resource metrics (CPU and memory usage) from Kubelets and exposes them through the metrics.k8s.io API. Without it, HPA cannot compute utilization percentages and will show <unknown> for resource metrics.

Installation (standard method): ``bash kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml ` This installs the Metrics Server in the kube-system` namespace. For production clusters, customize the deployment with resource limits and readiness probes.

Cloud-specific variations: - Amazon EKS: Set --kubelet-insecure-tls if using a self-signed CA (common with EKS managed node groups). - Azure AKS: Enable the addon via CLI: az aks enable-addons --addons monitoring --name <cluster> --resource-group <rg>; Metrics Server is included with Azure Monitor. - Google GKE: GKE provides a managed Metrics Server automatically. No manual install needed.

Troubleshooting checklist: 1. Verify the Metrics Server deployment is running: kubectl get deployment metrics-server -n kube-system 2. Check pod logs: kubectl logs deployment/metrics-server -n kube-system — look for TLS or authentication errors. 3. Test metric collection: kubectl top pods and kubectl top nodes. If empty, Metrics Server is not collecting. 4. Confirm API service is healthy: kubectl get apiservice v1beta1.metrics.k8s.io — status should be True. 5. If using a custom CA, pass --kubelet-preferred-address-types=InternalIP and --kubelet-insecure-tls flags. 6. Ensure each node has at least 1 vCPU and 1 GB memory – Metrics Server can be resource-hungry on large clusters. 7. For clusters with hundreds of nodes, increase the Metrics Server request limit: ``yaml resources: requests: cpu: 100m memory: 200Mi limits: cpu: 200m memory: 500Mi ``

io/thecodeforge/kubernetes/metrics-server-custom-flags.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Custom Metrics Server deployment with TLS flags
# Package: io.thecodeforge.kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server
  namespace: kube-system
  labels:
    k8s-app: metrics-server
spec:
  selector:
    matchLabels:
      k8s-app: metrics-server
  template:
    metadata:
      labels:
        k8s-app: metrics-server
    spec:
      containers:
      - args:
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP
        - --metric-resolution=15s
        name: metrics-server
        image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
        ports:
        - containerPort: 4443
          name: https
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          initialDelaySeconds: 20
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          initialDelaySeconds: 20
          periodSeconds: 10
Output
Deployment configured with TLS insecure flag and custom metric resolution.
Metrics Server Resource Requirements
Metrics Server scales linearly with the number of nodes and pods. For clusters with <20 nodes, default requests of 50m CPU and 100Mi memory are sufficient. For larger clusters, monitor memory usage and increase requests. A common failure mode is OOMKill on large clusters, which causes intermittent metric collection.
Production Insight
Metrics Server is a low-priority component in many clusters, but its failure silently breaks HPA for CPU and memory scaling. Always include a Prometheus or KEDA-based fallback for critical services. Use kubectl top pods --containers to check per-container metrics, which helps diagnose missing requests. In production, set up alerts for the metrics.k8s.io API service status — if it becomes unavailable, HPA will show <unknown> and stop scaling.
Key Takeaway
Metrics Server is mandatory for HPA CPU/memory scaling. Install it, verify it, and monitor its health. Cloud-managed clusters often have it pre-installed, but self-managed clusters require explicit setup.

The HPA Algorithm: How Desired Replicas Are Calculated

The HPA algorithm is proportional, not binary. It does not simply 'add one pod' when a threshold is breached. Instead, it calculates the ratio of current metric to target metric and scales proportionally. This means high load causes rapid scale-up (doubling or more), while moderate load causes gradual adjustments.

io/thecodeforge/kubernetes/autoscaling/HpaAlgorithm.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Simplified HPA algorithm implementation
// Package: io.thecodeforge.kubernetes.autoscaling
package io.thecodeforge.kubernetes.autoscaling;

public class HpaAlgorithm {

    /**
     * Core HPA formula:
     * desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
     *
     * For multiple metrics, HPA computes the max across all metric sources.
     */
    public static int calculateDesiredReplicas(
            int currentReplicas,
            long currentMetricValue,
            long targetMetricValue,
            int minReplicas,
            int maxReplicas
    ) {
        if (targetMetricValue == 0) {
            throw new IllegalArgumentException("Target metric value cannot be zero");
        }

        double ratio = (double) currentMetricValue / targetMetricValue;
        int desired = (int) Math.ceil(currentReplicas * ratio);

        // Clamp to min/max bounds
        return Math.max(minReplicas, Math.min(desired, maxReplicas));
    }

    public static void main(String[] args) {
        // Example: 10 replicas, CPU at 1500m, target 1000m
        // desired = ceil(10 * (1500/1000)) = ceil(15) = 15
        int desired = calculateDesiredReplicas(10, 1500, 1000, 3, 50);
        System.out.println("Desired replicas: " + desired);  // 15

        // Example: 10 replicas, CPU at 500m, target 1000m
        // desired = ceil(10 * (500/1000)) = ceil(5) = 5
        int scaledDown = calculateDesiredReplicas(10, 500, 1000, 3, 50);
        System.out.println("Scaled down replicas: " + scaledDown);  // 5
    }
}
Output
Desired replicas: 15
Scaled down replicas: 5
Proportional Scaling vs Binary Thresholds
  • pods: 4 policy: can add at most 4 pods per period.
  • percent: 50 policy: can add at most 50% of current replicas per period.
  • Multiple policies: HPA uses the policy that allows the most scaling (max).
  • For scale-down: same logic applies but in reverse. HPA uses the min across policies for safety.
Production Insight
When multiple metrics are defined (e.g., CPU and memory), HPA computes the desired replicas for each metric independently and uses the MAX. This means if CPU says 'scale to 5' and memory says 'scale to 10', HPA scales to 10. This is the correct behavior for availability — the most constrained metric wins. However, it can cause unexpected over-provisioning if one metric is misconfigured or noisy. Always validate all metric sources independently.
Key Takeaway
HPA scales proportionally, not incrementally. Without behavior policies, a single metric spike can cause massive scale-up. Use behavior to cap aggressiveness, and remember: the MAX across metrics wins.

Stabilization Windows: Preventing Flapping

Stabilization windows are the primary mechanism for preventing HPA flapping — the rapid oscillation between scale-up and scale-down. During the stabilization window, HPA considers only the most conservative metric value (max for scale-up, min for scale-down) from the history of collected samples.

io/thecodeforge/kubernetes/hpa-stabilization.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# HPA with tuned stabilization windows
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 100
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    # Scale UP: Aggressive but dampened
    scaleUp:
      stabilizationWindowSeconds: 60    # Consider max metric over last 60s
      policies:
        - type: Percent
          value: 100                      # Can double replicas per period
          periodSeconds: 60
        - type: Pods
          value: 10                       # Or add 10 pods, whichever is more
          periodSeconds: 60
      selectPolicy: Max                   # Use the policy allowing most scaling
    # Scale DOWN: Conservative and slow
    scaleDown:
      stabilizationWindowSeconds: 600    # Consider min metric over last 10 minutes
      policies:
        - type: Percent
          value: 10                       # Remove at most 10% per period
          periodSeconds: 120
      selectPolicy: Min                   # Use the most conservative policy
Output
HPA configured with asymmetric stabilization: fast scale-up, slow scale-down.
How Stabilization Windows Work
  • Scale-up window: 0s default. Set to 60-120s to dampen during deployments and cold starts.
  • Scale-down window: 300s default. Set to 300-600s for production stability.
  • Window only applies to the stabilization decision, not the scaling policy rate.
  • Pods in CrashLoopBackOff or not yet ready are treated as using 100% of target for scale-up.
Production Insight
The default scale-up stabilization window of 0 seconds is the single most common cause of HPA flapping in production. During a rolling update, new pods start with low CPU (cold start), which pulls the average down, triggering scale-down. Once old pods terminate, CPU spikes, triggering scale-up. The fix is always: set scaleUp.stabilizationWindowSeconds: 60-120 to let new pods warm up before HPA reacts.
Key Takeaway
Stabilization windows are the dampening mechanism for HPA. Asymmetric windows (fast up, slow down) are the production standard. Never leave scale-up stabilization at 0 seconds in production.

Custom and External Metrics: Beyond CPU and Memory

CPU and memory are often poor proxies for actual application load. A web server might be CPU-bound during image processing but network-bound during API calls. HPA supports custom metrics (per-pod metrics from Prometheus) and external metrics (cluster-external signals like SQS queue depth or Pub/Sub backlog) through the Kubernetes API aggregation layer.

io/thecodeforge/kubernetes/hpa-custom-metrics.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# HPA with custom metrics from Prometheus Adapter
# Requires: prometheus-adapter installed and configured
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-processor-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-processor
  minReplicas: 2
  maxReplicas: 40
  metrics:
    # Custom metric: requests per second from Prometheus
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"    # Scale when avg RPS per pod exceeds 100
    # External metric: SQS queue depth
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: "order-processing"
        target:
          type: AverageValue
          averageValue: "50"     # Scale when queue depth per pod exceeds 50
    # Keep CPU as fallback
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 75
Output
HPA configured with custom (RPS) and external (SQS) metrics plus CPU fallback.
Custom vs External Metrics
  • Custom metrics: Prometheus Adapter, Datadog, or Stackdriver adapter.
  • External metrics: Cloud-provider specific (AWS CloudWatch, GCP Stackdriver, Azure Monitor).
  • Label selectors must match the metric's labels to the target pods.
  • If the metric API is unavailable, HPA falls back to CPU/memory (if configured).
Production Insight
Custom metrics adapters are a single point of failure for HPA. If the Prometheus Adapter crashes or the Prometheus server is unreachable, HPA shows <unknown> for custom metrics and stops scaling on those metrics. Always include a CPU or memory metric as a fallback. Monitor the adapter's availability and latency — a slow adapter causes delayed scaling decisions. The adapter's --metrics-relist-interval (default 1m) controls how often it re-reads available Prometheus metrics. Set it lower if you add new metrics frequently.
Key Takeaway
Custom metrics make HPA application-aware, but they add a dependency chain: Prometheus -> Adapter -> HPA. Always include CPU/memory as a fallback metric. Monitor the adapter as critical infrastructure.

Custom Metrics (Prometheus Adapter) Guide

The Prometheus Adapter is the most common way to expose application-level metrics to HPA. It implements the custom.metrics.k8s.io API and translates Prometheus queries into metric values that HPA can consume. Below is a complete guide to installing and configuring the adapter for production use.

Installation using Helm (recommended): ``bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus-adapter prometheus-community/prometheus-adapter \ --namespace monitoring \ --set prometheus.url=http://prometheus-server.monitoring.svc \ --set rules.custom[0]=default=true ``

Configuration via ConfigMap: The adapter uses a series of rules that define which Prometheus series become custom metrics. Each rule specifies a seriesQuery (PromQL to find matching time series) and template transformations for pods and nodes.

``yaml # ConfigMap for prometheus-adapter rules apiVersion: v1 kind: ConfigMap metadata: name: prometheus-adapter-config namespace: monitoring data: config.yaml: | rules: - seriesQuery: 'http_requests_total' resources: overrides: namespace: {resource: "namespace"} pod: {resource: "pod"} name: matches: "^(.*)_total$" as: "${1}_per_second" metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m]) / <<.GroupBy>>' ``

Testing custom metrics: Once the adapter is running, check the available metrics: ``bash kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq . ` This should return a list of metric names. Then verify a specific metric for a pod: `bash kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second | jq . ``

Common issues: - No metrics appear: The adapter cannot connect to Prometheus. Check prometheus.url and network policies. - <unknown> in HPA: The metric query returns no data. Verify the label selectors in HPA match the pod labels. - High latency: The PromQL query is too expensive. Use aggregation (<<.GroupBy>>) and limit time range. - Adapter crashes: Memory or CPU limit too low. Increase resources and add --logtostderr for debug logs.

io/thecodeforge/kubernetes/prometheus-adapter-values.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Helm values for prometheus-adapter production setup
# Package: io.thecodeforge.kubernetes
prometheus:
  url: http://prometheus.monitoring.svc
  port: 9090
rules:
  default: false  # Disable default rules to avoid namespace/pod conflicts
  custom:
  - seriesQuery: 'http_requests_total'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m]) / <<.GroupBy>>'
  - seriesQuery: 'nginx_ingress_controller_requests_total'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        service: {resource: "service"}
        ingress: {resource: "ingress"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
resources:
  requests:
    cpu: 200m
    memory: 300Mi
  limits:
    cpu: 500m
    memory: 500Mi
Output
Prometheus Adapter configured with custom HTTP request metric and Nginx ingress metric.
RBAC for Custom Metrics
The HPA controller must have permissions to access the custom.metrics.k8s.io API. In most clusters, the default ClusterRole system:controller:horizontal-pod-autoscaler already includes this, but if you are using a custom RBAC setup, ensure the binding exists. Without it, HPA will show 'unauthorized' errors.
Production Insight
Prometheus Adapter is a critical link in the custom metrics chain. Monitor its endpoint (/metrics) for request latency and error rates. A common mistake is to expose raw metrics without proper aggregation, causing the adapter to return large data sets that overwhelm HPA. Use sum and rate in the metricsQuery to reduce cardinality. Also, ensure the adapter's replicas: 2 for high availability — its failure silently breaks all custom metrics scaling.
Key Takeaway
Prometheus Adapter translates Prometheus metrics into the Kubernetes custom metrics API. Proper configuration of seriesQuery and metricsQuery is essential. Always test with kubectl get --raw and include fallback resource metrics in HPA.

HPA, VPA, and Cluster Autoscaler: The Scaling Stack

HPA, VPA (Vertical Pod Autoscaler), and Cluster Autoscaler are complementary but interact in non-obvious ways. HPA scales horizontally (more pods). VPA scales vertically (bigger pods). Cluster Autoscaler scales infrastructure (more nodes). Using them together requires careful configuration to avoid conflicts.

io/thecodeforge/kubernetes/hpa-vpa-coexistence.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# VPA in 'Off' mode: Recommends but does not auto-apply resource requests
# This avoids conflict with HPA which also modifies replica counts
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"      # CRITICAL: Do not use 'Auto' with HPA on CPU/memory
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 4Gi
        controlledResources: ["cpu", "memory"]
---
# Cluster Autoscaler: Scales nodes based on pending pods
# No YAML needed — it watches for unschedulable pods automatically
# Key flags:
# --scale-down-delay-after-add=10m
# --scale-down-unneeded-time=10m
# --max-node-provision-time=15m
Output
VPA configured in recommendation-only mode. Cluster Autoscaler watches for unschedulable pods.
The Conflict: HPA vs VPA on CPU/Memory
  • HPA on CPU + VPA on memory: Safe. They operate on different metrics.
  • HPA on custom metrics + VPA on CPU/memory: Safe. HPA ignores resource metrics.
  • HPA on CPU + VPA on CPU (Auto mode): Dangerous. Feedback loop.
  • Best practice: HPA for scaling, VPA in 'Off' mode for right-sizing recommendations.
Production Insight
The Cluster Autoscaler has a --scale-down-delay-after-add flag (default 10 minutes) that prevents it from removing nodes immediately after HPA scales up. Without this, the Cluster Autoscaler could remove nodes while HPA is still creating pods, causing scheduling failures. Conversely, if the Cluster Autoscaler is too slow to add nodes, HPA's desired replicas may exceed available capacity, leaving pods in Pending state. Monitor for pods stuck in Pending with reason: Unschedulable — this indicates the Cluster Autoscaler cannot provision nodes fast enough or has hit its --max-nodes limit.
Key Takeaway
HPA, VPA, and Cluster Autoscaler form a three-layer scaling stack. The key rule: never let HPA and VPA auto-scale on the same metric. Use VPA in 'Off' mode for right-sizing, HPA for horizontal scaling, and Cluster Autoscaler for node provisioning.

VPA vs HPA vs Cluster Autoscaler Comparison Table

Understanding the differences and interactions between the three Kubernetes scaling components is critical for designing a robust autoscaling strategy. Below is a detailed comparison table.

FeatureHPA (Horizontal Pod Autoscaler)VPA (Vertical Pod Autoscaler)Cluster Autoscaler
What it scalesNumber of pod replicasCPU/memory requests per podNumber of cluster nodes
Metric sourceCPU, memory, custom (Prometheus), external (cloud)Historical resource usage (recommendations)Pending pods (unschedulable)
Scaling directionScale out (increase replicas) / Scale in (decrease replicas)Scale up (increase requests) / Scale down (decrease requests)Scale up (add nodes) / Scale down (remove empty nodes)
Scale to zeroNo (minReplicas >= 1)NoYes (empty nodes removed)
Conflict with HPA?N/AYes, if both operate on CPU/memory. Use VPA in 'Off' mode.No, complementary.
Best use caseStateless microservices with variable trafficStateful applications, databases, batch jobsCost optimization for fluctuating cluster demand
Latency impactFast (within 15s loop)Slow (minutes to hours for recommendations)Slow (node provisioning takes 2-10 min)
Configuration complexityLow (basic CPU) to medium (custom metrics)Medium (need historical data)Medium (cloud provider integration)

When to combine: - HPA + Cluster Autoscaler: The most common pair. HPA adds pods; Cluster Autoscaler adds nodes when pods can't be scheduled. - VPA (Off mode) + HPA: Safe. VPA provides dashboards/charts for right-sizing requests; HPA handles actual scaling. - VPA (Auto mode) alone: Works for single-instance workloads but cannot horizontally scale.

Production pitfalls: - Running VPA in Auto mode with HPA on the same metrics causes oscillation (see callout in the scaling stack section). - Cluster Autoscaler may conflict with HPA if scale-down is too aggressive — pods terminated by HPA trigger node removal, causing new pods to be pending. - Total scaling delay = HPA reaction time + Cluster Autoscaler provisioning time. For bursty traffic, pre-provision nodes or use KEDA with scale-to-zero.

The Three-Layer Scaling Model
Think of scaling as a three-layer stack: HPA handles pod count, VPA handles pod size, and Cluster Autoscaler handles cluster capacity. Each layer operates on a different resource (replicas, requests, nodes) and has different latency profiles. Understanding this stack prevents conflicts and ensures predictable scaling behavior.
Production Insight
A common mistake is to set aggressive HPA scale-down policies that remove pods faster than the Cluster Autoscaler's scale-down delay. This causes the Cluster Autoscaler to keep nodes that become underutilized, increasing costs. Always align HPA scale-down periodSeconds with Cluster Autoscaler's --scale-down-delay-after-delete (default 10 minutes). For example, set HPA scale-down stabilization window to at least 600s and periodSeconds to 120s to prevent rapid pod churn.
Key Takeaway
HPA, VPA, and Cluster Autoscaler are complementary but must be configured with awareness of each other's behavior. Use the comparison table to decide which combination fits your workload.

KEDA: Event-Driven Autoscaling Beyond HPA

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA by enabling scale-to-zero and supporting a wider range of event sources (message queues, databases, cron schedules). KEDA acts as an HPA adapter — it creates and manages HPA resources internally, but provides a simpler API and more scaler options.

io/thecodeforge/kubernetes/keda-scaledobject.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# KEDA ScaledObject: Scale based on SQS queue depth with scale-to-zero
# Package: io.thecodeforge.kubernetes
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0          # Scale to zero when no messages
  maxReplicaCount: 50
  pollingInterval: 15         # Check metrics every 15 seconds
  cooldownPeriod: 300         # Wait 5 minutes before scaling to zero
  triggers:
    - type: aws-sqs-queue
      authenticationRef:
        name: aws-credentials
      metadata:
        queueURL: "https://sqs.us-east-1.amazonaws.com/123456789/orders"
        queueLength: "5"        # 1 pod per 5 messages
        awsRegion: "us-east-1"
    - type: cron
      metadata:
        timezone: America/New_York
        start: "0 8 * * *"      # Pre-scale at 8 AM
        end: "0 20 * * *"       # End pre-scale at 8 PM
        desiredReplicas: "10"
Output
KEDA configured for SQS-based scaling with scale-to-zero and cron pre-scaling.
Why KEDA Instead of Native HPA?
  • Scale-to-zero: KEDA sets minReplicas to 0 and manages the 0->1 transition.
  • 50+ scalers: SQS, Kafka, RabbitMQ, Prometheus, PostgreSQL, cron, and more.
  • HPA under the hood: KEDA creates an HPA with custom metrics for each ScaledObject.
  • Cooldown period: Prevents rapid scale-to-zero when a batch temporarily drains the queue.
Production Insight
KEDA's scale-to-zero introduces a cold-start latency problem: the first pod must start from zero, which can take 30-120 seconds depending on the container image and initialization. For latency-sensitive services, use KEDA's cron scaler to pre-scale during known traffic windows. Monitor the ScaledObject status for Active: false — this means KEDA has scaled to zero and is waiting for events. If events arrive but pods are slow to start, increase cooldownPeriod to prevent premature scale-to-zero between message batches.
Key Takeaway
KEDA extends HPA with event-driven scaling and scale-to-zero. Use it for queue-based workloads and batch processing. The trade-off is cold-start latency when scaling from zero. Pre-scale with cron triggers for latency-sensitive paths.
● Production incidentPOST-MORTEMseverity: high

HPA Flapping: Pods Scaling Up and Down Every 15 Seconds

Symptom
Deployment replica count oscillating rapidly. Pods in Terminating and ContainerCreating states simultaneously. Customer-facing latency spikes correlated with scaling events. HPA events showed alternating 'scaled up' and 'scaled down' messages every 30-60 seconds.
Assumption
Traffic was genuinely spiky, causing real demand fluctuations.
Root cause
The HPA was configured with a 50% CPU target and a 60-second stabilization window for scale-down (the default). However, the scale-up policy was set to pods: 4 with periodSeconds: 15, meaning HPA could add 4 pods every 15 seconds. The new pods started with low CPU (cold start), which pulled the average below the 50% threshold, triggering scale-down. Once pods terminated, CPU spiked again, triggering scale-up. The cycle repeated indefinitely. The core issue: the scale-up stabilization window was 0 seconds (default), so HPA reacted immediately to every metric change without dampening.
Fix
1. Set behavior.scaleUp.stabilizationWindowSeconds: 120 to prevent rapid scale-up during pod initialization. 2. Changed behavior.scaleUp.policies from pods: 4 to percent: 50 for proportional scaling. 3. Set behavior.scaleDown.stabilizationWindowSeconds: 300 to be conservative on scale-down. 4. Added a readiness probe with initialDelaySeconds: 30 so pods were not counted in metrics until fully warm.
Key lesson
  • HPA flapping is caused by asymmetric scale-up and scale-down policies. Both need stabilization windows.
  • Cold-start pods with low CPU skew the average metric and cause premature scale-down.
  • Always set stabilizationWindowSeconds for both scale-up and scale-down.
  • Readiness probes gate when a pod is counted in the HPA metric calculation. Delay it for warm-up.
Production debug guideSymptom-first investigation path for HPA misbehavior.5 entries
Symptom · 01
HPA shows <unknown> for metric values.
Fix
Check that metrics-server is running and healthy. Verify that the target deployment has CPU/memory requests set. Without requests, HPA cannot calculate utilization percentages.
Symptom · 02
HPA is not scaling up despite high load.
Fix
Check HPA events (kubectl describe hpa). Look for 'failed to get cpu utilization' or 'unable to fetch metrics'. Verify metrics-server is collecting data. Check if maxReplicas has been reached.
Symptom · 03
HPA scales up but never scales down.
Fix
Check behavior.scaleDown policy. Default scale-down stabilization window is 300 seconds (5 minutes). Verify the metric is actually below the target after stabilization. Check if minReplicas has been reached.
Symptom · 04
HPA flapping: rapid scale up and down oscillations.
Fix
Add stabilization windows to both scale-up and scale-down. Increase periodSeconds to reduce polling frequency. Check if cold-start pods are skewing metrics.
Symptom · 05
Custom metrics not appearing in HPA.
Fix
Verify the Prometheus Adapter or custom.metrics.k8s.io API is registered (kubectl get apiservice). Check adapter configuration for the metric query. Ensure the metric label selectors match the deployment's pods.
★ HPA Triage CommandsRapid commands to isolate HPA scaling issues.
HPA showing unknown or missing metrics.
Immediate action
Check metrics-server and container resource requests.
Commands
kubectl top pods -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
Fix now
If kubectl top fails, restart metrics-server. If HPA shows 'missing request for cpu', add resource requests to the deployment.
HPA not scaling despite load.+
Immediate action
Check HPA status, events, and maxReplicas.
Commands
kubectl get hpa <hpa-name> -n <namespace> -o yaml | grep -A 20 status
kubectl get events -n <namespace> --field-selector involvedObject.name=<hpa-name>
Fix now
If currentReplicas equals maxReplicas, increase maxReplicas. If events show metric errors, check metrics-server.
HPA flapping (rapid scale up/down).+
Immediate action
Check stabilization windows and scaling policies.
Commands
kubectl get hpa <hpa-name> -o jsonpath='{.spec.behavior}'
kubectl get hpa <hpa-name> -o jsonpath='{.status.lastScaleTime}'
Fix now
Add stabilizationWindowSeconds: 120 to scaleUp and 300 to scaleDown. Reduce scale-up aggressiveness.
Custom metrics not feeding HPA.+
Immediate action
Check custom metrics API registration and adapter logs.
Commands
kubectl get apiservice v1beta1.custom.metrics.k8s.io -o yaml
kubectl logs -n custom-metrics deploy/prometheus-adapter
Fix now
If API service shows 'MissingEndpoints', restart the adapter. If adapter logs show query errors, fix the Prometheus query in the adapter config.
HPA vs VPA vs Cluster Autoscaler vs KEDA
ComponentScalesTriggerScale to ZeroBest For
HPAPod replicas (horizontal)CPU, memory, custom metricsNo (minReplicas >= 1)Stateless web services, APIs with variable traffic
VPAPod resource requests (vertical)Historical resource usageNoStateful workloads, databases, single-instance services
Cluster AutoscalerNodes (infrastructure)Pending unschedulable podsYes (scale down empty nodes)Cost optimization, burst capacity
KEDAPod replicas (horizontal)Event sources (queues, DBs, cron, etc.)YesQueue consumers, batch jobs, event-driven architectures

Key takeaways

1
HPA is a proportional control loop, not a binary threshold trigger. It scales based on the ratio of current to target metric.
2
Stabilization windows are the dampening mechanism. Asymmetric windows (fast up, slow down) are the production standard.
3
Custom metrics make HPA application-aware but add a dependency chain. Always include CPU/memory as fallback.
4
Never let HPA and VPA auto-scale on the same metric. Use VPA in 'Off' mode for right-sizing recommendations.
5
KEDA extends HPA with event-driven scaling and scale-to-zero. Use it for queue-based workloads.
6
Test HPA behavior under load before production. The scaling algorithm, stabilization windows, and Cluster Autoscaler interactions all need validation.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How does HPA calculate the desired number of replicas?
02
Why is my HPA showing '' for metrics?
03
Can HPA and VPA work together?
04
What is KEDA and when should I use it instead of HPA?
05
How do I prevent HPA flapping?
🔥

That's Kubernetes. Mark it forged?

6 min read · try the examples if you haven't

Previous
Kubernetes StatefulSets
6 / 12 · Kubernetes
Next
Helm Charts for Kubernetes