Advanced 9 min · March 06, 2026

Kubernetes HPA — Autoscaling

HPA Flapping — Cold-Start Pods Trigger 15-Second Cycles

Q: How does HPA calculate the desired number of replicas?

HPA uses the formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]. For example, if you have 10 replicas with CPU at 150m and a target of 100m, HPA calculates ceil(10 * 150/100) = ceil(15) = 15 replicas. When multiple metrics are configured, HPA computes the desired replicas for each and uses the maximum.

Q: Why is my HPA showing ' ' for metrics?

This typically means either: (1) metrics-server is not running or not collecting data, (2) the target deployment does not have CPU/memory requests set (HPA needs requests to compute utilization), or (3) for custom metrics, the metrics adapter (e.g., Prometheus Adapter) is down or misconfigured. Run `kubectl top pods` to verify metrics-server is working.

Q: Can HPA and VPA work together?

Yes, but they must not operate on the same metric. The safe patterns are: (1) HPA on CPU + VPA on memory, (2) HPA on custom metrics + VPA on CPU/memory, or (3) HPA on CPU/memory + VPA in 'Off' mode (recommendations only, no auto-apply). Never use HPA on CPU with VPA in 'Auto' mode on CPU — this creates a feedback loop.

Q: What is KEDA and when should I use it instead of HPA?

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA with scale-to-zero support and 50+ event-based scalers (message queues, databases, cron schedules). Use KEDA for: queue consumers that should scale to zero when idle, event-driven workloads triggered by external systems, and batch jobs with predictable scheduling. KEDA creates HPA resources internally, so it is complementary, not a replacement.

Q: How do I prevent HPA flapping?

Set stabilization windows for both scale-up and scale-down. Recommended: scaleUp.stabilizationWindowSeconds: 60-120, scaleDown.stabilizationWindowSeconds: 300-600. Also ensure pods have readiness probes with appropriate initialDelaySeconds so new pods are not counted in metrics until they are warm. Use `behavior` policies to cap the rate of scaling changes.

HPA flapping occurs when scale-up has no stabilization window; cold-start pods drop CPU average, triggering rapid scale-down.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

HPA runs a control loop every 15 seconds (default) that reads metrics, computes desired replicas, and scales.
Algorithm: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
Supports CPU, memory, custom metrics (Prometheus), and external metrics (cloud provider queues).
Scaling behavior is configurable via behavior field: separate policies for scale-up and scale-down.
Faster polling = more responsive but higher API Server and metrics-server load.
Aggressive scale-up = risk of over-provisioning and cluster resource exhaustion.
Conservative scale-down = cost waste but stability during traffic dips.
Setting target CPU to 80% without understanding that CPU requests must be set on the container. Without requests, HPA has no denominator and will not function.

✦ Definition~90s read

What is Kubernetes HPA?

★

Imagine a burger restaurant that only opens new cash registers when the queue gets too long, and closes them when it empties out.

HPA does not add nodes — it adds pods. If pods cannot be scheduled due to insufficient node capacity, the Cluster Autoscaler is responsible for adding nodes.

Plain-English First

Imagine a burger restaurant that only opens new cash registers when the queue gets too long, and closes them when it empties out. You don't pay 10 cashiers to stand around at 6am — you scale up at noon rush and scale back down by 3pm. Kubernetes HPA is exactly that manager watching the queue (CPU, memory, or custom metrics) and telling the kitchen (your cluster) to add or remove servers automatically. You set the rules once, and it handles the rest.

Every production system eventually hits the same wall: traffic is unpredictable, and over-provisioning is expensive while under-provisioning is catastrophic. A Black Friday spike, a viral tweet, a nightly batch job — any of these can kneecap a statically-sized deployment in minutes.

Kubernetes Horizontal Pod Autoscaler (HPA) solves the reactive scaling problem by continuously watching resource metrics and adjusting pod replica counts to match demand. But the naive 'just set CPU threshold to 80%' approach breaks in subtle and painful ways in production — flapping deployments, ignored metrics, race conditions with the Cluster Autoscaler, and custom metrics that silently stop working.

This is not a getting-started guide. It is for engineers who need to understand the HPA algorithm at the formula level, how stabilization windows prevent flapping, how to wire up custom metrics via Prometheus and KEDA, how HPA interacts with VPA and Cluster Autoscaler, and the production mistakes that wake senior engineers at 3am.

What is Kubernetes HPA — Autoscaling?

Kubernetes HPA (Horizontal Pod Autoscaler) is a control loop that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics. It is the primary mechanism for reactive horizontal scaling in Kubernetes. HPA does not add nodes — it adds pods. If pods cannot be scheduled due to insufficient node capacity, the Cluster Autoscaler is responsible for adding nodes.

io/thecodeforge/kubernetes/hpa-basic.yamlYAML

# Basic HPA with CPU utilization target
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

Output

HPA configured with CPU/memory targets and scaling behavior policies.

The HPA Control Loop

Loop interval: 15s default, configurable via controller flag.
Metric source: metrics-server for CPU/memory, custom.metrics.k8s.io for Prometheus, external.metrics.k8s.io for cloud metrics.
Stabilization: HPA keeps a history of metric values and uses the max (for scale-up) or min (for scale-down) during the stabilization window.
Cooldown: There is no explicit cooldown. Stabilization windows serve as the dampening mechanism.

Production Insight

The HPA algorithm uses the formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]. If a pod has no metric (e.g., not yet ready), HPA assumes it uses 100% of the target for scale-up calculations and 0% for scale-down. This conservative behavior prevents premature scale-down during deployments. However, it also means that during a rolling update, HPA may over-provision because old terminating pods and new unready pods both count as 'missing' metrics.

Key Takeaway

HPA is a control loop, not a threshold trigger. It computes proportional scaling, not binary on/off. Understanding the algorithm and stabilization windows is essential to preventing flapping and over-provisioning.

thecodeforge.io

HPA Flapping: Cold-Start Pods Trigger 15-Second Cycles

Kubernetes Hpa Autoscaling

Metrics Server Install & Troubleshoot Guide

The Kubernetes Metrics Server is the backbone for HPA CPU/memory scaling. It collects resource metrics (CPU and memory usage) from Kubelets and exposes them through the metrics.k8s.io API. Without it, HPA cannot compute utilization percentages and will show <unknown> for resource metrics.

Installation (standard method): ``bash kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml ` This installs the Metrics Server in the kube-system` namespace. For production clusters, customize the deployment with resource limits and readiness probes.

Cloud-specific variations: - Amazon EKS: Set --kubelet-insecure-tls if using a self-signed CA (common with EKS managed node groups). - Azure AKS: Enable the addon via CLI: az aks enable-addons --addons monitoring --name <cluster> --resource-group <rg>; Metrics Server is included with Azure Monitor. - Google GKE: GKE provides a managed Metrics Server automatically. No manual install needed.

Troubleshooting checklist: 1. Verify the Metrics Server deployment is running: kubectl get deployment metrics-server -n kube-system 2. Check pod logs: kubectl logs deployment/metrics-server -n kube-system — look for TLS or authentication errors. 3. Test metric collection: kubectl top pods and kubectl top nodes. If empty, Metrics Server is not collecting. 4. Confirm API service is healthy: kubectl get apiservice v1beta1.metrics.k8s.io — status should be True. 5. If using a custom CA, pass --kubelet-preferred-address-types=InternalIP and --kubelet-insecure-tls flags. 6. Ensure each node has at least 1 vCPU and 1 GB memory – Metrics Server can be resource-hungry on large clusters. 7. For clusters with hundreds of nodes, increase the Metrics Server request limit: ``yaml resources: requests: cpu: 100m memory: 200Mi limits: cpu: 200m memory: 500Mi ``

io/thecodeforge/kubernetes/metrics-server-custom-flags.yamlYAML

# Custom Metrics Server deployment with TLS flags
# Package: io.thecodeforge.kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server
  namespace: kube-system
  labels:
    k8s-app: metrics-server
spec:
  selector:
    matchLabels:
      k8s-app: metrics-server
  template:
    metadata:
      labels:
        k8s-app: metrics-server
    spec:
      containers:
      - args:
        - --kubelet-insecure-tls
        - --kubelet-preferred-address-types=InternalIP
        - --metric-resolution=15s
        name: metrics-server
        image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
        ports:
        - containerPort: 4443
          name: https
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          initialDelaySeconds: 20
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          initialDelaySeconds: 20
          periodSeconds: 10

Output

Deployment configured with TLS insecure flag and custom metric resolution.

Metrics Server Resource Requirements

Metrics Server scales linearly with the number of nodes and pods. For clusters with <20 nodes, default requests of 50m CPU and 100Mi memory are sufficient. For larger clusters, monitor memory usage and increase requests. A common failure mode is OOMKill on large clusters, which causes intermittent metric collection.

Production Insight

Metrics Server is a low-priority component in many clusters, but its failure silently breaks HPA for CPU and memory scaling. Always include a Prometheus or KEDA-based fallback for critical services. Use kubectl top pods --containers to check per-container metrics, which helps diagnose missing requests. In production, set up alerts for the metrics.k8s.io API service status — if it becomes unavailable, HPA will show <unknown> and stop scaling.

Key Takeaway

Metrics Server is mandatory for HPA CPU/memory scaling. Install it, verify it, and monitor its health. Cloud-managed clusters often have it pre-installed, but self-managed clusters require explicit setup.

The HPA Algorithm: How Desired Replicas Are Calculated

The HPA algorithm is proportional, not binary. It does not simply 'add one pod' when a threshold is breached. Instead, it calculates the ratio of current metric to target metric and scales proportionally. This means high load causes rapid scale-up (doubling or more), while moderate load causes gradual adjustments.

io/thecodeforge/kubernetes/autoscaling/HpaAlgorithm.javaJAVA

// Simplified HPA algorithm implementation
// Package: io.thecodeforge.kubernetes.autoscaling
package io.thecodeforge.kubernetes.autoscaling;

public class HpaAlgorithm {

    /**
     * Core HPA formula:
     * desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
     *
     * For multiple metrics, HPA computes the max across all metric sources.
     */
    public static int calculateDesiredReplicas(
            int currentReplicas,
            long currentMetricValue,
            long targetMetricValue,
            int minReplicas,
            int maxReplicas
    ) {
        if (targetMetricValue == 0) {
            throw new IllegalArgumentException("Target metric value cannot be zero");
        }

        double ratio = (double) currentMetricValue / targetMetricValue;
        int desired = (int) Math.ceil(currentReplicas * ratio);

        // Clamp to min/max bounds
        return Math.max(minReplicas, Math.min(desired, maxReplicas));
    }

    public static void main(String[] args) {
        // Example: 10 replicas, CPU at 1500m, target 1000m
        // desired = ceil(10 * (1500/1000)) = ceil(15) = 15
        int desired = calculateDesiredReplicas(10, 1500, 1000, 3, 50);
        System.out.println("Desired replicas: " + desired);  // 15

        // Example: 10 replicas, CPU at 500m, target 1000m
        // desired = ceil(10 * (500/1000)) = ceil(5) = 5
        int scaledDown = calculateDesiredReplicas(10, 500, 1000, 3, 50);
        System.out.println("Scaled down replicas: " + scaledDown);  // 5
    }
}

Output

Desired replicas: 15

Scaled down replicas: 5

Proportional Scaling vs Binary Thresholds

pods: 4 policy: can add at most 4 pods per period.
percent: 50 policy: can add at most 50% of current replicas per period.
Multiple policies: HPA uses the policy that allows the most scaling (max).
For scale-down: same logic applies but in reverse. HPA uses the min across policies for safety.

Production Insight

When multiple metrics are defined (e.g., CPU and memory), HPA computes the desired replicas for each metric independently and uses the MAX. This means if CPU says 'scale to 5' and memory says 'scale to 10', HPA scales to 10. This is the correct behavior for availability — the most constrained metric wins. However, it can cause unexpected over-provisioning if one metric is misconfigured or noisy. Always validate all metric sources independently.

Key Takeaway

HPA scales proportionally, not incrementally. Without behavior policies, a single metric spike can cause massive scale-up. Use behavior to cap aggressiveness, and remember: the MAX across metrics wins.

Stabilization Windows: Preventing Flapping

Stabilization windows are the primary mechanism for preventing HPA flapping — the rapid oscillation between scale-up and scale-down. During the stabilization window, HPA considers only the most conservative metric value (max for scale-up, min for scale-down) from the history of collected samples.

io/thecodeforge/kubernetes/hpa-stabilization.yamlYAML

# HPA with tuned stabilization windows
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 100
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    # Scale UP: Aggressive but dampened
    scaleUp:
      stabilizationWindowSeconds: 60    # Consider max metric over last 60s
      policies:
        - type: Percent
          value: 100                      # Can double replicas per period
          periodSeconds: 60
        - type: Pods
          value: 10                       # Or add 10 pods, whichever is more
          periodSeconds: 60
      selectPolicy: Max                   # Use the policy allowing most scaling
    # Scale DOWN: Conservative and slow
    scaleDown:
      stabilizationWindowSeconds: 600    # Consider min metric over last 10 minutes
      policies:
        - type: Percent
          value: 10                       # Remove at most 10% per period
          periodSeconds: 120
      selectPolicy: Min                   # Use the most conservative policy

Output

HPA configured with asymmetric stabilization: fast scale-up, slow scale-down.

How Stabilization Windows Work

Scale-up window: 0s default. Set to 60-120s to dampen during deployments and cold starts.
Scale-down window: 300s default. Set to 300-600s for production stability.
Window only applies to the stabilization decision, not the scaling policy rate.
Pods in CrashLoopBackOff or not yet ready are treated as using 100% of target for scale-up.

Production Insight

The default scale-up stabilization window of 0 seconds is the single most common cause of HPA flapping in production. During a rolling update, new pods start with low CPU (cold start), which pulls the average down, triggering scale-down. Once old pods terminate, CPU spikes, triggering scale-up. The fix is always: set scaleUp.stabilizationWindowSeconds: 60-120 to let new pods warm up before HPA reacts.

Key Takeaway

Stabilization windows are the dampening mechanism for HPA. Asymmetric windows (fast up, slow down) are the production standard. Never leave scale-up stabilization at 0 seconds in production.

Custom and External Metrics: Beyond CPU and Memory

CPU and memory are often poor proxies for actual application load. A web server might be CPU-bound during image processing but network-bound during API calls. HPA supports custom metrics (per-pod metrics from Prometheus) and external metrics (cluster-external signals like SQS queue depth or Pub/Sub backlog) through the Kubernetes API aggregation layer.

io/thecodeforge/kubernetes/hpa-custom-metrics.yamlYAML

# HPA with custom metrics from Prometheus Adapter
# Requires: prometheus-adapter installed and configured
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-processor-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-processor
  minReplicas: 2
  maxReplicas: 40
  metrics:
    # Custom metric: requests per second from Prometheus
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"    # Scale when avg RPS per pod exceeds 100
    # External metric: SQS queue depth
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: "order-processing"
        target:
          type: AverageValue
          averageValue: "50"     # Scale when queue depth per pod exceeds 50
    # Keep CPU as fallback
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 75

Output

HPA configured with custom (RPS) and external (SQS) metrics plus CPU fallback.

Custom vs External Metrics

Custom metrics: Prometheus Adapter, Datadog, or Stackdriver adapter.
External metrics: Cloud-provider specific (AWS CloudWatch, GCP Stackdriver, Azure Monitor).
Label selectors must match the metric's labels to the target pods.
If the metric API is unavailable, HPA falls back to CPU/memory (if configured).

Production Insight

Custom metrics adapters are a single point of failure for HPA. If the Prometheus Adapter crashes or the Prometheus server is unreachable, HPA shows <unknown> for custom metrics and stops scaling on those metrics. Always include a CPU or memory metric as a fallback. Monitor the adapter's availability and latency — a slow adapter causes delayed scaling decisions. The adapter's --metrics-relist-interval (default 1m) controls how often it re-reads available Prometheus metrics. Set it lower if you add new metrics frequently.

Key Takeaway

Custom metrics make HPA application-aware, but they add a dependency chain: Prometheus -> Adapter -> HPA. Always include CPU/memory as a fallback metric. Monitor the adapter as critical infrastructure.

Custom Metrics (Prometheus Adapter) Guide

The Prometheus Adapter is the most common way to expose application-level metrics to HPA. It implements the custom.metrics.k8s.io API and translates Prometheus queries into metric values that HPA can consume. Below is a complete guide to installing and configuring the adapter for production use.

Installation using Helm (recommended): ``bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus-adapter prometheus-community/prometheus-adapter \ --namespace monitoring \ --set prometheus.url=http://prometheus-server.monitoring.svc \ --set rules.custom[0]=default=true ``

Configuration via ConfigMap: The adapter uses a series of rules that define which Prometheus series become custom metrics. Each rule specifies a seriesQuery (PromQL to find matching time series) and template transformations for pods and nodes.

``yaml # ConfigMap for prometheus-adapter rules apiVersion: v1 kind: ConfigMap metadata: name: prometheus-adapter-config namespace: monitoring data: config.yaml: | rules: - seriesQuery: 'http_requests_total' resources: overrides: namespace: {resource: "namespace"} pod: {resource: "pod"} name: matches: "^(.*)_total$" as: "${1}_per_second" metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m]) / <<.GroupBy>>' ``

Testing custom metrics: Once the adapter is running, check the available metrics: ``bash kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq . ` This should return a list of metric names. Then verify a specific metric for a pod: `bash kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second | jq . ``

Common issues: - No metrics appear: The adapter cannot connect to Prometheus. Check prometheus.url and network policies. - <unknown> in HPA: The metric query returns no data. Verify the label selectors in HPA match the pod labels. - High latency: The PromQL query is too expensive. Use aggregation (<<.GroupBy>>) and limit time range. - Adapter crashes: Memory or CPU limit too low. Increase resources and add --logtostderr for debug logs.

io/thecodeforge/kubernetes/prometheus-adapter-values.yamlYAML

# Helm values for prometheus-adapter production setup
# Package: io.thecodeforge.kubernetes
prometheus:
  url: http://prometheus.monitoring.svc
  port: 9090
rules:
  default: false  # Disable default rules to avoid namespace/pod conflicts
  custom:
  - seriesQuery: 'http_requests_total'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m]) / <<.GroupBy>>'
  - seriesQuery: 'nginx_ingress_controller_requests_total'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        service: {resource: "service"}
        ingress: {resource: "ingress"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
resources:
  requests:
    cpu: 200m
    memory: 300Mi
  limits:
    cpu: 500m
    memory: 500Mi

Output

Prometheus Adapter configured with custom HTTP request metric and Nginx ingress metric.

RBAC for Custom Metrics

The HPA controller must have permissions to access the custom.metrics.k8s.io API. In most clusters, the default ClusterRole system:controller:horizontal-pod-autoscaler already includes this, but if you are using a custom RBAC setup, ensure the binding exists. Without it, HPA will show 'unauthorized' errors.

Production Insight

Prometheus Adapter is a critical link in the custom metrics chain. Monitor its endpoint (/metrics) for request latency and error rates. A common mistake is to expose raw metrics without proper aggregation, causing the adapter to return large data sets that overwhelm HPA. Use sum and rate in the metricsQuery to reduce cardinality. Also, ensure the adapter's replicas: 2 for high availability — its failure silently breaks all custom metrics scaling.

Key Takeaway

Prometheus Adapter translates Prometheus metrics into the Kubernetes custom metrics API. Proper configuration of seriesQuery and metricsQuery is essential. Always test with kubectl get --raw and include fallback resource metrics in HPA.

HPA, VPA, and Cluster Autoscaler: The Scaling Stack

HPA, VPA (Vertical Pod Autoscaler), and Cluster Autoscaler are complementary but interact in non-obvious ways. HPA scales horizontally (more pods). VPA scales vertically (bigger pods). Cluster Autoscaler scales infrastructure (more nodes). Using them together requires careful configuration to avoid conflicts.

io/thecodeforge/kubernetes/hpa-vpa-coexistence.yamlYAML

# VPA in 'Off' mode: Recommends but does not auto-apply resource requests
# This avoids conflict with HPA which also modifies replica counts
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"      # CRITICAL: Do not use 'Auto' with HPA on CPU/memory
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 4Gi
        controlledResources: ["cpu", "memory"]
---
# Cluster Autoscaler: Scales nodes based on pending pods
# No YAML needed — it watches for unschedulable pods automatically
# Key flags:
# --scale-down-delay-after-add=10m
# --scale-down-unneeded-time=10m
# --max-node-provision-time=15m

Output

VPA configured in recommendation-only mode. Cluster Autoscaler watches for unschedulable pods.

The Conflict: HPA vs VPA on CPU/Memory

HPA on CPU + VPA on memory: Safe. They operate on different metrics.
HPA on custom metrics + VPA on CPU/memory: Safe. HPA ignores resource metrics.
HPA on CPU + VPA on CPU (Auto mode): Dangerous. Feedback loop.
Best practice: HPA for scaling, VPA in 'Off' mode for right-sizing recommendations.

Production Insight

The Cluster Autoscaler has a --scale-down-delay-after-add flag (default 10 minutes) that prevents it from removing nodes immediately after HPA scales up. Without this, the Cluster Autoscaler could remove nodes while HPA is still creating pods, causing scheduling failures. Conversely, if the Cluster Autoscaler is too slow to add nodes, HPA's desired replicas may exceed available capacity, leaving pods in Pending state. Monitor for pods stuck in Pending with reason: Unschedulable — this indicates the Cluster Autoscaler cannot provision nodes fast enough or has hit its --max-nodes limit.

Key Takeaway

HPA, VPA, and Cluster Autoscaler form a three-layer scaling stack. The key rule: never let HPA and VPA auto-scale on the same metric. Use VPA in 'Off' mode for right-sizing, HPA for horizontal scaling, and Cluster Autoscaler for node provisioning.

VPA vs HPA vs Cluster Autoscaler Comparison Table

Understanding the differences and interactions between the three Kubernetes scaling components is critical for designing a robust autoscaling strategy. Below is a detailed comparison table.

Feature	HPA (Horizontal Pod Autoscaler)	VPA (Vertical Pod Autoscaler)	Cluster Autoscaler
What it scales	Number of pod replicas	CPU/memory requests per pod	Number of cluster nodes
Metric source	CPU, memory, custom (Prometheus), external (cloud)	Historical resource usage (recommendations)	Pending pods (unschedulable)
Scaling direction	Scale out (increase replicas) / Scale in (decrease replicas)	Scale up (increase requests) / Scale down (decrease requests)	Scale up (add nodes) / Scale down (remove empty nodes)
Scale to zero	No (minReplicas >= 1)	No	Yes (empty nodes removed)
Conflict with HPA?	N/A	Yes, if both operate on CPU/memory. Use VPA in 'Off' mode.	No, complementary.
Best use case	Stateless microservices with variable traffic	Stateful applications, databases, batch jobs	Cost optimization for fluctuating cluster demand
Latency impact	Fast (within 15s loop)	Slow (minutes to hours for recommendations)	Slow (node provisioning takes 2-10 min)
Configuration complexity	Low (basic CPU) to medium (custom metrics)	Medium (need historical data)	Medium (cloud provider integration)

When to combine: - HPA + Cluster Autoscaler: The most common pair. HPA adds pods; Cluster Autoscaler adds nodes when pods can't be scheduled. - VPA (Off mode) + HPA: Safe. VPA provides dashboards/charts for right-sizing requests; HPA handles actual scaling. - VPA (Auto mode) alone: Works for single-instance workloads but cannot horizontally scale.

Production pitfalls: - Running VPA in Auto mode with HPA on the same metrics causes oscillation (see callout in the scaling stack section). - Cluster Autoscaler may conflict with HPA if scale-down is too aggressive — pods terminated by HPA trigger node removal, causing new pods to be pending. - Total scaling delay = HPA reaction time + Cluster Autoscaler provisioning time. For bursty traffic, pre-provision nodes or use KEDA with scale-to-zero.

The Three-Layer Scaling Model

Think of scaling as a three-layer stack: HPA handles pod count, VPA handles pod size, and Cluster Autoscaler handles cluster capacity. Each layer operates on a different resource (replicas, requests, nodes) and has different latency profiles. Understanding this stack prevents conflicts and ensures predictable scaling behavior.

Production Insight

A common mistake is to set aggressive HPA scale-down policies that remove pods faster than the Cluster Autoscaler's scale-down delay. This causes the Cluster Autoscaler to keep nodes that become underutilized, increasing costs. Always align HPA scale-down periodSeconds with Cluster Autoscaler's --scale-down-delay-after-delete (default 10 minutes). For example, set HPA scale-down stabilization window to at least 600s and periodSeconds to 120s to prevent rapid pod churn.

Key Takeaway

HPA, VPA, and Cluster Autoscaler are complementary but must be configured with awareness of each other's behavior. Use the comparison table to decide which combination fits your workload.

KEDA: Event-Driven Autoscaling Beyond HPA

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA by enabling scale-to-zero and supporting a wider range of event sources (message queues, databases, cron schedules). KEDA acts as an HPA adapter — it creates and manages HPA resources internally, but provides a simpler API and more scaler options.

io/thecodeforge/kubernetes/keda-scaledobject.yamlYAML

# KEDA ScaledObject: Scale based on SQS queue depth with scale-to-zero
# Package: io.thecodeforge.kubernetes
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0          # Scale to zero when no messages
  maxReplicaCount: 50
  pollingInterval: 15         # Check metrics every 15 seconds
  cooldownPeriod: 300         # Wait 5 minutes before scaling to zero
  triggers:
    - type: aws-sqs-queue
      authenticationRef:
        name: aws-credentials
      metadata:
        queueURL: "https://sqs.us-east-1.amazonaws.com/123456789/orders"
        queueLength: "5"        # 1 pod per 5 messages
        awsRegion: "us-east-1"
    - type: cron
      metadata:
        timezone: America/New_York
        start: "0 8 * * *"      # Pre-scale at 8 AM
        end: "0 20 * * *"       # End pre-scale at 8 PM
        desiredReplicas: "10"

Output

KEDA configured for SQS-based scaling with scale-to-zero and cron pre-scaling.

Why KEDA Instead of Native HPA?

Scale-to-zero: KEDA sets minReplicas to 0 and manages the 0->1 transition.
50+ scalers: SQS, Kafka, RabbitMQ, Prometheus, PostgreSQL, cron, and more.
HPA under the hood: KEDA creates an HPA with custom metrics for each ScaledObject.
Cooldown period: Prevents rapid scale-to-zero when a batch temporarily drains the queue.

Production Insight

KEDA's scale-to-zero introduces a cold-start latency problem: the first pod must start from zero, which can take 30-120 seconds depending on the container image and initialization. For latency-sensitive services, use KEDA's cron scaler to pre-scale during known traffic windows. Monitor the ScaledObject status for Active: false — this means KEDA has scaled to zero and is waiting for events. If events arrive but pods are slow to start, increase cooldownPeriod to prevent premature scale-to-zero between message batches.

Key Takeaway

KEDA extends HPA with event-driven scaling and scale-to-zero. Use it for queue-based workloads and batch processing. The trade-off is cold-start latency when scaling from zero. Pre-scale with cron triggers for latency-sensitive paths.

Before You Begin: The Minimum Viable Cluster

You don't need a production-grade cluster to test HPA. But you do need a cluster that can actually run the Metrics Server. That means your nodes must have kubelet serving metrics on port 10250, and your CNI must allow pod-to-node traffic. Don't assume your local minikube works out of the box — verify it.

If you're using kind or k3s, check that the Metrics Server can resolve node names. This is the #1 reason people waste hours on HPA labs. The Metrics Server does not use DNS by default — it relies on node internal IPs. If your nodes don't have resolvable names, patch the Metrics Server deployment with --kubelet-use-node-status-port=false and --kubelet-preferred-address-types=InternalIP.

You also need a workload with resource requests set. HPA cannot calculate target utilization if pods don't declare requests. That turns a 5-minute setup into a 45-minute debugging session. Set CPU and memory requests on every container.

hpa-prerequisites.ymlYAML

// io.thecodeforge — devops tutorial

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
spec:
  replicas: 1
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - name: php-apache
        image: registry.k8s.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 200m
            memory: 64Mi

Production Trap:

If you skip resource requests, HPA will report 'unable to get metrics' with zero explanation. Always configure resources.requests.cpu before creating your HPA.

Key Takeaway

HPA requires Metrics Server + resource requests on pods. Without both, autoscaling is dead on arrival.

HPA Status Conditions: Your Debugging Lifeline

When HPA misbehaves — and it will — Kubernetes gives you conditions that tell you exactly what's broken. Most engineers only check kubectl get hpa and see the target/current metrics. That's like checking your car's speedometer while the engine is on fire.

Run kubectl describe hpa YOUR_HPA_NAME. The Conditions section shows AbleToScale, ScalingActive, and ScalingLimited. When ScalingActive is False, it means the HPA can't read metrics. Common reasons: Metrics Server is down, or you're using a custom metric that isn't registered. When ScalingLimited is True, you've hit max or min replicas. That's not a bug — that's you not setting limits correctly.

Another hidden gem: the LastScaleTime field. If it's suspiciously old (like 5+ minutes), your HPA is stuck in backoff. HPA uses a cooldown window (usually 3-5 minutes) to prevent flapping. Check the --horizontal-pod-autoscaler-downscale-stabilization flag in the kube-controller-manager if you need to tune this.

hpa-status-debug.ymlYAML

// io.thecodeforge — devops tutorial

$ kubectl describe hpa php-apache

Name:                     php-apache
Namespace:                default
Labels:                   <none>
Annotations:              <none>
CreationTimestamp:        Mon, 20 Mar 2024 10:00:00 +0000
Reference:                Deployment/php-apache
Metrics:                  ( current / target )
  resource cpu on pods:   120m / 50%
Min replicas:             1
Max replicas:             10
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  3m    horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) above target

Senior Shortcut:

Write a shell alias: alias hpadebug='kubectl describe hpa | grep -A5 Conditions'. Use it before you open GitHub issues. You'll fix 80% of HPA failures in 10 seconds.

Key Takeaway

Conditions reveal why HPA fails. Check AbleToScale, ScalingActive, ScalingLimited before blaming the algorithm.

Introduction: Why HPA Autoscaling Demands a Solid Foundation

Kubernetes Horizontal Pod Autoscaling (HPA) is not a set-and-forget feature. It is a control loop that continuously adjusts the number of pod replicas based on observed resource utilization or custom metrics. Without a proper grounding, teams often face cascading failures — from pod flapping under load to ignored scaling events during traffic spikes. HPA succeeds only when three conditions are guaranteed: reliable metrics collection, accurate target utilization thresholds, and a stable cluster with sufficient headroom. The goal is to match supply (pods) with demand (requests) while minimizing cost and resource waste. This guide begins at the very start: understanding what HPA needs to function and configuring the essential metrics pipeline. The following sections will take you from zero to a production-safe autoscaling setup, starting with the absolute prerequisites.

prerequisites-checklist.ymlYAML

// io.thecodeforge — devops tutorial
// Minimum cluster requirements for HPA:
// 1. Kubernetes v1.23+ (v1.27 recommended)
// 2. kube-controller-manager with --horizontal-pod-autoscaler-sync-period=15s
// 3. Metrics Server v0.7+ installed and serving
// 4. Each node has at least 500m CPU and 512Mi allocatable
// 5. Resource requests set on ALL containers in target workloads
// 6. Cluster Autoscaler installed if using node-level scaling
// 7. RBAC: metrics.k8s.io API accessible by HPA controller
// 8. kubelet resource metrics endpoint enabled (default)
// 9. No resource quotas preventing replica creation
// 10. At least 2 nodes for high availability

Output

YAML manifest verifying cluster readiness for HPA

Production Trap:

Running HPA without resource requests on your pods will cause the metrics server to report zero utilization, making autoscaling ineffective. Always set requests at deployment time.

Key Takeaway

HPA cannot work without a functioning metrics pipeline; always validate metrics-server health first.

Starter Kit: Metrics Server Configuration from the Official Helm Chart

The metrics-server is your primary source for CPU and memory metrics consumed by HPA. Incorrect configuration is the #1 reason autoscaling fails silently. Using the official Helm chart (v3.8.2+) ensures you get sane defaults for RBAC, certificates, and resource limits. The reference repository at github.com/kubernetes-sigs/metrics-server provides the canonical chart structure. Key configuration points include: enabling the KubeletInsecureTLS flag only in non-production environments (prefer CA-bundle mounting), setting --kubelet-use-node-status-port for kubelet port discovery, and configuring resource requests (minimum 50m CPU / 64Mi memory per replica). For high-availability, deploy two replicas with pod anti-affinity. The chart also supports external metrics API aggregation via --metric-resolution=60s (default 60s). Validate connectivity with kubectl get --raw /apis/metrics.k8s.io/v1beta1. Production hardening: enable --kubelet-preferred-address-types=InternalIP,ExternalIP to avoid DNS resolution failures in multi-cloud setups.

metrics-server-values.ymlYAML

// io.thecodeforge — devops tutorial
// Helm values for metrics-server v3.8.2
metricsResolution: 60s
replicas: 2
podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution: true
extraArgs:
  - --kubelet-insecure-tls=false
  - --kubelet-use-node-status-port
  - --kubelet-preferred-address-types=InternalIP,ExternalIP
  - --metric-resolution=60s
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi
service:
  type: ClusterIP
rbac:
  create: true
serviceAccount:
  create: true

Output

Apply with: helm upgrade -i metrics-server metrics-server/metrics-server -f values.yml -n kube-system

Production Trap:

Disabling kubelet TLS verification (--kubelet-insecure-tls) exposes your cluster to man-in-the-middle attacks. Always use a valid CA bundle in production environments.

Key Takeaway

Always start with the official Helm chart for metrics-server; customize only TLS, resolution, and resource allocation.

● Production incidentPOST-MORTEMseverity: high

HPA Flapping: Pods Scaling Up and Down Every 15 Seconds

Symptom

Deployment replica count oscillating rapidly. Pods in Terminating and ContainerCreating states simultaneously. Customer-facing latency spikes correlated with scaling events. HPA events showed alternating 'scaled up' and 'scaled down' messages every 30-60 seconds.

Assumption

Traffic was genuinely spiky, causing real demand fluctuations.

Root cause

The HPA was configured with a 50% CPU target and a 60-second stabilization window for scale-down (the default). However, the scale-up policy was set to pods: 4 with periodSeconds: 15, meaning HPA could add 4 pods every 15 seconds. The new pods started with low CPU (cold start), which pulled the average below the 50% threshold, triggering scale-down. Once pods terminated, CPU spiked again, triggering scale-up. The cycle repeated indefinitely. The core issue: the scale-up stabilization window was 0 seconds (default), so HPA reacted immediately to every metric change without dampening.

Fix

1. Set behavior.scaleUp.stabilizationWindowSeconds: 120 to prevent rapid scale-up during pod initialization. 2. Changed behavior.scaleUp.policies from pods: 4 to percent: 50 for proportional scaling. 3. Set behavior.scaleDown.stabilizationWindowSeconds: 300 to be conservative on scale-down. 4. Added a readiness probe with initialDelaySeconds: 30 so pods were not counted in metrics until fully warm.

Key lesson

HPA flapping is caused by asymmetric scale-up and scale-down policies. Both need stabilization windows.
Cold-start pods with low CPU skew the average metric and cause premature scale-down.
Always set stabilizationWindowSeconds for both scale-up and scale-down.
Readiness probes gate when a pod is counted in the HPA metric calculation. Delay it for warm-up.

Production debug guideSymptom-first investigation path for HPA misbehavior.5 entries

Symptom · 01

HPA shows <unknown> for metric values.

→

Fix

Check that metrics-server is running and healthy. Verify that the target deployment has CPU/memory requests set. Without requests, HPA cannot calculate utilization percentages.

Symptom · 02

HPA is not scaling up despite high load.

→

Fix

Check HPA events (kubectl describe hpa). Look for 'failed to get cpu utilization' or 'unable to fetch metrics'. Verify metrics-server is collecting data. Check if maxReplicas has been reached.

Symptom · 03

HPA scales up but never scales down.

→

Fix

Check behavior.scaleDown policy. Default scale-down stabilization window is 300 seconds (5 minutes). Verify the metric is actually below the target after stabilization. Check if minReplicas has been reached.

Symptom · 04

HPA flapping: rapid scale up and down oscillations.

→

Fix

Add stabilization windows to both scale-up and scale-down. Increase periodSeconds to reduce polling frequency. Check if cold-start pods are skewing metrics.

Symptom · 05

Custom metrics not appearing in HPA.

→

Fix

Verify the Prometheus Adapter or custom.metrics.k8s.io API is registered (kubectl get apiservice). Check adapter configuration for the metric query. Ensure the metric label selectors match the deployment's pods.

★ HPA Triage CommandsRapid commands to isolate HPA scaling issues.

HPA showing unknown or missing metrics.−

Immediate action

Check metrics-server and container resource requests.

Commands

kubectl top pods -n <namespace>

kubectl describe hpa <hpa-name> -n <namespace>

Fix now

If kubectl top fails, restart metrics-server. If HPA shows 'missing request for cpu', add resource requests to the deployment.

HPA not scaling despite load.+

HPA flapping (rapid scale up/down).+

Custom metrics not feeding HPA.+

HPA vs VPA vs Cluster Autoscaler vs KEDA

Component	Scales	Trigger	Scale to Zero	Best For
HPA	Pod replicas (horizontal)	CPU, memory, custom metrics	No (minReplicas >= 1)	Stateless web services, APIs with variable traffic
VPA	Pod resource requests (vertical)	Historical resource usage	No	Stateful workloads, databases, single-instance services
Cluster Autoscaler	Nodes (infrastructure)	Pending unschedulable pods	Yes (scale down empty nodes)	Cost optimization, burst capacity
KEDA	Pod replicas (horizontal)	Event sources (queues, DBs, cron, etc.)	Yes	Queue consumers, batch jobs, event-driven architectures

Key takeaways

HPA is a proportional control loop, not a binary threshold trigger. It scales based on the ratio of current to target metric.

Stabilization windows are the dampening mechanism. Asymmetric windows (fast up, slow down) are the production standard.

Custom metrics make HPA application-aware but add a dependency chain. Always include CPU/memory as fallback.

Never let HPA and VPA auto-scale on the same metric. Use VPA in 'Off' mode for right-sizing recommendations.

KEDA extends HPA with event-driven scaling and scale-to-zero. Use it for queue-based workloads.

Test HPA behavior under load before production. The scaling algorithm, stabilization windows, and Cluster Autoscaler interactions all need validation.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

FAQ · 5 QUESTIONS

Frequently Asked Questions

How does HPA calculate the desired number of replicas?

Why is my HPA showing '' for metrics?

Can HPA and VPA work together?

What is KEDA and when should I use it instead of HPA?

How do I prevent HPA flapping?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's Kubernetes. Mark it forged?

9 min read · try the examples if you haven't