Skip to content
Home DevOps Kubernetes HPA Deep Dive: Autoscaling Internals, Gotchas & Production Tuning

Kubernetes HPA Deep Dive: Autoscaling Internals, Gotchas & Production Tuning

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Kubernetes → Topic 6 of 12
Kubernetes HPA autoscaling explained deeply — control loop internals, custom metrics, stabilization windows, KEDA, and real production gotchas you won't find in the docs.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
Kubernetes HPA autoscaling explained deeply — control loop internals, custom metrics, stabilization windows, KEDA, and real production gotchas you won't find in the docs.
  • HPA is a proportional control loop, not a binary threshold trigger. It scales based on the ratio of current to target metric.
  • Stabilization windows are the dampening mechanism. Asymmetric windows (fast up, slow down) are the production standard.
  • Custom metrics make HPA application-aware but add a dependency chain. Always include CPU/memory as fallback.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • HPA runs a control loop every 15 seconds (default) that reads metrics, computes desired replicas, and scales.
  • Algorithm: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
  • Supports CPU, memory, custom metrics (Prometheus), and external metrics (cloud provider queues).
  • Scaling behavior is configurable via behavior field: separate policies for scale-up and scale-down.
  • Faster polling = more responsive but higher API Server and metrics-server load.
  • Aggressive scale-up = risk of over-provisioning and cluster resource exhaustion.
  • Conservative scale-down = cost waste but stability during traffic dips.
  • Setting target CPU to 80% without understanding that CPU requests must be set on the container. Without requests, HPA has no denominator and will not function.
🚨 START HERE
HPA Triage Commands
Rapid commands to isolate HPA scaling issues.
🟡HPA showing unknown or missing metrics.
Immediate ActionCheck metrics-server and container resource requests.
Commands
kubectl top pods -n <namespace>
kubectl describe hpa <hpa-name> -n <namespace>
Fix NowIf `kubectl top` fails, restart metrics-server. If HPA shows 'missing request for cpu', add resource requests to the deployment.
🟡HPA not scaling despite load.
Immediate ActionCheck HPA status, events, and maxReplicas.
Commands
kubectl get hpa <hpa-name> -n <namespace> -o yaml | grep -A 20 status
kubectl get events -n <namespace> --field-selector involvedObject.name=<hpa-name>
Fix NowIf `currentReplicas` equals `maxReplicas`, increase maxReplicas. If events show metric errors, check metrics-server.
🟡HPA flapping (rapid scale up/down).
Immediate ActionCheck stabilization windows and scaling policies.
Commands
kubectl get hpa <hpa-name> -o jsonpath='{.spec.behavior}'
kubectl get hpa <hpa-name> -o jsonpath='{.status.lastScaleTime}'
Fix NowAdd `stabilizationWindowSeconds: 120` to scaleUp and `300` to scaleDown. Reduce scale-up aggressiveness.
🟡Custom metrics not feeding HPA.
Immediate ActionCheck custom metrics API registration and adapter logs.
Commands
kubectl get apiservice v1beta1.custom.metrics.k8s.io -o yaml
kubectl logs -n custom-metrics deploy/prometheus-adapter
Fix NowIf API service shows 'MissingEndpoints', restart the adapter. If adapter logs show query errors, fix the Prometheus query in the adapter config.
Production IncidentHPA Flapping: Pods Scaling Up and Down Every 15 SecondsProduction deployment scaled from 3 to 12 pods, then back to 3, then to 15, in a continuous oscillation loop. Pods were constantly being created and terminated, causing connection resets and 503 errors.
SymptomDeployment replica count oscillating rapidly. Pods in Terminating and ContainerCreating states simultaneously. Customer-facing latency spikes correlated with scaling events. HPA events showed alternating 'scaled up' and 'scaled down' messages every 30-60 seconds.
AssumptionTraffic was genuinely spiky, causing real demand fluctuations.
Root causeThe HPA was configured with a 50% CPU target and a 60-second stabilization window for scale-down (the default). However, the scale-up policy was set to pods: 4 with periodSeconds: 15, meaning HPA could add 4 pods every 15 seconds. The new pods started with low CPU (cold start), which pulled the average below the 50% threshold, triggering scale-down. Once pods terminated, CPU spiked again, triggering scale-up. The cycle repeated indefinitely. The core issue: the scale-up stabilization window was 0 seconds (default), so HPA reacted immediately to every metric change without dampening.
Fix1. Set behavior.scaleUp.stabilizationWindowSeconds: 120 to prevent rapid scale-up during pod initialization. 2. Changed behavior.scaleUp.policies from pods: 4 to percent: 50 for proportional scaling. 3. Set behavior.scaleDown.stabilizationWindowSeconds: 300 to be conservative on scale-down. 4. Added a readiness probe with initialDelaySeconds: 30 so pods were not counted in metrics until fully warm.
Key Lesson
HPA flapping is caused by asymmetric scale-up and scale-down policies. Both need stabilization windows.Cold-start pods with low CPU skew the average metric and cause premature scale-down.Always set stabilizationWindowSeconds for both scale-up and scale-down.Readiness probes gate when a pod is counted in the HPA metric calculation. Delay it for warm-up.
Production Debug GuideSymptom-first investigation path for HPA misbehavior.
HPA shows <unknown> for metric values.Check that metrics-server is running and healthy. Verify that the target deployment has CPU/memory requests set. Without requests, HPA cannot calculate utilization percentages.
HPA is not scaling up despite high load.Check HPA events (kubectl describe hpa). Look for 'failed to get cpu utilization' or 'unable to fetch metrics'. Verify metrics-server is collecting data. Check if maxReplicas has been reached.
HPA scales up but never scales down.Check behavior.scaleDown policy. Default scale-down stabilization window is 300 seconds (5 minutes). Verify the metric is actually below the target after stabilization. Check if minReplicas has been reached.
HPA flapping: rapid scale up and down oscillations.Add stabilization windows to both scale-up and scale-down. Increase periodSeconds to reduce polling frequency. Check if cold-start pods are skewing metrics.
Custom metrics not appearing in HPA.Verify the Prometheus Adapter or custom.metrics.k8s.io API is registered (kubectl get apiservice). Check adapter configuration for the metric query. Ensure the metric label selectors match the deployment's pods.

Every production system eventually hits the same wall: traffic is unpredictable, and over-provisioning is expensive while under-provisioning is catastrophic. A Black Friday spike, a viral tweet, a nightly batch job — any of these can kneecap a statically-sized deployment in minutes.

Kubernetes Horizontal Pod Autoscaler (HPA) solves the reactive scaling problem by continuously watching resource metrics and adjusting pod replica counts to match demand. But the naive 'just set CPU threshold to 80%' approach breaks in subtle and painful ways in production — flapping deployments, ignored metrics, race conditions with the Cluster Autoscaler, and custom metrics that silently stop working.

This is not a getting-started guide. It is for engineers who need to understand the HPA algorithm at the formula level, how stabilization windows prevent flapping, how to wire up custom metrics via Prometheus and KEDA, how HPA interacts with VPA and Cluster Autoscaler, and the production mistakes that wake senior engineers at 3am.

What is Kubernetes HPA — Autoscaling?

Kubernetes HPA (Horizontal Pod Autoscaler) is a control loop that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics. It is the primary mechanism for reactive horizontal scaling in Kubernetes. HPA does not add nodes — it adds pods. If pods cannot be scheduled due to insufficient node capacity, the Cluster Autoscaler is responsible for adding nodes.

io/thecodeforge/kubernetes/hpa-basic.yaml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940
# Basic HPA with CPU utilization target
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120
▶ Output
HPA configured with CPU/memory targets and scaling behavior policies.
Mental Model
The HPA Control Loop
HPA does not directly create or delete pods. It updates the replicas field on the Deployment, which triggers the ReplicaSet controller to reconcile.
  • Loop interval: 15s default, configurable via controller flag.
  • Metric source: metrics-server for CPU/memory, custom.metrics.k8s.io for Prometheus, external.metrics.k8s.io for cloud metrics.
  • Stabilization: HPA keeps a history of metric values and uses the max (for scale-up) or min (for scale-down) during the stabilization window.
  • Cooldown: There is no explicit cooldown. Stabilization windows serve as the dampening mechanism.
📊 Production Insight
The HPA algorithm uses the formula: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]. If a pod has no metric (e.g., not yet ready), HPA assumes it uses 100% of the target for scale-up calculations and 0% for scale-down. This conservative behavior prevents premature scale-down during deployments. However, it also means that during a rolling update, HPA may over-provision because old terminating pods and new unready pods both count as 'missing' metrics.
🎯 Key Takeaway
HPA is a control loop, not a threshold trigger. It computes proportional scaling, not binary on/off. Understanding the algorithm and stabilization windows is essential to preventing flapping and over-provisioning.

The HPA Algorithm: How Desired Replicas Are Calculated

The HPA algorithm is proportional, not binary. It does not simply 'add one pod' when a threshold is breached. Instead, it calculates the ratio of current metric to target metric and scales proportionally. This means high load causes rapid scale-up (doubling or more), while moderate load causes gradual adjustments.

io/thecodeforge/kubernetes/autoscaling/HpaAlgorithm.java · JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142
// Simplified HPA algorithm implementation
// Package: io.thecodeforge.kubernetes.autoscaling
package io.thecodeforge.kubernetes.autoscaling;

public class HpaAlgorithm {

    /**
     * Core HPA formula:
     * desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
     *
     * For multiple metrics, HPA computes the max across all metric sources.
     */
    public static int calculateDesiredReplicas(
            int currentReplicas,
            long currentMetricValue,
            long targetMetricValue,
            int minReplicas,
            int maxReplicas
    ) {
        if (targetMetricValue == 0) {
            throw new IllegalArgumentException("Target metric value cannot be zero");
        }

        double ratio = (double) currentMetricValue / targetMetricValue;
        int desired = (int) Math.ceil(currentReplicas * ratio);

        // Clamp to min/max bounds
        return Math.max(minReplicas, Math.min(desired, maxReplicas));
    }

    public static void main(String[] args) {
        // Example: 10 replicas, CPU at 1500m, target 1000m
        // desired = ceil(10 * (1500/1000)) = ceil(15) = 15
        int desired = calculateDesiredReplicas(10, 1500, 1000, 3, 50);
        System.out.println("Desired replicas: " + desired);  // 15

        // Example: 10 replicas, CPU at 500m, target 1000m
        // desired = ceil(10 * (500/1000)) = ceil(5) = 5
        int scaledDown = calculateDesiredReplicas(10, 500, 1000, 3, 50);
        System.out.println("Scaled down replicas: " + scaledDown);  // 5
    }
}
▶ Output
Desired replicas: 15
Scaled down replicas: 5
Mental Model
Proportional Scaling vs Binary Thresholds
Without behavior policies, a sudden 10x spike in CPU could cause HPA to jump from 3 pods to 30 pods in a single loop iteration.
  • pods: 4 policy: can add at most 4 pods per period.
  • percent: 50 policy: can add at most 50% of current replicas per period.
  • Multiple policies: HPA uses the policy that allows the most scaling (max).
  • For scale-down: same logic applies but in reverse. HPA uses the min across policies for safety.
📊 Production Insight
When multiple metrics are defined (e.g., CPU and memory), HPA computes the desired replicas for each metric independently and uses the MAX. This means if CPU says 'scale to 5' and memory says 'scale to 10', HPA scales to 10. This is the correct behavior for availability — the most constrained metric wins. However, it can cause unexpected over-provisioning if one metric is misconfigured or noisy. Always validate all metric sources independently.
🎯 Key Takeaway
HPA scales proportionally, not incrementally. Without behavior policies, a single metric spike can cause massive scale-up. Use behavior to cap aggressiveness, and remember: the MAX across metrics wins.

Stabilization Windows: Preventing Flapping

Stabilization windows are the primary mechanism for preventing HPA flapping — the rapid oscillation between scale-up and scale-down. During the stabilization window, HPA considers only the most conservative metric value (max for scale-up, min for scale-down) from the history of collected samples.

io/thecodeforge/kubernetes/hpa-stabilization.yaml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041
# HPA with tuned stabilization windows
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 5
  maxReplicas: 100
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    # Scale UP: Aggressive but dampened
    scaleUp:
      stabilizationWindowSeconds: 60    # Consider max metric over last 60s
      policies:
        - type: Percent
          value: 100                      # Can double replicas per period
          periodSeconds: 60
        - type: Pods
          value: 10                       # Or add 10 pods, whichever is more
          periodSeconds: 60
      selectPolicy: Max                   # Use the policy allowing most scaling
    # Scale DOWN: Conservative and slow
    scaleDown:
      stabilizationWindowSeconds: 600    # Consider min metric over last 10 minutes
      policies:
        - type: Percent
          value: 10                       # Remove at most 10% per period
          periodSeconds: 120
      selectPolicy: Min                   # Use the most conservative policy
▶ Output
HPA configured with asymmetric stabilization: fast scale-up, slow scale-down.
Mental Model
How Stabilization Windows Work
The asymmetry is intentional. Scale-up should be fast (availability). Scale-down should be slow (cost savings without sacrificing stability).
  • Scale-up window: 0s default. Set to 60-120s to dampen during deployments and cold starts.
  • Scale-down window: 300s default. Set to 300-600s for production stability.
  • Window only applies to the stabilization decision, not the scaling policy rate.
  • Pods in CrashLoopBackOff or not yet ready are treated as using 100% of target for scale-up.
📊 Production Insight
The default scale-up stabilization window of 0 seconds is the single most common cause of HPA flapping in production. During a rolling update, new pods start with low CPU (cold start), which pulls the average down, triggering scale-down. Once old pods terminate, CPU spikes, triggering scale-up. The fix is always: set scaleUp.stabilizationWindowSeconds: 60-120 to let new pods warm up before HPA reacts.
🎯 Key Takeaway
Stabilization windows are the dampening mechanism for HPA. Asymmetric windows (fast up, slow down) are the production standard. Never leave scale-up stabilization at 0 seconds in production.

Custom and External Metrics: Beyond CPU and Memory

CPU and memory are often poor proxies for actual application load. A web server might be CPU-bound during image processing but network-bound during API calls. HPA supports custom metrics (per-pod metrics from Prometheus) and external metrics (cluster-external signals like SQS queue depth or Pub/Sub backlog) through the Kubernetes API aggregation layer.

io/thecodeforge/kubernetes/hpa-custom-metrics.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142
# HPA with custom metrics from Prometheus Adapter
# Requires: prometheus-adapter installed and configured
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-processor-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-processor
  minReplicas: 2
  maxReplicas: 40
  metrics:
    # Custom metric: requests per second from Prometheus
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"    # Scale when avg RPS per pod exceeds 100
    # External metric: SQS queue depth
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: "order-processing"
        target:
          type: AverageValue
          averageValue: "50"     # Scale when queue depth per pod exceeds 50
    # Keep CPU as fallback
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 75
▶ Output
HPA configured with custom (RPS) and external (SQS) metrics plus CPU fallback.
Mental Model
Custom vs External Metrics
Custom metrics answer: 'How busy is this pod?' External metrics answer: 'How much work is waiting for the cluster?'
  • Custom metrics: Prometheus Adapter, Datadog, or Stackdriver adapter.
  • External metrics: Cloud-provider specific (AWS CloudWatch, GCP Stackdriver, Azure Monitor).
  • Label selectors must match the metric's labels to the target pods.
  • If the metric API is unavailable, HPA falls back to CPU/memory (if configured).
📊 Production Insight
Custom metrics adapters are a single point of failure for HPA. If the Prometheus Adapter crashes or the Prometheus server is unreachable, HPA shows <unknown> for custom metrics and stops scaling on those metrics. Always include a CPU or memory metric as a fallback. Monitor the adapter's availability and latency — a slow adapter causes delayed scaling decisions. The adapter's --metrics-relist-interval (default 1m) controls how often it re-reads available Prometheus metrics. Set it lower if you add new metrics frequently.
🎯 Key Takeaway
Custom metrics make HPA application-aware, but they add a dependency chain: Prometheus -> Adapter -> HPA. Always include CPU/memory as a fallback metric. Monitor the adapter as critical infrastructure.

HPA, VPA, and Cluster Autoscaler: The Scaling Stack

HPA, VPA (Vertical Pod Autoscaler), and Cluster Autoscaler are complementary but interact in non-obvious ways. HPA scales horizontally (more pods). VPA scales vertically (bigger pods). Cluster Autoscaler scales infrastructure (more nodes). Using them together requires careful configuration to avoid conflicts.

io/thecodeforge/kubernetes/hpa-vpa-coexistence.yaml · YAML
1234567891011121314151617181920212223242526272829303132
# VPA in 'Off' mode: Recommends but does not auto-apply resource requests
# This avoids conflict with HPA which also modifies replica counts
# Package: io.thecodeforge.kubernetes
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"      # CRITICAL: Do not use 'Auto' with HPA on CPU/memory
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 4Gi
        controlledResources: ["cpu", "memory"]
---
# Cluster Autoscaler: Scales nodes based on pending pods
# No YAML needed — it watches for unschedulable pods automatically
# Key flags:
# --scale-down-delay-after-add=10m
# --scale-down-unneeded-time=10m
# --max-node-provision-time=15m
▶ Output
VPA configured in recommendation-only mode. Cluster Autoscaler watches for unschedulable pods.
Mental Model
The Conflict: HPA vs VPA on CPU/Memory
The rule: never let HPA and VPA both operate on the same resource metric. If HPA uses CPU, VPA should be in 'Off' mode (recommendations only).
  • HPA on CPU + VPA on memory: Safe. They operate on different metrics.
  • HPA on custom metrics + VPA on CPU/memory: Safe. HPA ignores resource metrics.
  • HPA on CPU + VPA on CPU (Auto mode): Dangerous. Feedback loop.
  • Best practice: HPA for scaling, VPA in 'Off' mode for right-sizing recommendations.
📊 Production Insight
The Cluster Autoscaler has a --scale-down-delay-after-add flag (default 10 minutes) that prevents it from removing nodes immediately after HPA scales up. Without this, the Cluster Autoscaler could remove nodes while HPA is still creating pods, causing scheduling failures. Conversely, if the Cluster Autoscaler is too slow to add nodes, HPA's desired replicas may exceed available capacity, leaving pods in Pending state. Monitor for pods stuck in Pending with reason: Unschedulable — this indicates the Cluster Autoscaler cannot provision nodes fast enough or has hit its --max-nodes limit.
🎯 Key Takeaway
HPA, VPA, and Cluster Autoscaler form a three-layer scaling stack. The key rule: never let HPA and VPA auto-scale on the same metric. Use VPA in 'Off' mode for right-sizing, HPA for horizontal scaling, and Cluster Autoscaler for node provisioning.

KEDA: Event-Driven Autoscaling Beyond HPA

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA by enabling scale-to-zero and supporting a wider range of event sources (message queues, databases, cron schedules). KEDA acts as an HPA adapter — it creates and manages HPA resources internally, but provides a simpler API and more scaler options.

io/thecodeforge/kubernetes/keda-scaledobject.yaml · YAML
12345678910111213141516171819202122232425262728
# KEDA ScaledObject: Scale based on SQS queue depth with scale-to-zero
# Package: io.thecodeforge.kubernetes
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0          # Scale to zero when no messages
  maxReplicaCount: 50
  pollingInterval: 15         # Check metrics every 15 seconds
  cooldownPeriod: 300         # Wait 5 minutes before scaling to zero
  triggers:
    - type: aws-sqs-queue
      authenticationRef:
        name: aws-credentials
      metadata:
        queueURL: "https://sqs.us-east-1.amazonaws.com/123456789/orders"
        queueLength: "5"        # 1 pod per 5 messages
        awsRegion: "us-east-1"
    - type: cron
      metadata:
        timezone: America/New_York
        start: "0 8 * * *"      # Pre-scale at 8 AM
        end: "0 20 * * *"       # End pre-scale at 8 PM
        desiredReplicas: "10"
▶ Output
KEDA configured for SQS-based scaling with scale-to-zero and cron pre-scaling.
Mental Model
Why KEDA Instead of Native HPA?
KEDA is not a replacement for HPA — it is an extension. It uses HPA internally for the actual scaling.
  • Scale-to-zero: KEDA sets minReplicas to 0 and manages the 0->1 transition.
  • 50+ scalers: SQS, Kafka, RabbitMQ, Prometheus, PostgreSQL, cron, and more.
  • HPA under the hood: KEDA creates an HPA with custom metrics for each ScaledObject.
  • Cooldown period: Prevents rapid scale-to-zero when a batch temporarily drains the queue.
📊 Production Insight
KEDA's scale-to-zero introduces a cold-start latency problem: the first pod must start from zero, which can take 30-120 seconds depending on the container image and initialization. For latency-sensitive services, use KEDA's cron scaler to pre-scale during known traffic windows. Monitor the ScaledObject status for Active: false — this means KEDA has scaled to zero and is waiting for events. If events arrive but pods are slow to start, increase cooldownPeriod to prevent premature scale-to-zero between message batches.
🎯 Key Takeaway
KEDA extends HPA with event-driven scaling and scale-to-zero. Use it for queue-based workloads and batch processing. The trade-off is cold-start latency when scaling from zero. Pre-scale with cron triggers for latency-sensitive paths.
🗂 HPA vs VPA vs Cluster Autoscaler vs KEDA
Understanding the scaling stack and when to use each component.
ComponentScalesTriggerScale to ZeroBest For
HPAPod replicas (horizontal)CPU, memory, custom metricsNo (minReplicas >= 1)Stateless web services, APIs with variable traffic
VPAPod resource requests (vertical)Historical resource usageNoStateful workloads, databases, single-instance services
Cluster AutoscalerNodes (infrastructure)Pending unschedulable podsYes (scale down empty nodes)Cost optimization, burst capacity
KEDAPod replicas (horizontal)Event sources (queues, DBs, cron, etc.)YesQueue consumers, batch jobs, event-driven architectures

🎯 Key Takeaways

  • HPA is a proportional control loop, not a binary threshold trigger. It scales based on the ratio of current to target metric.
  • Stabilization windows are the dampening mechanism. Asymmetric windows (fast up, slow down) are the production standard.
  • Custom metrics make HPA application-aware but add a dependency chain. Always include CPU/memory as fallback.
  • Never let HPA and VPA auto-scale on the same metric. Use VPA in 'Off' mode for right-sizing recommendations.
  • KEDA extends HPA with event-driven scaling and scale-to-zero. Use it for queue-based workloads.
  • Test HPA behavior under load before production. The scaling algorithm, stabilization windows, and Cluster Autoscaler interactions all need validation.

⚠ Common Mistakes to Avoid

    Not setting CPU/memory requests on the target deployment. HPA cannot compute utilization without requests.
    Using HPA and VPA in 'Auto' mode on the same metric. This creates a feedback loop.
    Leaving scale-up stabilization window at 0 seconds. This causes flapping during deployments and cold starts.
    Not setting `maxReplicas` to a realistic value. HPA will scale until cluster resources are exhausted.
    Relying solely on CPU for scaling. CPU is often a poor proxy for actual application load.
    Not including a fallback metric (CPU/memory) when using custom metrics. If the adapter fails, HPA stops scaling.
    Ignoring the Cluster Autoscaler's capacity. HPA may desire more replicas than the cluster can schedule.
    Setting aggressive scale-down policies. This causes thrashing during traffic dips and wastes resources on constant pod churn.
    Not testing HPA behavior during load tests. The first time you see HPA in action should not be during an incident.
    Forgetting that HPA counts pods based on readiness. Pods without readiness probes are immediately counted, even if not ready to serve traffic.

Interview Questions on This Topic

  • QExplain the HPA algorithm. How does it compute the desired replica count?
  • QWhat are stabilization windows and why are they important for preventing flapping?
  • QHow does HPA handle multiple metrics? What happens if CPU says 'scale up' but memory says 'scale down'?
  • QDescribe the conflict between HPA and VPA when both operate on CPU. How do you resolve it?
  • QHow do you wire up custom metrics from Prometheus to HPA? What happens if the metrics adapter fails?
  • QWhat is KEDA and how does it differ from native HPA? When would you use it?
  • QHow does HPA interact with the Cluster Autoscaler? What are the race conditions?
  • QA deployment is flapping between 3 and 15 replicas every minute. Walk me through your debugging process.
  • QHow do you design an HPA configuration for a service with predictable daily traffic patterns?
  • QWhat is the selectPolicy field in HPA behavior and how does it affect scaling decisions?

Frequently Asked Questions

How does HPA calculate the desired number of replicas?

HPA uses the formula: desiredReplicas = ceil[currentReplicas (currentMetricValue / targetMetricValue)]. For example, if you have 10 replicas with CPU at 150m and a target of 100m, HPA calculates ceil(10 150/100) = ceil(15) = 15 replicas. When multiple metrics are configured, HPA computes the desired replicas for each and uses the maximum.

Why is my HPA showing '' for metrics?

This typically means either: (1) metrics-server is not running or not collecting data, (2) the target deployment does not have CPU/memory requests set (HPA needs requests to compute utilization), or (3) for custom metrics, the metrics adapter (e.g., Prometheus Adapter) is down or misconfigured. Run kubectl top pods to verify metrics-server is working.

Can HPA and VPA work together?

Yes, but they must not operate on the same metric. The safe patterns are: (1) HPA on CPU + VPA on memory, (2) HPA on custom metrics + VPA on CPU/memory, or (3) HPA on CPU/memory + VPA in 'Off' mode (recommendations only, no auto-apply). Never use HPA on CPU with VPA in 'Auto' mode on CPU — this creates a feedback loop.

What is KEDA and when should I use it instead of HPA?

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA with scale-to-zero support and 50+ event-based scalers (message queues, databases, cron schedules). Use KEDA for: queue consumers that should scale to zero when idle, event-driven workloads triggered by external systems, and batch jobs with predictable scheduling. KEDA creates HPA resources internally, so it is complementary, not a replacement.

How do I prevent HPA flapping?

Set stabilization windows for both scale-up and scale-down. Recommended: scaleUp.stabilizationWindowSeconds: 60-120, scaleDown.stabilizationWindowSeconds: 300-600. Also ensure pods have readiness probes with appropriate initialDelaySeconds so new pods are not counted in metrics until they are warm. Use behavior policies to cap the rate of scaling changes.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousKubernetes StatefulSetsNext →Helm Charts for Kubernetes
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged