Advanced 14 min · March 06, 2026

Prometheus OOMKill: High-Cardinality Labels in Kubernetes

user_id label explosion: 250k series/hour, 24GB RAM, OOMKill on 16GB pod.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Pull model: Prometheus scrapes /metrics endpoints on targets at configured intervals.
  • Service Discovery: Queries Kubernetes API to find pods, services, nodes dynamically — no static IPs.
  • Four metric types: Counter (only goes up), Gauge (up/down), Histogram (bucketed observations), Summary (client-side quantiles).
  • Recording rules: Pre-compute expensive PromQL into cheap time series.
  • Alerting: Prometheus evaluates rules, Alertmanager routes to PagerDuty/Slack.
  • Pull model means Prometheus must reach every target. NetworkPolicy misconfigs silently break scraping.
  • High-cardinality labels (user_id, request_id) will OOM Prometheus.
  • Using Summary instead of Histogram in multi-replica deployments. Summaries cannot be aggregated across instances.
Plain-English First

Imagine your Kubernetes cluster is a busy hospital. Dozens of doctors (pods), nurses (services), and wards (namespaces) are running simultaneously. Prometheus is like the hospital's central monitoring board — it walks around every few seconds, checks each room's vitals (CPU, memory, request rates), writes them down in a giant logbook, and sounds an alarm if a patient's heart rate spikes. You don't wait for something to go wrong — the board tells you before it becomes a crisis.

Running Kubernetes in production without monitoring is like flying a commercial aircraft with the instrument panel blacked out. Everything might feel fine until it catastrophically isn't. Prometheus is used by over 84% of Kubernetes production environments — not because it's the easiest tool, but because it's the most powerful pull-based metrics system that was purpose-built for dynamic, containerized infrastructure.

The real problem Prometheus solves is the ephemeral nature of Kubernetes workloads. Traditional monitoring tools expect your target IPs to stay fixed. In Kubernetes, a pod's IP changes every restart. Prometheus solves this with Kubernetes-native service discovery — it queries the Kubernetes API server directly to find what's alive right now, not what was alive when you wrote the config.

This is not a getting-started guide. It covers scrape configurations with relabeling, custom application metrics using client libraries, recording rules to avoid query-time explosions, Alertmanager integration, and the five most expensive mistakes teams make in production.

Prometheus Stack (Operator) Component Visual

The Prometheus Operator for Kubernetes introduces a set of Custom Resource Definitions (CRDs) that declaratively define the Prometheus monitoring stack. Understanding how these components fit together is essential before diving into scrape configuration.

The stack consists of five core CRDs
  • Prometheus: Defines a Prometheus statefulset, including retention, storage, and resource limits. The operator manages the Prometheus pods, config reloading, and target reconciliation.
  • Alertmanager: Defines an Alertmanager cluster with config secret, routing, and receivers. The operator creates the Alertmanager pods and manages the config.
  • ServiceMonitor: Declares how to scrape a Kubernetes service. The operator converts ServiceMonitor selectors into Prometheus scrape targets. It is the most common way to configure scraping in operator-managed setups.
  • PodMonitor: Like ServiceMonitor but scrapes individual pods directly, without an intermediate service. Useful for scraping metrics from pods that are not behind a service.
  • PrometheusRule: Contains recording and alerting rules. The operator loads them into Prometheus.

Additionally, the ScrapeConfig CRD (introduced in Prometheus Operator v0.65+) provides a lower-level way to define scrape targets with full Prometheus scrape_config semantics, bypassing the semantic filters of ServiceMonitor/PodMonitor.

A typical production deployment creates a Prometheus CR, an Alertmanager CR, and one or more ServiceMonitor or ScrapeConfig CRs, plus PrometheusRule CRs for alerts. The operator watches these CRDs and reconciles the state of Prometheus and Alertmanager instances.

The diagram below shows the relationships: the Operator watches CRDs, creates Prometheus/Alertmanager pods, and translates ServiceMonitors into scrape configs.

prometheus-operator-components.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Minimal Prometheus CR — operator creates the deployment
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus
  namespace: monitoring
spec:
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      app: myapp
  resources:
    requests:
      memory: 16Gi
---
# ServiceMonitor — operator translates to scrape targets
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
Output
The Prometheus Operator creates a Prometheus deployment with the given resource limits, discovers ServiceMonitors with matching labels, and automatically adds scrape targets to Prometheus configuration.
Label Matching is Everything
  • Prometheus CR specifies selectors for ServiceMonitors and PrometheusRules.
  • Each ServiceMonitor must have labels matching serviceMonitorSelector.
  • Each PrometheusRule must have labels matching ruleSelector.
  • Use --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false in Helm to avoid default label restrictions.
Production Insight
When deploying the Prometheus Operator via Helm (kube-prometheus-stack), the default values set serviceMonitorSelectorNilUsesHelmValues: true. This means the Operator only picks up ServiceMonitors with the Helm release label. If you create a ServiceMonitor manually using kubectl apply, it won't be scraped unless you also add the label release: <name>. This catches many teams off guard. Set serviceMonitorSelectorNilUsesHelmValues: false to allow unlabeled ServiceMonitors, or always include the release label.
Key Takeaway
The Prometheus Operator stack: Prometheus CR (instance), ServiceMonitor (scrape target definition), PrometheusRule (alert/recording rules), Alertmanager (notification routing). Label matching between CRDs is the most frequent misconfiguration. Understand the component relationships before writing any YAML.

How Prometheus Service Discovery Works Inside Kubernetes

Prometheus uses a pull model — it reaches out to targets and scrapes metrics endpoints, typically on path /metrics, at a configured interval. In a static world you'd list IPs. In Kubernetes, Prometheus uses kubernetes_sd_configs to query the Kubernetes API and discover pods, services, endpoints, nodes, and ingresses dynamically.

When Prometheus starts, it authenticates to the API server using a ServiceAccount token mounted in its pod. It then watches specific resource types. For the endpoints role, Prometheus discovers every Endpoints object across the cluster. For each endpoint address it finds, it creates a scrape target. The magic happens during relabeling — a pipeline that runs before the scrape and lets you filter, rename, and attach labels using values pulled directly from Kubernetes metadata (pod annotations, namespace labels, service names).

The annotation `prometheus.io/scrape: 'true' is a community convention that Prometheus relabeling configs check. If the annotation exists and is true, the pod is scraped. This means enabling monitoring for a new application is as simple as adding three lines to its pod spec — no Prometheus config reload needed. Prometheus reconciles new targets automatically every scrape_interval`.

Understanding the target lifecycle is critical for production. Targets move through states: up, down, and unknown. A target goes unknown when Prometheus can't reach the endpoint at all (network issue or pod not started). It goes down when the HTTP scrape returns a non-200 status or times out. Staleness markers are injected after a target disappears — this prevents old time series from polluting range queries.

prometheus-kubernetes-sd-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# This is the core Prometheus scrape configuration for Kubernetes pod discovery.
# It lives inside your prometheus.yml (or a PrometheusRule CR if using the operator).

scrape_configs:
  - job_name: 'kubernetes-pods'
    # Prometheus will query the Kubernetes API server to find all pods.
    kubernetes_sd_configs:
      - role: pod
        # Restrict discovery to a specific namespace for security isolation.
        namespaces:
          names:
            - production
            - staging

    # relabel_configs runs BEFORE each scrape — it filters and transforms targets.
    relabel_configs:
      # STEP 1: Only scrape pods that explicitly opt in via annotation.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

      # STEP 2: Allow pods to declare a custom metrics path.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # STEP 3: Allow pods to declare a custom port for scraping.
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: '([^:]+)(?::\d+)?;(\d+)'
        replacement: '$1:$2'
        target_label: __address__

      # STEP 4: Carry namespace as a label.
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace

      # STEP 5: Carry the pod name as a label.
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

      # STEP 6: Carry the app label from the pod.
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app

      # STEP 7: Drop targets running on hostNetwork.
      - source_labels: [__meta_kubernetes_pod_host_ip, __address__]
        regex: '(\d+\.\d+\.\d+\.\d+);\1:\d+'
        action: drop
Output
# Prometheus /targets page shows discovered targets with labels applied.
The __address__ Rewrite Trap
  • Multi-container pods expose multiple ports. Prometheus picks one — not always the right one.
  • The __address__ label determines where Prometheus connects. Relabeling rewrites it.
  • Without prometheus.io/port, Prometheus uses the first container port in the pod spec.
  • Sidecar containers (Istio, Envoy) often expose ports that are not your metrics port.
Production Insight
NetworkPolicies are the silent killer of Prometheus scraping. If you deploy a NetworkPolicy that restricts ingress to your pod, and Prometheus is not in the allowed source namespace/IP range, scraping silently fails. The target shows as down or unknown with no useful error. Always include Prometheus's namespace in your NetworkPolicy ingress rules. Use kubectl exec from the Prometheus pod to test connectivity to the target before blaming the scrape config.
Key Takeaway
Prometheus service discovery is annotation-driven and relabeling-configured. The most common production failures are: missing annotations, NetworkPolicy blocking scrapes, and multi-container port confusion. Always set prometheus.io/scrape, prometheus.io/port, and prometheus.io/path explicitly.

ServiceMonitor vs ScrapeConfig Decision Guide

When using the Prometheus Operator, you have two primary ways to define scrape targets: ServiceMonitor (and its cousin PodMonitor) and the newer ScrapeConfig CRD (available since Prometheus Operator v0.65). Understanding which to use is critical for production architecture.

ServiceMonitor is the original CRD. It is designed to scrape a Kubernetes Service. You specify a service selector (by labels), and the Operator automatically discovers all endpoints behind that service. It adds semantic constraints: the service must expose one or more ports, and you can optionally specify a path, interval, and metric relabeling. ServiceMonitor is high-level: the Operator handles converting service endpoints to individual pod IP addresses.

ScrapeConfig is a lower-level CRD. It directly mirrors the Prometheus scrape_config block. You define kubernetes_sd_configs, relabel_configs, metric_relabel_configs, etc., exactly as you would in a raw Prometheus YAML file. There are no implicit service-based semantics. This gives you full control but requires more expertise.

The decision tree below helps pick the right one.

Use ServiceMonitor when
  • You want to scrape a standard Kubernetes Service (preferred for 90% of cases).
  • You need simplicity: just select the service by label and define a port.
  • You want the Operator to dynamically follow pods as they scale or roll.
  • You are monitoring a typical application pod behind a Service.
Use ScrapeConfig when
  • You need full control over relabeling, including non-standard discovery (e.g., scraping a non-Kubernetes endpoint, or a static target).
  • You need to scrape a target that is not behind a Service (e.g., a DaemonSet pod that you want to scrape by node).
  • You want to use the full power of kubernetes_sd_configs roles beyond endpoints (e.g., node, ingress).
  • You are migrating from a raw Prometheus configuration and want to keep the same scrape config syntax.

A common production pattern: use ServiceMonitor for all user-facing services (HTTP APIs, gRPC), and ScrapeConfig for infrastructure components (node-exporter, kube-state-metrics, or custom exporters that expose metrics on non-standard ports).

servicemonitor-vs-scrapeconfig.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# ServiceMonitor — simple, operator-managed
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-service-monitor
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    metricRelabelings:
    - sourceLabels: [__name__]
      regex: '.*_total'
      action: keep
---
# ScrapeConfig — full control, raw Prometheus config
apiVersion: monitoring.coreos.com/v1
kind: ScrapeConfig
metadata:
  name: myapp-scrape-config
  labels:
    app: myapp
spec:
  scrapeConfig:
    job_name: 'myapp-scrape'
    kubernetes_sd_configs:
    - role: pod
      namespaces:
        names: [production]
    relabel_configs:
    - sourceLabels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: 'true'
    - sourceLabels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: '([^:]+)(?::\d+)?;(\d+)'
      replacement: '$1:$2'
      targetLabel: __address__
    metric_relabel_configs:
    - sourceLabels: [__name__]
      regex: '.*_total'
      action: keep
Output
ServiceMonitor targets the Service. ScrapeConfig uses pod-level discovery with custom relabeling.
ScrapeConfig replaces the need for raw prometheus.yaml
  • ServiceMonitor: high-level, service-oriented, Operator abstracts pod IPs.
  • ScrapeConfig: low-level, full Prometheus config syntax, full control over discovery.
  • Use ServiceMonitor for typical HTTP services on standard ports.
  • Use ScrapeConfig for node-level scraping, non-Kubernetes targets, or complex relabeling.
  • ScrapeConfig supports all kubernetes_sd_config roles: pod, service, endpoints, node, ingress.
Production Insight
In large clusters with hundreds of services, ServiceMonitor label selection should be precise to avoid scraping unintended endpoints. Always use matchLabels with a specific app label, not broad selectors that could match system services. ScrapeConfig on the other hand lets you define exactly which kubernetes_sd_config role to use and filter with relabel_configs. A common pattern is to use ServiceMonitor for business applications and ScrapeConfig for infrastructure components like node-exporter, cAdvisor, or custom DaemonSet-based exporters.
Key Takeaway
ServiceMonitor is the default choice for scraping services; ScrapeConfig gives full Prometheus config semantics. Use ScrapeConfig when you need non-service-based discovery or fine-grained metric relabeling that ServiceMonitor cannot express. Both CRDs are reconciled by the Prometheus Operator.

Exposing Custom Application Metrics with the Prometheus Client Libraries

Kubernetes infrastructure metrics (CPU, memory, network) come from kube-state-metrics and node-exporter. But the metrics that make or break your SLOs are application-level: request latency, error rates, queue depth, cache hit ratio. These come from instrumenting your own code.

Prometheus has four core metric types you need to understand at the semantic level, not just the API level:

Counter — a value that only goes up (resets to zero on restart). Use it for total requests, total errors, total bytes sent. Never use a counter for something that can decrease. PromQL's rate() and increase() functions unwrap counters properly, handling resets.

Gauge — a value that can go up or down. Use it for current queue depth, active connections, temperature, memory usage. Don't use rate() on a gauge — it's meaningless.

Histogram — pre-aggregated bucketed observations. Use it for latency and request size. It exposes three time series: _bucket, _sum, and _count. The bucket boundaries you choose at instrumentation time are permanent — you can't change them without restarting the process.

Summary — client-side computed quantiles. Use it only when you need accurate quantiles and can't aggregate across instances (summaries can't be aggregated in PromQL). In Kubernetes with multiple replicas, histograms are almost always the right choice over summaries.

instrumented_http_server.goGO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package main

import (
	"math/rand"
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "payment_service_http_requests_total",
			Help: "Total number of HTTP requests processed by the payment service.",
		},
		[]string{"method", "endpoint", "status_code"},
	)

	httpRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{\n\t\t\tName: \"payment_service_http_request_duration_seconds\",\n\t\t\tHelp: \"HTTP request duration in seconds, bucketed by endpoint.\",\n\t\t\tBuckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5},\n\t\t},\n\t\t[]string{\"method\", \"endpoint\"},\n\t)\n\n\tinFlightRequests = promauto.NewGauge(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"payment_service_in_flight_requests\",\n\t\t\tHelp: \"Number of HTTP requests currently being processed.\",\n\t\t},\n\t)\n\n\tpaymentQueueDepth = promauto.NewGauge(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"payment_service_queue_depth\",\n\t\t\tHelp: \"Current number of payments waiting in the processing queue.\",\n\t\t},\n\t)\n)\n\nfunc instrumentedHandler(endpoint string, handlerFunc http.HandlerFunc) http.HandlerFunc {\n\treturn func(responseWriter http.ResponseWriter, request *http.Request) {\n\t\tinFlightRequests.Inc()\n\t\tdefer inFlightRequests.Dec()\n\t\tstartTime := time.Now()\n\t\twrappedWriter := &statusCapturingWriter{ResponseWriter: responseWriter, statusCode: http.StatusOK}\n\t\thandlerFunc(wrappedWriter, request)\n\t\tdurationSeconds := time.Since(startTime).Seconds()\n\t\thttpRequestDuration.WithLabelValues(request.Method, endpoint).Observe(durationSeconds)\n\t\thttpRequestsTotal.WithLabelValues(\n\t\t\trequest.Method,\n\t\t\tendpoint,\n\t\t\tstrconv.Itoa(wrappedWriter.statusCode),\n\t\t).Inc()\n\t}\n}\n\ntype statusCapturingWriter struct {\n\thttp.ResponseWriter\n\tstatusCode int\n}\n\nfunc (scw *statusCapturingWriter) WriteHeader(code int) {\n\tscw.statusCode = code\n\tscw.ResponseWriter.WriteHeader(code)\n}\n\nfunc processPayment(responseWriter http.ResponseWriter, request *http.Request) {\n\tprocessingTime := time.Duration(5+rand.Intn(295)) * time.Millisecond\n\ttime.Sleep(processingTime)\n\tif rand.Float64() < 0.02 {\n\t\thttp.Error(responseWriter, \"upstream payment gateway timeout\", http.StatusGatewayTimeout)\n\t\treturn\n\t}\n\tresponseWriter.WriteHeader(http.StatusOK)\n\tresponseWriter.Write([]byte(`{\"status\":\"processed\"}`))\n}\n\nfunc main() {\n\tgo func() {\n\t\tfor {\n\t\t\tpaymentQueueDepth.Set(float64(rand.Intn(500)))\n\t\t\ttime.Sleep(5 * time.Second)\n\t\t}\n\t}()\n\thttp.HandleFunc(\"/api/v1/payments\", instrumentedHandler(\"/api/v1/payments\", processPayment))\n\thttp.Handle(\"/metrics\", promhttp.Handler())\n\tgo http.ListenAndServe(\":9091\", nil)\n\thttp.ListenAndServe(\":8080\", nil)\n}",
        "output": "# Prometheus scrapes http://payment-service-pod:9091/metrics and receives all four metric types."
      }

Production-Grade Recording Rules and Alerting That Won't Page You at 3am

Raw PromQL queries against high-cardinality data are expensive. A query like rate(http_requests_total[5m]) across 200 pods runs every time a dashboard loads. In large clusters, this causes Prometheus to churn through millions of samples per query, leading to query timeouts and the dreaded 'query timed out in expression evaluation' error.

Recording rules solve this by pre-computing expensive expressions and storing the result as a new time series. Prometheus evaluates recording rules on its evaluation interval (typically 1m), writes the result into its TSDB, and future queries read that cheap pre-computed series instead of re-scanning the raw data.

Naming matters. The Prometheus community convention for recording rule names is level:metric:operations. For example job:http_requests_total:rate5m means: aggregated at the job level, derived from http_requests_total, computed as a 5-minute rate. Sticking to this convention makes rules self-documenting and searchable.

Alerts in Prometheus are defined in the same YAML format as recording rules. The critical production insight is that alerts should express SLO burn rates, not raw thresholds. An alert that fires when error rate > 1% will fire constantly during minor blips. An alert based on a multi-window burn rate (Google's SRE model) only fires when you're burning through your error budget fast enough to exhaust it within a prediction window — dramatically reducing noise.

payment-service-rules.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# PrometheusRule custom resource — picked up automatically by the Prometheus Operator.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-service-slo-rules
  namespace: production
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: payment_service_recording_rules
      interval: 1m
      rules:
        - record: job_endpoint:payment_service_http_requests_total:rate5m
          expr: |
            rate(payment_service_http_requests_total[5m])

        - record: job_endpoint:payment_service_error_ratio:rate5m
          expr: |
            sum by (job, endpoint) (
              rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m])\n            )\n            /\n            sum by (job, endpoint) (\n              rate(payment_service_http_requests_total[5m])\n            )\n\n        - record: job_endpoint:payment_service_latency_p99:rate5m\n          expr: |\n            histogram_quantile(\n              0.99,\n              sum by (job, endpoint, le) (\n                rate(payment_service_http_request_duration_seconds_bucket[5m])\n              )\n            )\n\n    - name: payment_service_alerts\n      rules:\n        - alert: PaymentServiceHighErrorBurnRate\n          expr: |\n            job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001)\n          for: 2m\n          labels:\n            severity: critical\n            team: payments\n            runbook_url: https://wiki.company.com/runbooks/payment-service-errors\n          annotations:\n            summary: \"Payment service burning error budget at critical rate\"\n            description: |\n              Endpoint {{ $labels.endpoint }} error ratio is {{ $value | humanizePercentage }}.\n\n        - alert: PaymentServiceHighLatency\n          expr: |\n            job_endpoint:payment_service_latency_p99:rate5m > 0.5\n          for: 5m\n          labels:\n            severity: warning\n            team: payments\n          annotations:\n            summary: \"Payment service p99 latency exceeds SLO threshold\"\n\n        - alert: PaymentQueueConsumerDead\n          expr: |\n            payment_service_queue_depth == 0\n            and\n            sum(job_endpoint:payment_service_http_requests_total:rate5m) > 10\n          for: 3m\n          labels:\n            severity: critical\n            team: payments\n          annotations:\n            summary: \"Payment queue depth is zero but traffic is flowing\"",
        "output": "# PrometheusRule applied. Operator picks it up via label matching."
      }

Alertmanager: Routing, Silencing, and Deduplication

Prometheus evaluates alert rules and sends firing alerts to Alertmanager. Alertmanager is a separate component responsible for deduplicating, grouping, routing, and silencing alerts before sending them to notification channels (PagerDuty, Slack, email, Opsgenie).

Alertmanager's routing tree is a hierarchical matching system. An alert's labels are matched against the route configuration. The first matching route determines the receiver. This means your alert labels (severity, team, service) must be carefully designed to match your routing tree.

alertmanager-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# Alertmanager configuration — deployed as a Secret in the monitoring namespace.
# The Prometheus Operator picks this up from the alertmanager.yaml key.
global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxx'

route:
  # Default receiver for alerts that don't match any sub-route.
  receiver: 'slack-catchall'
  group_by: ['alertname', 'namespace', 'endpoint']
  group_wait: 30s         # Wait 30s to group similar alerts.
  group_interval: 5m      # Send grouped updates every 5m.
  repeat_interval: 4h     # Re-send unresolved alerts every 4h.

  routes:
    # Critical alerts -> PagerDuty (page the on-call engineer).
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s       # Page faster for critical alerts.
      repeat_interval: 1h   # Re-page every hour if unresolved.

    # Warning alerts -> Slack channel (no page, just notification).
    - match:
        severity: warning
      receiver: 'slack-warnings'
      group_wait: 1m
      repeat_interval: 4h

    # Team-specific routing.
    - match:
        team: payments
      receiver: 'slack-payments-team'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-integration-key>'
        severity: 'critical'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
          num_firing: '{{ .Alerts.Firing | len }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}'
        send_resolved: true

  - name: 'slack-payments-team'
    slack_configs:
      - channel: '#payments-alerts'
        title: '[Payments] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}'

  - name: 'slack-catchall'
    slack_configs:
      - channel: '#alerts-catchall'
        title: 'Unrouted Alert: {{ .GroupLabels.alertname }}'

# Inhibition: suppress warning alerts when a critical alert is already firing
# for the same service. Prevents alert storms.
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'namespace']
Output
Alertmanager configured with PagerDuty for critical, Slack for warnings, and inhibition rules.
Alertmanager's Three Core Functions
  • Deduplication: Same alert fingerprint = one notification.
  • Grouping: group_by determines which alerts are batched together.
  • Inhibition: Higher-severity alerts suppress lower-severity alerts for the same context.
  • Silences: Temporary muting of alerts during maintenance windows.
  • Routing: Label matching determines which receiver (PagerDuty, Slack, email) gets the alert.
Production Insight
The group_by field in Alertmanager controls notification batching. If you group by alertname only, all pods with the same alert are grouped into one notification — good for reducing noise. If you group by alertname, pod, each pod gets its own notification — bad during a cluster-wide outage where 200 pods trigger the same alert. The production default should be group_by: ['alertname', 'namespace'] to batch by service and namespace. Use group_wait: 30s to allow grouping before sending.
Key Takeaway
Alertmanager is the routing and deduplication layer between Prometheus and your notification channels. Inhibition rules prevent alert storms. Group_by controls notification batching. Always configure inhibition rules to suppress lower-severity alerts when higher-severity alerts are already firing.

PromQL Common Query Cheat Sheet

PromQL is the query language for Prometheus. Whether you're building Grafana dashboards, writing alerting rules, or debugging a production issue, having a mental library of common PromQL patterns saves hours. Below is a cheat sheet of the most useful queries for Kubernetes monitoring.

Infrastructure Metrics - CPU usage per pod: rate(container_cpu_usage_seconds_total{container!=\"POD\", image!=\"\"}[5m]) - Memory usage per pod: container_memory_working_set_bytes{container!=\"POD\", image!=\"\"} - Network receive bytes per pod: rate(container_network_receive_bytes_total[5m]) - Disk reads per pod: rate(container_fs_reads_bytes_total[5m])

Application Metrics (assuming custom metrics like payment_service_http_requests_total) - Request rate per second: rate(payment_service_http_requests_total[5m]) - Error rate per second: rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m]) - Error ratio (percentage): sum(rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(payment_service_http_requests_total[5m])) - p99 latency: histogram_quantile(0.99, sum(rate(payment_service_http_request_duration_seconds_bucket[5m])) by (le, endpoint)) - Average latency: rate(payment_service_http_request_duration_seconds_sum[5m]) / rate(payment_service_http_request_duration_seconds_count[5m])

Prometheus Self-Monitoring - Active time series: prometheus_tsdb_head_series - Memory usage: process_resident_memory_bytes - Scrape duration per job: scrape_duration_seconds - Targets up per job: up (returns 1 if target is up, 0 if down)

Alerting Patterns - Pager-worthy: job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001) (14.4 x 0.1% = 1.44% error rate over 5m, burning through monthly SLO in hours) - No data: absent(prometheus_tsdb_head_series) — fires when Prometheus itself is down - Pod restart detection: changes(process_start_time_seconds[1h]) > 0

Cardinality Detection - Top 10 metric names by series count: topk(10, count by (__name__)({__name__=~\".+\"})) - Series count for a specific metric: count(payment_service_http_requests_total)", "code": { "language": "promql", "filename": "promql-cheat-sheet.promql", "code": "# ---- Infrastructure Metrics ----

# CPU usage rate (5m window) per pod rate(container_cpu_usage_seconds_total{container!=\"POD\", image!=\"\"}[5m])

# Memory working set per pod (Gauge, no rate) container_memory_working_set_bytes{container!=\"POD\", image!=\"\"}

# Network bytes received per second per pod rate(container_network_receive_bytes_total[5m])

# ---- Application Request Metrics ----

# Total request rate sum(rate(payment_service_http_requests_total[5m]))

# Error ratio (percentage of 5xx) sum(rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(payment_service_http_requests_total[5m]))

# p99 latency (uses histogram_quantile) histogram_quantile(0.99, sum(rate(payment_service_http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# ---- Prometheus Self-Monitoring ----

# Active time series count prometheus_tsdb_head_series

# Prometheus process memory process_resident_memory_bytes

# ---- Cardinality Analysis ----

# Top 10 metric names by number of time series topk(10, count by (__name__)({__name__=~\".+\"}))

# ---- Alerting Patterns ----

# High error burn rate (SLO-based) job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001)

# Detect pod restart in last hour (counter reset) changes(process_start_time_seconds[1h]) > 0

# Absence of data (Prometheus itself down) absent(prometheus_tsdb_head_series)", "output": "These PromQL queries can be executed directly in the Prometheus UI Execute page, in Grafana panels, or in alert rules." }, "callout": { "type": "tip", "title": "Always Use Recording Rules for Repeated Queries", "text": "The error ratio query above is expensive: it scans all error series and total series, then divides. If you use this query in five dashboards and one alert, you're executing it six times every evaluation cycle. Create a recording rule job_endpoint:payment_service_error_ratio:rate5m and query that instead.", "hook": "Every PromQL query you write should be a candidate for a recording rule if it appears in more than one place.", "bullets": [ "Recording rules pre-compute expensive queries into cheap time series.", "Use rate() for counters, not raw values (counter resets break averages).", "Use histogram_quantile() with sum by (le, ...) for aggregated percentiles.", "Avoid * selectors in production — always filter by at least one label.", "Check query performance with the Prometheus UI's query analysis (explain button)." ] }, "production_insight": "The most common PromQL mistake is using rate() on a gauge — it produces meaningless results because gauges can go down. Another frequent error is forgetting to sum by the correct labels when using histogram_quantile(): you must include le (bucket upper bound) in the by clause, otherwise Prometheus returns an error. When debugging high memory, the topk(10, count by (__name__)({__name__=~\".+\"}))` query quickly identifies which metric name has the most time series — often the culprit is a high-cardinality label on a single metric.", "key_takeaway": "Master these core PromQL patterns: rate (counters), gauge (raw values), histogram_quantile (latency percentiles), topk (cardinality detection). Use recording rules to optimize expensive queries. Always filter by at least one label to avoid scanning the entire TSDB." }, { "heading": "Prometheus Storage: TSDB Internals, Retention, and Thanos/Cortex for Long-Term", "content": "Prometheus stores metrics in its own time-series database (TSDB). Understanding TSDB internals is critical for capacity planning, retention tuning, and deciding when to add long-term storage.

TSDB stores data in blocks. Each block covers a 2-hour time range and contains a chunks directory (compressed metric samples) and an index. The head block is the in-memory write-ahead log (WAL) that receives all new samples. Every 2 hours, the head block is compacted into a persistent block and flushed to disk. Old blocks are compacted into larger blocks (e.g., 2h blocks into 1-day blocks) to reduce the number of files.", "code": { "language": "yaml", "filename": "prometheus-retention-config.yaml", "code": "# Prometheus StatefulSet with retention and storage configuration. # For the kube-prometheus-stack Helm chart, these go in prometheus.prometheusSpec. apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: kube-prometheus namespace: monitoring spec: # Retention: how long to keep data locally. # 15d is typical. Longer retention = more disk and memory. retention: 15d

# Retention by size: delete oldest blocks when storage exceeds this. # Use this as a safety net alongside time-based retention. retentionSize: 50GB

# Storage: PVC for persistent TSDB blocks. storage: volumeClaimTemplate: spec: storageClassName: fast-ssd accessModes: - ReadWriteOnce resources: requests: storage: 100Gi

# Resources: Prometheus is memory-hungry. # Rule of thumb: ~16KB per active time series. # 1M series = 16GB RAM. Plan accordingly. resources: requests: cpu: '1' memory: 16Gi limits: cpu: '4' memory: 32Gi

# External labels: applied to all metrics when using Thanos/Cortex. # Identifies which Prometheus instance scraped the data. externalLabels: cluster: production-us-east-1 environment: production

# Thanos sidecar: uploads blocks to object storage for long-term retention. thanos: objectStorageConfig: name: thanos-objstore-config key: objstore.yml

# Sample limit: max series per scrape target. # Safety net against high-cardinality targets. serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false", "output": "Prometheus configured with 15-day retention, 50GB size limit, fast SSD storage, and Thanos sidecar for long-term block upload to object storage." }, "callout": { "type": "mental_model", "title": "When to Add Thanos or Cortex", "text": "Prometheus is designed for short-term, per-cluster monitoring. It cannot federate queries across clusters, cannot store data longer than a few weeks efficiently, and is a single point of failure. Thanos and Cortex solve these problems.", "hook": "Add Thanos when you need: cross-cluster query federation, retention beyond 30 days, or HA for Prometheus.", "bullets": [ "Prometheus: Single cluster, short-term (days to weeks). No native HA or federation.", "Thanos: Sidecar uploads blocks to S3/GCS. Querier federates across Prometheus instances. Compactor reduces storage costs. Best for multi-cluster with object storage.", "Cortex: Hor Cloud. Best for multi-tenant SaaS platforms.", "VictoriaMetrics: Drop-in Prometheus replacement with better compression and lower resource usage. Best for single-cluster with high cardinality.", "Decision: Use Thanos for multi-cluster with object storage. Use Cortex for multi-tenant SaaS. Use VictoriaMetrics for single-cluster resource optimization." ] }, "production_insight": "Prometheus's memory usage is directly proportional to the number of active time series in the TSDB head block. Each active series consumes approximately 16KB of memory. If you have 2 million active series, you need approximately 32GB of RAM for the head block alone, plus overhead for queries and compaction. Monitor prometheus_tsdb_head_series and process_resident_memory_bytes. Set retention based on disk capacity: 15 days at 1M series with 15s scrape interval equals approximately 50GB of disk. Use fast SSDs for TSDB — network-attached storage introduces latency that slows compaction and can cause WAL corruption during power loss.", "decision_tree": { "title": "Prometheus Storage Decision Tree", "items": [ { "condition": "Single cluster, retention under 15 days, under 5M active series", "result": "Standalone Prometheus with local TSDB and PVC on fast SSD. No additional components needed." }, { "condition": "Multiple clusters, need cross-cluster query federation", "result": "Deploy Thanos Sidecar on each Prometheus. Add Thanos Querier for global view. Add Thanos Store Gateway for historical queries." }, { "condition": "Retention beyond 30 days or more than 5M active series", "result": "Add Thanos Sidecar with object storage (S3/GCS). Add Thanos Compactor to reduce storage costs. Keep local retention short (7d) and rely on object storage for long-term." }, { "condition": "Multi-tenant SaaS platform with per-customer isolation", "result": "Use Cortex or Grafana Mimir for horizontal scalability and per-tenant resource limits." }, { "condition": "Single cluster with extreme cardinality (10M+ series)", "result": "Consider VictoriaMetrics as a drop-in replacement. It offers better compression (1izontally scalable multi-tenant Prometheus backend. Used by Grafana0x) and lower memory usage than Prometheus TSDB." } ] }, "key_takeaway": "Prometheus TSDB is a block-based storage engine with an in-memory head block. Memory is proportional to active series count. Use fast SSDs for TSDB. Add Thanos for cross-cluster federation and long-term retention beyond local disk capacity. Plan capacity based on series count: 1M series equals approximately 16GB RAM and approximately 50GB disk for 15 days." } ]

● Production incidentPOST-MORTEMseverity: high

Prometheus OOMKill from High-Cardinality Label Explosion

Symptom
Prometheus pod restarted with OOMKill (exit code 137). Memory usage showed exponential growth in the 6 hours before the crash. TSDB head chunks metric showed millions of active series. The /targets page showed all targets as UP — scraping was healthy.
Assumption
Prometheus needed more memory. The cluster had grown and was generating more metrics.
Root cause
A developer instrumented a counter with a user_id label to track per-user request counts. With 50,000 unique users per hour and 5 label combinations (method, endpoint, status_code, user_id), the metric generated 50,000 * 5 = 250,000 new time series per hour. Each time series consumes memory in Prometheus's TSDB head block. After 6 hours, the head block contained over 1.5 million active series for a single metric, consuming 24GB of RAM. The Prometheus pod was configured with a 16GB memory limit and was OOMKilled.
Fix
1. Removed the user_id label from the counter immediately and redeployed the application. 2. Added a Prometheus recording rule to aggregate by user tier (free, premium, enterprise) instead of individual user_id. 3. Added sample_limit: 1000 to the scrape config to prevent future label explosions from a single target. 4. Deployed a cardinality-linter CI check that rejects metrics with more than 3 labels in code review. 5. Added a Prometheus alert on prometheus_tsdb_head_series > 1000000 to catch future explosions early.
Key lesson
  • High-cardinality labels (user_id, request_id, trace_id) will destroy Prometheus. Never add unbounded values as label values.
  • Each unique combination of label values creates a new time series. 5 labels with 10 values each = 100,000 series per metric name.
  • Set sample_limit on scrape configs as targets that expose too many series.
  • Monitor Prometheus's own metrics: prometheus_tsdb_head_series, prometheus_tsdb_head_chunks, and memory usage. Alert before OOMKill.
  • Enforce cardinality limits in CI/CD. A single bad label can take down monitoring for the entire cluster.
Production debug guideSymptom-first investigation path for Prometheus failures in Kubernetes.6 entries
Symptom · 01
Target showing as DOWN on Prometheus /targets page.
Fix
Check if the pod is running and the metrics endpoint returns 200. Verify the prometheus.io/scrape annotation is set. Check NetworkPolicy — Prometheus must be able to reach the target pod's IP a safety net. It drops.
Symptom · 02
Target showing as UNKNOWN — Prometheus cannot reach it at all.
Fix
This is a network issue. Check if the pod exists, has an IP, and Prometheus can reach it. Common cause: pod restarted and Prometheus has stale target. Wait for the next service discovery refresh.
Symptom · 03
Query returns 'query timed out in expression evaluation'.
Fix
The query is too expensive. Check for high-cardinality selectors. Add recording rules to pre-compute expensive expressions. Check Prometheus CPU/memory usage — it may be under-provisioned.
Symptom · 04
Grafana dashboards show gaps in metrics.
Fix
Check Prometheus /targets for flapping targets (alternating UP/DOWN). Check scrape duration — if it exceeds scrape_interval, samples are missed. Check for Prometheus restarts (TSDB WAL replay takes time).
Symptom · 05
Alerts not firing when they should.
Fix
Check Prometheus /rules page — is the rule group evaluating? Check the for duration — the alert may be in PENDING state. Check Alertmanager routing — the alert may be firing but silenced or routed to the wrong receiver.
Symptom · 06
Prometheus consuming excessive memory.
Fix
Check prometheus_tsdb_head_series — if over 5M, you have a cardinality problem. Run promtool tsdb analyze to find the highest-cardinality metric names. Look for labels with unbounded values (user_id, request_id).
★ Prometheus Triage CommandsRapid commands to isolate Prometheus monitoring issues.
Target showing DOWN or UNKNOWN.
Immediate action
Check target health and scrape annotations.
Commands
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name} {.metadata.annotations.prometheus\.io/scrape}{"\n"}{end}'
kubectl exec -n monitoring deploy/prometheus -- wget -qO- http://<target-ip>:<port>/metrics | head -5
Fix now
If annotation is missing, add it. If wget fails, check NetworkPolicy and pod readiness.
Prometheus memory growing rapidly.+
Immediate action
Check TSDB head series count and find high-cardinality metrics.
Commands
curl -s http://prometheus:9090/api/v1/query?query=prometheus_tsdb_head_series | jq '.data.result[0].value[1]'
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'
Fix now
If head series > 5M, identify the top metric and reduce its cardinality. Add sample_limit to scrape configs.
Recording rules not evaluating.+
Immediate action
Check PrometheusRule CR label matching and operator logs.
Commands
kubectl get prometheusrule -A -o json | jq '.items[] | select(.metadata.labels.prometheus=="kube-prometheus") | .metadata.name'
kubectl logs -n monitoring deploy/prometheus-operator | grep -i 'rule\|error'
Fix now
If the rule is missing from the list, the label does not match ruleSelector. Fix the label on the PrometheusRule CR.
Alerts firing but not reaching PagerDuty/Slack.+
Immediate action
Check Alertmanager routing and receiver configuration.
Commands
kubectl exec -n monitoring deploy/alertmanager -- amtool config show
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="firing") | .labels'
Fix now
If alerts are in Alertmanager but not routed, check the route matching (team label, severity label). Test with amtool.
Scrape duration exceeds scrape_interval.+
Immediate action
Check which targets have slow scrapes.
Commands
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapeDurationSeconds > 10) | {job: .labels.job, instance: .labels.instance, duration: .scrapeDurationSeconds}'
curl -s http://prometheus:9090/api/v1/query?query=scrape_duration_seconds | jq '.data.result | sort_by(.value[1]) | reverse | .[0:5]'
Fix now
If a single target is slow, it may be exposing too many metrics. Reduce cardinality or increase scrape_timeout for that job.
🔥

That's Kubernetes. Mark it forged?

14 min read · try the examples if you haven't