Advanced 21 min · March 06, 2026

Prometheus OOMKill: High-Cardinality Labels in Kubernetes

user_id label explosion: 250k series/hour, 24GB RAM, OOMKill on 16GB pod.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Pull model: Prometheus scrapes /metrics endpoints on targets at configured intervals.
  • Service Discovery: Queries Kubernetes API to find pods, services, nodes dynamically — no static IPs.
  • Four metric types: Counter (only goes up), Gauge (up/down), Histogram (bucketed observations), Summary (client-side quantiles).
  • Recording rules: Pre-compute expensive PromQL into cheap time series.
  • Alerting: Prometheus evaluates rules, Alertmanager routes to PagerDuty/Slack.
  • Pull model means Prometheus must reach every target. NetworkPolicy misconfigs silently break scraping.
  • High-cardinality labels (user_id, request_id) will OOM Prometheus.
  • Using Summary instead of Histogram in multi-replica deployments. Summaries cannot be aggregated across instances.
✦ Definition~90s read
What is Kubernetes Monitoring with Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud, now a CNCF graduated project. It works by scraping metrics from HTTP endpoints at regular intervals, storing them in a time-series database, and evaluating rules against that data.

Imagine your Kubernetes cluster is a busy hospital.

In Kubernetes, Prometheus is typically deployed via the Prometheus Operator, which manages the lifecycle of Prometheus instances, ServiceMonitors, and other CRDs. The core problem this article addresses is that Prometheus's in-memory data model is fundamentally limited by available RAM — when you scrape too many unique time series (high cardinality), the process runs out of memory and gets killed by the Linux OOM killer.

This is not a bug; it's a design constraint. Prometheus is optimized for reliability and simplicity, not for handling unbounded label combinations. If you need to track per-user or per-request metrics with thousands of distinct label values, you should use a long-term storage solution like Thanos, Cortex, or Mimir, or push to a system designed for high cardinality like VictoriaMetrics.

The article focuses on the specific Kubernetes failure mode where service discovery and label explosion from Pods, Deployments, and custom application metrics cause OOM kills, and how to diagnose and prevent them with proper label design, recording rules, and scrape configuration.

Plain-English First

Imagine your Kubernetes cluster is a busy hospital. Dozens of doctors (pods), nurses (services), and wards (namespaces) are running simultaneously. Prometheus is like the hospital's central monitoring board — it walks around every few seconds, checks each room's vitals (CPU, memory, request rates), writes them down in a giant logbook, and sounds an alarm if a patient's heart rate spikes. You don't wait for something to go wrong — the board tells you before it becomes a crisis.

Running Kubernetes in production without monitoring is like flying a commercial aircraft with the instrument panel blacked out. Everything might feel fine until it catastrophically isn't. Prometheus is used by over 84% of Kubernetes production environments — not because it's the easiest tool, but because it's the most powerful pull-based metrics system that was purpose-built for dynamic, containerized infrastructure.

The real problem Prometheus solves is the ephemeral nature of Kubernetes workloads. Traditional monitoring tools expect your target IPs to stay fixed. In Kubernetes, a pod's IP changes every restart. Prometheus solves this with Kubernetes-native service discovery — it queries the Kubernetes API server directly to find what's alive right now, not what was alive when you wrote the config.

This is not a getting-started guide. It covers scrape configurations with relabeling, custom application metrics using client libraries, recording rules to avoid query-time explosions, Alertmanager integration, and the five most expensive mistakes teams make in production.

Why Prometheus OOMKills in Kubernetes Are a Label Problem

Prometheus monitoring in Kubernetes means scraping metrics from pods, nodes, and services via a pull model. The core mechanic: Prometheus stores every unique combination of metric name and label set as a time series. High-cardinality labels—like request IDs, user IDs, or pod IPs—explode the number of series, consuming memory until the process hits its limit and gets OOMKilled. In practice, a single label with 10,000 unique values can multiply memory usage by 10x or more. The key property: Prometheus memory scales linearly with the number of active series, not the number of metrics. Use this when you need reliable, long-term monitoring of Kubernetes workloads. Avoid it when you need per-request tracking or unbounded label values—that's where OOMKills happen. Real systems fail because teams treat labels as arbitrary metadata, not cardinality-sensitive keys.

Cardinality Is the Silent Killer
A single high-cardinality label (e.g., 'user_id') can cause a Prometheus OOMKill even if total metric count is low—memory grows with unique label combinations, not metric names.
Production Insight
A team added 'pod_ip' as a label to track per-pod latency, generating 500 series per deployment. With 200 pods rolling every hour, series count hit 100k and Prometheus OOMKilled within 15 minutes.
Symptom: Prometheus pod restarts repeatedly, 'kubectl top pod' shows memory climbing to limit, then crash. Grafana dashboards go blank.
Rule of thumb: Never use labels with unbounded values (IDs, IPs, timestamps). If a label has >1000 unique values per metric, redesign.
Key Takeaway
Prometheus memory usage is proportional to the number of unique label-value combinations, not the number of metrics.
High-cardinality labels (e.g., user_id, request_id, pod_ip) are the #1 cause of OOMKills in Kubernetes.
Always set a series limit per metric (--storage.tsdb.max-series) and monitor series count before memory usage.
Prometheus OOMKill: High-Cardinality Labels in Kubernetes THECODEFORGE.IO Prometheus OOMKill: High-Cardinality Labels in Kubernetes Flow from service discovery to alerting with cardinality trap Service Discovery Kube API discovers pods/services ScrapeConfig/ServiceMonitor Defines targets and label relabeling High-Cardinality Labels e.g., request_id, pod IP cause explosion Prometheus Memory Spikes TSDB memory grows with unique label combos OOMKill by Kernel Process killed, metrics lost Recording Rules & Alerting Aggregate before alert to reduce cardinality ⚠ High-cardinality labels cause OOMKill Avoid labels like request_id; use relabeling to drop them THECODEFORGE.IO
thecodeforge.io
Prometheus OOMKill: High-Cardinality Labels in Kubernetes
Kubernetes Monitoring Prometheus

Prometheus Stack (Operator) Component Visual

The Prometheus Operator for Kubernetes introduces a set of Custom Resource Definitions (CRDs) that declaratively define the Prometheus monitoring stack. Understanding how these components fit together is essential before diving into scrape configuration.

The stack consists of five core CRDs
  • Prometheus: Defines a Prometheus statefulset, including retention, storage, and resource limits. The operator manages the Prometheus pods, config reloading, and target reconciliation.
  • Alertmanager: Defines an Alertmanager cluster with config secret, routing, and receivers. The operator creates the Alertmanager pods and manages the config.
  • ServiceMonitor: Declares how to scrape a Kubernetes service. The operator converts ServiceMonitor selectors into Prometheus scrape targets. It is the most common way to configure scraping in operator-managed setups.
  • PodMonitor: Like ServiceMonitor but scrapes individual pods directly, without an intermediate service. Useful for scraping metrics from pods that are not behind a service.
  • PrometheusRule: Contains recording and alerting rules. The operator loads them into Prometheus.

Additionally, the ScrapeConfig CRD (introduced in Prometheus Operator v0.65+) provides a lower-level way to define scrape targets with full Prometheus scrape_config semantics, bypassing the semantic filters of ServiceMonitor/PodMonitor.

A typical production deployment creates a Prometheus CR, an Alertmanager CR, and one or more ServiceMonitor or ScrapeConfig CRs, plus PrometheusRule CRs for alerts. The operator watches these CRDs and reconciles the state of Prometheus and Alertmanager instances.

The diagram below shows the relationships: the Operator watches CRDs, creates Prometheus/Alertmanager pods, and translates ServiceMonitors into scrape configs.

prometheus-operator-components.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Minimal Prometheus CR — operator creates the deployment
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus
  namespace: monitoring
spec:
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      app: myapp
  resources:
    requests:
      memory: 16Gi
---
# ServiceMonitor — operator translates to scrape targets
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
Output
The Prometheus Operator creates a Prometheus deployment with the given resource limits, discovers ServiceMonitors with matching labels, and automatically adds scrape targets to Prometheus configuration.
Label Matching is Everything
  • Prometheus CR specifies selectors for ServiceMonitors and PrometheusRules.
  • Each ServiceMonitor must have labels matching serviceMonitorSelector.
  • Each PrometheusRule must have labels matching ruleSelector.
  • Use --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false in Helm to avoid default label restrictions.
Production Insight
When deploying the Prometheus Operator via Helm (kube-prometheus-stack), the default values set serviceMonitorSelectorNilUsesHelmValues: true. This means the Operator only picks up ServiceMonitors with the Helm release label. If you create a ServiceMonitor manually using kubectl apply, it won't be scraped unless you also add the label release: <name>. This catches many teams off guard. Set serviceMonitorSelectorNilUsesHelmValues: false to allow unlabeled ServiceMonitors, or always include the release label.
Key Takeaway
The Prometheus Operator stack: Prometheus CR (instance), ServiceMonitor (scrape target definition), PrometheusRule (alert/recording rules), Alertmanager (notification routing). Label matching between CRDs is the most frequent misconfiguration. Understand the component relationships before writing any YAML.
Prometheus Operator Stack Components
createscreateswatcheswatcheswatchesselectsroutes toscrapesloads rulesfires alertsroutesPrometheus OperatorPrometheus StatefulSetAlertmanager StatefulSetServiceMonitor CRDPodMonitor CRDPrometheusRule CRDServicePod targetSlack, PagerDuty, email

How Prometheus Service Discovery Works Inside Kubernetes

Prometheus uses a pull model — it reaches out to targets and scrapes metrics endpoints, typically on path /metrics, at a configured interval. In a static world you'd list IPs. In Kubernetes, Prometheus uses kubernetes_sd_configs to query the Kubernetes API and discover pods, services, endpoints, nodes, and ingresses dynamically.

When Prometheus starts, it authenticates to the API server using a ServiceAccount token mounted in its pod. It then watches specific resource types. For the endpoints role, Prometheus discovers every Endpoints object across the cluster. For each endpoint address it finds, it creates a scrape target. The magic happens during relabeling — a pipeline that runs before the scrape and lets you filter, rename, and attach labels using values pulled directly from Kubernetes metadata (pod annotations, namespace labels, service names).

The annotation `prometheus.io/scrape: 'true' is a community convention that Prometheus relabeling configs check. If the annotation exists and is true, the pod is scraped. This means enabling monitoring for a new application is as simple as adding three lines to its pod spec — no Prometheus config reload needed. Prometheus reconciles new targets automatically every scrape_interval`.

Understanding the target lifecycle is critical for production. Targets move through states: up, down, and unknown. A target goes unknown when Prometheus can't reach the endpoint at all (network issue or pod not started). It goes down when the HTTP scrape returns a non-200 status or times out. Staleness markers are injected after a target disappears — this prevents old time series from polluting range queries.

prometheus-kubernetes-sd-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# This is the core Prometheus scrape configuration for Kubernetes pod discovery.
# It lives inside your prometheus.yml (or a PrometheusRule CR if using the operator).

scrape_configs:
  - job_name: 'kubernetes-pods'
    # Prometheus will query the Kubernetes API server to find all pods.
    kubernetes_sd_configs:
      - role: pod
        # Restrict discovery to a specific namespace for security isolation.
        namespaces:
          names:
            - production
            - staging

    # relabel_configs runs BEFORE each scrape — it filters and transforms targets.
    relabel_configs:
      # STEP 1: Only scrape pods that explicitly opt in via annotation.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

      # STEP 2: Allow pods to declare a custom metrics path.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # STEP 3: Allow pods to declare a custom port for scraping.
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: '([^:]+)(?::\d+)?;(\d+)'
        replacement: '$1:$2'
        target_label: __address__

      # STEP 4: Carry namespace as a label.
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace

      # STEP 5: Carry the pod name as a label.
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

      # STEP 6: Carry the app label from the pod.
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app

      # STEP 7: Drop targets running on hostNetwork.
      - source_labels: [__meta_kubernetes_pod_host_ip, __address__]
        regex: '(\d+\.\d+\.\d+\.\d+);\1:\d+'
        action: drop
Output
# Prometheus /targets page shows discovered targets with labels applied.
The __address__ Rewrite Trap
  • Multi-container pods expose multiple ports. Prometheus picks one — not always the right one.
  • The __address__ label determines where Prometheus connects. Relabeling rewrites it.
  • Without prometheus.io/port, Prometheus uses the first container port in the pod spec.
  • Sidecar containers (Istio, Envoy) often expose ports that are not your metrics port.
Production Insight
NetworkPolicies are the silent killer of Prometheus scraping. If you deploy a NetworkPolicy that restricts ingress to your pod, and Prometheus is not in the allowed source namespace/IP range, scraping silently fails. The target shows as down or unknown with no useful error. Always include Prometheus's namespace in your NetworkPolicy ingress rules. Use kubectl exec from the Prometheus pod to test connectivity to the target before blaming the scrape config.
Key Takeaway
Prometheus service discovery is annotation-driven and relabeling-configured. The most common production failures are: missing annotations, NetworkPolicy blocking scrapes, and multi-container port confusion. Always set prometheus.io/scrape, prometheus.io/port, and prometheus.io/path explicitly.

ServiceMonitor vs ScrapeConfig Decision Guide

When using the Prometheus Operator, you have two primary ways to define scrape targets: ServiceMonitor (and its cousin PodMonitor) and the newer ScrapeConfig CRD (available since Prometheus Operator v0.65). Understanding which to use is critical for production architecture.

ServiceMonitor is the original CRD. It is designed to scrape a Kubernetes Service. You specify a service selector (by labels), and the Operator automatically discovers all endpoints behind that service. It adds semantic constraints: the service must expose one or more ports, and you can optionally specify a path, interval, and metric relabeling. ServiceMonitor is high-level: the Operator handles converting service endpoints to individual pod IP addresses.

ScrapeConfig is a lower-level CRD. It directly mirrors the Prometheus scrape_config block. You define kubernetes_sd_configs, relabel_configs, metric_relabel_configs, etc., exactly as you would in a raw Prometheus YAML file. There are no implicit service-based semantics. This gives you full control but requires more expertise.

The decision tree below helps pick the right one.

Use ServiceMonitor when
  • You want to scrape a standard Kubernetes Service (preferred for 90% of cases).
  • You need simplicity: just select the service by label and define a port.
  • You want the Operator to dynamically follow pods as they scale or roll.
  • You are monitoring a typical application pod behind a Service.
Use ScrapeConfig when
  • You need full control over relabeling, including non-standard discovery (e.g., scraping a non-Kubernetes endpoint, or a static target).
  • You need to scrape a target that is not behind a Service (e.g., a DaemonSet pod that you want to scrape by node).
  • You want to use the full power of kubernetes_sd_configs roles beyond endpoints (e.g., node, ingress).
  • You are migrating from a raw Prometheus configuration and want to keep the same scrape config syntax.

A common production pattern: use ServiceMonitor for all user-facing services (HTTP APIs, gRPC), and ScrapeConfig for infrastructure components (node-exporter, kube-state-metrics, or custom exporters that expose metrics on non-standard ports).

servicemonitor-vs-scrapeconfig.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# ServiceMonitor — simple, operator-managed
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-service-monitor
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    metricRelabelings:
    - sourceLabels: [__name__]
      regex: '.*_total'
      action: keep
---
# ScrapeConfig — full control, raw Prometheus config
apiVersion: monitoring.coreos.com/v1
kind: ScrapeConfig
metadata:
  name: myapp-scrape-config
  labels:
    app: myapp
spec:
  scrapeConfig:
    job_name: 'myapp-scrape'
    kubernetes_sd_configs:
    - role: pod
      namespaces:
        names: [production]
    relabel_configs:
    - sourceLabels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: 'true'
    - sourceLabels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: '([^:]+)(?::\d+)?;(\d+)'
      replacement: '$1:$2'
      targetLabel: __address__
    metric_relabel_configs:
    - sourceLabels: [__name__]
      regex: '.*_total'
      action: keep
Output
ServiceMonitor targets the Service. ScrapeConfig uses pod-level discovery with custom relabeling.
ScrapeConfig replaces the need for raw prometheus.yaml
  • ServiceMonitor: high-level, service-oriented, Operator abstracts pod IPs.
  • ScrapeConfig: low-level, full Prometheus config syntax, full control over discovery.
  • Use ServiceMonitor for typical HTTP services on standard ports.
  • Use ScrapeConfig for node-level scraping, non-Kubernetes targets, or complex relabeling.
  • ScrapeConfig supports all kubernetes_sd_config roles: pod, service, endpoints, node, ingress.
Production Insight
In large clusters with hundreds of services, ServiceMonitor label selection should be precise to avoid scraping unintended endpoints. Always use matchLabels with a specific app label, not broad selectors that could match system services. ScrapeConfig on the other hand lets you define exactly which kubernetes_sd_config role to use and filter with relabel_configs. A common pattern is to use ServiceMonitor for business applications and ScrapeConfig for infrastructure components like node-exporter, cAdvisor, or custom DaemonSet-based exporters.
Key Takeaway
ServiceMonitor is the default choice for scraping services; ScrapeConfig gives full Prometheus config semantics. Use ScrapeConfig when you need non-service-based discovery or fine-grained metric relabeling that ServiceMonitor cannot express. Both CRDs are reconciled by the Prometheus Operator.
ServiceMonitor vs ScrapeConfig Decision Flow
YesNoYesNoNeed to scrape a target inKubernetes?Is the target behind aKubernetes Service?Use ServiceMonitorNeed full control overrelabeling?e.g., scrape endpoints by node,or non-standard discoveryServiceMonitor is sufficientUse ScrapeConfig for fullPrometheus config syntaxEnd

Exposing Custom Application Metrics with the Prometheus Client Libraries

Kubernetes infrastructure metrics (CPU, memory, network) come from kube-state-metrics and node-exporter. But the metrics that make or break your SLOs are application-level: request latency, error rates, queue depth, cache hit ratio. These come from instrumenting your own code.

Prometheus has four core metric types you need to understand at the semantic level, not just the API level:

Counter — a value that only goes up (resets to zero on restart). Use it for total requests, total errors, total bytes sent. Never use a counter for something that can decrease. PromQL's rate() and increase() functions unwrap counters properly, handling resets.

Gauge — a value that can go up or down. Use it for current queue depth, active connections, temperature, memory usage. Don't use rate() on a gauge — it's meaningless.

Histogram — pre-aggregated bucketed observations. Use it for latency and request size. It exposes three time series: _bucket, _sum, and _count. The bucket boundaries you choose at instrumentation time are permanent — you can't change them without restarting the process.

Summary — client-side computed quantiles. Use it only when you need accurate quantiles and can't aggregate across instances (summaries can't be aggregated in PromQL). In Kubernetes with multiple replicas, histograms are almost always the right choice over summaries.

instrumented_http_server.goGO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package main

import (
	"math/rand"
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "payment_service_http_requests_total",
			Help: "Total number of HTTP requests processed by the payment service.",
		},
		[]string{"method", "endpoint", "status_code"},
	)

	httpRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{\n\t\t\tName: \"payment_service_http_request_duration_seconds\",\n\t\t\tHelp: \"HTTP request duration in seconds, bucketed by endpoint.\",\n\t\t\tBuckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5},\n\t\t},\n\t\t[]string{\"method\", \"endpoint\"},\n\t)\n\n\tinFlightRequests = promauto.NewGauge(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"payment_service_in_flight_requests\",\n\t\t\tHelp: \"Number of HTTP requests currently being processed.\",\n\t\t},\n\t)\n\n\tpaymentQueueDepth = promauto.NewGauge(\n\t\tprometheus.GaugeOpts{\n\t\t\tName: \"payment_service_queue_depth\",\n\t\t\tHelp: \"Current number of payments waiting in the processing queue.\",\n\t\t},\n\t)\n)\n\nfunc instrumentedHandler(endpoint string, handlerFunc http.HandlerFunc) http.HandlerFunc {\n\treturn func(responseWriter http.ResponseWriter, request *http.Request) {\n\t\tinFlightRequests.Inc()\n\t\tdefer inFlightRequests.Dec()\n\t\tstartTime := time.Now()\n\t\twrappedWriter := &statusCapturingWriter{ResponseWriter: responseWriter, statusCode: http.StatusOK}\n\t\thandlerFunc(wrappedWriter, request)\n\t\tdurationSeconds := time.Since(startTime).Seconds()\n\t\thttpRequestDuration.WithLabelValues(request.Method, endpoint).Observe(durationSeconds)\n\t\thttpRequestsTotal.WithLabelValues(\n\t\t\trequest.Method,\n\t\t\tendpoint,\n\t\t\tstrconv.Itoa(wrappedWriter.statusCode),\n\t\t).Inc()\n\t}\n}\n\ntype statusCapturingWriter struct {\n\thttp.ResponseWriter\n\tstatusCode int\n}\n\nfunc (scw *statusCapturingWriter) WriteHeader(code int) {\n\tscw.statusCode = code\n\tscw.ResponseWriter.WriteHeader(code)\n}\n\nfunc processPayment(responseWriter http.ResponseWriter, request *http.Request) {\n\tprocessingTime := time.Duration(5+rand.Intn(295)) * time.Millisecond\n\ttime.Sleep(processingTime)\n\tif rand.Float64() < 0.02 {\n\t\thttp.Error(responseWriter, \"upstream payment gateway timeout\", http.StatusGatewayTimeout)\n\t\treturn\n\t}\n\tresponseWriter.WriteHeader(http.StatusOK)\n\tresponseWriter.Write([]byte(`{\"status\":\"processed\"}`))\n}\n\nfunc main() {\n\tgo func() {\n\t\tfor {\n\t\t\tpaymentQueueDepth.Set(float64(rand.Intn(500)))\n\t\t\ttime.Sleep(5 * time.Second)\n\t\t}\n\t}()\n\thttp.HandleFunc(\"/api/v1/payments\", instrumentedHandler(\"/api/v1/payments\", processPayment))\n\thttp.Handle(\"/metrics\", promhttp.Handler())\n\tgo http.ListenAndServe(\":9091\", nil)\n\thttp.ListenAndServe(\":8080\", nil)\n}",
        "output": "# Prometheus scrapes http://payment-service-pod:9091/metrics and receives all four metric types."
      }

Production-Grade Recording Rules and Alerting That Won't Page You at 3am

Raw PromQL queries against high-cardinality data are expensive. A query like rate(http_requests_total[5m]) across 200 pods runs every time a dashboard loads. In large clusters, this causes Prometheus to churn through millions of samples per query, leading to query timeouts and the dreaded 'query timed out in expression evaluation' error.

Recording rules solve this by pre-computing expensive expressions and storing the result as a new time series. Prometheus evaluates recording rules on its evaluation interval (typically 1m), writes the result into its TSDB, and future queries read that cheap pre-computed series instead of re-scanning the raw data.

Naming matters. The Prometheus community convention for recording rule names is level:metric:operations. For example job:http_requests_total:rate5m means: aggregated at the job level, derived from http_requests_total, computed as a 5-minute rate. Sticking to this convention makes rules self-documenting and searchable.

Alerts in Prometheus are defined in the same YAML format as recording rules. The critical production insight is that alerts should express SLO burn rates, not raw thresholds. An alert that fires when error rate > 1% will fire constantly during minor blips. An alert based on a multi-window burn rate (Google's SRE model) only fires when you're burning through your error budget fast enough to exhaust it within a prediction window — dramatically reducing noise.

payment-service-rules.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# PrometheusRule custom resource — picked up automatically by the Prometheus Operator.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-service-slo-rules
  namespace: production
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: payment_service_recording_rules
      interval: 1m
      rules:
        - record: job_endpoint:payment_service_http_requests_total:rate5m
          expr: |
            rate(payment_service_http_requests_total[5m])

        - record: job_endpoint:payment_service_error_ratio:rate5m
          expr: |
            sum by (job, endpoint) (
              rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m])\n            )\n            /\n            sum by (job, endpoint) (\n              rate(payment_service_http_requests_total[5m])\n            )\n\n        - record: job_endpoint:payment_service_latency_p99:rate5m\n          expr: |\n            histogram_quantile(\n              0.99,\n              sum by (job, endpoint, le) (\n                rate(payment_service_http_request_duration_seconds_bucket[5m])\n              )\n            )\n\n    - name: payment_service_alerts\n      rules:\n        - alert: PaymentServiceHighErrorBurnRate\n          expr: |\n            job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001)\n          for: 2m\n          labels:\n            severity: critical\n            team: payments\n            runbook_url: https://wiki.company.com/runbooks/payment-service-errors\n          annotations:\n            summary: \"Payment service burning error budget at critical rate\"\n            description: |\n              Endpoint {{ $labels.endpoint }} error ratio is {{ $value | humanizePercentage }}.\n\n        - alert: PaymentServiceHighLatency\n          expr: |\n            job_endpoint:payment_service_latency_p99:rate5m > 0.5\n          for: 5m\n          labels:\n            severity: warning\n            team: payments\n          annotations:\n            summary: \"Payment service p99 latency exceeds SLO threshold\"\n\n        - alert: PaymentQueueConsumerDead\n          expr: |\n            payment_service_queue_depth == 0\n            and\n            sum(job_endpoint:payment_service_http_requests_total:rate5m) > 10\n          for: 3m\n          labels:\n            severity: critical\n            team: payments\n          annotations:\n            summary: \"Payment queue depth is zero but traffic is flowing\"",
        "output": "# PrometheusRule applied. Operator picks it up via label matching."
      }

Alertmanager: Routing, Silencing, and Deduplication

Prometheus evaluates alert rules and sends firing alerts to Alertmanager. Alertmanager is a separate component responsible for deduplicating, grouping, routing, and silencing alerts before sending them to notification channels (PagerDuty, Slack, email, Opsgenie).

Alertmanager's routing tree is a hierarchical matching system. An alert's labels are matched against the route configuration. The first matching route determines the receiver. This means your alert labels (severity, team, service) must be carefully designed to match your routing tree.

alertmanager-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# Alertmanager configuration — deployed as a Secret in the monitoring namespace.
# The Prometheus Operator picks this up from the alertmanager.yaml key.
global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxx'

route:
  # Default receiver for alerts that don't match any sub-route.
  receiver: 'slack-catchall'
  group_by: ['alertname', 'namespace', 'endpoint']
  group_wait: 30s         # Wait 30s to group similar alerts.
  group_interval: 5m      # Send grouped updates every 5m.
  repeat_interval: 4h     # Re-send unresolved alerts every 4h.

  routes:
    # Critical alerts -> PagerDuty (page the on-call engineer).
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s       # Page faster for critical alerts.
      repeat_interval: 1h   # Re-page every hour if unresolved.

    # Warning alerts -> Slack channel (no page, just notification).
    - match:
        severity: warning
      receiver: 'slack-warnings'
      group_wait: 1m
      repeat_interval: 4h

    # Team-specific routing.
    - match:
        team: payments
      receiver: 'slack-payments-team'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-integration-key>'
        severity: 'critical'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
          num_firing: '{{ .Alerts.Firing | len }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}'
        send_resolved: true

  - name: 'slack-payments-team'
    slack_configs:
      - channel: '#payments-alerts'
        title: '[Payments] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}'

  - name: 'slack-catchall'
    slack_configs:
      - channel: '#alerts-catchall'
        title: 'Unrouted Alert: {{ .GroupLabels.alertname }}'

# Inhibition: suppress warning alerts when a critical alert is already firing
# for the same service. Prevents alert storms.
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'namespace']
Output
Alertmanager configured with PagerDuty for critical, Slack for warnings, and inhibition rules.
Alertmanager's Three Core Functions
  • Deduplication: Same alert fingerprint = one notification.
  • Grouping: group_by determines which alerts are batched together.
  • Inhibition: Higher-severity alerts suppress lower-severity alerts for the same context.
  • Silences: Temporary muting of alerts during maintenance windows.
  • Routing: Label matching determines which receiver (PagerDuty, Slack, email) gets the alert.
Production Insight
The group_by field in Alertmanager controls notification batching. If you group by alertname only, all pods with the same alert are grouped into one notification — good for reducing noise. If you group by alertname, pod, each pod gets its own notification — bad during a cluster-wide outage where 200 pods trigger the same alert. The production default should be group_by: ['alertname', 'namespace'] to batch by service and namespace. Use group_wait: 30s to allow grouping before sending.
Key Takeaway
Alertmanager is the routing and deduplication layer between Prometheus and your notification channels. Inhibition rules prevent alert storms. Group_by controls notification batching. Always configure inhibition rules to suppress lower-severity alerts when higher-severity alerts are already firing.

PromQL Common Query Cheat Sheet

PromQL is the query language for Prometheus. Whether you're building Grafana dashboards, writing alerting rules, or debugging a production issue, having a mental library of common PromQL patterns saves hours. Below is a cheat sheet of the most useful queries for Kubernetes monitoring.

Infrastructure Metrics - CPU usage per pod: rate(container_cpu_usage_seconds_total{container!=\"POD\", image!=\"\"}[5m]) - Memory usage per pod: container_memory_working_set_bytes{container!=\"POD\", image!=\"\"} - Network receive bytes per pod: rate(container_network_receive_bytes_total[5m]) - Disk reads per pod: rate(container_fs_reads_bytes_total[5m])

Application Metrics (assuming custom metrics like payment_service_http_requests_total) - Request rate per second: rate(payment_service_http_requests_total[5m]) - Error rate per second: rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m]) - Error ratio (percentage): sum(rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(payment_service_http_requests_total[5m])) - p99 latency: histogram_quantile(0.99, sum(rate(payment_service_http_request_duration_seconds_bucket[5m])) by (le, endpoint)) - Average latency: rate(payment_service_http_request_duration_seconds_sum[5m]) / rate(payment_service_http_request_duration_seconds_count[5m])

Prometheus Self-Monitoring - Active time series: prometheus_tsdb_head_series - Memory usage: process_resident_memory_bytes - Scrape duration per job: scrape_duration_seconds - Targets up per job: up (returns 1 if target is up, 0 if down)

Alerting Patterns - Pager-worthy: job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001) (14.4 x 0.1% = 1.44% error rate over 5m, burning through monthly SLO in hours) - No data: absent(prometheus_tsdb_head_series) — fires when Prometheus itself is down - Pod restart detection: changes(process_start_time_seconds[1h]) > 0

Cardinality Detection - Top 10 metric names by series count: topk(10, count by (__name__)({__name__=~\".+\"})) - Series count for a specific metric: count(payment_service_http_requests_total)", "code": { "language": "promql", "filename": "promql-cheat-sheet.promql", "code": "# ---- Infrastructure Metrics ----

# CPU usage rate (5m window) per pod rate(container_cpu_usage_seconds_total{container!=\"POD\", image!=\"\"}[5m])

# Memory working set per pod (Gauge, no rate) container_memory_working_set_bytes{container!=\"POD\", image!=\"\"}

# Network bytes received per second per pod rate(container_network_receive_bytes_total[5m])

# ---- Application Request Metrics ----

# Total request rate sum(rate(payment_service_http_requests_total[5m]))

# Error ratio (percentage of 5xx) sum(rate(payment_service_http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(payment_service_http_requests_total[5m]))

# p99 latency (uses histogram_quantile) histogram_quantile(0.99, sum(rate(payment_service_http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# ---- Prometheus Self-Monitoring ----

# Active time series count prometheus_tsdb_head_series

# Prometheus process memory process_resident_memory_bytes

# ---- Cardinality Analysis ----

# Top 10 metric names by number of time series topk(10, count by (__name__)({__name__=~\".+\"}))

# ---- Alerting Patterns ----

# High error burn rate (SLO-based) job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001)

# Detect pod restart in last hour (counter reset) changes(process_start_time_seconds[1h]) > 0

# Absence of data (Prometheus itself down) absent(prometheus_tsdb_head_series)", "output": "These PromQL queries can be executed directly in the Prometheus UI Execute page, in Grafana panels, or in alert rules." }, "callout": { "type": "tip", "title": "Always Use Recording Rules for Repeated Queries", "text": "The error ratio query above is expensive: it scans all error series and total series, then divides. If you use this query in five dashboards and one alert, you're executing it six times every evaluation cycle. Create a recording rule job_endpoint:payment_service_error_ratio:rate5m and query that instead.", "hook": "Every PromQL query you write should be a candidate for a recording rule if it appears in more than one place.", "bullets": [ "Recording rules pre-compute expensive queries into cheap time series.", "Use rate() for counters, not raw values (counter resets break averages).", "Use histogram_quantile() with sum by (le, ...) for aggregated percentiles.", "Avoid * selectors in production — always filter by at least one label.", "Check query performance with the Prometheus UI's query analysis (explain button)." ] }, "production_insight": "The most common PromQL mistake is using rate() on a gauge — it produces meaningless results because gauges can go down. Another frequent error is forgetting to sum by the correct labels when using histogram_quantile(): you must include le (bucket upper bound) in the by clause, otherwise Prometheus returns an error. When debugging high memory, the topk(10, count by (__name__)({__name__=~\".+\"}))` query quickly identifies which metric name has the most time series — often the culprit is a high-cardinality label on a single metric.", "key_takeaway": "Master these core PromQL patterns: rate (counters), gauge (raw values), histogram_quantile (latency percentiles), topk (cardinality detection). Use recording rules to optimize expensive queries. Always filter by at least one label to avoid scanning the entire TSDB." }, { "heading": "Prometheus Storage: TSDB Internals, Retention, and Thanos/Cortex for Long-Term", "content": "Prometheus stores metrics in its own time-series database (TSDB). Understanding TSDB internals is critical for capacity planning, retention tuning, and deciding when to add long-term storage.

TSDB stores data in blocks. Each block covers a 2-hour time range and contains a chunks directory (compressed metric samples) and an index. The head block is the in-memory write-ahead log (WAL) that receives all new samples. Every 2 hours, the head block is compacted into a persistent block and flushed to disk. Old blocks are compacted into larger blocks (e.g., 2h blocks into 1-day blocks) to reduce the number of files.", "code": { "language": "yaml", "filename": "prometheus-retention-config.yaml", "code": "# Prometheus StatefulSet with retention and storage configuration. # For the kube-prometheus-stack Helm chart, these go in prometheus.prometheusSpec. apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: kube-prometheus namespace: monitoring spec: # Retention: how long to keep data locally. # 15d is typical. Longer retention = more disk and memory. retention: 15d

# Retention by size: delete oldest blocks when storage exceeds this. # Use this as a safety net alongside time-based retention. retentionSize: 50GB

# Storage: PVC for persistent TSDB blocks. storage: volumeClaimTemplate: spec: storageClassName: fast-ssd accessModes: - ReadWriteOnce resources: requests: storage: 100Gi

# Resources: Prometheus is memory-hungry. # Rule of thumb: ~16KB per active time series. # 1M series = 16GB RAM. Plan accordingly. resources: requests: cpu: '1' memory: 16Gi limits: cpu: '4' memory: 32Gi

# External labels: applied to all metrics when using Thanos/Cortex. # Identifies which Prometheus instance scraped the data. externalLabels: cluster: production-us-east-1 environment: production

# Thanos sidecar: uploads blocks to object storage for long-term retention. thanos: objectStorageConfig: name: thanos-objstore-config key: objstore.yml

# Sample limit: max series per scrape target. # Safety net against high-cardinality targets. serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false", "output": "Prometheus configured with 15-day retention, 50GB size limit, fast SSD storage, and Thanos sidecar for long-term block upload to object storage." }, "callout": { "type": "mental_model", "title": "When to Add Thanos or Cortex", "text": "Prometheus is designed for short-term, per-cluster monitoring. It cannot federate queries across clusters, cannot store data longer than a few weeks efficiently, and is a single point of failure. Thanos and Cortex solve these problems.", "hook": "Add Thanos when you need: cross-cluster query federation, retention beyond 30 days, or HA for Prometheus.", "bullets": [ "Prometheus: Single cluster, short-term (days to weeks). No native HA or federation.", "Thanos: Sidecar uploads blocks to S3/GCS. Querier federates across Prometheus instances. Compactor reduces storage costs. Best for multi-cluster with object storage.", "Cortex: Hor Cloud. Best for multi-tenant SaaS platforms.", "VictoriaMetrics: Drop-in Prometheus replacement with better compression and lower resource usage. Best for single-cluster with high cardinality.", "Decision: Use Thanos for multi-cluster with object storage. Use Cortex for multi-tenant SaaS. Use VictoriaMetrics for single-cluster resource optimization." ] }, "production_insight": "Prometheus's memory usage is directly proportional to the number of active time series in the TSDB head block. Each active series consumes approximately 16KB of memory. If you have 2 million active series, you need approximately 32GB of RAM for the head block alone, plus overhead for queries and compaction. Monitor prometheus_tsdb_head_series and process_resident_memory_bytes. Set retention based on disk capacity: 15 days at 1M series with 15s scrape interval equals approximately 50GB of disk. Use fast SSDs for TSDB — network-attached storage introduces latency that slows compaction and can cause WAL corruption during power loss.", "decision_tree": { "title": "Prometheus Storage Decision Tree", "items": [ { "condition": "Single cluster, retention under 15 days, under 5M active series", "result": "Standalone Prometheus with local TSDB and PVC on fast SSD. No additional components needed." }, { "condition": "Multiple clusters, need cross-cluster query federation", "result": "Deploy Thanos Sidecar on each Prometheus. Add Thanos Querier for global view. Add Thanos Store Gateway for historical queries." }, { "condition": "Retention beyond 30 days or more than 5M active series", "result": "Add Thanos Sidecar with object storage (S3/GCS). Add Thanos Compactor to reduce storage costs. Keep local retention short (7d) and rely on object storage for long-term." }, { "condition": "Multi-tenant SaaS platform with per-customer isolation", "result": "Use Cortex or Grafana Mimir for horizontal scalability and per-tenant resource limits." }, { "condition": "Single cluster with extreme cardinality (10M+ series)", "result": "Consider VictoriaMetrics as a drop-in replacement. It offers better compression (1izontally scalable multi-tenant Prometheus backend. Used by Grafana0x) and lower memory usage than Prometheus TSDB." } ] }, "key_takeaway": "Prometheus TSDB is a block-based storage engine with an in-memory head block. Memory is proportional to active series count. Use fast SSDs for TSDB. Add Thanos for cross-cluster federation and long-term retention beyond local disk capacity. Plan capacity based on series count: 1M series equals approximately 16GB RAM and approximately 50GB disk for 15 days." } ]

What Prometheus Actually Is (and Isn’t) for Kubernetes

Prometheus is not a full platform. It’s a pull-based metrics system with a time-series database and a query language. That’s it. No dashboards. No built-in long-term storage. No synthetic monitoring. What it does well: scrape HTTP endpoints, store labeled metrics, fire alerts on expressions.

Inside Kubernetes, this matters because pods die, reschedule, and scale. Prometheus discovers targets via the Kubernetes API — no static config files. Each service, pod, or endpoint gets labels. Those labels are your only join key between metrics. Mess them up and your alerts break silently.

Prometheus uses a single binary. No external dependencies. The operator wraps it in CRDs like ServiceMonitor so you don’t write raw scrape configs. But understand this: the operator is just a config generator. If you don’t know what a scrape interval or relabel config does, you’re cargo-culting YAML, not monitoring.

Senior take: Prometheus assumes you understand your architecture. If your metrics don’t map to your deployment topology, you’re blind.

PrometheusSelfDescription.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial

# What Prometheus says about itself in its own config
# This is the closest to a truth-of-record you'll get

global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s   # How often to evaluate rules

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod            # Discover targets from pod metadata
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
Output
$ kubectl get prometheus default -o yaml
# Returns the full custom resource — 300+ lines.
# The above snippet is the functional core.
Cargo Cult Alert:
Deploying the Prometheus Operator without understanding relabeling is like giving a toddler a chainsaw. The operator hides complexity, not architecture. Read the docs on relabel_configs before you touch a production cluster.
Key Takeaway
Prometheus is a pull-based metrics scraper with a TSDB and alert engine. Everything else — dashboards, storage, discovery — is bolted on. Know the core first.

Why Prometheus Owns Kubernetes Monitoring (and Where It Bleeds)

Prometheus won the container monitoring war for one reason: service discovery. In a static VM world, you list IPs in a config file. In Kubernetes, pods come and go every deploy. Prometheus talks to the Kubernetes API, discovers new pods in seconds, and starts scraping. No manual sync. No config reload.

But that strength is also its weakness. Service discovery means Prometheus holds a live connection to the API server. If your cluster has 10,000 pods and each pod exposes 1000 metrics, Prometheus will OOM before lunch. Labels multiply cardinality exponentially. A label like request_id with 10 million unique values will crash the TSDB.

The big bleed: long-term retention. Prometheus isn’t built for storing months of data. The default retention is 15 days. You need Thanos or Cortex for that. And if you’re scraping 500+ targets, a single Prometheus will hit performance walls. Shard by namespace or use federation.

Real talk: Prometheus works great for 99% of clusters. The 1% are hyperscale edge cases. Don’t over-engineer until you hit those limits.

CardinalityBomb.ymlYAML
1
2
3
4
5
6
7
8
9
// io.thecodeforge — devops tutorial

# This metric will kill Prometheus in production
# Avoid high-cardinality labels at ALL COSTS

# BAD - do NOT deploy this
http_requests_total{endpoint="/api/v1/users", method="GET", status="200", request_id="abc-123-def"} 1
http_requests_total{endpoint="/api/v1/users", method="GET", status="200", request_id="abc-124-def"} 1
# ... 10 million unique request_ids
Output
WAL corruption detected. Prometheus will restart in 30 seconds.
Check logs: level=error msg="'/app/main': out of memory: killed process 1024 (prometheus) total-vm: 8388608kB"
Senior Shortcut:
Aggregate high-cardinality metrics at the application level before exposing them. If you need per-request tracking, use a log aggregator. Prometheus is not your observability dumpster.
Key Takeaway
Prometheus wins on Kubernetes-native service discovery but bleeds on cardinality and retention. Know your metric count per target, set cardinality limits, and plan for Thanos before you need it.

Prometheus Architecture: The Minimal Mental Model You Need

You don’t need to memorize every component in the architecture diagram. You need the three things Prometheus actually does: scrape, store, alert.

First, the scraper. Prometheus runs a HTTP client that hits /metrics endpoints on your targets. It uses service discovery to know where to hit. Each scrape produces a sample — a timestamp, a value, and a set of labels. The scraper doesn’t push; the target must expose an endpoint.

Second, the TSDB (time-series database). Samples get stored on local disk. Default retention is 15 days, 512MB block size. The TSDB is append-only with periodic compaction. It’s fast but not distributed. If the pod dies, you lose data. That’s why you add Thanos or Cortex for HA.

Third, the alerting engine. Prometheus evaluates recording and alerting rules every evaluation_interval. If a rule matches, it fires an alert to Alertmanager. Alertmanager handles deduplication, grouping, and routing to Slack, PagerDuty, email.

That’s it. Three components. Everything else — exporters, operators, dashboards — is auxiliary. Master this model and you can debug any Prometheus issue in production.

MinimalPrometheusConfig.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial

# Absolute minimal Prometheus config to monitor itself
# This is the skeleton. Everything else is customization.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert.rules"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
Output
$ prometheus --config.file=prometheus.yml
level=info ts=2025-02-20T10:00:00Z caller=main.go:383 msg="Starting Prometheus" version="(version=2.54.0)"
level=info ts=2025-02-20T10:00:01Z caller=web.go:518 msg="Listening on :9090"
Debug Shortcut:
When Prometheus isn't scraping, check the /targets endpoint. Every target shows its last scrape state and error. That’s your first debugging step, not the logs.
Key Takeaway
Prometheus does three things: scrape HTTP endpoints, store time-series data locally, and evaluate rules for alerts. Everything else is bolt-on. Master the trinity.

Key Terminologies: Stop Pretending You Know What These Mean

Before you touch Prometheus in Kubernetes, you need the vocabulary straight. Not the marketing fluff — the actual implementation details that break your cluster at 2am.

A Target is anything Prometheus scrapes: a pod, a node, a service endpoint. Metadata from the target becomes labels. A Metric is a time series — a name plus a set of labels with a timestamped float value. That's it. No hidden complexity.

Labels are key-value pairs that identify the metric's dimension. They're not tags — they ARE the identity. Change a label and you create a new time series. Cardinality is the number of unique label combinations. High cardinality kills Prometheus. Period.

An Alert fires when a PromQL expression is true. Silences are global mute buttons for specific labels. Recording rules precompute expensive queries so your dashboards don't fall over. ServiceMonitor is a CRD that tells the Operator which services to scrape. ScrapeConfig is the older, more manual way.

Know the difference between counter (only increases), gauge (goes up and down), and histogram (buckets duration). Mix them up and your queries lie.

terminology-example.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial

// A minimal scrape config showing key terms in the wild
scrape_configs:
  - job_name: 'node-exporter'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__meta_kubernetes_node_name]
        target_label: node
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'node_load1'
        action: keep
    # node_load1 is a gauge, not a counter
    # Each node becomes a different time series (target)
Output
// io.thecodeforge — devops tutorial
// No output — this config just defines the scrape.
// But the Prometheus web UI will show:
// Target "node-exporter/192.168.1.10:9100"
// Metric "node_load1{node="k8s-node-1"}" value=0.45
Cardinality Trap:
One pod with a unique label instance_id creates a new time series for every scrape. If you have 10,000 pods with unique IDs, you have 10,000 time series for one metric. Your Prometheus OOMs. Never use high-cardinality labels you don't control.
Key Takeaway
Cardinality kills clusters. Every label pair is a new time series — design for uniqueness, not descriptiveness.

Key Features: Why Prometheus Wins for Kubernetes Monitoring

Prometheus dominates Kubernetes monitoring not because it's easy, but because its architecture matches how Kubernetes works. Its key features: pull-based scraping means no agent configuration inside containers—Prometheus discovers targets via Kubernetes APIs, adapting to pod churn automatically. The multi-dimensional data model pairs metric names with arbitrary key-value labels, enabling slicing by deployment, namespace, or pod. PromQL's aggregation operators (sum, rate, histogram_quantile) allow real-time SLO calculations without stored aggregations. Alertmanager handles deduplication, silencing, and grouping before paging you. Service discovery happens natively through pod annotations or ServiceMonitors, eliminating static configs. The operator pattern lets you define scraping rules as CRDs, versioned and GitOps-friendly. Prometheus is stateless by design—scrape failures don't cascade, and data is ephemeral unless remote-stored. Where it bleeds is long-term storage and cardinality explosion; choose Thanos or Cortex for retention beyond a month. Key decision: every new metric label adds memory pressure—team-wide label conventions enforce discipline.

prometheus-features.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial

// Key features: pull model, label-based data, PromQL
// This sample shows a ServiceMonitor using labels
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: http
    interval: 15s
  namespaceSelector:
    any: true
// Labels enable queries like: rate(http_requests_total{app="myapp"}[5m])
Output
ServiceMonitor created in cluster
Production Trap:
High-cardinality labels (user IDs, request paths) explode memory. Set scrape limits per target.
Key Takeaway
Labels are power AND poison—design them sparingly.

Demo: Monitor Any Third-Party App With Prometheus in Kubernetes

Third-party apps rarely expose Prometheus metrics natively. The fix: deploy a sidecar exporter or use the kube-state-metrics pattern. Steps: First, identify what metrics you need—CPU? request latency? HTTP status codes? Second, find or build a metrics exporter. For Redis, use the redis-exporter. For Nginx, the nginx-prometheus-exporter. For custom apps, write a lightweight exporter in Go or Python exposing :2112/metrics on the /metrics endpoint. Third, deploy the exporter as a sidecar container in the same pod as your third-party app—this guarantees they share the network namespace. Fourth, annotate the pod with prometheus.io/scrape: "true" and prometheus.io/port: "2112". Fifth, verify Prometheus discovers the target via its Targets page. Sixth, write recording rules for common aggregations (e.g., avg 5m CPU) to avoid expensive live PromQL. Seventh, set up Alertmanager for critical thresholds. This pattern works for legacy databases, message queues, or any black-box service. Don't modify the third-party app—wrap it.

sidecar-redis-exporter.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — devops tutorial

// Sidecar pattern: exporter + third-party app in one pod
apiVersion: v1
kind: Pod
metadata:
  name: redis-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9121"
spec:
  containers:
  - name: redis
    image: redis:7
  - name: exporter
    image: oliver006/redis_exporter:latest
    args: ["--redis.addr=redis://localhost:6379"]
    ports:
    - containerPort: 9121
Output
Pod running with sidecar exporter; Prometheus scrapes :9121/metrics
Production Trap:
Ensure the exporter container has resource limits—runaway exporters consume pod CPU and risk OOM.
Key Takeaway
Wrap, don't rewrite—sidecar exporters keep third-party apps untouched.

Prerequisites Before Installing Prometheus on Kubernetes

Before deploying the kube-prometheus-stack, you need a running Kubernetes cluster (v1.19+) with sufficient resources—at least 4 CPU cores and 8GB RAM across nodes for a basic setup. Install kubectl and Helm v3, and ensure your cluster has PersistentVolume support for Prometheus and Alertmanager data durability. You must also have cluster-admin RBAC permissions. For Loki integration, prepare an object storage backend like S3 or MinIO, as Loki stores logs externally. Understanding basic PromQL and Kubernetes namespaces is assumed. Without these, the stack will fail silently, leaving you blind to metrics gaps. This setup is not plug-and-play; it demands a solid infrastructure foundation to avoid data loss or performance degradation in production.

prerequisites-check.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial
# Verify cluster readiness before proceeding
apiVersion: v1
kind: Pod
metadata:
  name: readiness-check
spec:
  containers:
  - name: checker
    image: bitnami/kubectl:latest
    command: ["sh", "-c", "kubectl version --short && helm version"]
  restartPolicy: Never
---
# Expected output:
# Client Version: v1.28.0
# version.BuildInfo{Version:"v3.14.0"}
Output
Client Version: v1.28.0
version.BuildInfo{Version:"v3.14.0"}
Production Trap:
Skipping resource checks leads to OOM kills on Prometheus under load. Always set resource limits and requests—default Helm values are too generous for small clusters.
Key Takeaway
Validate cluster resources, Helm, and storage before installing kube-prometheus-stack to prevent silent failures.

Installing kube-prometheus-stack and Loki for Unified Monitoring

Install the kube-prometheus-stack via Helm to get Prometheus, Alertmanager, and Grafana in one chart. Add the Prometheus community repo and run: helm install monitoring prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace. This deploys default scraping for Kubernetes components and node metrics. For log aggregation, install Loki using the Grafana Helm chart: helm install loki grafana/loki-stack --namespace monitoring --set grafana.enabled=false,prometheus.enabled=false. Loki pairs with Promtail (deployed automatically) to push container logs to the Loki backend. You can then query logs in Grafana alongside metrics. This combination gives you a single pane of glass for metrics and logs, reducing context switching during incidents. Adjust storage retention via values.yaml to avoid unbounded disk usage.

install-stack.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — devops tutorial
# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=7d

# Install Loki stack (logs)
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install loki grafana/loki-stack \
  --namespace monitoring \
  --set grafana.enabled=false,prometheus.enabled=false
Output
Release "monitoring" installed
Release "loki" installed
Production Trap:
Default Loki retention is 24h. Set it to 7d or more via --set loki.config.limits_config.retention_period=168h to avoid losing historical logs during audits.
Key Takeaway
Helm installs both stacks in minutes; pair Prometheus metrics with Loki logs for full observability.

Advantages of Prometheus and Loki for Kubernetes Monitoring

Prometheus offers a pull-based model that perfectly fits Kubernetes' dynamic pod lifecycle—auto-discovery via service monitors means no manual target configuration. Its powerful PromQL allows slicing metrics by labels, enabling precise alerting and ad-hoc queries. The kube-prometheus-stack bundles dashboards, recording rules, and alerting out of the box, slashing initial setup time. Loki complements this with log aggregation that reuses Kubernetes labels, letting you jump from a high-metric spike to relevant log lines instantly. Both are cloud-native, scale horizontally (via Thanos for Prometheus, and Loki's microservices mode), and integrate with Grafana for unified visualization. This stack reduces MTTR by correlating metrics and logs without leaving your dashboard. Plus, the ecosystem is vast—community exporters cover databases, caches, and HTTP services.

advantages-demo.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial
# Example: auto-discovery via ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: example-app
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: http
    interval: 15s
---
# Prometheus auto-targets pods matching label app: my-app
Output
ServiceMonitor created
Prometheus scrapes metrics every 15s
Production Trap:
Beware of label cardinality explosion—avoid high-cardinality labels like request IDs in Prometheus metrics to prevent memory exhaustion.
Key Takeaway
Prometheus and Loki together provide label-based correlation between metrics and logs, reducing incident resolution time.

Limitations of Prometheus and Loki in Kubernetes Environments

Prometheus is not designed for long-term storage; its local TSDB retains data for weeks at most, requiring Thanos or Cortex for months of history. It struggles with high-cardinality metrics—labels like user IDs or request paths can balloon memory usage past available limits. The pull model fails if targets are ephemeral (e.g., serverless) or behind network segmentation. Loki, while efficient for logs, has slow query performance on large volumes without proper index configuration; log parsing is limited to simple label extraction. Both tools lack built-in multi-tenancy—separate instances or proxies are needed for team isolation. Finally, the operational overhead is non-trivial: upgrading Helm charts, tuning scrape intervals, and managing retention policies demand dedicated DevOps effort. For very small clusters, a lighter alternative like VictoriaMetrics might suffice.

limitations-mitigation.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial
# Mitigate cardinality: limit labels in metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_configs:
      - job_name: 'safe-app'
        metrics_path: /metrics
        relabel_configs:
        - action: labeldrop
          regex: 'user_id|request_id'  # Drop high-cardinality labels
Output
ConfigMap created
Prometheus reloaded with label drop rule
Production Trap:
Running Prometheus without rate-limiting scrape targets can overload the cluster network—always set scrape_interval to 30s or higher for non-critical services.
Key Takeaway
Prometheus and Loki require careful cardinality control and external storage for long-term retention; they are not turnkey solutions.
● Production incidentPOST-MORTEMseverity: high

Prometheus OOMKill from High-Cardinality Label Explosion

Symptom
Prometheus pod restarted with OOMKill (exit code 137). Memory usage showed exponential growth in the 6 hours before the crash. TSDB head chunks metric showed millions of active series. The /targets page showed all targets as UP — scraping was healthy.
Assumption
Prometheus needed more memory. The cluster had grown and was generating more metrics.
Root cause
A developer instrumented a counter with a user_id label to track per-user request counts. With 50,000 unique users per hour and 5 label combinations (method, endpoint, status_code, user_id), the metric generated 50,000 * 5 = 250,000 new time series per hour. Each time series consumes memory in Prometheus's TSDB head block. After 6 hours, the head block contained over 1.5 million active series for a single metric, consuming 24GB of RAM. The Prometheus pod was configured with a 16GB memory limit and was OOMKilled.
Fix
1. Removed the user_id label from the counter immediately and redeployed the application. 2. Added a Prometheus recording rule to aggregate by user tier (free, premium, enterprise) instead of individual user_id. 3. Added sample_limit: 1000 to the scrape config to prevent future label explosions from a single target. 4. Deployed a cardinality-linter CI check that rejects metrics with more than 3 labels in code review. 5. Added a Prometheus alert on prometheus_tsdb_head_series > 1000000 to catch future explosions early.
Key lesson
  • High-cardinality labels (user_id, request_id, trace_id) will destroy Prometheus. Never add unbounded values as label values.
  • Each unique combination of label values creates a new time series. 5 labels with 10 values each = 100,000 series per metric name.
  • Set sample_limit on scrape configs as targets that expose too many series.
  • Monitor Prometheus's own metrics: prometheus_tsdb_head_series, prometheus_tsdb_head_chunks, and memory usage. Alert before OOMKill.
  • Enforce cardinality limits in CI/CD. A single bad label can take down monitoring for the entire cluster.
Production debug guideSymptom-first investigation path for Prometheus failures in Kubernetes.6 entries
Symptom · 01
Target showing as DOWN on Prometheus /targets page.
Fix
Check if the pod is running and the metrics endpoint returns 200. Verify the prometheus.io/scrape annotation is set. Check NetworkPolicy — Prometheus must be able to reach the target pod's IP a safety net. It drops.
Symptom · 02
Target showing as UNKNOWN — Prometheus cannot reach it at all.
Fix
This is a network issue. Check if the pod exists, has an IP, and Prometheus can reach it. Common cause: pod restarted and Prometheus has stale target. Wait for the next service discovery refresh.
Symptom · 03
Query returns 'query timed out in expression evaluation'.
Fix
The query is too expensive. Check for high-cardinality selectors. Add recording rules to pre-compute expensive expressions. Check Prometheus CPU/memory usage — it may be under-provisioned.
Symptom · 04
Grafana dashboards show gaps in metrics.
Fix
Check Prometheus /targets for flapping targets (alternating UP/DOWN). Check scrape duration — if it exceeds scrape_interval, samples are missed. Check for Prometheus restarts (TSDB WAL replay takes time).
Symptom · 05
Alerts not firing when they should.
Fix
Check Prometheus /rules page — is the rule group evaluating? Check the for duration — the alert may be in PENDING state. Check Alertmanager routing — the alert may be firing but silenced or routed to the wrong receiver.
Symptom · 06
Prometheus consuming excessive memory.
Fix
Check prometheus_tsdb_head_series — if over 5M, you have a cardinality problem. Run promtool tsdb analyze to find the highest-cardinality metric names. Look for labels with unbounded values (user_id, request_id).
★ Prometheus Triage CommandsRapid commands to isolate Prometheus monitoring issues.
Target showing DOWN or UNKNOWN.
Immediate action
Check target health and scrape annotations.
Commands
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name} {.metadata.annotations.prometheus\.io/scrape}{"\n"}{end}'
kubectl exec -n monitoring deploy/prometheus -- wget -qO- http://<target-ip>:<port>/metrics | head -5
Fix now
If annotation is missing, add it. If wget fails, check NetworkPolicy and pod readiness.
Prometheus memory growing rapidly.+
Immediate action
Check TSDB head series count and find high-cardinality metrics.
Commands
curl -s http://prometheus:9090/api/v1/query?query=prometheus_tsdb_head_series | jq '.data.result[0].value[1]'
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'
Fix now
If head series > 5M, identify the top metric and reduce its cardinality. Add sample_limit to scrape configs.
Recording rules not evaluating.+
Immediate action
Check PrometheusRule CR label matching and operator logs.
Commands
kubectl get prometheusrule -A -o json | jq '.items[] | select(.metadata.labels.prometheus=="kube-prometheus") | .metadata.name'
kubectl logs -n monitoring deploy/prometheus-operator | grep -i 'rule\|error'
Fix now
If the rule is missing from the list, the label does not match ruleSelector. Fix the label on the PrometheusRule CR.
Alerts firing but not reaching PagerDuty/Slack.+
Immediate action
Check Alertmanager routing and receiver configuration.
Commands
kubectl exec -n monitoring deploy/alertmanager -- amtool config show
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="firing") | .labels'
Fix now
If alerts are in Alertmanager but not routed, check the route matching (team label, severity label). Test with amtool.
Scrape duration exceeds scrape_interval.+
Immediate action
Check which targets have slow scrapes.
Commands
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapeDurationSeconds > 10) | {job: .labels.job, instance: .labels.instance, duration: .scrapeDurationSeconds}'
curl -s http://prometheus:9090/api/v1/query?query=scrape_duration_seconds | jq '.data.result | sort_by(.value[1]) | reverse | .[0:5]'
Fix now
If a single target is slow, it may be exposing too many metrics. Reduce cardinality or increase scrape_timeout for that job.
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's Kubernetes. Mark it forged?

21 min read · try the examples if you haven't