Skip to content
Home DevOps Kubernetes Monitoring with Prometheus — Deep Dive for Production

Kubernetes Monitoring with Prometheus — Deep Dive for Production

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Kubernetes → Topic 9 of 12
Kubernetes monitoring with Prometheus explained deeply — scrape configs, service discovery, custom metrics, alerting rules, and production gotchas covered end-to-end.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
Kubernetes monitoring with Prometheus explained deeply — scrape configs, service discovery, custom metrics, alerting rules, and production gotchas covered end-to-end.
  • Prometheus is pull-based with Kubernetes-native service discovery. Annotation-driven scraping means enabling monitoring requires no config changes on the Prometheus side.
  • Instrument with Counters (totals), Gauges (current state), and Histograms (latency). Never use Summary in multi-replica deployments.
  • Recording rules pre-compute expensive PromQL into cheap time series. They are not optional in production — they are the performance optimization.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Pull model: Prometheus scrapes /metrics endpoints on targets at configured intervals.
  • Service Discovery: Queries Kubernetes API to find pods, services, nodes dynamically — no static IPs.
  • Four metric types: Counter (only goes up), Gauge (up/down), Histogram (bucketed observations), Summary (client-side quantiles).
  • Recording rules: Pre-compute expensive PromQL into cheap time series.
  • Alerting: Prometheus evaluates rules, Alertmanager routes to PagerDuty/Slack.
  • Pull model means Prometheus must reach every target. NetworkPolicy misconfigs silently break scraping.
  • High-cardinality labels (user_id, request_id) will OOM Prometheus.
  • Using Summary instead of Histogram in multi-replica deployments. Summaries cannot be aggregated across instances.
🚨 START HERE
Prometheus Triage Commands
Rapid commands to isolate Prometheus monitoring issues.
🟡Target showing DOWN or UNKNOWN.
Immediate ActionCheck target health and scrape annotations.
Commands
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name} {.metadata.annotations.prometheus\.io/scrape}{"\n"}{end}'
kubectl exec -n monitoring deploy/prometheus -- wget -qO- http://<target-ip>:<port>/metrics | head -5
Fix NowIf annotation is missing, add it. If wget fails, check NetworkPolicy and pod readiness.
🟡Prometheus memory growing rapidly.
Immediate ActionCheck TSDB head series count and find high-cardinality metrics.
Commands
curl -s http://prometheus:9090/api/v1/query?query=prometheus_tsdb_head_series | jq '.data.result[0].value[1]'
curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'
Fix NowIf head series > 5M, identify the top metric and reduce its cardinality. Add sample_limit to scrape configs.
🟡Recording rules not evaluating.
Immediate ActionCheck PrometheusRule CR label matching and operator logs.
Commands
kubectl get prometheusrule -A -o json | jq '.items[] | select(.metadata.labels.prometheus=="kube-prometheus") | .metadata.name'
kubectl logs -n monitoring deploy/prometheus-operator | grep -i 'rule\|error'
Fix NowIf the rule is missing from the list, the label does not match ruleSelector. Fix the label on the PrometheusRule CR.
🟡Alerts firing but not reaching PagerDuty/Slack.
Immediate ActionCheck Alertmanager routing and receiver configuration.
Commands
kubectl exec -n monitoring deploy/alertmanager -- amtool config show
curl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="firing") | .labels'
Fix NowIf alerts are in Alertmanager but not routed, check the route matching (team label, severity label). Test with amtool.
🟡Scrape duration exceeds scrape_interval.
Immediate ActionCheck which targets have slow scrapes.
Commands
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapeDurationSeconds > 10) | {job: .labels.job, instance: .labels.instance, duration: .scrapeDurationSeconds}'
curl -s http://prometheus:9090/api/v1/query?query=scrape_duration_seconds | jq '.data.result | sort_by(.value[1]) | reverse | .[0:5]'
Fix NowIf a single target is slow, it may be exposing too many metrics. Reduce cardinality or increase scrape_timeout for that job.
Production IncidentPrometheus OOMKill from High-Cardinality Label ExplosionA developer added a `user_id` label to a counter metric. Within 6 hours, Prometheus memory usage grew from 4GB to 28GB and was OOMKilled by the kernel, causing a 45-minute monitoring blackout during a production incident.
SymptomPrometheus pod restarted with OOMKill (exit code 137). Memory usage showed exponential growth in the 6 hours before the crash. TSDB head chunks metric showed millions of active series. The /targets page showed all targets as UP — scraping was healthy.
AssumptionPrometheus needed more memory. The cluster had grown and was generating more metrics.
Root causeA developer instrumented a counter with a user_id label to track per-user request counts. With 50,000 unique users per hour and 5 label combinations (method, endpoint, status_code, user_id), the metric generated 50,000 * 5 = 250,000 new time series per hour. Each time series consumes memory in Prometheus's TSDB head block. After 6 hours, the head block contained over 1.5 million active series for a single metric, consuming 24GB of RAM. The Prometheus pod was configured with a 16GB memory limit and was OOMKilled.
Fix1. Removed the user_id label from the counter immediately and redeployed the application. 2. Added a Prometheus recording rule to aggregate by user tier (free, premium, enterprise) instead of individual user_id. 3. Added sample_limit: 1000 to the scrape config to prevent future label explosions from a single target. 4. Deployed a cardinality-linter CI check that rejects metrics with more than 3 labels in code review. 5. Added a Prometheus alert on prometheus_tsdb_head_series > 1000000 to catch future explosions early.
Key Lesson
High-cardinality labels (user_id, request_id, trace_id) will destroy Prometheus. Never add unbounded values as label values.Each unique combination of label values creates a new time series. 5 labels with 10 values each = 100,000 series per metric name.Set sample_limit on scrape configs as targets that expose too many series.Monitor Prometheus's own metrics: prometheus_tsdb_head_series, prometheus_tsdb_head_chunks, and memory usage. Alert before OOMKill.Enforce cardinality limits in CI/CD. A single bad label can take down monitoring for the entire cluster.
Production Debug GuideSymptom-first investigation path for Prometheus failures in Kubernetes.
Target showing as DOWN on Prometheus /targets page.Check if the pod is running and the metrics endpoint returns 200. Verify the prometheus.io/scrape annotation is set. Check NetworkPolicy — Prometheus must be able to reach the target pod's IP a safety net. It drops.
Target showing as UNKNOWN — Prometheus cannot reach it at all.This is a network issue. Check if the pod exists, has an IP, and Prometheus can reach it. Common cause: pod restarted and Prometheus has stale target. Wait for the next service discovery refresh.
Query returns 'query timed out in expression evaluation'.The query is too expensive. Check for high-cardinality selectors. Add recording rules to pre-compute expensive expressions. Check Prometheus CPU/memory usage — it may be under-provisioned.
Grafana dashboards show gaps in metrics.Check Prometheus /targets for flapping targets (alternating UP/DOWN). Check scrape duration — if it exceeds scrape_interval, samples are missed. Check for Prometheus restarts (TSDB WAL replay takes time).
Alerts not firing when they should.Check Prometheus /rules page — is the rule group evaluating? Check the for duration — the alert may be in PENDING state. Check Alertmanager routing — the alert may be firing but silenced or routed to the wrong receiver.
Prometheus consuming excessive memory.Check prometheus_tsdb_head_series — if over 5M, you have a cardinality problem. Run promtool tsdb analyze to find the highest-cardinality metric names. Look for labels with unbounded values (user_id, request_id).

Running Kubernetes in production without monitoring is like flying a commercial aircraft with the instrument panel blacked out. Everything might feel fine until it catastrophically isn't. Prometheus is used by over 84% of Kubernetes production environments — not because it's the easiest tool, but because it's the most powerful pull-based metrics system that was purpose-built for dynamic, containerized infrastructure.

The real problem Prometheus solves is the ephemeral nature of Kubernetes workloads. Traditional monitoring tools expect your target IPs to stay fixed. In Kubernetes, a pod's IP changes every restart. Prometheus solves this with Kubernetes-native service discovery — it queries the Kubernetes API server directly to find what's alive right now, not what was alive when you wrote the config.

This is not a getting-started guide. It covers scrape configurations with relabeling, custom application metrics using client libraries, recording rules to avoid query-time explosions, Alertmanager integration, and the five most expensive mistakes teams make in production.

How Prometheus Service Discovery Works Inside Kubernetes

Prometheus uses a pull model — it reaches out to targets and scrapes metrics endpoints, typically on path /metrics, at a configured interval. In a static world you'd list IPs. In Kubernetes, Prometheus uses kubernetes_sd_configs to query the Kubernetes API and discover pods, services, endpoints, nodes, and ingresses dynamically.

When Prometheus starts, it authenticates to the API server using a ServiceAccount token mounted in its pod. It then watches specific resource types. For the endpoints role, Prometheus discovers every Endpoints object across the cluster. For each endpoint address it finds, it creates a scrape target. The magic happens during relabeling — a pipeline that runs before the scrape and lets you filter, rename, and attach labels using values pulled directly from Kubernetes metadata (pod annotations, namespace labels, service names).

The annotation `prometheus.io/scrape: 'true' is a community convention that Prometheus relabeling configs check. If the annotation exists and is true, the pod is scraped. This means enabling monitoring for a new application is as simple as adding three lines to its pod spec — no Prometheus config reload needed. Prometheus reconciles new targets automatically every scrape_interval`.

Understanding the target lifecycle is critical for production. Targets move through states: up, down, and unknown. A target goes unknown when Prometheus can't reach the endpoint at all (network issue or pod not started). It goes down when the HTTP scrape returns a non-200 status or times out. Staleness markers are injected after a target disappears — this prevents old time series from polluting range queries.

prometheus-kubernetes-sd-config.yaml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
# This is the core Prometheus scrape configuration for Kubernetes pod discovery.
# It lives inside your prometheus.yml (or a PrometheusRule CR if using the operator).

scrape_configs:
  - job_name: 'kubernetes-pods'
    # Prometheus will query the Kubernetes API server to find all pods.
    kubernetes_sd_configs:
      - role: pod
        # Restrict discovery to a specific namespace for security isolation.
        namespaces:
          names:
            - production
            - staging

    # relabel_configs runs BEFORE each scrape — it filters and transforms targets.
    relabel_configs:
      # STEP 1: Only scrape pods that explicitly opt in via annotation.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

      # STEP 2: Allow pods to declare a custom metrics path.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # STEP 3: Allow pods to declare a custom port for scraping.
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: '([^:]+)(?::\d+)?;(\d+)'
        replacement: '$1:$2'
        target_label: __address__

      # STEP 4: Carry namespace as a label.
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace

      # STEP 5: Carry the pod name as a label.
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

      # STEP 6: Carry the app label from the pod.
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app

      # STEP 7: Drop targets running on hostNetwork.
      - source_labels: [__meta_kubernetes_pod_host_ip, __address__]
        regex: '(\d+\.\d+\.\d+\.\d+);\1:\d+'
        action: drop
▶ Output
# Prometheus /targets page shows discovered targets with labels applied.
Mental Model
The __address__ Rewrite Trap
Always set prometheus.io/port explicitly in multi-container pods. This is the most common scrape misconfiguration in production.
  • Multi-container pods expose multiple ports. Prometheus picks one — not always the right one.
  • The __address__ label determines where Prometheus connects. Relabeling rewrites it.
  • Without prometheus.io/port, Prometheus uses the first container port in the pod spec.
  • Sidecar containers (Istio, Envoy) often expose ports that are not your metrics port.
📊 Production Insight
NetworkPolicies are the silent killer of Prometheus scraping. If you deploy a NetworkPolicy that restricts ingress to your pod, and Prometheus is not in the allowed source namespace/IP range, scraping silently fails. The target shows as down or unknown with no useful error. Always include Prometheus's namespace in your NetworkPolicy ingress rules. Use kubectl exec from the Prometheus pod to test connectivity to the target before blaming the scrape config.
🎯 Key Takeaway
Prometheus service discovery is annotation-driven and relabeling-configured. The most common production failures are: missing annotations, NetworkPolicy blocking scrapes, and multi-container port confusion. Always set prometheus.io/scrape, prometheus.io/port, and prometheus.io/path explicitly.

Exposing Custom Application Metrics with the Prometheus Client Libraries

Kubernetes infrastructure metrics (CPU, memory, network) come from kube-state-metrics and node-exporter. But the metrics that make or break your SLOs are application-level: request latency, error rates, queue depth, cache hit ratio. These come from instrumenting your own code.

Prometheus has four core metric types you need to understand at the semantic level, not just the API level:

Counter — a value that only goes up (resets to zero on restart). Use it for total requests, total errors, total bytes sent. Never use a counter for something that can decrease. PromQL's rate() and increase() functions unwrap counters properly, handling resets.

Gauge — a value that can go up or down. Use it for current queue depth, active connections, temperature, memory usage. Don't use rate() on a gauge — it's meaningless.

Histogram — pre-aggregated bucketed observations. Use it for latency and request size. It exposes three time series: _bucket, _sum, and _count. The bucket boundaries you choose at instrumentation time are permanent — you can't change them without restarting the process.

Summary — client-side computed quantiles. Use it only when you need accurate quantiles and can't aggregate across instances (summaries can't be aggregated in PromQL). In Kubernetes with multiple replicas, histograms are almost always the right choice over summaries.

instrumented_http_server.go · GO
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
package main

import (
	"math/rand"
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "payment_service_http_requests_total",
			Help: "Total number of HTTP requests processed by the payment service.",
		},
		[]string{"method", "endpoint", "status_code"},
	)

	httpRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name: "payment_service_http_request_duration_seconds",
			Help: "HTTP request duration in seconds, bucketed by endpoint.",
			Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5},
		},
		[]string{"method", "endpoint"},
	)

	inFlightRequests = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "payment_service_in_flight_requests",
			Help: "Number of HTTP requests currently being processed.",
		},
	)

	paymentQueueDepth = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "payment_service_queue_depth",
			Help: "Current number of payments waiting in the processing queue.",
		},
	)
)

func instrumentedHandler(endpoint string, handlerFunc http.HandlerFunc) http.HandlerFunc {
	return func(responseWriter http.ResponseWriter, request *http.Request) {
		inFlightRequests.Inc()
		defer inFlightRequests.Dec()
		startTime := time.Now()
		wrappedWriter := &statusCapturingWriter{ResponseWriter: responseWriter, statusCode: http.StatusOK}
		handlerFunc(wrappedWriter, request)
		durationSeconds := time.Since(startTime).Seconds()
		httpRequestDuration.WithLabelValues(request.Method, endpoint).Observe(durationSeconds)
		httpRequestsTotal.WithLabelValues(
			request.Method,
			endpoint,
			strconv.Itoa(wrappedWriter.statusCode),
		).Inc()
	}
}

type statusCapturingWriter struct {
	http.ResponseWriter
	statusCode int
}

func (scw *statusCapturingWriter) WriteHeader(code int) {
	scw.statusCode = code
	scw.ResponseWriter.WriteHeader(code)
}

func processPayment(responseWriter http.ResponseWriter, request *http.Request) {
	processingTime := time.Duration(5+rand.Intn(295)) * time.Millisecond
	time.Sleep(processingTime)
	if rand.Float64() < 0.02 {
		http.Error(responseWriter, "upstream payment gateway timeout", http.StatusGatewayTimeout)
		return
	}
	responseWriter.WriteHeader(http.StatusOK)
	responseWriter.Write([]byte(`{"status":"processed"}`))
}

func main() {
	go func() {
		for {
			paymentQueueDepth.Set(float64(rand.Intn(500)))
			time.Sleep(5 * time.Second)
		}
	}()
	http.HandleFunc("/api/v1/payments", instrumentedHandler("/api/v1/payments", processPayment))
	http.Handle("/metrics", promhttp.Handler())
	go http.ListenAndServe(":9091", nil)
	http.ListenAndServe(":8080", nil)
}
▶ Output
# Prometheus scrapes http://payment-service-pod:9091/metrics and receives all four metric types.
Mental Model
Histogram Bucket Design is a One-Way Door
Sketch your bucket boundaries on a napkin against your SLO thresholds before writing a single line of code.
  • Bucket boundaries are permanent until the process restarts.
  • histogram_quantile() interpolates between buckets — imprecise if boundaries don't align with SLO.
  • Default buckets (DefBuckets) go up to 10s — too coarse for most APIs.
  • Custom buckets aligned to SLO thresholds (e.g., 0.2s for p99 < 200ms) give accurate SLO tracking.
  • Histograms expose _bucket, _sum, _count — three time series per label combination.
📊 Production Insight
High-cardinality labels on metrics are the most common cause of Prometheus OOMKill in production. A label like user_id with 50,000 unique values creates 50,000 time series per metric per label combination. With 5 labels, that is 250,000 series per hour. Prometheus stores all active series in memory (TSDB head block). At ~16KB per series, 1 million series = 16GB RAM. Never add unbounded values as label values. Use bounded labels like user_tier (free/premium/enterprise) instead.
🎯 Key Takeaway
Instrument with Counters for totals, Gauges for current state, and Histograms for latency distributions. Never use Summary in multi-replica Kubernetes deployments — it cannot be aggregated. Design histogram buckets against SLO thresholds. Never add unbounded label values.

Production-Grade Recording Rules and Alerting That Won't Page You at 3am

Raw PromQL queries against high-cardinality data are expensive. A query like rate(http_requests_total[5m]) across 200 pods runs every time a dashboard loads. In large clusters, this causes Prometheus to churn through millions of samples per query, leading to query timeouts and the dreaded 'query timed out in expression evaluation' error.

Recording rules solve this by pre-computing expensive expressions and storing the result as a new time series. Prometheus evaluates recording rules on its evaluation interval (typically 1m), writes the result into its TSDB, and future queries read that cheap pre-computed series instead of re-scanning the raw data.

Naming matters. The Prometheus community convention for recording rule names is level:metric:operations. For example job:http_requests_total:rate5m means: aggregated at the job level, derived from http_requests_total, computed as a 5-minute rate. Sticking to this convention makes rules self-documenting and searchable.

Alerts in Prometheus are defined in the same YAML format as recording rules. The critical production insight is that alerts should express SLO burn rates, not raw thresholds. An alert that fires when error rate > 1% will fire constantly during minor blips. An alert based on a multi-window burn rate (Google's SRE model) only fires when you're burning through your error budget fast enough to exhaust it within a prediction window — dramatically reducing noise.

payment-service-rules.yaml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
# PrometheusRule custom resource — picked up automatically by the Prometheus Operator.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-service-slo-rules
  namespace: production
  labels:
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    - name: payment_service_recording_rules
      interval: 1m
      rules:
        - record: job_endpoint:payment_service_http_requests_total:rate5m
          expr: |
            rate(payment_service_http_requests_total[5m])

        - record: job_endpoint:payment_service_error_ratio:rate5m
          expr: |
            sum by (job, endpoint) (
              rate(payment_service_http_requests_total{status_code=~"5.."}[5m])
            )
            /
            sum by (job, endpoint) (
              rate(payment_service_http_requests_total[5m])
            )

        - record: job_endpoint:payment_service_latency_p99:rate5m
          expr: |
            histogram_quantile(
              0.99,
              sum by (job, endpoint, le) (
                rate(payment_service_http_request_duration_seconds_bucket[5m])
              )
            )

    - name: payment_service_alerts
      rules:
        - alert: PaymentServiceHighErrorBurnRate
          expr: |
            job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001)
          for: 2m
          labels:
            severity: critical
            team: payments
            runbook_url: https://wiki.company.com/runbooks/payment-service-errors
          annotations:
            summary: "Payment service burning error budget at critical rate"
            description: |
              Endpoint {{ $labels.endpoint }} error ratio is {{ $value | humanizePercentage }}.

        - alert: PaymentServiceHighLatency
          expr: |
            job_endpoint:payment_service_latency_p99:rate5m > 0.5
          for: 5m
          labels:
            severity: warning
            team: payments
          annotations:
            summary: "Payment service p99 latency exceeds SLO threshold"

        - alert: PaymentQueueConsumerDead
          expr: |
            payment_service_queue_depth == 0
            and
            sum(job_endpoint:payment_service_http_requests_total:rate5m) > 10
          for: 3m
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "Payment queue depth is zero but traffic is flowing"
▶ Output
# PrometheusRule applied. Operator picks it up via label matching.
Mental Model
Why 'for: 2m' on Alerts Matters
Set for to at least 2-3x your scrape_interval. Critical page alerts: for: 2m. Ticket-level alerts: for: 15m.
  • Without for: alert fires on first bad scrape. Noisy.
  • With for: 2m: alert must be consistently bad for 2 minutes before firing.
  • PENDING state: visible in Prometheus UI but does not send to Alertmanager.
  • FIRING state: sent to Alertmanager for routing to PagerDuty/Slack.
  • The for duration should be >= 2x scrape_interval to avoid single-scrape flukes.
📊 Production Insight
Recording rules are not optional in production. Without them, every Grafana dashboard load re-executes expensive PromQL against raw data. With 200 pods and a 5-minute rate window, a single rate(http_requests_total[5m]) query scans millions of samples. Recording rules pre-compute this once per evaluation interval (1m) and store a cheap-to-read time series. The performance difference is 100x or more. Always create recording rules for any query used in more than one dashboard or alert.
🎯 Key Takeaway
Recording rules are the performance optimization for Prometheus. Pre-compute expensive queries. Use SLO burn rate alerts instead of raw thresholds. The for field prevents noise. Label matching on PrometheusRule CRs must exactly match the Prometheus CR's ruleSelector.

Alertmanager: Routing, Silencing, and Deduplication

Prometheus evaluates alert rules and sends firing alerts to Alertmanager. Alertmanager is a separate component responsible for deduplicating, grouping, routing, and silencing alerts before sending them to notification channels (PagerDuty, Slack, email, Opsgenie).

Alertmanager's routing tree is a hierarchical matching system. An alert's labels are matched against the route configuration. The first matching route determines the receiver. This means your alert labels (severity, team, service) must be carefully designed to match your routing tree.

alertmanager-config.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
# Alertmanager configuration — deployed as a Secret in the monitoring namespace.
# The Prometheus Operator picks this up from the alertmanager.yaml key.
global:
  resolve_timeout: 5m
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxx'

route:
  # Default receiver for alerts that don't match any sub-route.
  receiver: 'slack-catchall'
  group_by: ['alertname', 'namespace', 'endpoint']
  group_wait: 30s         # Wait 30s to group similar alerts.
  group_interval: 5m      # Send grouped updates every 5m.
  repeat_interval: 4h     # Re-send unresolved alerts every 4h.

  routes:
    # Critical alerts -> PagerDuty (page the on-call engineer).
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s       # Page faster for critical alerts.
      repeat_interval: 1h   # Re-page every hour if unresolved.

    # Warning alerts -> Slack channel (no page, just notification).
    - match:
        severity: warning
      receiver: 'slack-warnings'
      group_wait: 1m
      repeat_interval: 4h

    # Team-specific routing.
    - match:
        team: payments
      receiver: 'slack-payments-team'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-integration-key>'
        severity: 'critical'
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
          num_firing: '{{ .Alerts.Firing | len }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}'
        send_resolved: true

  - name: 'slack-payments-team'
    slack_configs:
      - channel: '#payments-alerts'
        title: '[Payments] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}'

  - name: 'slack-catchall'
    slack_configs:
      - channel: '#alerts-catchall'
        title: 'Unrouted Alert: {{ .GroupLabels.alertname }}'

# Inhibition: suppress warning alerts when a critical alert is already firing
# for the same service. Prevents alert storms.
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'namespace']
▶ Output
Alertmanager configured with PagerDuty for critical, Slack for warnings, and inhibition rules.
Mental Model
Alertmanager's Three Core Functions
Without inhibition rules, a cascading failure generates 50 alerts when 1 would suffice. Inhibition is the most underused Alertmanager feature.
  • Deduplication: Same alert fingerprint = one notification.
  • Grouping: group_by determines which alerts are batched together.
  • Inhibition: Higher-severity alerts suppress lower-severity alerts for the same context.
  • Silences: Temporary muting of alerts during maintenance windows.
  • Routing: Label matching determines which receiver (PagerDuty, Slack, email) gets the alert.
📊 Production Insight
The group_by field in Alertmanager controls notification batching. If you group by alertname only, all pods with the same alert are grouped into one notification — good for reducing noise. If you group by alertname, pod, each pod gets its own notification — bad during a cluster-wide outage where 200 pods trigger the same alert. The production default should be group_by: ['alertname', 'namespace'] to batch by service and namespace. Use group_wait: 30s to allow grouping before sending.
🎯 Key Takeaway
Alertmanager is the routing and deduplication layer between Prometheus and your notification channels. Inhibition rules prevent alert storms. Group_by controls notification batching. Always configure inhibition rules to suppress lower-severity alerts when higher-severity alerts are already firing.

Prometheus Storage: TSDB Internals, Retention, and Thanos/Cortex for Long-Term

Prometheus stores metrics in its own time-series database (TSDB). Understanding TSDB internals is critical for capacity planning, retention tuning, and deciding when to add long-term storage.

TSDB stores data in blocks. Each block covers a 2-hour time range and contains a chunks directory (compressed metric samples) and an index. The head block is the in-memory write-ahead log (WAL) that receives all new samples. Every 2 hours, the head block is compacted into a persistent block and flushed to disk. Old blocks are compacted into larger blocks (e.g., 2h blocks into 1-day blocks) to reduce the number of files.

prometheus-retention-config.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
# Prometheus StatefulSet with retention and storage configuration.
# For the kube-prometheus-stack Helm chart, these go in prometheus.prometheusSpec.
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus
  namespace: monitoring
spec:
  # Retention: how long to keep data locally.
  # 15d is typical. Longer retention = more disk and memory.
  retention: 15d

  # Retention by size: delete oldest blocks when storage exceeds this.
  # Use this as a safety net alongside time-based retention.
  retentionSize: 50GB

  # Storage: PVC for persistent TSDB blocks.
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-ssd
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi

  # Resources: Prometheus is memory-hungry.
  # Rule of thumb: ~16KB per active time series.
  # 1M series = 16GB RAM. Plan accordingly.
  resources:
    requests:
      cpu: '1'
      memory: 16Gi
    limits:
      cpu: '4'
      memory: 32Gi

  # External labels: applied to all metrics when using Thanos/Cortex.
  # Identifies which Prometheus instance scraped the data.
  externalLabels:
    cluster: production-us-east-1
    environment: production

  # Thanos sidecar: uploads blocks to object storage for long-term retention.
  thanos:
    objectStorageConfig:
      name: thanos-objstore-config
      key: objstore.yml

  # Sample limit: max series per scrape target.
  # Safety net against high-cardinality targets.
  serviceMonitorSelectorNilUsesHelmValues: false
  podMonitorSelectorNilUsesHelmValues: false
▶ Output
Prometheus configured with 15-day retention, 50GB size limit, fast SSD storage, and Thanos sidecar for long-term block upload to object storage.
Mental Model
When to Add Thanos or Cortex
Add Thanos when you need: cross-cluster query federation, retention beyond 30 days, or HA for Prometheus.
  • Prometheus: Single cluster, short-term (days to weeks). No native HA or federation.
  • Thanos: Sidecar uploads blocks to S3/GCS. Querier federates across Prometheus instances. Compactor reduces storage costs. Best for multi-cluster with object storage.
  • Cortex: Hor Cloud. Best for multi-tenant SaaS platforms.
  • VictoriaMetrics: Drop-in Prometheus replacement with better compression and lower resource usage. Best for single-cluster with high cardinality.
  • Decision: Use Thanos for multi-cluster with object storage. Use Cortex for multi-tenant SaaS. Use VictoriaMetrics for single-cluster resource optimization.
📊 Production Insight
Prometheus's memory usage is directly proportional to the number of active time series in the TSDB head block. Each active series consumes approximately 16KB of memory. If you have 2 million active series, you need approximately 32GB of RAM for the head block alone, plus overhead for queries and compaction. Monitor prometheus_tsdb_head_series and process_resident_memory_bytes. Set retention based on disk capacity: 15 days at 1M series with 15s scrape interval equals approximately 50GB of disk. Use fast SSDs for TSDB — network-attached storage introduces latency that slows compaction and can cause WAL corruption during power loss.
🎯 Key Takeaway
Prometheus TSDB is a block-based storage engine with an in-memory head block. Memory is proportional to active series count. Use fast SSDs for TSDB. Add Thanos for cross-cluster federation and long-term retention beyond local disk capacity. Plan capacity based on series count: 1M series equals approximately 16GB RAM and approximately 50GB disk for 15 days.
Prometheus Storage Decision Tree
IfSingle cluster, retention under 15 days, under 5M active series
UseStandalone Prometheus with local TSDB and PVC on fast SSD. No additional components needed.
IfMultiple clusters, need cross-cluster query federation
UseDeploy Thanos Sidecar on each Prometheus. Add Thanos Querier for global view. Add Thanos Store Gateway for historical queries.
IfRetention beyond 30 days or more than 5M active series
UseAdd Thanos Sidecar with object storage (S3/GCS). Add Thanos Compactor to reduce storage costs. Keep local retention short (7d) and rely on object storage for long-term.
IfMulti-tenant SaaS platform with per-customer isolation
UseUse Cortex or Grafana Mimir for horizontal scalability and per-tenant resource limits.
IfSingle cluster with extreme cardinality (10M+ series)
UseConsider VictoriaMetrics as a drop-in replacement. It offers better compression (1izontally scalable multi-tenant Prometheus backend. Used by Grafana0x) and lower memory usage than Prometheus TSDB.
🗂 Prometheus Metric Types and Storage Options
Understanding metric types, aggregation capabilities, and long-term storage trade-offs.
AspectHistogramSummaryThanosCortexVictoriaMetrics
PurposeBucketed latency/size observationsClient-side quantile computationCross-cluster federation + long-term storageMulti-tenant Prometheus backendDrop-in Prometheus replacement
Quantile calculationServer-side (histogram_quantile())Client-side (at instrumentation time)N/AN/AN/A
Aggregatable across replicasYes — sum buckets before quantileNo — pre-computed quantiles cannot be averagedYes — via Thanos Querier global viewYes — native horizontal aggregationYes — native horizontal aggregation
Best for KubernetesYes — multi-replica needs server-side aggregationOnly single-instance with exact quantilesMulti-cluster with S3/GCS object storageMulti-tenant SaaS platformsSingle-cluster with extreme cardinality
Retention modelN/AN/AUnlimited (blocks in S3/GCS)Unlimited (horizontally scalable)Unlimited (highly compressed local TSDB)
HA supportN/AN/AYes — Querier deduplicates across replicasYes — native replicationYes — cluster mode
Compression ratioN/AN/ASame as Prometheus (inherits TSDB blocks)Same as PrometheusUp to 10x better than Prometheus
Query compatibilityN/AN/A100% PromQL compatible100% PromQL compatible100% PromQL compatible
Operational complexityN/AN/AMedium — Sidecar + Querier + Store Gateway + CompactorHigh — Ingester + Distributor + Querier + CompactorLow — single binary or cluster mode
SLO burn rate supportExcellent — rate() on _count and _bucketDifficult — quantile series not rate()-ableN/AN/AN/A
Memory overhead (client)Low — O(number of buckets)Higher — sliding window per quantileN/AN/AN/A
Bucket boundary changesRequires app restartRequires app restart (objective changes)N/AN/AN/A

🎯 Key Takeaways

  • Prometheus is pull-based with Kubernetes-native service discovery. Annotation-driven scraping means enabling monitoring requires no config changes on the Prometheus side.
  • Instrument with Counters (totals), Gauges (current state), and Histograms (latency). Never use Summary in multi-replica deployments.
  • Recording rules pre-compute expensive PromQL into cheap time series. They are not optional in production — they are the performance optimization.
  • Alertmanager handles deduplication, grouping, routing, and inhibition. Inhibition rules are the most underused feature and the key to preventing alert storms.
  • Prometheus TSDB memory is proportional to active series count. High-cardinality labels will OOM Prometheus. Monitor and limit cardinality aggressively.
  • Prometheus is designed for short-term, per-cluster monitoring. Add Thanos for cross-cluster federation and long-term retention. Add Cortex for multi-tenant SaaS.

⚠ Common Mistakes to Avoid

    Adding high-cardinality labels (user_id, request_id, trace_id) to metrics. Each unique label value creates a new time series. Unbounded values will OOM Prometheus.
    Using Summary instead of Histogram in multi-replica deployments. Summaries compute quantiles client-side and cannot be aggregated across instances in PromQL.
    Not setting `prometheus.io/port` on multi-container pods. Prometheus scrapes the first container port it finds, which may not be the metrics port.
    Writing recording rules without the `level:metric:operations` naming convention. Inconsistent names make rules hard to discover and maintain.
    Setting alert `for` duration too short (or zero). Single-scrape blips trigger false pages. Set `for` to at least 2-3x scrape_interval.
    Not configuring Alertmanager inhibition rules. Without inhibition, a cascading failure generates 50 alerts when 1 would suffice.
    Not monitoring Prometheus itself. Prometheus is infrastructure. Monitor `prometheus_tsdb_head_series`, memory usage, scrape duration, and WAL size.
    Using default histogram buckets (`DefBuckets`) for latency metrics. Default buckets go up to 10s — too coarse for APIs with SLOs under 500ms.
    Not pre-computing expensive queries with recording rules. Every dashboard load re-executes raw PromQL, causing query timeouts in large clusters.
    Ignoring NetworkPolicies for Prometheus scraping. A deny-all NetworkPolicy silently breaks Prometheus with no useful error message.
    Not setting `sample_limit` on scrape configs. A single misbehaving target can expose millions of series and crash Prometheus.
    Running Prometheus with default retention (15d) without capacity planning. TSDB grows linearly with active series count. Monitor disk usage.

Interview Questions on This Topic

  • QExplain how Prometheus discovers targets in Kubernetes. What is the role of relabel_configs?
  • QWhat is the difference between a Counter and a Gauge? When would you use each?
  • QWhy are Histograms preferred over Summaries in Kubernetes deployments with multiple replicas?
  • QExplain histogram bucket design. Why is it a 'one-way door' and how does it affect SLO tracking?
  • QWhat are recording rules and why are they important for production Prometheus deployments?
  • QHow does the for field in alert rules prevent false positives? What is the recommended value?
  • QExplain Alertmanager's inhibition rules. How do they prevent alert storms during cascading failures?
  • QWhat is the relationship between Prometheus TSDB head block size and memory usage? How do you plan capacity?
  • QWhen would you add Thanos or Cortex to your monitoring stack? What problem does each solve?
  • QHow do you debug a target showing as DOWN on the Prometheus /targets page?
  • QWhat is the __address__ rewrite trap in multi-container pods? How do you avoid it?
  • QExplain SLO burn rate alerts. How do they differ from raw threshold alerts?
  • QHow do NetworkPolicies affect Prometheus scraping? How do you debug silent scrape failures?
  • QWhat is the sample_limit field in scrape configs? When is it useful?
  • QDescribe the Prometheus TSDB block lifecycle: WAL, head block, compaction, and retention.

Frequently Asked Questions

How does Prometheus discover pods in Kubernetes?

Prometheus uses kubernetes_sd_configs to query the Kubernetes API server for pods, services, endpoints, nodes, and ingresses. It authenticates using a ServiceAccount token. Relabeling configs filter and transform discovered targets before scraping. The convention prometheus.io/scrape: 'true' annotation opts pods into scraping.

What is the difference between a Counter and a Gauge?

A Counter only goes up (resets to zero on restart). Use it for totals like request count or error count. Use rate() or increase() to compute per-second rates. A Gauge can go up or down. Use it for current values like queue depth or active connections. Never use rate() on a Gauge.

Why should I use Histograms instead of Summaries in Kubernetes?

Histograms pre-aggregate observations into buckets on the client, then histogram_quantile() computes quantiles on the server at query time. This allows aggregation across multiple pod replicas. Summaries compute quantiles on the client and cannot be meaningfully aggregated across instances. In Kubernetes with multiple replicas, Histograms are almost always the correct choice.

How do recording rules improve Prometheus performance?

Recording rules pre-compute expensive PromQL expressions (like rate() over 5 minutes) and store the result as a new time series. Dashboards and alerts query this pre-computed series instead of re-scanning raw data. The performance improvement is 100x or more for queries across high-cardinality data.

What is the `for` field in alert rules?

The for field makes an alert go through a PENDING state before reaching FIRING. The alert must be consistently firing for the for duration before it is sent to Alertmanager. This prevents single-scrape blips from triggering pages. Recommended: for: 2m for critical alerts, for: 15m for warning alerts.

When do I need Thanos or Cortex?

Add Thanos when you need cross-cluster query federation, retention beyond local disk capacity (weeks to years), or Prometheus HA. Add Cortex when you need a horizontally scalable multi-tenant Prometheus backend. For single-cluster with high cardinality, consider VictoriaMetrics as a drop-in replacement.

How do I prevent Prometheus from OOMKill?

Monitor prometheus_tsdb_head_series — each active series consumes ~16KB of RAM. Never add unbounded values (user_id, request_id) as label values. Set sample_limit on scrape configs to cap series per target. Set memory requests/limits based on expected series count (1M series = ~16GB). Alert on memory usage before OOMKill.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← Previouskubectl Commands CheatsheetNext →Kubernetes Architecture Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged