Kubernetes Monitoring with Prometheus — Deep Dive for Production
Running Kubernetes in production without monitoring is like flying a commercial aircraft with the instrument panel blacked out. Everything might feel fine until it catastrophically isn't. In 2024, the CNCF survey reported that Prometheus is used by over 84% of Kubernetes production environments — not because it's the easiest tool, but because it's the most powerful pull-based metrics system that was purpose-built for dynamic, containerized infrastructure. When pods scale up and down every minute, static monitoring configs break. Prometheus doesn't just survive this chaos — it thrives in it.
The real problem Prometheus solves is the ephemeral nature of Kubernetes workloads. Traditional monitoring tools expect your target IPs to stay fixed. In Kubernetes, a pod's IP changes every restart. Prometheus solves this with Kubernetes-native service discovery — it queries the Kubernetes API server directly to find what's alive right now, not what was alive when you wrote the config. Layer in custom application metrics, recording rules, and Alertmanager integration, and you have a complete observability stack that reacts to your cluster's actual state.
By the end of this article you'll be able to deploy Prometheus into a Kubernetes cluster using the kube-prometheus-stack Helm chart, write production-grade scrape configurations with relabeling, expose custom application metrics using the Go and Python client libraries, design efficient recording rules to avoid query-time explosions, wire up Alertmanager for PagerDuty and Slack, and avoid the five most expensive mistakes teams make in production. Let's build this system from the ground up.
How Prometheus Service Discovery Works Inside Kubernetes
Prometheus uses a pull model — it reaches out to targets and scrapes metrics endpoints, typically on path /metrics, at a configured interval. In a static world you'd list IPs. In Kubernetes, Prometheus uses kubernetes_sd_configs to query the Kubernetes API and discover pods, services, endpoints, nodes, and ingresses dynamically.
When Prometheus starts, it authenticates to the API server using a ServiceAccount token mounted in its pod. It then watches specific resource types. For the endpoints role, Prometheus discovers every Endpoints object across the cluster. For each endpoint address it finds, it creates a scrape target. The magic happens during relabeling — a pipeline that runs before the scrape and lets you filter, rename, and attach labels using values pulled directly from Kubernetes metadata (pod annotations, namespace labels, service names).
The annotation prometheus.io/scrape: 'true' is a community convention that Prometheus relabeling configs check. If the annotation exists and is true, the pod is scraped. This means enabling monitoring for a new application is as simple as adding three lines to its pod spec — no Prometheus config reload needed. Prometheus reconciles new targets automatically every scrape_interval.
Understanding the target lifecycle is critical for production. Targets move through states: up, down, and unknown. A target goes unknown when Prometheus can't reach the endpoint at all (network issue or pod not started). It goes down when the HTTP scrape returns a non-200 status or times out. Staleness markers are injected after a target disappears — this prevents old time series from polluting range queries.
# This is the core Prometheus scrape configuration for Kubernetes pod discovery. # It lives inside your prometheus.yml (or a PrometheusRule CR if using the operator). scrape_configs: - job_name: 'kubernetes-pods' # Prometheus will query the Kubernetes API server to find all pods. kubernetes_sd_configs: - role: pod # Restrict discovery to a specific namespace for security isolation. # Remove this line to discover pods cluster-wide. namespaces: names: - production - staging # relabel_configs runs BEFORE each scrape — it filters and transforms targets. relabel_configs: # STEP 1: Only scrape pods that explicitly opt in via annotation. # If prometheus.io/scrape is not 'true', drop this target entirely. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true' # STEP 2: Allow pods to declare a custom metrics path (default is /metrics). # e.g. annotation: prometheus.io/path: '/actuator/prometheus' - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # STEP 3: Allow pods to declare a custom port for scraping. # e.g. annotation: prometheus.io/port: '8080' - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace # Regex captures the IP from __address__ and combines with the annotation port. regex: '([^:]+)(?::\d+)?;(\d+)' replacement: '$1:$2' target_label: __address__ # STEP 4: Carry namespace as a label so dashboards can filter by it. - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace # STEP 5: Carry the pod name as a label for drill-down in Grafana. - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name # STEP 6: Carry the app label from the pod so we can group by application. - source_labels: [__meta_kubernetes_pod_label_app] action: replace target_label: app # STEP 7: Drop any targets running on hostNetwork — they often expose # system-level ports and should be handled by the node exporter job instead. - source_labels: [__meta_kubernetes_pod_host_ip, __address__] regex: '(\d+\.\d+\.\d+\.\d+);\1:\d+' action: drop
kubernetes-pods (42 / 42 up)
Target State Labels
http://10.0.1.15:8080/metrics UP app="payment-service", kubernetes_namespace="production"
http://10.0.1.22:8080/metrics UP app="user-service", kubernetes_namespace="production"
http://10.0.2.10:9090/actuator/prometheus UP app="order-service", kubernetes_namespace="staging"
...
# Targets without the annotation prometheus.io/scrape='true' are silently dropped
# and never appear in this list — exactly what we want.
Exposing Custom Application Metrics with the Prometheus Client Libraries
Kubernetes infrastructure metrics (CPU, memory, network) come from kube-state-metrics and node-exporter. But the metrics that make or break your SLOs are application-level: request latency, error rates, queue depth, cache hit ratio. These come from instrumenting your own code.
Prometheus has four core metric types you need to understand at the semantic level, not just the API level:
Counter — a value that only goes up (resets to zero on restart). Use it for total requests, total errors, total bytes sent. Never use a counter for something that can decrease. PromQL's rate() and increase() functions unwrap counters properly, handling resets.
Gauge — a value that can go up or down. Use it for current queue depth, active connections, temperature, memory usage. Don't use rate() on a gauge — it's meaningless.
Histogram — pre-aggregated bucketed observations. Use it for latency and request size. It exposes three time series: _bucket, _sum, and _count. The bucket boundaries you choose at instrumentation time are permanent — you can't change them without restarting the process.
Summary — client-side computed quantiles. Use it only when you need accurate quantiles and can't aggregate across instances (summaries can't be aggregated in PromQL). In Kubernetes with multiple replicas, histograms are almost always the right choice over summaries.
The example below shows a production-grade Go HTTP service instrumented with all four types and serving them on a dedicated /metrics endpoint.
package main import ( "math/rand" "net/http" "strconv" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" "github.com/prometheus/client_golang/prometheus/promhttp" ) // --- Metric Declarations --- // promauto.New* registers metrics automatically with the default registry. // Always declare metrics at package level — not inside handlers — to avoid // re-registering on every request (causes a panic in production). var ( // Counter: tracks total HTTP requests. Labels let us slice by method and status code. httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "payment_service_http_requests_total", Help: "Total number of HTTP requests processed by the payment service.", }, []string{"method", "endpoint", "status_code"}, ) // Histogram: tracks request duration. Bucket boundaries are chosen to match // our SLO thresholds: 95th percentile < 200ms, 99th percentile < 500ms. // DefBuckets (default) go up to 10s which is often too coarse for APIs. httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "payment_service_http_request_duration_seconds", Help: "HTTP request duration in seconds, bucketed by endpoint.", // Custom buckets aligned to SLO boundaries — critical in production. Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5}, }, []string{"method", "endpoint"}, ) // Gauge: tracks the number of in-flight (currently processing) requests. // This helps detect thread pool saturation before latency spikes show up. inFlightRequests = promauto.NewGauge( prometheus.GaugeOpts{ Name: "payment_service_in_flight_requests", Help: "Number of HTTP requests currently being processed.", }, ) // Gauge: tracks the depth of the async payment processing queue. paymentQueueDepth = promauto.NewGauge( prometheus.GaugeOpts{ Name: "payment_service_queue_depth", Help: "Current number of payments waiting in the processing queue.", }, ) ) // instrumentedHandler wraps an HTTP handler with Prometheus instrumentation. // It measures latency, increments the request counter, and tracks in-flight requests. func instrumentedHandler(endpoint string, handlerFunc http.HandlerFunc) http.HandlerFunc { return func(responseWriter http.ResponseWriter, request *http.Request) { // Record that one more request is in flight. inFlightRequests.Inc() // Use defer so this always runs even if the handler panics. defer inFlightRequests.Dec() // Start the latency timer. startTime := time.Now() // Wrap the ResponseWriter to capture the status code. wrappedWriter := &statusCapturingWriter{ResponseWriter: responseWriter, statusCode: http.StatusOK} // Call the actual business logic handler. handlerFunc(wrappedWriter, request) // Calculate how long the request took. durationSeconds := time.Since(startTime).Seconds() // Record latency in the histogram — this updates _bucket, _sum, and _count. httpRequestDuration.WithLabelValues(request.Method, endpoint).Observe(durationSeconds) // Increment the total requests counter with the final status code. httpRequestsTotal.WithLabelValues( request.Method, endpoint, strconv.Itoa(wrappedWriter.statusCode), ).Inc() } } // statusCapturingWriter is a thin wrapper around http.ResponseWriter that // intercepts WriteHeader so we can record the HTTP status code as a label. type statusCapturingWriter struct { http.ResponseWriter statusCode int } func (scw *statusCapturingWriter) WriteHeader(code int) { scw.statusCode = code scw.ResponseWriter.WriteHeader(code) } // processPayment simulates payment processing with realistic latency variance. func processPayment(responseWriter http.ResponseWriter, request *http.Request) { // Simulate variable processing time between 5ms and 300ms. processingTime := time.Duration(5+rand.Intn(295)) * time.Millisecond time.Sleep(processingTime) // Simulate a 2% error rate — realistic for downstream dependency failures. if rand.Float64() < 0.02 { http.Error(responseWriter, "upstream payment gateway timeout", http.StatusGatewayTimeout) return } responseWriter.WriteHeader(http.StatusOK) responseWriter.Write([]byte(`{"status":"processed"}`)) } func main() { // Simulate queue depth fluctuations in a background goroutine. // In production this would read from your actual queue (Kafka, SQS, etc.). go func() { for { // Randomly vary queue depth between 0 and 500 items. paymentQueueDepth.Set(float64(rand.Intn(500))) time.Sleep(5 * time.Second) } }() // Register our instrumented business handler. http.HandleFunc("/api/v1/payments", instrumentedHandler("/api/v1/payments", processPayment)) // Expose the /metrics endpoint — Prometheus will scrape this. // NEVER put /metrics behind authentication that Prometheus can't pass. http.Handle("/metrics", promhttp.Handler()) // Listen on a separate port for metrics to allow network policies // to restrict /metrics access without blocking API traffic. go http.ListenAndServe(":9091", nil) // metrics port http.ListenAndServe(":8080", nil) // API port (uses DefaultServeMux) }
# HELP payment_service_http_requests_total Total number of HTTP requests processed by the payment service.
# TYPE payment_service_http_requests_total counter
payment_service_http_requests_total{endpoint="/api/v1/payments",method="POST",status_code="200"} 4821
payment_service_http_requests_total{endpoint="/api/v1/payments",method="POST",status_code="504"} 97
# HELP payment_service_http_request_duration_seconds HTTP request duration in seconds, bucketed by endpoint.
# TYPE payment_service_http_request_duration_seconds histogram
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.005"} 0
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.01"} 142
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.1"} 2103
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.2"} 3890
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.5"} 4750
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="+Inf"} 4918
payment_service_http_request_duration_seconds_sum{endpoint="/api/v1/payments",method="POST"} 743.21
payment_service_http_request_duration_seconds_count{endpoint="/api/v1/payments",method="POST"} 4918
# HELP payment_service_in_flight_requests Number of HTTP requests currently being processed.
# TYPE payment_service_in_flight_requests gauge
payment_service_in_flight_requests 12
# HELP payment_service_queue_depth Current number of payments waiting in the processing queue.
# TYPE payment_service_queue_depth gauge
payment_service_queue_depth 247
Production-Grade Recording Rules and Alerting That Won't Page You at 3am
Raw PromQL queries against high-cardinality data are expensive. A query like rate(http_requests_total[5m]) across 200 pods runs every time a dashboard loads. In large clusters, this causes Prometheus to churn through millions of samples per query, leading to query timeouts and the dreaded 'query timed out in expression evaluation' error.
Recording rules solve this by pre-computing expensive expressions and storing the result as a new time series. Prometheus evaluates recording rules on its evaluation interval (typically 1m), writes the result into its TSDB, and future queries read that cheap pre-computed series instead of re-scanning the raw data.
Naming matters. The Prometheus community convention for recording rule names is level:metric:operations. For example job:http_requests_total:rate5m means: aggregated at the job level, derived from http_requests_total, computed as a 5-minute rate. Sticking to this convention makes rules self-documenting and searchable.
Alerts in Prometheus are defined in the same YAML format as recording rules. The critical production insight is that alerts should express SLO burn rates, not raw thresholds. An alert that fires when error rate > 1% will fire constantly during minor blips. An alert based on a multi-window burn rate (Google's SRE model) only fires when you're burning through your error budget fast enough to exhaust it within a prediction window — dramatically reducing noise.
# This PrometheusRule custom resource is picked up automatically by the # Prometheus Operator. No Prometheus restart needed — the operator reconciles it. apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: payment-service-slo-rules namespace: production labels: # This label must match the ruleSelector in your Prometheus CR. # If this label is missing, the Prometheus Operator silently ignores the file. prometheus: kube-prometheus role: alert-rules spec: groups: # ───────────────────────────────────────────── # GROUP 1: Recording Rules — pre-compute expensive queries # ───────────────────────────────────────────── - name: payment_service_recording_rules # Evaluate every 1 minute — must be >= scrape_interval. interval: 1m rules: # Pre-compute 5-minute request rate per job and endpoint. # Queries against this are ~100x cheaper than querying the raw counter. - record: job_endpoint:payment_service_http_requests_total:rate5m expr: | rate(payment_service_http_requests_total[5m]) # Pre-compute the error ratio (4xx + 5xx / total) per endpoint. # This is the core signal for SLO tracking. - record: job_endpoint:payment_service_error_ratio:rate5m expr: | sum by (job, endpoint) ( rate(payment_service_http_requests_total{ status_code=~"5.." }[5m]) ) / sum by (job, endpoint) ( rate(payment_service_http_requests_total[5m]) ) # Pre-compute p99 latency using the histogram. # histogram_quantile needs the _bucket series — this is why we use histograms. - record: job_endpoint:payment_service_latency_p99:rate5m expr: | histogram_quantile( 0.99, sum by (job, endpoint, le) ( rate(payment_service_http_request_duration_seconds_bucket[5m]) ) ) # ───────────────────────────────────────────── # GROUP 2: Alerts — fire on SLO burn rate, not raw thresholds # ───────────────────────────────────────────── - name: payment_service_alerts rules: # ALERT 1: High-burn-rate alert (fast burn — page immediately). # This fires if the error rate is 14.4x the SLO budget for 1 hour. # At this rate, you'll exhaust a 30-day 99.9% SLO budget in 2 hours. # Using 'for: 2m' prevents single-scrape blips from firing pages. - alert: PaymentServiceHighErrorBurnRate expr: | ( job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001) ) for: 2m labels: severity: critical team: payments # runbook_url label is a convention — attach it to PagerDuty in Alertmanager. runbook_url: https://wiki.company.com/runbooks/payment-service-errors annotations: summary: "Payment service burning error budget at critical rate" description: | Endpoint {{ $labels.endpoint }} error ratio is {{ $value | humanizePercentage }}, which is 14.4x above the 0.1% SLO target. At this rate, the monthly error budget will be exhausted within 2 hours. Check downstream payment gateway connectivity and database connection pool saturation. # ALERT 2: Latency SLO breach. # Fires if p99 latency exceeds 500ms for 5 consecutive minutes. - alert: PaymentServiceHighLatency expr: | job_endpoint:payment_service_latency_p99:rate5m > 0.5 for: 5m labels: severity: warning team: payments annotations: summary: "Payment service p99 latency exceeds SLO threshold" description: | p99 latency for {{ $labels.endpoint }} is {{ $value | humanizeDuration }}, exceeding the 500ms SLO threshold. Current queue depth: {{ with query "payment_service_queue_depth" }}{{ . | first | value | humanize }}{{ end }} items. # ALERT 3: Dead queue — queue depth at zero while requests are incoming. # This catches silent consumer crashes before users notice. - alert: PaymentQueueConsumerDead expr: | payment_service_queue_depth == 0 and sum(job_endpoint:payment_service_http_requests_total:rate5m) > 10 for: 3m labels: severity: critical team: payments annotations: summary: "Payment queue depth is zero but traffic is flowing — consumer likely crashed"
prometheusrule.monitoring.coreos.com/payment-service-slo-rules created
# Verify the Prometheus Operator picked up the rules (check operator logs):
# kubectl logs -n monitoring deploy/prometheus-operator | grep payment-service
level=info msg="found PrometheusRule" name=payment-service-slo-rules namespace=production
level=info msg="rule group loaded" group=payment_service_recording_rules rules=3
level=info msg="rule group loaded" group=payment_service_alerts rules=3
# In the Prometheus UI at /rules you'll now see:
Group: payment_service_recording_rules [3 rules, last evaluated 0.84s ago]
job_endpoint:payment_service_http_requests_total:rate5m OK
job_endpoint:payment_service_error_ratio:rate5m OK
job_endpoint:payment_service_latency_p99:rate5m OK
Group: payment_service_alerts [3 rules]
PaymentServiceHighErrorBurnRate INACTIVE
PaymentServiceHighLatency INACTIVE
PaymentQueueConsumerDead INACTIVE
# When an alert fires (e.g. during a payment gateway outage):
# At /alerts in Prometheus UI:
PaymentServiceHighErrorBurnRate FIRING
Labels: endpoint="/api/v1/payments", severity="critical", team="payments"
Value: 0.0187 (18.7x the budget — critical burn rate)
Active: 2m 15s
| Aspect | Prometheus Histogram | Prometheus Summary |
|---|---|---|
| Quantile calculation location | Prometheus server (at query time via histogram_quantile()) | Client library (at instrumentation time) |
| Aggregatable across replicas? | Yes — sum buckets across instances before calling histogram_quantile() | No — pre-computed quantiles cannot be meaningfully averaged |
| Bucket boundary changes | Requires app restart — boundaries are fixed at init | Requires app restart — quantile objectives are fixed at init |
| Accuracy | Approximate — depends on bucket granularity | Configurable accuracy with error bounds (e.g. 0.01 error on 0.99 quantile) |
| Query cost | Higher — scans all bucket time series | Lower — quantile already computed client-side |
| Best for Kubernetes? | Yes — multiple pod replicas need server-side aggregation | Only if single-instance and exact quantiles are mandatory |
| Memory overhead (client) | Low — O(number of buckets) | Higher — sliding time window maintained per quantile |
| SLO burn rate calculations | Excellent — rate() on _count and _bucket work perfectly | Difficult — _count works but quantile series isn't rate()-able |
🎯 Key Takeaways
Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.