DevOps Advanced

Kubernetes Monitoring with Prometheus — Deep Dive for Production

📅 March 2026 ⏱ 8 min read 🎯 Advanced

In Plain English 🔥

Imagine your Kubernetes cluster is a busy hospital. Dozens of doctors (pods), nurses (services), and wards (namespaces) are running simultaneously. Prometheus is like the hospital's central monitoring board — it walks around every few seconds, checks each room's vitals (CPU, memory, request rates), writes them down in a giant logbook, and sounds an alarm if a patient's heart rate spikes. You don't wait for something to go wrong — the board tells you before it becomes a crisis.

⚡ Quick Answer

Running Kubernetes in production without monitoring is like flying a commercial aircraft with the instrument panel blacked out. Everything might feel fine until it catastrophically isn't. In 2024, the CNCF survey reported that Prometheus is used by over 84% of Kubernetes production environments — not because it's the easiest tool, but because it's the most powerful pull-based metrics system that was purpose-built for dynamic, containerized infrastructure. When pods scale up and down every minute, static monitoring configs break. Prometheus doesn't just survive this chaos — it thrives in it.

The real problem Prometheus solves is the ephemeral nature of Kubernetes workloads. Traditional monitoring tools expect your target IPs to stay fixed. In Kubernetes, a pod's IP changes every restart. Prometheus solves this with Kubernetes-native service discovery — it queries the Kubernetes API server directly to find what's alive right now, not what was alive when you wrote the config. Layer in custom application metrics, recording rules, and Alertmanager integration, and you have a complete observability stack that reacts to your cluster's actual state.

By the end of this article you'll be able to deploy Prometheus into a Kubernetes cluster using the kube-prometheus-stack Helm chart, write production-grade scrape configurations with relabeling, expose custom application metrics using the Go and Python client libraries, design efficient recording rules to avoid query-time explosions, wire up Alertmanager for PagerDuty and Slack, and avoid the five most expensive mistakes teams make in production. Let's build this system from the ground up.

How Prometheus Service Discovery Works Inside Kubernetes

Prometheus uses a pull model — it reaches out to targets and scrapes metrics endpoints, typically on path /metrics, at a configured interval. In a static world you'd list IPs. In Kubernetes, Prometheus uses kubernetes_sd_configs to query the Kubernetes API and discover pods, services, endpoints, nodes, and ingresses dynamically.

When Prometheus starts, it authenticates to the API server using a ServiceAccount token mounted in its pod. It then watches specific resource types. For the endpoints role, Prometheus discovers every Endpoints object across the cluster. For each endpoint address it finds, it creates a scrape target. The magic happens during relabeling — a pipeline that runs before the scrape and lets you filter, rename, and attach labels using values pulled directly from Kubernetes metadata (pod annotations, namespace labels, service names).

The annotation prometheus.io/scrape: 'true' is a community convention that Prometheus relabeling configs check. If the annotation exists and is true, the pod is scraped. This means enabling monitoring for a new application is as simple as adding three lines to its pod spec — no Prometheus config reload needed. Prometheus reconciles new targets automatically every scrape_interval.

Understanding the target lifecycle is critical for production. Targets move through states: up, down, and unknown. A target goes unknown when Prometheus can't reach the endpoint at all (network issue or pod not started). It goes down when the HTTP scrape returns a non-200 status or times out. Staleness markers are injected after a target disappears — this prevents old time series from polluting range queries.

prometheus-kubernetes-sd-config.yaml · YAML

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859

# This is the core Prometheus scrape configuration for Kubernetes pod discovery.
# It lives inside your prometheus.yml (or a PrometheusRule CR if using the operator).

scrape_configs:
  - job_name: 'kubernetes-pods'
    # Prometheus will query the Kubernetes API server to find all pods.
    kubernetes_sd_configs:
      - role: pod
        # Restrict discovery to a specific namespace for security isolation.
        # Remove this line to discover pods cluster-wide.
        namespaces:
          names:
            - production
            - staging

    # relabel_configs runs BEFORE each scrape — it filters and transforms targets.
    relabel_configs:
      # STEP 1: Only scrape pods that explicitly opt in via annotation.
      # If prometheus.io/scrape is not 'true', drop this target entirely.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'

      # STEP 2: Allow pods to declare a custom metrics path (default is /metrics).
      # e.g. annotation: prometheus.io/path: '/actuator/prometheus'
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # STEP 3: Allow pods to declare a custom port for scraping.
      # e.g. annotation: prometheus.io/port: '8080'
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        # Regex captures the IP from __address__ and combines with the annotation port.
        regex: '([^:]+)(?::\d+)?;(\d+)'
        replacement: '$1:$2'
        target_label: __address__

      # STEP 4: Carry namespace as a label so dashboards can filter by it.
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace

      # STEP 5: Carry the pod name as a label for drill-down in Grafana.
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

      # STEP 6: Carry the app label from the pod so we can group by application.
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: replace
        target_label: app

      # STEP 7: Drop any targets running on hostNetwork — they often expose
      # system-level ports and should be handled by the node exporter job instead.
      - source_labels: [__meta_kubernetes_pod_host_ip, __address__]
        regex: '(\d+\.\d+\.\d+\.\d+);\1:\d+'
        action: drop

▶ Output

# Prometheus /targets page (accessible at http://prometheus:9090/targets) will show:

kubernetes-pods (42 / 42 up)

Target State Labels
http://10.0.1.15:8080/metrics UP app="payment-service", kubernetes_namespace="production"
http://10.0.1.22:8080/metrics UP app="user-service", kubernetes_namespace="production"
http://10.0.2.10:9090/actuator/prometheus UP app="order-service", kubernetes_namespace="staging"
...

# Targets without the annotation prometheus.io/scrape='true' are silently dropped
# and never appear in this list — exactly what we want.

⚠️

Watch Out: The __address__ Rewrite TrapIf your pods run multiple containers on different ports, the `__address__` relabeling in Step 3 only rewrites when the annotation port differs from the discovered port. If you skip the annotation and rely on the default port, Prometheus will try to scrape the first container port it finds — which might be your gRPC port (e.g. 50051), not your HTTP metrics port. Always set `prometheus.io/port` explicitly in multi-container pods.

Exposing Custom Application Metrics with the Prometheus Client Libraries

Kubernetes infrastructure metrics (CPU, memory, network) come from kube-state-metrics and node-exporter. But the metrics that make or break your SLOs are application-level: request latency, error rates, queue depth, cache hit ratio. These come from instrumenting your own code.

Prometheus has four core metric types you need to understand at the semantic level, not just the API level:

Counter — a value that only goes up (resets to zero on restart). Use it for total requests, total errors, total bytes sent. Never use a counter for something that can decrease. PromQL's rate() and increase() functions unwrap counters properly, handling resets.

Gauge — a value that can go up or down. Use it for current queue depth, active connections, temperature, memory usage. Don't use rate() on a gauge — it's meaningless.

Histogram — pre-aggregated bucketed observations. Use it for latency and request size. It exposes three time series: _bucket, _sum, and _count. The bucket boundaries you choose at instrumentation time are permanent — you can't change them without restarting the process.

Summary — client-side computed quantiles. Use it only when you need accurate quantiles and can't aggregate across instances (summaries can't be aggregated in PromQL). In Kubernetes with multiple replicas, histograms are almost always the right choice over summaries.

The example below shows a production-grade Go HTTP service instrumented with all four types and serving them on a dedicated /metrics endpoint.

instrumented_http_server.go · GO

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143

package main

import (
	"math/rand"
	"net/http"
	"strconv"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

// --- Metric Declarations ---
// promauto.New* registers metrics automatically with the default registry.
// Always declare metrics at package level — not inside handlers — to avoid
// re-registering on every request (causes a panic in production).

var (
	// Counter: tracks total HTTP requests. Labels let us slice by method and status code.
	httpRequestsTotal = promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "payment_service_http_requests_total",
			Help: "Total number of HTTP requests processed by the payment service.",
		},
		[]string{"method", "endpoint", "status_code"},
	)

	// Histogram: tracks request duration. Bucket boundaries are chosen to match
	// our SLO thresholds: 95th percentile < 200ms, 99th percentile < 500ms.
	// DefBuckets (default) go up to 10s which is often too coarse for APIs.
	httpRequestDuration = promauto.NewHistogramVec(
		prometheus.HistogramOpts{
			Name: "payment_service_http_request_duration_seconds",
			Help: "HTTP request duration in seconds, bucketed by endpoint.",
			// Custom buckets aligned to SLO boundaries — critical in production.
			Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5},
		},
		[]string{"method", "endpoint"},
	)

	// Gauge: tracks the number of in-flight (currently processing) requests.
	// This helps detect thread pool saturation before latency spikes show up.
	inFlightRequests = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "payment_service_in_flight_requests",
			Help: "Number of HTTP requests currently being processed.",
		},
	)

	// Gauge: tracks the depth of the async payment processing queue.
	paymentQueueDepth = promauto.NewGauge(
		prometheus.GaugeOpts{
			Name: "payment_service_queue_depth",
			Help: "Current number of payments waiting in the processing queue.",
		},
	)
)

// instrumentedHandler wraps an HTTP handler with Prometheus instrumentation.
// It measures latency, increments the request counter, and tracks in-flight requests.
func instrumentedHandler(endpoint string, handlerFunc http.HandlerFunc) http.HandlerFunc {
	return func(responseWriter http.ResponseWriter, request *http.Request) {
		// Record that one more request is in flight.
		inFlightRequests.Inc()
		// Use defer so this always runs even if the handler panics.
		defer inFlightRequests.Dec()

		// Start the latency timer.
		startTime := time.Now()

		// Wrap the ResponseWriter to capture the status code.
		wrappedWriter := &statusCapturingWriter{ResponseWriter: responseWriter, statusCode: http.StatusOK}

		// Call the actual business logic handler.
		handlerFunc(wrappedWriter, request)

		// Calculate how long the request took.
		durationSeconds := time.Since(startTime).Seconds()

		// Record latency in the histogram — this updates _bucket, _sum, and _count.
		httpRequestDuration.WithLabelValues(request.Method, endpoint).Observe(durationSeconds)

		// Increment the total requests counter with the final status code.
		httpRequestsTotal.WithLabelValues(
			request.Method,
			endpoint,
			strconv.Itoa(wrappedWriter.statusCode),
		).Inc()
	}
}

// statusCapturingWriter is a thin wrapper around http.ResponseWriter that
// intercepts WriteHeader so we can record the HTTP status code as a label.
type statusCapturingWriter struct {
	http.ResponseWriter
	statusCode int
}

func (scw *statusCapturingWriter) WriteHeader(code int) {
	scw.statusCode = code
	scw.ResponseWriter.WriteHeader(code)
}

// processPayment simulates payment processing with realistic latency variance.
func processPayment(responseWriter http.ResponseWriter, request *http.Request) {
	// Simulate variable processing time between 5ms and 300ms.
	processingTime := time.Duration(5+rand.Intn(295)) * time.Millisecond
	time.Sleep(processingTime)

	// Simulate a 2% error rate — realistic for downstream dependency failures.
	if rand.Float64() < 0.02 {
		http.Error(responseWriter, "upstream payment gateway timeout", http.StatusGatewayTimeout)
		return
	}

	responseWriter.WriteHeader(http.StatusOK)
	responseWriter.Write([]byte(`{"status":"processed"}`))
}

func main() {
	// Simulate queue depth fluctuations in a background goroutine.
	// In production this would read from your actual queue (Kafka, SQS, etc.).
	go func() {
		for {
			// Randomly vary queue depth between 0 and 500 items.
			paymentQueueDepth.Set(float64(rand.Intn(500)))
			time.Sleep(5 * time.Second)
		}
	}()

	// Register our instrumented business handler.
	http.HandleFunc("/api/v1/payments", instrumentedHandler("/api/v1/payments", processPayment))

	// Expose the /metrics endpoint — Prometheus will scrape this.
	// NEVER put /metrics behind authentication that Prometheus can't pass.
	http.Handle("/metrics", promhttp.Handler())

	// Listen on a separate port for metrics to allow network policies
	// to restrict /metrics access without blocking API traffic.
	go http.ListenAndServe(":9091", nil) // metrics port
	http.ListenAndServe(":8080", nil)    // API port (uses DefaultServeMux)
}

▶ Output

# When Prometheus scrapes http://payment-service-pod:9091/metrics it receives:

# HELP payment_service_http_requests_total Total number of HTTP requests processed by the payment service.
# TYPE payment_service_http_requests_total counter
payment_service_http_requests_total{endpoint="/api/v1/payments",method="POST",status_code="200"} 4821
payment_service_http_requests_total{endpoint="/api/v1/payments",method="POST",status_code="504"} 97

# HELP payment_service_http_request_duration_seconds HTTP request duration in seconds, bucketed by endpoint.
# TYPE payment_service_http_request_duration_seconds histogram
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.005"} 0
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.01"} 142
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.1"} 2103
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.2"} 3890
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="0.5"} 4750
payment_service_http_request_duration_seconds_bucket{endpoint="/api/v1/payments",method="POST",le="+Inf"} 4918
payment_service_http_request_duration_seconds_sum{endpoint="/api/v1/payments",method="POST"} 743.21
payment_service_http_request_duration_seconds_count{endpoint="/api/v1/payments",method="POST"} 4918

# HELP payment_service_in_flight_requests Number of HTTP requests currently being processed.
# TYPE payment_service_in_flight_requests gauge
payment_service_in_flight_requests 12

# HELP payment_service_queue_depth Current number of payments waiting in the processing queue.
# TYPE payment_service_queue_depth gauge
payment_service_queue_depth 247

⚠️

Pro Tip: Histogram Bucket Design is a One-Way DoorOnce your service is deployed, you cannot change histogram bucket boundaries without losing continuity in your time series. Before you ship, look at your SLO: if your target is p99 < 200ms, you need a bucket boundary AT exactly 0.2 seconds. If 0.2 falls between two buckets, `histogram_quantile()` will interpolate — and your SLO dashboard will be silently wrong. Sketch your bucket boundaries on a napkin against your SLO thresholds before writing a single line of code.

Production-Grade Recording Rules and Alerting That Won't Page You at 3am

Raw PromQL queries against high-cardinality data are expensive. A query like rate(http_requests_total[5m]) across 200 pods runs every time a dashboard loads. In large clusters, this causes Prometheus to churn through millions of samples per query, leading to query timeouts and the dreaded 'query timed out in expression evaluation' error.

Recording rules solve this by pre-computing expensive expressions and storing the result as a new time series. Prometheus evaluates recording rules on its evaluation interval (typically 1m), writes the result into its TSDB, and future queries read that cheap pre-computed series instead of re-scanning the raw data.

Naming matters. The Prometheus community convention for recording rule names is level:metric:operations. For example job:http_requests_total:rate5m means: aggregated at the job level, derived from http_requests_total, computed as a 5-minute rate. Sticking to this convention makes rules self-documenting and searchable.

Alerts in Prometheus are defined in the same YAML format as recording rules. The critical production insight is that alerts should express SLO burn rates, not raw thresholds. An alert that fires when error rate > 1% will fire constantly during minor blips. An alert based on a multi-window burn rate (Google's SRE model) only fires when you're burning through your error budget fast enough to exhaust it within a prediction window — dramatically reducing noise.

payment-service-rules.yaml · YAML

# This PrometheusRule custom resource is picked up automatically by the
# Prometheus Operator. No Prometheus restart needed — the operator reconciles it.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-service-slo-rules
  namespace: production
  labels:
    # This label must match the ruleSelector in your Prometheus CR.
    # If this label is missing, the Prometheus Operator silently ignores the file.
    prometheus: kube-prometheus
    role: alert-rules
spec:
  groups:
    # ─────────────────────────────────────────────
    # GROUP 1: Recording Rules — pre-compute expensive queries
    # ─────────────────────────────────────────────
    - name: payment_service_recording_rules
      # Evaluate every 1 minute — must be >= scrape_interval.
      interval: 1m
      rules:
        # Pre-compute 5-minute request rate per job and endpoint.
        # Queries against this are ~100x cheaper than querying the raw counter.
        - record: job_endpoint:payment_service_http_requests_total:rate5m
          expr: |
            rate(payment_service_http_requests_total[5m])

        # Pre-compute the error ratio (4xx + 5xx / total) per endpoint.
        # This is the core signal for SLO tracking.
        - record: job_endpoint:payment_service_error_ratio:rate5m
          expr: |
            sum by (job, endpoint) (
              rate(payment_service_http_requests_total{
                status_code=~"5.."
              }[5m])
            )
            /
            sum by (job, endpoint) (
              rate(payment_service_http_requests_total[5m])
            )

        # Pre-compute p99 latency using the histogram.
        # histogram_quantile needs the _bucket series — this is why we use histograms.
        - record: job_endpoint:payment_service_latency_p99:rate5m
          expr: |
            histogram_quantile(
              0.99,
              sum by (job, endpoint, le) (
                rate(payment_service_http_request_duration_seconds_bucket[5m])
              )
            )

    # ─────────────────────────────────────────────
    # GROUP 2: Alerts — fire on SLO burn rate, not raw thresholds
    # ─────────────────────────────────────────────
    - name: payment_service_alerts
      rules:
        # ALERT 1: High-burn-rate alert (fast burn — page immediately).
        # This fires if the error rate is 14.4x the SLO budget for 1 hour.
        # At this rate, you'll exhaust a 30-day 99.9% SLO budget in 2 hours.
        # Using 'for: 2m' prevents single-scrape blips from firing pages.
        - alert: PaymentServiceHighErrorBurnRate
          expr: |
            (
              job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001)
            )
          for: 2m
          labels:
            severity: critical
            team: payments
            # runbook_url label is a convention — attach it to PagerDuty in Alertmanager.
            runbook_url: https://wiki.company.com/runbooks/payment-service-errors
          annotations:
            summary: "Payment service burning error budget at critical rate"
            description: |
              Endpoint {{ $labels.endpoint }} error ratio is {{ $value | humanizePercentage }},
              which is 14.4x above the 0.1% SLO target. At this rate, the monthly
              error budget will be exhausted within 2 hours. Check downstream
              payment gateway connectivity and database connection pool saturation.

        # ALERT 2: Latency SLO breach.
        # Fires if p99 latency exceeds 500ms for 5 consecutive minutes.
        - alert: PaymentServiceHighLatency
          expr: |
            job_endpoint:payment_service_latency_p99:rate5m > 0.5
          for: 5m
          labels:
            severity: warning
            team: payments
          annotations:
            summary: "Payment service p99 latency exceeds SLO threshold"
            description: |
              p99 latency for {{ $labels.endpoint }} is {{ $value | humanizeDuration }},
              exceeding the 500ms SLO threshold. Current queue depth:
              {{ with query "payment_service_queue_depth" }}{{ . | first | value | humanize }}{{ end }} items.

        # ALERT 3: Dead queue — queue depth at zero while requests are incoming.
        # This catches silent consumer crashes before users notice.
        - alert: PaymentQueueConsumerDead
          expr: |
            payment_service_queue_depth == 0
            and
            sum(job_endpoint:payment_service_http_requests_total:rate5m) > 10
          for: 3m
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "Payment queue depth is zero but traffic is flowing — consumer likely crashed"

▶ Output

# kubectl apply -f payment-service-rules.yaml
prometheusrule.monitoring.coreos.com/payment-service-slo-rules created

# Verify the Prometheus Operator picked up the rules (check operator logs):
# kubectl logs -n monitoring deploy/prometheus-operator | grep payment-service
level=info msg="found PrometheusRule" name=payment-service-slo-rules namespace=production
level=info msg="rule group loaded" group=payment_service_recording_rules rules=3
level=info msg="rule group loaded" group=payment_service_alerts rules=3

# In the Prometheus UI at /rules you'll now see:
Group: payment_service_recording_rules [3 rules, last evaluated 0.84s ago]
job_endpoint:payment_service_http_requests_total:rate5m OK
job_endpoint:payment_service_error_ratio:rate5m OK
job_endpoint:payment_service_latency_p99:rate5m OK

Group: payment_service_alerts [3 rules]
PaymentServiceHighErrorBurnRate INACTIVE
PaymentServiceHighLatency INACTIVE
PaymentQueueConsumerDead INACTIVE

# When an alert fires (e.g. during a payment gateway outage):
# At /alerts in Prometheus UI:
PaymentServiceHighErrorBurnRate FIRING
Labels: endpoint="/api/v1/payments", severity="critical", team="payments"
Value: 0.0187 (18.7x the budget — critical burn rate)
Active: 2m 15s

🔥

Interview Gold: Why 'for: 2m' on Alerts MattersThe `for` field in an alert rule makes the alert transition through a PENDING state before it reaches FIRING. Without it, a single bad scrape (a brief pod restart, a GC pause causing a slow response) will trigger your PagerDuty alert at 3am for an event that resolved itself in 15 seconds. The `for` duration should be set to at least 2-3x your scrape interval. For critical alerts that page humans, use `for: 2m`. For ticket-level alerts, `for: 15m` is reasonable.

Aspect	Prometheus Histogram	Prometheus Summary
Quantile calculation location	Prometheus server (at query time via histogram_quantile())	Client library (at instrumentation time)
Aggregatable across replicas?	Yes — sum buckets across instances before calling histogram_quantile()	No — pre-computed quantiles cannot be meaningfully averaged
Bucket boundary changes	Requires app restart — boundaries are fixed at init	Requires app restart — quantile objectives are fixed at init
Accuracy	Approximate — depends on bucket granularity	Configurable accuracy with error bounds (e.g. 0.01 error on 0.99 quantile)
Query cost	Higher — scans all bucket time series	Lower — quantile already computed client-side
Best for Kubernetes?	Yes — multiple pod replicas need server-side aggregation	Only if single-instance and exact quantiles are mandatory
Memory overhead (client)	Low — O(number of buckets)	Higher — sliding time window maintained per quantile
SLO burn rate calculations	Excellent — rate() on _count and _bucket work perfectly	Difficult — _count works but quantile series isn't rate()-able

🎯 Key Takeaways

🔥

TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

About Our Team Editorial Standards

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged