Kubernetes Monitoring with Prometheus — Deep Dive for Production
- Prometheus is pull-based with Kubernetes-native service discovery. Annotation-driven scraping means enabling monitoring requires no config changes on the Prometheus side.
- Instrument with Counters (totals), Gauges (current state), and Histograms (latency). Never use Summary in multi-replica deployments.
- Recording rules pre-compute expensive PromQL into cheap time series. They are not optional in production — they are the performance optimization.
- Pull model: Prometheus scrapes /metrics endpoints on targets at configured intervals.
- Service Discovery: Queries Kubernetes API to find pods, services, nodes dynamically — no static IPs.
- Four metric types: Counter (only goes up), Gauge (up/down), Histogram (bucketed observations), Summary (client-side quantiles).
- Recording rules: Pre-compute expensive PromQL into cheap time series.
- Alerting: Prometheus evaluates rules, Alertmanager routes to PagerDuty/Slack.
- Pull model means Prometheus must reach every target. NetworkPolicy misconfigs silently break scraping.
- High-cardinality labels (user_id, request_id) will OOM Prometheus.
- Using Summary instead of Histogram in multi-replica deployments. Summaries cannot be aggregated across instances.
Target showing DOWN or UNKNOWN.
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name} {.metadata.annotations.prometheus\.io/scrape}{"\n"}{end}'kubectl exec -n monitoring deploy/prometheus -- wget -qO- http://<target-ip>:<port>/metrics | head -5Prometheus memory growing rapidly.
curl -s http://prometheus:9090/api/v1/query?query=prometheus_tsdb_head_series | jq '.data.result[0].value[1]'curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(.value) | reverse | .[0:10]'Recording rules not evaluating.
kubectl get prometheusrule -A -o json | jq '.items[] | select(.metadata.labels.prometheus=="kube-prometheus") | .metadata.name'kubectl logs -n monitoring deploy/prometheus-operator | grep -i 'rule\|error'Alerts firing but not reaching PagerDuty/Slack.
kubectl exec -n monitoring deploy/alertmanager -- amtool config showcurl -s http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.status.state=="firing") | .labels'Scrape duration exceeds scrape_interval.
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.scrapeDurationSeconds > 10) | {job: .labels.job, instance: .labels.instance, duration: .scrapeDurationSeconds}'curl -s http://prometheus:9090/api/v1/query?query=scrape_duration_seconds | jq '.data.result | sort_by(.value[1]) | reverse | .[0:5]'Production Incident
user_id label to track per-user request counts. With 50,000 unique users per hour and 5 label combinations (method, endpoint, status_code, user_id), the metric generated 50,000 * 5 = 250,000 new time series per hour. Each time series consumes memory in Prometheus's TSDB head block. After 6 hours, the head block contained over 1.5 million active series for a single metric, consuming 24GB of RAM. The Prometheus pod was configured with a 16GB memory limit and was OOMKilled.user_id label from the counter immediately and redeployed the application.
2. Added a Prometheus recording rule to aggregate by user tier (free, premium, enterprise) instead of individual user_id.
3. Added sample_limit: 1000 to the scrape config to prevent future label explosions from a single target.
4. Deployed a cardinality-linter CI check that rejects metrics with more than 3 labels in code review.
5. Added a Prometheus alert on prometheus_tsdb_head_series > 1000000 to catch future explosions early.sample_limit on scrape configs as targets that expose too many series.Monitor Prometheus's own metrics: prometheus_tsdb_head_series, prometheus_tsdb_head_chunks, and memory usage. Alert before OOMKill.Enforce cardinality limits in CI/CD. A single bad label can take down monitoring for the entire cluster.Production Debug GuideSymptom-first investigation path for Prometheus failures in Kubernetes.
prometheus.io/scrape annotation is set. Check NetworkPolicy — Prometheus must be able to reach the target pod's IP a safety net. It drops.for duration — the alert may be in PENDING state. Check Alertmanager routing — the alert may be firing but silenced or routed to the wrong receiver.prometheus_tsdb_head_series — if over 5M, you have a cardinality problem. Run promtool tsdb analyze to find the highest-cardinality metric names. Look for labels with unbounded values (user_id, request_id).Running Kubernetes in production without monitoring is like flying a commercial aircraft with the instrument panel blacked out. Everything might feel fine until it catastrophically isn't. Prometheus is used by over 84% of Kubernetes production environments — not because it's the easiest tool, but because it's the most powerful pull-based metrics system that was purpose-built for dynamic, containerized infrastructure.
The real problem Prometheus solves is the ephemeral nature of Kubernetes workloads. Traditional monitoring tools expect your target IPs to stay fixed. In Kubernetes, a pod's IP changes every restart. Prometheus solves this with Kubernetes-native service discovery — it queries the Kubernetes API server directly to find what's alive right now, not what was alive when you wrote the config.
This is not a getting-started guide. It covers scrape configurations with relabeling, custom application metrics using client libraries, recording rules to avoid query-time explosions, Alertmanager integration, and the five most expensive mistakes teams make in production.
How Prometheus Service Discovery Works Inside Kubernetes
Prometheus uses a pull model — it reaches out to targets and scrapes metrics endpoints, typically on path /metrics, at a configured interval. In a static world you'd list IPs. In Kubernetes, Prometheus uses kubernetes_sd_configs to query the Kubernetes API and discover pods, services, endpoints, nodes, and ingresses dynamically.
When Prometheus starts, it authenticates to the API server using a ServiceAccount token mounted in its pod. It then watches specific resource types. For the endpoints role, Prometheus discovers every Endpoints object across the cluster. For each endpoint address it finds, it creates a scrape target. The magic happens during relabeling — a pipeline that runs before the scrape and lets you filter, rename, and attach labels using values pulled directly from Kubernetes metadata (pod annotations, namespace labels, service names).
The annotation `prometheus.io/scrape: 'true' is a community convention that Prometheus relabeling configs check. If the annotation exists and is true, the pod is scraped. This means enabling monitoring for a new application is as simple as adding three lines to its pod spec — no Prometheus config reload needed. Prometheus reconciles new targets automatically every scrape_interval`.
Understanding the target lifecycle is critical for production. Targets move through states: up, down, and unknown. A target goes unknown when Prometheus can't reach the endpoint at all (network issue or pod not started). It goes down when the HTTP scrape returns a non-200 status or times out. Staleness markers are injected after a target disappears — this prevents old time series from polluting range queries.
# This is the core Prometheus scrape configuration for Kubernetes pod discovery. # It lives inside your prometheus.yml (or a PrometheusRule CR if using the operator). scrape_configs: - job_name: 'kubernetes-pods' # Prometheus will query the Kubernetes API server to find all pods. kubernetes_sd_configs: - role: pod # Restrict discovery to a specific namespace for security isolation. namespaces: names: - production - staging # relabel_configs runs BEFORE each scrape — it filters and transforms targets. relabel_configs: # STEP 1: Only scrape pods that explicitly opt in via annotation. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true' # STEP 2: Allow pods to declare a custom metrics path. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) # STEP 3: Allow pods to declare a custom port for scraping. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: '([^:]+)(?::\d+)?;(\d+)' replacement: '$1:$2' target_label: __address__ # STEP 4: Carry namespace as a label. - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace # STEP 5: Carry the pod name as a label. - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name # STEP 6: Carry the app label from the pod. - source_labels: [__meta_kubernetes_pod_label_app] action: replace target_label: app # STEP 7: Drop targets running on hostNetwork. - source_labels: [__meta_kubernetes_pod_host_ip, __address__] regex: '(\d+\.\d+\.\d+\.\d+);\1:\d+' action: drop
prometheus.io/port explicitly in multi-container pods. This is the most common scrape misconfiguration in production.- Multi-container pods expose multiple ports. Prometheus picks one — not always the right one.
- The
__address__label determines where Prometheus connects. Relabeling rewrites it. - Without
prometheus.io/port, Prometheus uses the first container port in the pod spec. - Sidecar containers (Istio, Envoy) often expose ports that are not your metrics port.
down or unknown with no useful error. Always include Prometheus's namespace in your NetworkPolicy ingress rules. Use kubectl exec from the Prometheus pod to test connectivity to the target before blaming the scrape config.prometheus.io/scrape, prometheus.io/port, and prometheus.io/path explicitly.Exposing Custom Application Metrics with the Prometheus Client Libraries
Kubernetes infrastructure metrics (CPU, memory, network) come from kube-state-metrics and node-exporter. But the metrics that make or break your SLOs are application-level: request latency, error rates, queue depth, cache hit ratio. These come from instrumenting your own code.
Prometheus has four core metric types you need to understand at the semantic level, not just the API level:
Counter — a value that only goes up (resets to zero on restart). Use it for total requests, total errors, total bytes sent. Never use a counter for something that can decrease. PromQL's rate() and increase() functions unwrap counters properly, handling resets.
Gauge — a value that can go up or down. Use it for current queue depth, active connections, temperature, memory usage. Don't use on a gauge — it's meaningless.rate()
Histogram — pre-aggregated bucketed observations. Use it for latency and request size. It exposes three time series: _bucket, _sum, and _count. The bucket boundaries you choose at instrumentation time are permanent — you can't change them without restarting the process.
Summary — client-side computed quantiles. Use it only when you need accurate quantiles and can't aggregate across instances (summaries can't be aggregated in PromQL). In Kubernetes with multiple replicas, histograms are almost always the right choice over summaries.
package main import ( "math/rand" "net/http" "strconv" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" "github.com/prometheus/client_golang/prometheus/promhttp" ) var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "payment_service_http_requests_total", Help: "Total number of HTTP requests processed by the payment service.", }, []string{"method", "endpoint", "status_code"}, ) httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "payment_service_http_request_duration_seconds", Help: "HTTP request duration in seconds, bucketed by endpoint.", Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1.0, 2.5}, }, []string{"method", "endpoint"}, ) inFlightRequests = promauto.NewGauge( prometheus.GaugeOpts{ Name: "payment_service_in_flight_requests", Help: "Number of HTTP requests currently being processed.", }, ) paymentQueueDepth = promauto.NewGauge( prometheus.GaugeOpts{ Name: "payment_service_queue_depth", Help: "Current number of payments waiting in the processing queue.", }, ) ) func instrumentedHandler(endpoint string, handlerFunc http.HandlerFunc) http.HandlerFunc { return func(responseWriter http.ResponseWriter, request *http.Request) { inFlightRequests.Inc() defer inFlightRequests.Dec() startTime := time.Now() wrappedWriter := &statusCapturingWriter{ResponseWriter: responseWriter, statusCode: http.StatusOK} handlerFunc(wrappedWriter, request) durationSeconds := time.Since(startTime).Seconds() httpRequestDuration.WithLabelValues(request.Method, endpoint).Observe(durationSeconds) httpRequestsTotal.WithLabelValues( request.Method, endpoint, strconv.Itoa(wrappedWriter.statusCode), ).Inc() } } type statusCapturingWriter struct { http.ResponseWriter statusCode int } func (scw *statusCapturingWriter) WriteHeader(code int) { scw.statusCode = code scw.ResponseWriter.WriteHeader(code) } func processPayment(responseWriter http.ResponseWriter, request *http.Request) { processingTime := time.Duration(5+rand.Intn(295)) * time.Millisecond time.Sleep(processingTime) if rand.Float64() < 0.02 { http.Error(responseWriter, "upstream payment gateway timeout", http.StatusGatewayTimeout) return } responseWriter.WriteHeader(http.StatusOK) responseWriter.Write([]byte(`{"status":"processed"}`)) } func main() { go func() { for { paymentQueueDepth.Set(float64(rand.Intn(500))) time.Sleep(5 * time.Second) } }() http.HandleFunc("/api/v1/payments", instrumentedHandler("/api/v1/payments", processPayment)) http.Handle("/metrics", promhttp.Handler()) go http.ListenAndServe(":9091", nil) http.ListenAndServe(":8080", nil) }
- Bucket boundaries are permanent until the process restarts.
interpolates between buckets — imprecise if boundaries don't align with SLO.histogram_quantile()- Default buckets (
DefBuckets) go up to 10s — too coarse for most APIs. - Custom buckets aligned to SLO thresholds (e.g., 0.2s for p99 < 200ms) give accurate SLO tracking.
- Histograms expose
_bucket,_sum,_count— three time series per label combination.
user_id with 50,000 unique values creates 50,000 time series per metric per label combination. With 5 labels, that is 250,000 series per hour. Prometheus stores all active series in memory (TSDB head block). At ~16KB per series, 1 million series = 16GB RAM. Never add unbounded values as label values. Use bounded labels like user_tier (free/premium/enterprise) instead.Production-Grade Recording Rules and Alerting That Won't Page You at 3am
Raw PromQL queries against high-cardinality data are expensive. A query like rate(http_requests_total[5m]) across 200 pods runs every time a dashboard loads. In large clusters, this causes Prometheus to churn through millions of samples per query, leading to query timeouts and the dreaded 'query timed out in expression evaluation' error.
Recording rules solve this by pre-computing expensive expressions and storing the result as a new time series. Prometheus evaluates recording rules on its evaluation interval (typically 1m), writes the result into its TSDB, and future queries read that cheap pre-computed series instead of re-scanning the raw data.
Naming matters. The Prometheus community convention for recording rule names is level:metric:operations. For example job:http_requests_total:rate5m means: aggregated at the job level, derived from http_requests_total, computed as a 5-minute rate. Sticking to this convention makes rules self-documenting and searchable.
Alerts in Prometheus are defined in the same YAML format as recording rules. The critical production insight is that alerts should express SLO burn rates, not raw thresholds. An alert that fires when error rate > 1% will fire constantly during minor blips. An alert based on a multi-window burn rate (Google's SRE model) only fires when you're burning through your error budget fast enough to exhaust it within a prediction window — dramatically reducing noise.
# PrometheusRule custom resource — picked up automatically by the Prometheus Operator. apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: payment-service-slo-rules namespace: production labels: prometheus: kube-prometheus role: alert-rules spec: groups: - name: payment_service_recording_rules interval: 1m rules: - record: job_endpoint:payment_service_http_requests_total:rate5m expr: | rate(payment_service_http_requests_total[5m]) - record: job_endpoint:payment_service_error_ratio:rate5m expr: | sum by (job, endpoint) ( rate(payment_service_http_requests_total{status_code=~"5.."}[5m]) ) / sum by (job, endpoint) ( rate(payment_service_http_requests_total[5m]) ) - record: job_endpoint:payment_service_latency_p99:rate5m expr: | histogram_quantile( 0.99, sum by (job, endpoint, le) ( rate(payment_service_http_request_duration_seconds_bucket[5m]) ) ) - name: payment_service_alerts rules: - alert: PaymentServiceHighErrorBurnRate expr: | job_endpoint:payment_service_error_ratio:rate5m > (14.4 * 0.001) for: 2m labels: severity: critical team: payments runbook_url: https://wiki.company.com/runbooks/payment-service-errors annotations: summary: "Payment service burning error budget at critical rate" description: | Endpoint {{ $labels.endpoint }} error ratio is {{ $value | humanizePercentage }}. - alert: PaymentServiceHighLatency expr: | job_endpoint:payment_service_latency_p99:rate5m > 0.5 for: 5m labels: severity: warning team: payments annotations: summary: "Payment service p99 latency exceeds SLO threshold" - alert: PaymentQueueConsumerDead expr: | payment_service_queue_depth == 0 and sum(job_endpoint:payment_service_http_requests_total:rate5m) > 10 for: 3m labels: severity: critical team: payments annotations: summary: "Payment queue depth is zero but traffic is flowing"
for to at least 2-3x your scrape_interval. Critical page alerts: for: 2m. Ticket-level alerts: for: 15m.- Without
for: alert fires on first bad scrape. Noisy. - With
for: 2m: alert must be consistently bad for 2 minutes before firing. - PENDING state: visible in Prometheus UI but does not send to Alertmanager.
- FIRING state: sent to Alertmanager for routing to PagerDuty/Slack.
- The
forduration should be >= 2x scrape_interval to avoid single-scrape flukes.
rate(http_requests_total[5m]) query scans millions of samples. Recording rules pre-compute this once per evaluation interval (1m) and store a cheap-to-read time series. The performance difference is 100x or more. Always create recording rules for any query used in more than one dashboard or alert.for field prevents noise. Label matching on PrometheusRule CRs must exactly match the Prometheus CR's ruleSelector.Alertmanager: Routing, Silencing, and Deduplication
Prometheus evaluates alert rules and sends firing alerts to Alertmanager. Alertmanager is a separate component responsible for deduplicating, grouping, routing, and silencing alerts before sending them to notification channels (PagerDuty, Slack, email, Opsgenie).
Alertmanager's routing tree is a hierarchical matching system. An alert's labels are matched against the route configuration. The first matching route determines the receiver. This means your alert labels (severity, team, service) must be carefully designed to match your routing tree.
# Alertmanager configuration — deployed as a Secret in the monitoring namespace. # The Prometheus Operator picks this up from the alertmanager.yaml key. global: resolve_timeout: 5m pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxx' route: # Default receiver for alerts that don't match any sub-route. receiver: 'slack-catchall' group_by: ['alertname', 'namespace', 'endpoint'] group_wait: 30s # Wait 30s to group similar alerts. group_interval: 5m # Send grouped updates every 5m. repeat_interval: 4h # Re-send unresolved alerts every 4h. routes: # Critical alerts -> PagerDuty (page the on-call engineer). - match: severity: critical receiver: 'pagerduty-critical' group_wait: 10s # Page faster for critical alerts. repeat_interval: 1h # Re-page every hour if unresolved. # Warning alerts -> Slack channel (no page, just notification). - match: severity: warning receiver: 'slack-warnings' group_wait: 1m repeat_interval: 4h # Team-specific routing. - match: team: payments receiver: 'slack-payments-team' receivers: - name: 'pagerduty-critical' pagerduty_configs: - service_key: '<pagerduty-integration-key>' severity: 'critical' description: '{{ .CommonAnnotations.summary }}' details: firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}' num_firing: '{{ .Alerts.Firing | len }}' runbook: '{{ .CommonAnnotations.runbook_url }}' - name: 'slack-warnings' slack_configs: - channel: '#alerts-warnings' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}' send_resolved: true - name: 'slack-payments-team' slack_configs: - channel: '#payments-alerts' title: '[Payments] {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}' - name: 'slack-catchall' slack_configs: - channel: '#alerts-catchall' title: 'Unrouted Alert: {{ .GroupLabels.alertname }}' # Inhibition: suppress warning alerts when a critical alert is already firing # for the same service. Prevents alert storms. inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: ['alertname', 'namespace']
- Deduplication: Same alert fingerprint = one notification.
- Grouping:
group_bydetermines which alerts are batched together. - Inhibition: Higher-severity alerts suppress lower-severity alerts for the same context.
- Silences: Temporary muting of alerts during maintenance windows.
- Routing: Label matching determines which receiver (PagerDuty, Slack, email) gets the alert.
group_by field in Alertmanager controls notification batching. If you group by alertname only, all pods with the same alert are grouped into one notification — good for reducing noise. If you group by alertname, pod, each pod gets its own notification — bad during a cluster-wide outage where 200 pods trigger the same alert. The production default should be group_by: ['alertname', 'namespace'] to batch by service and namespace. Use group_wait: 30s to allow grouping before sending.Prometheus Storage: TSDB Internals, Retention, and Thanos/Cortex for Long-Term
Prometheus stores metrics in its own time-series database (TSDB). Understanding TSDB internals is critical for capacity planning, retention tuning, and deciding when to add long-term storage.
TSDB stores data in blocks. Each block covers a 2-hour time range and contains a chunks directory (compressed metric samples) and an index. The head block is the in-memory write-ahead log (WAL) that receives all new samples. Every 2 hours, the head block is compacted into a persistent block and flushed to disk. Old blocks are compacted into larger blocks (e.g., 2h blocks into 1-day blocks) to reduce the number of files.
# Prometheus StatefulSet with retention and storage configuration. # For the kube-prometheus-stack Helm chart, these go in prometheus.prometheusSpec. apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: kube-prometheus namespace: monitoring spec: # Retention: how long to keep data locally. # 15d is typical. Longer retention = more disk and memory. retention: 15d # Retention by size: delete oldest blocks when storage exceeds this. # Use this as a safety net alongside time-based retention. retentionSize: 50GB # Storage: PVC for persistent TSDB blocks. storage: volumeClaimTemplate: spec: storageClassName: fast-ssd accessModes: - ReadWriteOnce resources: requests: storage: 100Gi # Resources: Prometheus is memory-hungry. # Rule of thumb: ~16KB per active time series. # 1M series = 16GB RAM. Plan accordingly. resources: requests: cpu: '1' memory: 16Gi limits: cpu: '4' memory: 32Gi # External labels: applied to all metrics when using Thanos/Cortex. # Identifies which Prometheus instance scraped the data. externalLabels: cluster: production-us-east-1 environment: production # Thanos sidecar: uploads blocks to object storage for long-term retention. thanos: objectStorageConfig: name: thanos-objstore-config key: objstore.yml # Sample limit: max series per scrape target. # Safety net against high-cardinality targets. serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false
- Prometheus: Single cluster, short-term (days to weeks). No native HA or federation.
- Thanos: Sidecar uploads blocks to S3/GCS. Querier federates across Prometheus instances. Compactor reduces storage costs. Best for multi-cluster with object storage.
- Cortex: Hor Cloud. Best for multi-tenant SaaS platforms.
- VictoriaMetrics: Drop-in Prometheus replacement with better compression and lower resource usage. Best for single-cluster with high cardinality.
- Decision: Use Thanos for multi-cluster with object storage. Use Cortex for multi-tenant SaaS. Use VictoriaMetrics for single-cluster resource optimization.
prometheus_tsdb_head_series and process_resident_memory_bytes. Set retention based on disk capacity: 15 days at 1M series with 15s scrape interval equals approximately 50GB of disk. Use fast SSDs for TSDB — network-attached storage introduces latency that slows compaction and can cause WAL corruption during power loss.| Aspect | Histogram | Summary | Thanos | Cortex | VictoriaMetrics |
|---|---|---|---|---|---|
| Purpose | Bucketed latency/size observations | Client-side quantile computation | Cross-cluster federation + long-term storage | Multi-tenant Prometheus backend | Drop-in Prometheus replacement |
| Quantile calculation | Server-side (histogram_quantile()) | Client-side (at instrumentation time) | N/A | N/A | N/A |
| Aggregatable across replicas | Yes — sum buckets before quantile | No — pre-computed quantiles cannot be averaged | Yes — via Thanos Querier global view | Yes — native horizontal aggregation | Yes — native horizontal aggregation |
| Best for Kubernetes | Yes — multi-replica needs server-side aggregation | Only single-instance with exact quantiles | Multi-cluster with S3/GCS object storage | Multi-tenant SaaS platforms | Single-cluster with extreme cardinality |
| Retention model | N/A | N/A | Unlimited (blocks in S3/GCS) | Unlimited (horizontally scalable) | Unlimited (highly compressed local TSDB) |
| HA support | N/A | N/A | Yes — Querier deduplicates across replicas | Yes — native replication | Yes — cluster mode |
| Compression ratio | N/A | N/A | Same as Prometheus (inherits TSDB blocks) | Same as Prometheus | Up to 10x better than Prometheus |
| Query compatibility | N/A | N/A | 100% PromQL compatible | 100% PromQL compatible | 100% PromQL compatible |
| Operational complexity | N/A | N/A | Medium — Sidecar + Querier + Store Gateway + Compactor | High — Ingester + Distributor + Querier + Compactor | Low — single binary or cluster mode |
| SLO burn rate support | Excellent — rate() on _count and _bucket | Difficult — quantile series not rate()-able | N/A | N/A | N/A |
| Memory overhead (client) | Low — O(number of buckets) | Higher — sliding window per quantile | N/A | N/A | N/A |
| Bucket boundary changes | Requires app restart | Requires app restart (objective changes) | N/A | N/A | N/A |
🎯 Key Takeaways
- Prometheus is pull-based with Kubernetes-native service discovery. Annotation-driven scraping means enabling monitoring requires no config changes on the Prometheus side.
- Instrument with Counters (totals), Gauges (current state), and Histograms (latency). Never use Summary in multi-replica deployments.
- Recording rules pre-compute expensive PromQL into cheap time series. They are not optional in production — they are the performance optimization.
- Alertmanager handles deduplication, grouping, routing, and inhibition. Inhibition rules are the most underused feature and the key to preventing alert storms.
- Prometheus TSDB memory is proportional to active series count. High-cardinality labels will OOM Prometheus. Monitor and limit cardinality aggressively.
- Prometheus is designed for short-term, per-cluster monitoring. Add Thanos for cross-cluster federation and long-term retention. Add Cortex for multi-tenant SaaS.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain how Prometheus discovers targets in Kubernetes. What is the role of relabel_configs?
- QWhat is the difference between a Counter and a Gauge? When would you use each?
- QWhy are Histograms preferred over Summaries in Kubernetes deployments with multiple replicas?
- QExplain histogram bucket design. Why is it a 'one-way door' and how does it affect SLO tracking?
- QWhat are recording rules and why are they important for production Prometheus deployments?
- QHow does the
forfield in alert rules prevent false positives? What is the recommended value? - QExplain Alertmanager's inhibition rules. How do they prevent alert storms during cascading failures?
- QWhat is the relationship between Prometheus TSDB head block size and memory usage? How do you plan capacity?
- QWhen would you add Thanos or Cortex to your monitoring stack? What problem does each solve?
- QHow do you debug a target showing as DOWN on the Prometheus /targets page?
- QWhat is the
__address__rewrite trap in multi-container pods? How do you avoid it? - QExplain SLO burn rate alerts. How do they differ from raw threshold alerts?
- QHow do NetworkPolicies affect Prometheus scraping? How do you debug silent scrape failures?
- QWhat is the
sample_limitfield in scrape configs? When is it useful? - QDescribe the Prometheus TSDB block lifecycle: WAL, head block, compaction, and retention.
Frequently Asked Questions
How does Prometheus discover pods in Kubernetes?
Prometheus uses kubernetes_sd_configs to query the Kubernetes API server for pods, services, endpoints, nodes, and ingresses. It authenticates using a ServiceAccount token. Relabeling configs filter and transform discovered targets before scraping. The convention prometheus.io/scrape: 'true' annotation opts pods into scraping.
What is the difference between a Counter and a Gauge?
A Counter only goes up (resets to zero on restart). Use it for totals like request count or error count. Use or rate() to compute per-second rates. A Gauge can go up or down. Use it for current values like queue depth or active connections. Never use increase() on a Gauge.rate()
Why should I use Histograms instead of Summaries in Kubernetes?
Histograms pre-aggregate observations into buckets on the client, then computes quantiles on the server at query time. This allows aggregation across multiple pod replicas. Summaries compute quantiles on the client and cannot be meaningfully aggregated across instances. In Kubernetes with multiple replicas, Histograms are almost always the correct choice.histogram_quantile()
How do recording rules improve Prometheus performance?
Recording rules pre-compute expensive PromQL expressions (like over 5 minutes) and store the result as a new time series. Dashboards and alerts query this pre-computed series instead of re-scanning raw data. The performance improvement is 100x or more for queries across high-cardinality data.rate()
What is the `for` field in alert rules?
The for field makes an alert go through a PENDING state before reaching FIRING. The alert must be consistently firing for the for duration before it is sent to Alertmanager. This prevents single-scrape blips from triggering pages. Recommended: for: 2m for critical alerts, for: 15m for warning alerts.
When do I need Thanos or Cortex?
Add Thanos when you need cross-cluster query federation, retention beyond local disk capacity (weeks to years), or Prometheus HA. Add Cortex when you need a horizontally scalable multi-tenant Prometheus backend. For single-cluster with high cardinality, consider VictoriaMetrics as a drop-in replacement.
How do I prevent Prometheus from OOMKill?
Monitor prometheus_tsdb_head_series — each active series consumes ~16KB of RAM. Never add unbounded values (user_id, request_id) as label values. Set sample_limit on scrape configs to cap series per target. Set memory requests/limits based on expected series count (1M series = ~16GB). Alert on memory usage before OOMKill.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.