Intermediate 15 min · March 06, 2026

Prometheus & Grafana Setup - Static IPs Cause Outages

No alert fired despite payment failures - static IP targets in Docker caused scrape loss after restart.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Prometheus scrapes metrics from services and stores them in a time-series database
  • Grafana visualises those metrics on customizable dashboards without writing UI code
  • Alertmanager handles notifications when metrics cross thresholds
  • A healthy Prometheus instance handles ~1M samples/second per core — cardinality is the bottleneck
  • Missing scraping targets because of network misconfiguration is the #1 production incident
  • Most engineers over-scrape: higher scrape frequency than scrape timeout causes cascading failures
✦ Definition~90s read
What is Prometheus and Grafana Setup?

Prometheus is an open-source time-series database and monitoring system that scrapes metrics from configured targets at regular intervals. Grafana is the analytics and visualization layer that queries Prometheus (and many other data sources) to build dashboards, alerts, and ad-hoc queries.

Imagine your app is a car engine.

The combination gives you a complete observability pipeline: instrument your code → expose metrics → scrape → store → query → alert → visualize. No external SaaS required, no per-seat licensing, and full control over retention.

You don't need to be an SRE to run this. A single Docker Compose file gets you a working stack in under 10 minutes. But the simplicity hides depth — get the scrape cadence wrong and you'll either burn your metrics disk or miss critical data.

Plain-English First

Imagine your app is a car engine. You wouldn't drive cross-country without a dashboard showing your speed, fuel, and temperature — you'd break down without warning. Prometheus is the set of sensors bolted to that engine, constantly measuring everything. Grafana is the beautiful dashboard on your steering wheel that turns those raw sensor readings into dials you can actually understand at a glance. Without this combo, you're driving blind.

Every production system fails eventually — the only question is whether YOU find out first, or your users do. In 2024, a five-minute outage at a mid-sized SaaS company can cost tens of thousands of dollars and destroy user trust built over months. The teams that catch problems in seconds rather than minutes aren't lucky — they have observability pipelines built with tools like Prometheus and Grafana that surface anomalies the moment they appear, not after a support ticket rolls in.

Before Prometheus became the de-facto standard for cloud-native monitoring, teams were duct-taping together cron jobs, custom scripts, and expensive APM vendors to answer the simplest question: 'Is my service healthy right now?' Prometheus solves this with a pull-based model that scrapes metrics from your services on a schedule, stores them in a time-series database, and lets you query them with a powerful expression language called PromQL. Grafana then plugs into that database and lets you visualise, alert on, and share those metrics without writing a single line of UI code.

By the end of this article you'll have a fully working Prometheus and Grafana stack running locally via Docker Compose, a real Node.js app exposing custom business metrics, a PromQL query that actually answers a business question, and an alerting rule that fires before your users notice a problem. This is the exact setup you'd use as a foundation for a production monitoring stack.

What is Prometheus and Grafana Setup?

Prometheus is an open-source time-series database and monitoring system that scrapes metrics from configured targets at regular intervals. Grafana is the analytics and visualization layer that queries Prometheus (and many other data sources) to build dashboards, alerts, and ad-hoc queries.

The combination gives you a complete observability pipeline: instrument your code → expose metrics → scrape → store → query → alert → visualize. No external SaaS required, no per-seat licensing, and full control over retention.

You don't need to be an SRE to run this. A single Docker Compose file gets you a working stack in under 10 minutes. But the simplicity hides depth — get the scrape cadence wrong and you'll either burn your metrics disk or miss critical data.

Production Insight
The default scrape interval of 15 seconds is fine for CPU/memory but too slow for request-rate spikes.
Set high-cardinality metrics (request latency per endpoint) to 10s, low-cardinality (disk space) to 60s.
Rule: different scrape intervals for different metric families — always configure per-job scrape_timeout and scrape_interval.
Key Takeaway
Prometheus stores what you scrape, not what you hope to debug later.
Design your metrics with query patterns in mind from day one.
Scrape sloppy, debug blind.
Prometheus & Grafana Setup Flow THECODEFORGE.IO Prometheus & Grafana Setup Flow From static IPs to reliable monitoring with service discovery Static IPs Cause Outages Hardcoded targets break when IPs change Service Discovery Consul or Kubernetes for dynamic targets Docker Compose Stack Prometheus, Grafana, exporters in containers Instrument Application Expose /metrics endpoint with client libs PromQL Queries Query metrics for dashboards and alerts Alerting & Storage Avoid disk full and 3AM false alarms ⚠ Static IPs in scrape configs cause silent outages Always use service discovery for production monitoring THECODEFORGE.IO
thecodeforge.io
Prometheus & Grafana Setup Flow
Prometheus Grafana Setup

Prometheus Architecture Flow Visual

Understanding the data flow from instrumentation to alerting is critical for debugging production issues. The following diagram shows how metrics travel from your application to Prometheus, then to Alertmanager and Grafana.

Flow overview: - Your application exposes metrics at /metrics (pull-based). - Prometheus scrapes these targets based on its scrape_configs and stores the data in its TSDB. - Prometheus evaluates alerting rules against the stored data and pushes alerts to Alertmanager. - Alertmanager handles deduplication, grouping, and routing to notification channels (PagerDuty, Slack, email). - Grafana queries Prometheus via its API to render dashboards.

The key architectural decision is that Prometheus pulls from targets, not the other way around. This makes it resilient to network partitions on the target side and gives you control over scrape cadence.

Pull vs Push Misconception
If your target is behind a firewall or has a dynamic IP, Prometheus cannot scrape it. Use Prometheus Pushgateway for short-lived jobs or Blackbox exporter for probes. The pull model is a feature, not a bug — it makes the monitoring system the source of truth for target availability.
Production Insight
The flow diagram maps directly to debugging steps:
- No data in Grafana? Check the Prometheus API first.
- No alert fired? Check alerting rules and Alertmanager status.
- Missing targets? Verify scrape configuration and target reachability.
Keep this flow in mind when triaging — it prevents jumping to wrong conclusions.
Key Takeaway
Metrics flow from targets to Prometheus to Alertmanager and Grafana.
Each hop is a potential failure point.
Debug upstream first (Prometheus API), then downstream (Alertmanager, Grafana).
Prometheus and Grafana Data Flow
exposes /metricsscrapesstoresevaluates rulessends notificationsquery APIdashboardsApplication / ExporterPrometheus ServerTSDB on DiskAlertmanagerPagerDuty / Slack / EmailGrafanaUser/Operator

Setting Up the Stack with Docker Compose

Here's a minimal docker-compose.yml that runs Prometheus, Grafana, and a Node.js app that exposes custom metrics. We'll use the prom/prometheus and grafana/grafana official images. The Node.js app uses the prom-client library.

``yaml version: '3.8' services: prometheus: image: prom/prometheus:v2.53.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090" grafana: image: grafana/grafana:10.4.2 ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin depends_on: - prometheus app: build: ./app ports: - "3001:3001" environment: - METRICS_PORT=3001 ``

The Prometheus config file (prometheus.yml) must define scrape targets. Notice we reference services by Docker Compose service name — the embedded DNS resolver resolves them.

``yaml scrape_configs: - job_name: 'node-app' scrape_interval: 15s static_configs: - targets: ['app:3001'] ``

Don't forget to add a network alias if you have multiple networks, or use depends_on: condition: service_healthy to ensure your app is ready before Prometheus starts scraping.

docker-compose.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
  grafana:
    image: grafana/grafana:10.4.2
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus
  app:
    build: ./app
    ports:
      - "3001:3001"
    environment:
      - METRICS_PORT=3001
Common Pitfall: depends_on is not a readiness check
Prometheus will start before your app is ready to serve metrics. Add a healthcheck to the app service and use condition: service_healthy in the Prometheus depends_on block. Otherwise you'll see target 'DOWN' on first scrape
Production Insight
Docker Compose DNS resolves service names internally, but if you override network_mode: host, service discovery breaks.
Always verify connectivity with docker compose exec prometheus wget -q -O- http://app:3001/metrics.
Rule: test the scrape endpoint from inside the Prometheus container, not from the host.
Key Takeaway
Compose files are reproducible local stacks — but don't treat them as production config.
Always add health checks and conditional depends_on.
Scrape target resolution fails silently; verify before relying on it.

Service Discovery: Configuring Consul, Kubernetes, and Docker Targets

Static targets work fine for a single Docker Compose stack, but in production your services scale, restart, and move IPs. That's where service discovery comes in. Prometheus supports multiple discovery mechanisms that dynamically generate targets from your infrastructure catalog.

Consul Service Discovery If you run Consul, Prometheus can discover targets by querying the Consul API. The following config scrapes all services that have a metrics tag and are passing health checks.

``yaml scrape_configs: - job_name: 'consul-services' consul_sd_configs: - server: 'consul:8500' tags: - metrics relabel_configs: - source_labels: [__meta_consul_service] target_label: job - source_labels: [__meta_consul_node] target_label: instance - source_labels: [__meta_consul_service_address, __meta_consul_service_port] separator: ':' target_label: __address__ ``

Kubernetes Service Discovery On Kubernetes, Prometheus uses the API to watch pods, services, endpoints, and ingresses. The most common pattern is scraping pods based on annotations.

``yaml scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) ``

Docker Service Discovery For Docker Swarm or individual containers, use Docker SD. This config discovers running containers and scrapes those with the prometheus.io/scrape: true label.

``yaml scrape_configs: - job_name: 'docker-containers' docker_sd_configs: - host: 'unix:///var/run/docker.sock' refresh_interval: 30s relabel_configs: - source_labels: [__meta_docker_container_label_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_docker_container_name] regex: '/(.*)' replacement: '$1' target_label: container ``

Each SD config uses relabel_configs to map the discovered metadata into Prometheus target labels. The key is to extract the IP, port, and any meaningful labels (service name, environment) from the provider's metadata.

prometheus-service-discovery.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Consul SD
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul:8500'
        tags: ['metrics']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_node]
        target_label: instance
      - source_labels: [__meta_consul_service_address, __meta_consul_service_port]
        separator: ':'
        target_label: __address__

# Kubernetes Pod SD
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

# Docker SD
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: 'unix:///var/run/docker.sock'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_docker_container_name]
        regex: '/(.*)'
        replacement: '$1'
        target_label: container
Security: Don't Expose the Docker Socket Directly
Giving Prometheus access to the Docker socket is a security risk. Use a dedicated Docker API proxy with read-only access, or switch to file-based discovery if your orchestrator doesn't change often. The docker_sd_configs approach should only be used in trusted environments.
Production Insight
Service discovery eliminates the static IP problem, but introduces latency: changes may take up to refresh_interval to propagate. In Consul, this is typically 30s. In Kubernetes, the watch mechanism is near-instant. Always set refresh_interval based on your scaling cadence — 5s for autoscaling services, 60s for long-running VMs.
Rule: combine SD with a blackbox exporter that pings targets from Prometheus's perspective to validate connectivity independent of SD.
Key Takeaway
Static targets fail in dynamic environments.
Service discovery ties Prometheus to your orchestrator's truth.
Use relabel_configs to transform provider metadata into meaningful labels.

Instrumenting Your Application

Your app must expose a /metrics HTTP endpoint that Prometheus can scrape. For Node.js, the prom-client library provides the OpenMetrics format. Create a simple Express app that tracks request count and latency.

```javascript const express = require('express'); const prometheus = require('prom-client');

const app = express(); const register = new prometheus.Registry();

prometheus.collectDefaultMetrics({ register });

const httpRequestDurationMicroseconds = new prometheus.Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5] }); register.registerMetric(httpRequestDurationMicroseconds);

app.use((req, res, next) => { const end = httpRequestDurationMicroseconds.startTimer(); res.on('finish', () => { end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode }); }); next(); });

app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });

app.get('/hello', (req, res) => res.json({ message: 'hello' }));

app.listen(3001, () => console.log('Metrics at http://localhost:3001/metrics')); ```

Key points: register default metrics (CPU, memory, event loop lag), define custom histograms with appropriate buckets, and always use labelNames that correspond to production dimensions (method, route, status). Too many unique label combinations will cause high cardinality — be surgical.

app/index.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const express = require('express');
const prometheus = require('prom-client');

const app = express();
const register = new prometheus.Registry();

prometheus.collectDefaultMetrics({ register });

const httpRequestDurationMicroseconds = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
register.registerMetric(httpRequestDurationMicroseconds);

app.use((req, res, next) => {\n  const end = httpRequestDurationMicroseconds.startTimer();\n  res.on('finish', () => {\n    end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });\n  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.get('/hello', (req, res) => res.json({ message: 'hello' }));

app.listen(3001, () => console.log('Metrics at http://localhost:3001/metrics'));
Mental Model: Metrics Are Dimensions, Not Counters
  • A histogram with method, route, status creates series for each (method, route, status) triplet.
  • If you have 5 methods × 20 routes × 5 statuses = 500 series per histogram bucket.
  • Prometheus uses memory proportional to series count — 1M series uses ~2GB RAM.
  • Aggressively limit label variability: use path code instead of raw URL path, or use aggregation before ingestion.
Production Insight
High cardinality is the silent killer of Prometheus. A simple gauge with user_id label creates 10k series for 10k users — that's 10k unique time-series per gauge.
If you must track per-user metrics, push them to a separate TSDB like VictoriaMetrics with higher cardinality tolerance.
Rule: never put unbounded labels (user_id, request_id, session_id) on metrics.
Key Takeaway
Instrumentation is a contract: you decide what's measurable.
Every label you add multiplies the storage cost.
Measure what matters, not everything that moves.

PromQL: The Query Language That Makes Metrics Useful

PromQL is the expression language that turns stored metrics into actionable insights. It's deceptively simple — rate(http_requests_total[5m]) gives you requests per second — but mastering aggregations, offset, and subqueries separates the pro from the panic.

  • Request rate per route: rate(http_request_duration_seconds_count{job="node-app"}[5m])
  • 95th percentile latency: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))
  • Error ratio: sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

Common gotcha: rate() requires a counter — only use it on metrics ending with _total or _count. For gauges (CPU, memory), use avg_over_time() or max_over_time().

Another trap: missing by() clause in histogram_quantile aggregates across all labels, giving you one value for your entire app. Always by (le, route) to get per-route latencies.", "code": { "language": "promql", "filename": "queries.promql", "code": "# Request rate per route (counter) rate(http_request_duration_seconds_count{job=\"node-app\"}[5m])

# 95th percentile latency per route (histogram) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"node-app\"}[5m])) by (le, route) )

# Error ratio (status 5xx vs total) sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

# CPU usage of the app container (gauge) avg by (container_name) (rate(container_cpu_usage_seconds_total{name=\"app\"}[1m]))" }, "callout": { "type": "info", "title": "PromQL Tip: Always Use Aggregators", "text": "Queries without by() or without() may return multiple time-series depending on label values. Use sum by(...) or avg by(...) to reduce dimensionality and make dashboards predictable." }, "production_insight": "PromQL queries with high cardinality time-series can timeout — default timeout is 60s. Use recording rules for expensive queries (e.g., hourly aggregated error rate) to precompute them every 5 minutes. Rule: never execute an unaggregated PromQL query on a dashboard with auto-refresh <30s.", "key_takeaway": "PromQL is powerful but expensive. Aggregate early, aggregate often. Use recording rules for dashboard queries — your CPU budget will thank you." }, { "heading": "PromQL Common Query Cheat Sheet (rate, sum, increase)", "content": "Here's a quick-reference cheat sheet for the three most frequently used PromQL functions. Bookmark it for when you need to write a query fast.

rate() – per-second average rate of increase over a time window (for counters) - rate(metric_total[5m]) → requests per second over last 5 minutes - rate(http_requests_total{status=\"500\"}[10m]) → error rate per second - Always use with counters ending in _total or _count

increase() – total increase over a time window (for counters) - increase(http_requests_total[1h]) → total requests in the last hour - increase(errors_total[24h]) → total errors in the last day - Useful for billing or compliance metrics that need cumulative totals

sum() – aggregates time series across labels - sum(rate(http_requests_total[5m])) → total request rate across all instances - sum by (service) (rate(http_requests_total[5m])) → request rate per service - Combine with rate() or increase() to reduce dimensionality

Common combinations: ```promql # Latency > 1s per minute (count of slow requests) sum(increase(http_request_duration_seconds_count{le=\"1\"}[1m]))

# 99th percentile latency by endpoint histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# Percentage of failed requests per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))

# Memory usage per container in MB container_memory_usage_bytes / 1024 / 1024 ```

Quick rules of thumb: - Counter metric → rate() or increase() - Gauge metric → avg_over_time(), max_over_time(), or raw value - Histogram metric → histogram_quantile() + rate(_bucket) - High cardinality → sum by() to collapse - Missing by() → you get multiple series (often not what you want)", "code": { "language": "promql", "filename": "promql-cheat-sheet.promql", "code": "# RATE: Requests per second rate(http_requests_total[5m])

# INCREASE: Total requests in last hour increase(http_requests_total[1h])

# SUM + RATE: Total error rate across all instances sum(rate(http_requests_total{status=~\"5..\"}[5m]))

# SUM BY: Error rate per service sum by (service) (rate(http_requests_total{status=~\"5..\"}[5m]))

# HISTOGRAM QUANTILE: 99th percentile latency histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# RATIO: Error percentage per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))

# GAUGE: CPU usage percentage (container_cpu_usage_seconds_total is a counter) avg by (container_name) (rate(container_cpu_usage_seconds_total[1m])) * 100" }, "callout": { "type": "info", "title": "Common Mistake: Using rate() on a gauge", "text": "rate() is only valid for counters (always increasing). Using it on a gauge that goes up and down (e.g., memory usage) will produce nonsense values. Use avg_over_time() or delta() for gauges." }, "production_insight": "A well-tuned PromQL query is the difference between a 10ms dashboard load and a 10s timeout. Always use recording rules for queries that appear on multiple dashboards or have a heavy by() clause. In a production incident, increase(errors_total[5m]) is faster than rate() when you need to know the raw count quickly.", "key_takeaway": "Memorize the rate, increase, sum combo. Counter → rate/increase. Gauge → average/raw. Aggregate by meaningful labels to control cardinality." }, { "heading": "Alerting and Grafana Dashboards", "content": "Prometheus alerting works in two parts: alerting rules in Prometheus, and alert routing by Alertmanager. Rules are defined in a YAML file and loaded by Prometheus. They evaluate at the scrape interval (default 15s) and can fire into different severities.

Example rule for high error rate: ``yaml groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" ``

The for: clause prevents flapping — the condition must be true for 2 minutes before an alert fires. Configure Alertmanager to send to email, PagerDuty, or Slack.

Grafana dashboards visualize these alerts. Create a panel with metric ALERTS{alertstate=\"firing\"} to show active alerts. Better yet, use the Alertmanager datasource in Grafana for a dedicated alerts page.

For dashboards, follow the SRE golden signals: latency, traffic, errors, saturation. Start with a system overview panel, then drill-down panels per service.", "code": { "language": "yaml", "filename": "alerting-rules.yml", "code": "groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" - alert: InstanceDown expr: up{job=\"node-app\"} == 0 for: 1m labels: severity: critical annotations: summary: \"{{ $labels.instance }} is down\"" }, "callout": { "type": "mental_model", "title": "Mental Model: Alert Fatigue is a Design Problem", "hook": "Every alert that fires but doesn't require a human action is noise — it trains your team to ignore the next alert.", "bullets": [ "Alert on symptoms (error rate >5%), not causes (CPU >80%).", "Use for: to avoid flapping alerts from transient spikes.", "Only escalate to on-call if the alert requires immediate human intervention within 15 minutes.", "A well-designed dashboard should show metrics, alerts should show anomalies." ] }, "production_insight": "Alertmanager can receive thousands of alerts during an incident — but human brains can process 3-5 per minute. Use inhibition rules to silence less-severe alerts when a critical alert fires (e.g., silence all InstanceDown alerts if HighErrorRate is already firing for the same cluster). Rule: alert about the root cause, not every symptom.", "key_takeaway": "Alerting isn't about being loud — it's about being precise. Every alert should answer: what's broken, why, and what to fix. If you can't script the fix, don't alert — dashboards are better." } ]

Alerting That Won't Wake You at 3 AM for Nothing

Most setups stop at dashboards. That’s not monitoring — that’s a screensaver. Real monitoring alerts you when something breaks, not when a metric twitches. Prometheus Alertmanager does this, but it takes configuration to avoid noise.

Start with recording rules. They pre-compute expensive queries so your alert rules don’t hammer Prometheus. Write alert rules that trigger on sustained anomalies, not single spikes. A 5-second CPU burst is noise. 5 minutes at 95% is a problem.

Route alerts by severity and team. Critical goes to PagerDuty, warnings go to Slack, info emails get archived. Use inhibition rules to suppress low-severity alerts when a higher one fires. No one cares about disk latency when the node is down.

Test your alerts. Run amtool against your rules before deploying. Simulate failures in staging. If your first real alert is a false positive, your team stops trusting the system.

alerting-rules.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — devops tutorial

// Alert rule for high CPU - fires after 5 minutes sustained
- name: node_alerts
  rules:
    - alert: HighCpuUsage
      expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.95
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Instance {{ $labels.instance }} CPU at {{ $value | humanizePercentage }}"
        description: "CPU >95% for 5 min on {{ $labels.instance }}"
Output
Alert rules compiled successfully.
Testing rule node_alerts/HighCpuUsage...
PASS
Production Trap:
Never alert on 'for: 0m'. That fires immediately. You'll get paged for a cron job spike that lasts 30 seconds. Always set a duration.
Key Takeaway
Alert on symptoms, not causes. CPU high matters; CPU high for 5 minutes matters more.

Prometheus Storage: Don't Run Out of Disk at 2 PM on a Tuesday

Prometheus stores time-series data locally by default. That data has a retention period — set it or it fills your disk. Default is 15 days. Adjust based on how far back you need to query. If you never look at last week, drop retention to 7 days and save space.

Block storage is a thing. Prometheus writes data in two-hour blocks. Each block is immutable after compaction. Deleting old data means dropping entire blocks, not individual metrics. Plan retention in multiples of two hours, or you waste space.

Tsdb retention is configured in Prometheus startup flags: --storage.tsdb.retention.time=30d. Pair that with --storage.tsdb.retention.size=100GB to cap disk usage. If your blocks exceed 100GB, Prometheus drops oldest data regardless of time. Use both flags. Never one.

For long-term storage, use Thanos or Cortex. They let you keep years of data in object storage (S3, GCS). Prometheus itself is ephemeral — its local storage is for fast queries on recent data. Archive everything older to external storage.

prometheus-retention-config.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — devops tutorial

// Prometheus startup flags for storage limits
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Command line flags for storage
# --storage.tsdb.retention.time=30d
# --storage.tsdb.retention.size=100GB

# Storage directory
# --storage.tsdb.path=/data/prometheus

# Example docker-compose service override
services:
  prometheus:
    image: prom/prometheus:
    volumes:
      - prometheus_data:/prometheus
    command:
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=100GB'
Output
tsdb: start of block: 2024-03-15T14:00:00Z, end: 2024-03-15T16:00:00Z, duration: 2h0m0s
current retention: 30d
current retention size: 100GB
blocks to delete: 0
Senior Shortcut:
Run ls -lh /data/prometheus and check du -sh * to see block sizes. If any block exceeds 1GB, your scrape interval is too aggressive or you're scraping too many metrics.
Key Takeaway
Set both time and size retention in Prometheus. Disk is not infinite. Archive old data with Thanos or Cortex.

Instrumenting a Spring Boot App

Why instrument your app? Without metrics, you're flying blind. Spring Boot makes this trivial with Micrometer, a facade that feeds Prometheus. Add micrometer-registry-prometheus to your pom.xml or build.gradle. Expose the /actuator/prometheus endpoint. That single change gives you JVM metrics—heap, threads, GC pauses—plus HTTP request counters and timers. Custom metrics? Annotate a method with @Timed or wire a MeterRegistry bean to record counters and gauges. The real power: you can aggregate these across all instances. CPU spikes become visible across your fleet. Out-of-memory warnings appear before the crash. This is observability without a PhD. One dependency, one endpoint, infinite clarity.

application.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — devops tutorial

management:
  endpoints:
    web:
      exposure:
        include: prometheus
  metrics:
    tags:
      application: my-app
      environment: production
    export:
      prometheus:
        enabled: true
Output
Prometheus metrics endpoint: /actuator/prometheus
Default tags: application=my-app, environment=production
Production Trap:
Never expose /actuator/prometheus to the public internet. Restrict access with a firewall or authentication gateway. Otherwise, anyone can scrape your heap dumps.
Key Takeaway
One Micrometer dependency unlocks JVM, HTTP, and custom metrics. Instrument first, debug later.

Pitfall 1: Overloaded Prometheus

Prometheus scrapes 10,000 time series per target by default. Push beyond that and scrapes fail, queries lag, storage balloons. Why it happens: instrumenting every SQL query, HTTP header, or loop iteration. Metrics with high cardinality—like user_id or session_id—explode the time series count. A single label with 1,000 unique values creates 1,000 series per metric. Times 100 metrics gives 100,000 series. This collapses the server. To diagnose, check prometheus_tsdb_head_series and note the cardinality. Fix by removing high-cardinality labels or aggregating with recording rules. Use count by (status) instead of sum by (user_id). Run promtool tsdb analyze on blocks. Protect storage with retention limits. An overloaded Prometheus is silent sabotage. Trim the fat.

prometheus.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — devops tutorial

storage:
  tsdb:
    # Limit retention to 15 days
    retention.time: 15d
    # Cap max blocks to 50 GB
    retention.size: 50GB

scrape_configs:
  - job_name: 'overload-watch'
    scrape_interval: 15s
    # Max samples per scrape
    sample_limit: 5000
Output
Retention: 15 days or 50 GB. Sample limit: 5,000 per scrape.
Production Trap:
A single misconfigured label like request_id can triple storage overnight. Monitor series count per job via prometheus_target_scrapes_exceeded_sample_limit_total.
Key Takeaway
High-cardinality labels kill Prometheus. Add sample_limit and analyze cardinality weekly.

Pitfall 2: Dashboard Clutter

Why clutter is dangerous: too many panels hide the one signal that matters. A Grafana dashboard with 50 graphs—CPU, memory, disk per host—blinds operators. The brain can't scan that fast. Root cause: copying generic dashboards without trimming. Every panel adds cognitive load. In an incident, you need two things: what's failing and where. The rest is noise. Fix by following the RED method—Rate, Errors, Duration—for each service. Start with 3 panels per service. Aggregate by team. Use variables to filter, not multiply. Remove graphs that no one looked at in the last 30 days. Enable dashboard provisioning from version control so changes are reviewed. A cluttered dashboard is an insecure system. Kill the noise.

dashboard-provider.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — devops tutorial

apiVersion: 1
providers:
  - name: 'production-dashboards'
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true
    # Delete orphaned dashboards
    allowUiUpdates: false
Output
Dashboards loaded from files. UI edits blocked. Enforce version control.
Production Trap:
Dashboards edited in the UI silently diverge from code. Enable provisioning with allowUiUpdates: false to force git-based changes only.
Key Takeaway
Adopt RED metrics (Rate, Errors, Duration). Keep 3 panels per service. Version-control everything.

Introduction: The Power of Knowing Your Systems

Before metrics, operations teams flew blind. A server crash meant frantic log dives, cascading failures caught hours late, and capacity planning was pure guesswork. Prometheus and Grafana flip that script by giving you real-time visibility into every moving part of your infrastructure. Prometheus scrapes metrics from targets, stores them as time-series data, and exposes a powerful query language (PromQL) for slicing and dicing. Grafana consumes that data to build dashboards and alerts that surface anomalies the moment they happen. Together, they transform opaque systems into instruments you can read and tune. This setup isn’t just about avoiding downtime—it’s about understanding performance trends, correlating events across services, and building confidence that your architecture can scale. When you know exactly what normal looks like, you spot the abnormal before it becomes a crisis.

Why This Matters:
Observability isn't a luxury; it's a requirement for any production system handling user traffic or revenue.
Key Takeaway
Visibility into your systems is the foundation of reliability and proactive troubleshooting.

The Story: From Chaos to Clarity

Our platform’s early days were a mess. A burst of user traffic would silently crash a backend service, but our monitoring—a cobbled-together Nagios with static thresholds—only alerted us after a full outage. We’d spend hours piecing together log fragments, chasing ghosts. The turning point came when we deployed Prometheus to scrape every container’s CPU, memory, and request latency. Grafana turned those raw numbers into a live dashboard: a single pane showing latency spikes, error rates, and resource saturation. Suddenly, we could see a slow database query pushing CPU to 90% thirty minutes before any alert fired. We tuned alerts from noise to signal: only page when p99 latency triples for five minutes. The result? Mean time to detection dropped from hours to under a minute. Our team stopped firefighting and started engineering—because we finally understood what our systems were saying.

The Turning Point:
When the first real incident hit after the setup, a Grafana panel showed the exact moment a misconfigured auto-scaler started starving a cache node. We fixed it in minutes, not hours.
Key Takeaway
Real clarity comes from correlating metrics over time, not just reacting to isolated thresholds.

Case Study: E-Commerce Turnaround

A mid-sized e-commerce site faced weekend flash sales that turned into outages: cart hangs, checkout freezes, and frustrated customers. Before Prometheus, they relied on basic cloud metrics—CPU and memory—but couldn’t see the real bottleneck: database connection pool exhaustion caused by a spike in abandoned sessions. We instrumented their Node.js checkout service to expose custom metrics: checkout_requests_total, checkout_duration_seconds, and db_connections_in_use. Grafana displayed a heatmap of request latency across product categories, and a PromQL query showed that rate(checkout_requests_total[5m]) correlating with rate(db_connections_in_use[5m]) diverged during sales. The fix: connection pooling tuning and a Circuit Breaker pattern for the database. The next flash sale saw zero downtime, and checkout latency dropped 40%. Prometheus didn’t just monitor—it guided the architecture fix.

prometheus-alert-rule.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — devops tutorial
groups:
  - name: ecommerce
    rules:
      - alert: HighCheckoutLatency
        expr: |
          histogram_quantile(0.99,
            rate(checkout_duration_seconds_bucket[5m])
          ) > 3
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Checkout p99 latency high"
          description: "Latency > 3s for 2 minutes"
Output
The alert triggers when p99 checkout latency exceeds 3 seconds for two consecutive minutes.
Production Trap:
Don't instrument generic metrics only. Expose business-specific ones (e.g., checkout_requests) to see what actually impacts users.
Key Takeaway
Custom business metrics turn Prometheus from a system monitor into a performance debugging tool.
● Production incidentPOST-MORTEMseverity: high

The Silent Outage: Missing Metrics Alert

Symptom
Customers reporting payment failures but no alert fired. All service dashboards showed green.
Assumption
The team assumed Prometheus would auto-discover new container IPs via DNS or service discovery.
Root cause
Prometheus was configured with static targets pointing to container IPs that changed after a Docker restart. No service discovery was configured, and the new IPs were never scraped.
Fix
Switch to service discovery: use a file-based target with consul_sd_configs or dns_sd_configs. Or use a Docker Compose network alias so container name resolves consistently.
Key lesson
  • Never rely on static IPs for Prometheus targets in dynamic environments.
  • Always configure a 'tooling' blackbox exporter to verify that Prometheus itself can still reach its targets.
  • Add a synthetic metric (up == 0?) alert with low threshold to catch total scrape loss.
Production debug guideSymptom → Action guide for the most common production problems4 entries
Symptom · 01
Grafana dashboards show 'No data' for a panel
Fix
Check the Prometheus data source connection in Grafana. Verify that the query returns data in Prometheus's own expression browser. Common cause: time range mismatch — Grafana defaults to last 6 hours but data might be older.
Symptom · 02
Prometheus target shows 'DOWN' with connection refused
Fix
Run docker compose logs <service> to check if the metrics endpoint is listening. Verify the port in the scrape config matches the container's exposed port. Use curl http://localhost:<port>/metrics from the host to confirm accessibility.
Symptom · 03
High memory usage on Prometheus server
Fix
Check cardinality: promtool tsdb analyze --extended /prometheus/data. Look for metrics with high label value counts. Common culprit: request_latency_seconds_bucket with user_id label. Reduce label cardinality or increase retention window.
Symptom · 04
Alerts not firing despite rule evaluation showing pending
Fix
Alerts must remain in 'Pending' state for the configured for: duration before firing. Check ALERTS metric in Prometheus. Also verify Alertmanager is reachable from Prometheus and that inhibition rules aren't suppressing the alert.
★ Quick Debug Cheat Sheet for Prometheus & GrafanaRun these commands in order when things go dark
Grafana shows no panels at all
Immediate action
Check browser console for 404 on datasource proxy
Commands
curl -s http://localhost:9090/api/v1/query?query=up
curl -s http://localhost:3000/api/datasources/proxy/1/query?query=up&time=`date +%s`
Fix now
Make sure the datasource URL in Grafana matches the Prometheus server URL (internal network name, not localhost).
Prometheus targets show as DOWN+
Immediate action
Check if the target process is running and exposing /metrics
Commands
docker compose ps
docker compose logs <target-service> | grep -i metrics
Fix now
If the service is running, verify the metrics endpoint path. Add --web.route-prefix=/ to Prometheus config if using custom path.
Alertmanager not delivering alerts+
Immediate action
Check alertmanager logs for receiver errors
Commands
docker compose logs alertmanager | tail -30
curl -s http://localhost:9093/api/v2/alerts | jq '. | length'
Fix now
Verify your receiver configuration (webhook URL, email SMTP) is correct. Use a test webhook service like webhook.site to verify delivery.
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Monitoring. Mark it forged?

15 min read · try the examples if you haven't

Previous
Introduction to Monitoring and Observability
2 / 9 · Monitoring
Next
ELK Stack — Elasticsearch Logstash Kibana