Prometheus & Grafana Setup - Static IPs Cause Outages
No alert fired despite payment failures - static IP targets in Docker caused scrape loss after restart.
- Prometheus scrapes metrics from services and stores them in a time-series database
- Grafana visualises those metrics on customizable dashboards without writing UI code
- Alertmanager handles notifications when metrics cross thresholds
- A healthy Prometheus instance handles ~1M samples/second per core — cardinality is the bottleneck
- Missing scraping targets because of network misconfiguration is the #1 production incident
- Most engineers over-scrape: higher scrape frequency than scrape timeout causes cascading failures
Imagine your app is a car engine. You wouldn't drive cross-country without a dashboard showing your speed, fuel, and temperature — you'd break down without warning. Prometheus is the set of sensors bolted to that engine, constantly measuring everything. Grafana is the beautiful dashboard on your steering wheel that turns those raw sensor readings into dials you can actually understand at a glance. Without this combo, you're driving blind.
Every production system fails eventually — the only question is whether YOU find out first, or your users do. In 2024, a five-minute outage at a mid-sized SaaS company can cost tens of thousands of dollars and destroy user trust built over months. The teams that catch problems in seconds rather than minutes aren't lucky — they have observability pipelines built with tools like Prometheus and Grafana that surface anomalies the moment they appear, not after a support ticket rolls in.
Before Prometheus became the de-facto standard for cloud-native monitoring, teams were duct-taping together cron jobs, custom scripts, and expensive APM vendors to answer the simplest question: 'Is my service healthy right now?' Prometheus solves this with a pull-based model that scrapes metrics from your services on a schedule, stores them in a time-series database, and lets you query them with a powerful expression language called PromQL. Grafana then plugs into that database and lets you visualise, alert on, and share those metrics without writing a single line of UI code.
By the end of this article you'll have a fully working Prometheus and Grafana stack running locally via Docker Compose, a real Node.js app exposing custom business metrics, a PromQL query that actually answers a business question, and an alerting rule that fires before your users notice a problem. This is the exact setup you'd use as a foundation for a production monitoring stack.
What is Prometheus and Grafana Setup?
Prometheus is an open-source time-series database and monitoring system that scrapes metrics from configured targets at regular intervals. Grafana is the analytics and visualization layer that queries Prometheus (and many other data sources) to build dashboards, alerts, and ad-hoc queries.
The combination gives you a complete observability pipeline: instrument your code → expose metrics → scrape → store → query → alert → visualize. No external SaaS required, no per-seat licensing, and full control over retention.
You don't need to be an SRE to run this. A single Docker Compose file gets you a working stack in under 10 minutes. But the simplicity hides depth — get the scrape cadence wrong and you'll either burn your metrics disk or miss critical data.
Prometheus Architecture Flow Visual
Understanding the data flow from instrumentation to alerting is critical for debugging production issues. The following diagram shows how metrics travel from your application to Prometheus, then to Alertmanager and Grafana.
Flow overview: - Your application exposes metrics at /metrics (pull-based). - Prometheus scrapes these targets based on its scrape_configs and stores the data in its TSDB. - Prometheus evaluates alerting rules against the stored data and pushes alerts to Alertmanager. - Alertmanager handles deduplication, grouping, and routing to notification channels (PagerDuty, Slack, email). - Grafana queries Prometheus via its API to render dashboards.
The key architectural decision is that Prometheus pulls from targets, not the other way around. This makes it resilient to network partitions on the target side and gives you control over scrape cadence.
Setting Up the Stack with Docker Compose
Here's a minimal docker-compose.yml that runs Prometheus, Grafana, and a Node.js app that exposes custom metrics. We'll use the prom/prometheus and grafana/grafana official images. The Node.js app uses the prom-client library.
``yaml version: '3.8' services: prometheus: image: prom/prometheus:v2.53.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090" grafana: image: grafana/grafana:10.4.2 ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin depends_on: - prometheus app: build: ./app ports: - "3001:3001" environment: - METRICS_PORT=3001 ``
The Prometheus config file (prometheus.yml) must define scrape targets. Notice we reference services by Docker Compose service name — the embedded DNS resolver resolves them.
``yaml scrape_configs: - job_name: 'node-app' scrape_interval: 15s static_configs: - targets: ['app:3001'] ``
Don't forget to add a network alias if you have multiple networks, or use depends_on: condition: service_healthy to ensure your app is ready before Prometheus starts scraping.
condition: service_healthy in the Prometheus depends_on block. Otherwise you'll see target 'DOWN' on first scrapeService Discovery: Configuring Consul, Kubernetes, and Docker Targets
Static targets work fine for a single Docker Compose stack, but in production your services scale, restart, and move IPs. That's where service discovery comes in. Prometheus supports multiple discovery mechanisms that dynamically generate targets from your infrastructure catalog.
Consul Service Discovery If you run Consul, Prometheus can discover targets by querying the Consul API. The following config scrapes all services that have a metrics tag and are passing health checks.
``yaml scrape_configs: - job_name: 'consul-services' consul_sd_configs: - server: 'consul:8500' tags: - metrics relabel_configs: - source_labels: [__meta_consul_service] target_label: job - source_labels: [__meta_consul_node] target_label: instance - source_labels: [__meta_consul_service_address, __meta_consul_service_port] separator: ':' target_label: __address__ ``
Kubernetes Service Discovery On Kubernetes, Prometheus uses the API to watch pods, services, endpoints, and ingresses. The most common pattern is scraping pods based on annotations.
``yaml scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) ``
Docker Service Discovery For Docker Swarm or individual containers, use Docker SD. This config discovers running containers and scrapes those with the prometheus.io/scrape: true label.
``yaml scrape_configs: - job_name: 'docker-containers' docker_sd_configs: - host: 'unix:///var/run/docker.sock' refresh_interval: 30s relabel_configs: - source_labels: [__meta_docker_container_label_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_docker_container_name] regex: '/(.*)' replacement: '$1' target_label: container ``
Each SD config uses relabel_configs to map the discovered metadata into Prometheus target labels. The key is to extract the IP, port, and any meaningful labels (service name, environment) from the provider's metadata.
docker_sd_configs approach should only be used in trusted environments.refresh_interval to propagate. In Consul, this is typically 30s. In Kubernetes, the watch mechanism is near-instant. Always set refresh_interval based on your scaling cadence — 5s for autoscaling services, 60s for long-running VMs.Instrumenting Your Application
Your app must expose a /metrics HTTP endpoint that Prometheus can scrape. For Node.js, the prom-client library provides the OpenMetrics format. Create a simple Express app that tracks request count and latency.
```javascript const express = require('express'); const prometheus = require('prom-client');
const app = express(); const register = new prometheus.Registry();
prometheus.collectDefaultMetrics({ register });
const httpRequestDurationMicroseconds = new prometheus.Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5] }); register.registerMetric(httpRequestDurationMicroseconds);
app.use((req, res, next) => { const end = httpRequestDurationMicroseconds.startTimer(); res.on('finish', () => { end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode }); }); next(); });
app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });
app.get('/hello', (req, res) => res.json({ message: 'hello' }));
app.listen(3001, () => console.log('Metrics at http://localhost:3001/metrics')); ```
Key points: register default metrics (CPU, memory, event loop lag), define custom histograms with appropriate buckets, and always use labelNames that correspond to production dimensions (method, route, status). Too many unique label combinations will cause high cardinality — be surgical.
- A histogram with method, route, status creates series for each (method, route, status) triplet.
- If you have 5 methods × 20 routes × 5 statuses = 500 series per histogram bucket.
- Prometheus uses memory proportional to series count — 1M series uses ~2GB RAM.
- Aggressively limit label variability: use
pathcode instead of raw URL path, or use aggregation before ingestion.
PromQL: The Query Language That Makes Metrics Useful
PromQL is the expression language that turns stored metrics into actionable insights. It's deceptively simple — rate(http_requests_total[5m]) gives you requests per second — but mastering aggregations, offset, and subqueries separates the pro from the panic.
Let's build queries for the Node.js app we instrumented:
- Request rate per route:
rate(http_request_duration_seconds_count{job="node-app"}[5m]) - 95th percentile latency:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) - Error ratio:
sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
Common gotcha: requires a counter — only use it on metrics ending with rate()_total or _count. For gauges (CPU, memory), use or avg_over_time().max_over_time()
Another trap: missing clause in by()histogram_quantile aggregates across all labels, giving you one value for your entire app. Always by (le, route) to get per-route latencies.", "code": { "language": "promql", "filename": "queries.promql", "code": "# Request rate per route (counter) rate(http_request_duration_seconds_count{job=\"node-app\"}[5m])
# 95th percentile latency per route (histogram) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"node-app\"}[5m])) by (le, route) )
# Error ratio (status 5xx vs total) sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
# CPU usage of the app container (gauge) avg by (container_name) (rate(container_cpu_usage_seconds_total{name=\"app\"}[1m]))" }, "callout": { "type": "info", "title": "PromQL Tip: Always Use Aggregators", "text": "Queries without by() or without() may return multiple time-series depending on label values. Use sum by(...) or avg by(...) to reduce dimensionality and make dashboards predictable." }, "production_insight": "PromQL queries with high cardinality time-series can timeout — default timeout is 60s. Use recording rules for expensive queries (e.g., hourly aggregated error rate) to precompute them every 5 minutes. Rule: never execute an unaggregated PromQL query on a dashboard with auto-refresh <30s.", "key_takeaway": "PromQL is powerful but expensive. Aggregate early, aggregate often. Use recording rules for dashboard queries — your CPU budget will thank you." }, { "heading": "PromQL Common Query Cheat Sheet (rate, sum, increase)", "content": "Here's a quick-reference cheat sheet for the three most frequently used PromQL functions. Bookmark it for when you need to write a query fast.
– per-second average rate of increase over a time window (for counters) - rate()rate(metric_total[5m]) → requests per second over last 5 minutes - rate(http_requests_total{status=\"500\"}[10m]) → error rate per second - Always use with counters ending in _total or _count
– total increase over a time window (for counters) - increase()increase(http_requests_total[1h]) → total requests in the last hour - increase(errors_total[24h]) → total errors in the last day - Useful for billing or compliance metrics that need cumulative totals
– aggregates time series across labels - sum()sum(rate(http_requests_total[5m])) → total request rate across all instances - sum by (service) (rate(http_requests_total[5m])) → request rate per service - Combine with or rate() to reduce dimensionalityincrease()
Common combinations: ```promql # Latency > 1s per minute (count of slow requests) sum(increase(http_request_duration_seconds_count{le=\"1\"}[1m]))
# 99th percentile latency by endpoint histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
# Percentage of failed requests per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))
# Memory usage per container in MB container_memory_usage_bytes / 1024 / 1024 ```
Quick rules of thumb: - Counter metric → or rate() - Gauge metric → increase(), avg_over_time(), or raw value - Histogram metric → max_over_time() + histogram_quantile()rate(_bucket) - High cardinality → sum to collapse - Missing by() → you get multiple series (often not what you want)", "code": { "language": "promql", "filename": "promql-cheat-sheet.promql", "code": "# RATE: Requests per second rate(http_requests_total[5m])by()
# INCREASE: Total requests in last hour increase(http_requests_total[1h])
# SUM + RATE: Total error rate across all instances sum(rate(http_requests_total{status=~\"5..\"}[5m]))
# SUM BY: Error rate per service sum by (service) (rate(http_requests_total{status=~\"5..\"}[5m]))
# HISTOGRAM QUANTILE: 99th percentile latency histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
# RATIO: Error percentage per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))
# GAUGE: CPU usage percentage (container_cpu_usage_seconds_total is a counter) avg by (container_name) (rate(container_cpu_usage_seconds_total[1m])) * 100" }, "callout": { "type": "info", "title": "Common Mistake: Using rate() on a gauge", "text": " is only valid for counters (always increasing). Using it on a gauge that goes up and down (e.g., memory usage) will produce nonsense values. Use rate() or avg_over_time() for gauges." }, "production_insight": "A well-tuned PromQL query is the difference between a 10ms dashboard load and a 10s timeout. Always use recording rules for queries that appear on multiple dashboards or have a heavy delta() clause. In a production incident, by()increase(errors_total[5m]) is faster than when you need to know the raw count quickly.", "key_takeaway": "Memorize the rate()rate, increase, sum combo. Counter → rate/increase. Gauge → average/raw. Aggregate by meaningful labels to control cardinality." }, { "heading": "Alerting and Grafana Dashboards", "content": "Prometheus alerting works in two parts: alerting rules in Prometheus, and alert routing by Alertmanager. Rules are defined in a YAML file and loaded by Prometheus. They evaluate at the scrape interval (default 15s) and can fire into different severities.
Example rule for high error rate: ``yaml groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" ``
The for: clause prevents flapping — the condition must be true for 2 minutes before an alert fires. Configure Alertmanager to send to email, PagerDuty, or Slack.
Grafana dashboards visualize these alerts. Create a panel with metric ALERTS{alertstate=\"firing\"} to show active alerts. Better yet, use the Alertmanager datasource in Grafana for a dedicated alerts page.
For dashboards, follow the SRE golden signals: latency, traffic, errors, saturation. Start with a system overview panel, then drill-down panels per service.", "code": { "language": "yaml", "filename": "alerting-rules.yml", "code": "groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" - alert: InstanceDown expr: up{job=\"node-app\"} == 0 for: 1m labels: severity: critical annotations: summary: \"{{ $labels.instance }} is down\"" }, "callout": { "type": "mental_model", "title": "Mental Model: Alert Fatigue is a Design Problem", "hook": "Every alert that fires but doesn't require a human action is noise — it trains your team to ignore the next alert.", "bullets": [ "Alert on symptoms (error rate >5%), not causes (CPU >80%).", "Use for: to avoid flapping alerts from transient spikes.", "Only escalate to on-call if the alert requires immediate human intervention within 15 minutes.", "A well-designed dashboard should show metrics, alerts should show anomalies." ] }, "production_insight": "Alertmanager can receive thousands of alerts during an incident — but human brains can process 3-5 per minute. Use inhibition rules to silence less-severe alerts when a critical alert fires (e.g., silence all InstanceDown alerts if HighErrorRate is already firing for the same cluster). Rule: alert about the root cause, not every symptom.", "key_takeaway": "Alerting isn't about being loud — it's about being precise. Every alert should answer: what's broken, why, and what to fix. If you can't script the fix, don't alert — dashboards are better." } ]
The Silent Outage: Missing Metrics Alert
- Never rely on static IPs for Prometheus targets in dynamic environments.
- Always configure a 'tooling' blackbox exporter to verify that Prometheus itself can still reach its targets.
- Add a synthetic metric (up == 0?) alert with low threshold to catch total scrape loss.
docker compose logs <service> to check if the metrics endpoint is listening. Verify the port in the scrape config matches the container's exposed port. Use curl http://localhost:<port>/metrics from the host to confirm accessibility.promtool tsdb analyze --extended /prometheus/data. Look for metrics with high label value counts. Common culprit: request_latency_seconds_bucket with user_id label. Reduce label cardinality or increase retention window.for: duration before firing. Check ALERTS metric in Prometheus. Also verify Alertmanager is reachable from Prometheus and that inhibition rules aren't suppressing the alert.That's Monitoring. Mark it forged?
10 min read · try the examples if you haven't