Prometheus & Grafana Setup - Static IPs Cause Outages
No alert fired despite payment failures - static IP targets in Docker caused scrape loss after restart.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
- Prometheus scrapes metrics from services and stores them in a time-series database
- Grafana visualises those metrics on customizable dashboards without writing UI code
- Alertmanager handles notifications when metrics cross thresholds
- A healthy Prometheus instance handles ~1M samples/second per core — cardinality is the bottleneck
- Missing scraping targets because of network misconfiguration is the #1 production incident
- Most engineers over-scrape: higher scrape frequency than scrape timeout causes cascading failures
Imagine your app is a car engine. You wouldn't drive cross-country without a dashboard showing your speed, fuel, and temperature — you'd break down without warning. Prometheus is the set of sensors bolted to that engine, constantly measuring everything. Grafana is the beautiful dashboard on your steering wheel that turns those raw sensor readings into dials you can actually understand at a glance. Without this combo, you're driving blind.
Every production system fails eventually — the only question is whether YOU find out first, or your users do. In 2024, a five-minute outage at a mid-sized SaaS company can cost tens of thousands of dollars and destroy user trust built over months. The teams that catch problems in seconds rather than minutes aren't lucky — they have observability pipelines built with tools like Prometheus and Grafana that surface anomalies the moment they appear, not after a support ticket rolls in.
Before Prometheus became the de-facto standard for cloud-native monitoring, teams were duct-taping together cron jobs, custom scripts, and expensive APM vendors to answer the simplest question: 'Is my service healthy right now?' Prometheus solves this with a pull-based model that scrapes metrics from your services on a schedule, stores them in a time-series database, and lets you query them with a powerful expression language called PromQL. Grafana then plugs into that database and lets you visualise, alert on, and share those metrics without writing a single line of UI code.
By the end of this article you'll have a fully working Prometheus and Grafana stack running locally via Docker Compose, a real Node.js app exposing custom business metrics, a PromQL query that actually answers a business question, and an alerting rule that fires before your users notice a problem. This is the exact setup you'd use as a foundation for a production monitoring stack.
What is Prometheus and Grafana Setup?
Prometheus is an open-source time-series database and monitoring system that scrapes metrics from configured targets at regular intervals. Grafana is the analytics and visualization layer that queries Prometheus (and many other data sources) to build dashboards, alerts, and ad-hoc queries.
The combination gives you a complete observability pipeline: instrument your code → expose metrics → scrape → store → query → alert → visualize. No external SaaS required, no per-seat licensing, and full control over retention.
You don't need to be an SRE to run this. A single Docker Compose file gets you a working stack in under 10 minutes. But the simplicity hides depth — get the scrape cadence wrong and you'll either burn your metrics disk or miss critical data.
Prometheus Architecture Flow Visual
Understanding the data flow from instrumentation to alerting is critical for debugging production issues. The following diagram shows how metrics travel from your application to Prometheus, then to Alertmanager and Grafana.
Flow overview: - Your application exposes metrics at /metrics (pull-based). - Prometheus scrapes these targets based on its scrape_configs and stores the data in its TSDB. - Prometheus evaluates alerting rules against the stored data and pushes alerts to Alertmanager. - Alertmanager handles deduplication, grouping, and routing to notification channels (PagerDuty, Slack, email). - Grafana queries Prometheus via its API to render dashboards.
The key architectural decision is that Prometheus pulls from targets, not the other way around. This makes it resilient to network partitions on the target side and gives you control over scrape cadence.
Setting Up the Stack with Docker Compose
Here's a minimal docker-compose.yml that runs Prometheus, Grafana, and a Node.js app that exposes custom metrics. We'll use the prom/prometheus and grafana/grafana official images. The Node.js app uses the prom-client library.
``yaml version: '3.8' services: prometheus: image: prom/prometheus:v2.53.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090" grafana: image: grafana/grafana:10.4.2 ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin depends_on: - prometheus app: build: ./app ports: - "3001:3001" environment: - METRICS_PORT=3001 ``
The Prometheus config file (prometheus.yml) must define scrape targets. Notice we reference services by Docker Compose service name — the embedded DNS resolver resolves them.
``yaml scrape_configs: - job_name: 'node-app' scrape_interval: 15s static_configs: - targets: ['app:3001'] ``
Don't forget to add a network alias if you have multiple networks, or use depends_on: condition: service_healthy to ensure your app is ready before Prometheus starts scraping.
condition: service_healthy in the Prometheus depends_on block. Otherwise you'll see target 'DOWN' on first scrapeService Discovery: Configuring Consul, Kubernetes, and Docker Targets
Static targets work fine for a single Docker Compose stack, but in production your services scale, restart, and move IPs. That's where service discovery comes in. Prometheus supports multiple discovery mechanisms that dynamically generate targets from your infrastructure catalog.
Consul Service Discovery If you run Consul, Prometheus can discover targets by querying the Consul API. The following config scrapes all services that have a metrics tag and are passing health checks.
``yaml scrape_configs: - job_name: 'consul-services' consul_sd_configs: - server: 'consul:8500' tags: - metrics relabel_configs: - source_labels: [__meta_consul_service] target_label: job - source_labels: [__meta_consul_node] target_label: instance - source_labels: [__meta_consul_service_address, __meta_consul_service_port] separator: ':' target_label: __address__ ``
Kubernetes Service Discovery On Kubernetes, Prometheus uses the API to watch pods, services, endpoints, and ingresses. The most common pattern is scraping pods based on annotations.
``yaml scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) ``
Docker Service Discovery For Docker Swarm or individual containers, use Docker SD. This config discovers running containers and scrapes those with the prometheus.io/scrape: true label.
``yaml scrape_configs: - job_name: 'docker-containers' docker_sd_configs: - host: 'unix:///var/run/docker.sock' refresh_interval: 30s relabel_configs: - source_labels: [__meta_docker_container_label_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_docker_container_name] regex: '/(.*)' replacement: '$1' target_label: container ``
Each SD config uses relabel_configs to map the discovered metadata into Prometheus target labels. The key is to extract the IP, port, and any meaningful labels (service name, environment) from the provider's metadata.
docker_sd_configs approach should only be used in trusted environments.refresh_interval to propagate. In Consul, this is typically 30s. In Kubernetes, the watch mechanism is near-instant. Always set refresh_interval based on your scaling cadence — 5s for autoscaling services, 60s for long-running VMs.Instrumenting Your Application
Your app must expose a /metrics HTTP endpoint that Prometheus can scrape. For Node.js, the prom-client library provides the OpenMetrics format. Create a simple Express app that tracks request count and latency.
```javascript const express = require('express'); const prometheus = require('prom-client');
const app = express(); const register = new prometheus.Registry();
prometheus.collectDefaultMetrics({ register });
const httpRequestDurationMicroseconds = new prometheus.Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5] }); register.registerMetric(httpRequestDurationMicroseconds);
app.use((req, res, next) => { const end = httpRequestDurationMicroseconds.startTimer(); res.on('finish', () => { end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode }); }); next(); });
app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });
app.get('/hello', (req, res) => res.json({ message: 'hello' }));
app.listen(3001, () => console.log('Metrics at http://localhost:3001/metrics')); ```
Key points: register default metrics (CPU, memory, event loop lag), define custom histograms with appropriate buckets, and always use labelNames that correspond to production dimensions (method, route, status). Too many unique label combinations will cause high cardinality — be surgical.
- A histogram with method, route, status creates series for each (method, route, status) triplet.
- If you have 5 methods × 20 routes × 5 statuses = 500 series per histogram bucket.
- Prometheus uses memory proportional to series count — 1M series uses ~2GB RAM.
- Aggressively limit label variability: use
pathcode instead of raw URL path, or use aggregation before ingestion.
PromQL: The Query Language That Makes Metrics Useful
PromQL is the expression language that turns stored metrics into actionable insights. It's deceptively simple — rate(http_requests_total[5m]) gives you requests per second — but mastering aggregations, offset, and subqueries separates the pro from the panic.
Let's build queries for the Node.js app we instrumented:
- Request rate per route:
rate(http_request_duration_seconds_count{job="node-app"}[5m]) - 95th percentile latency:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) - Error ratio:
sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
Common gotcha: requires a counter — only use it on metrics ending with rate()_total or _count. For gauges (CPU, memory), use or avg_over_time().max_over_time()
Another trap: missing clause in by()histogram_quantile aggregates across all labels, giving you one value for your entire app. Always by (le, route) to get per-route latencies.", "code": { "language": "promql", "filename": "queries.promql", "code": "# Request rate per route (counter) rate(http_request_duration_seconds_count{job=\"node-app\"}[5m])
# 95th percentile latency per route (histogram) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"node-app\"}[5m])) by (le, route) )
# Error ratio (status 5xx vs total) sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
# CPU usage of the app container (gauge) avg by (container_name) (rate(container_cpu_usage_seconds_total{name=\"app\"}[1m]))" }, "callout": { "type": "info", "title": "PromQL Tip: Always Use Aggregators", "text": "Queries without by() or without() may return multiple time-series depending on label values. Use sum by(...) or avg by(...) to reduce dimensionality and make dashboards predictable." }, "production_insight": "PromQL queries with high cardinality time-series can timeout — default timeout is 60s. Use recording rules for expensive queries (e.g., hourly aggregated error rate) to precompute them every 5 minutes. Rule: never execute an unaggregated PromQL query on a dashboard with auto-refresh <30s.", "key_takeaway": "PromQL is powerful but expensive. Aggregate early, aggregate often. Use recording rules for dashboard queries — your CPU budget will thank you." }, { "heading": "PromQL Common Query Cheat Sheet (rate, sum, increase)", "content": "Here's a quick-reference cheat sheet for the three most frequently used PromQL functions. Bookmark it for when you need to write a query fast.
– per-second average rate of increase over a time window (for counters) - rate()rate(metric_total[5m]) → requests per second over last 5 minutes - rate(http_requests_total{status=\"500\"}[10m]) → error rate per second - Always use with counters ending in _total or _count
– total increase over a time window (for counters) - increase()increase(http_requests_total[1h]) → total requests in the last hour - increase(errors_total[24h]) → total errors in the last day - Useful for billing or compliance metrics that need cumulative totals
– aggregates time series across labels - sum()sum(rate(http_requests_total[5m])) → total request rate across all instances - sum by (service) (rate(http_requests_total[5m])) → request rate per service - Combine with or rate() to reduce dimensionalityincrease()
Common combinations: ```promql # Latency > 1s per minute (count of slow requests) sum(increase(http_request_duration_seconds_count{le=\"1\"}[1m]))
# 99th percentile latency by endpoint histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
# Percentage of failed requests per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))
# Memory usage per container in MB container_memory_usage_bytes / 1024 / 1024 ```
Quick rules of thumb: - Counter metric → or rate() - Gauge metric → increase(), avg_over_time(), or raw value - Histogram metric → max_over_time() + histogram_quantile()rate(_bucket) - High cardinality → sum to collapse - Missing by() → you get multiple series (often not what you want)", "code": { "language": "promql", "filename": "promql-cheat-sheet.promql", "code": "# RATE: Requests per second rate(http_requests_total[5m])by()
# INCREASE: Total requests in last hour increase(http_requests_total[1h])
# SUM + RATE: Total error rate across all instances sum(rate(http_requests_total{status=~\"5..\"}[5m]))
# SUM BY: Error rate per service sum by (service) (rate(http_requests_total{status=~\"5..\"}[5m]))
# HISTOGRAM QUANTILE: 99th percentile latency histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
# RATIO: Error percentage per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))
# GAUGE: CPU usage percentage (container_cpu_usage_seconds_total is a counter) avg by (container_name) (rate(container_cpu_usage_seconds_total[1m])) * 100" }, "callout": { "type": "info", "title": "Common Mistake: Using rate() on a gauge", "text": " is only valid for counters (always increasing). Using it on a gauge that goes up and down (e.g., memory usage) will produce nonsense values. Use rate() or avg_over_time() for gauges." }, "production_insight": "A well-tuned PromQL query is the difference between a 10ms dashboard load and a 10s timeout. Always use recording rules for queries that appear on multiple dashboards or have a heavy delta() clause. In a production incident, by()increase(errors_total[5m]) is faster than when you need to know the raw count quickly.", "key_takeaway": "Memorize the rate()rate, increase, sum combo. Counter → rate/increase. Gauge → average/raw. Aggregate by meaningful labels to control cardinality." }, { "heading": "Alerting and Grafana Dashboards", "content": "Prometheus alerting works in two parts: alerting rules in Prometheus, and alert routing by Alertmanager. Rules are defined in a YAML file and loaded by Prometheus. They evaluate at the scrape interval (default 15s) and can fire into different severities.
Example rule for high error rate: ``yaml groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" ``
The for: clause prevents flapping — the condition must be true for 2 minutes before an alert fires. Configure Alertmanager to send to email, PagerDuty, or Slack.
Grafana dashboards visualize these alerts. Create a panel with metric ALERTS{alertstate=\"firing\"} to show active alerts. Better yet, use the Alertmanager datasource in Grafana for a dedicated alerts page.
For dashboards, follow the SRE golden signals: latency, traffic, errors, saturation. Start with a system overview panel, then drill-down panels per service.", "code": { "language": "yaml", "filename": "alerting-rules.yml", "code": "groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" - alert: InstanceDown expr: up{job=\"node-app\"} == 0 for: 1m labels: severity: critical annotations: summary: \"{{ $labels.instance }} is down\"" }, "callout": { "type": "mental_model", "title": "Mental Model: Alert Fatigue is a Design Problem", "hook": "Every alert that fires but doesn't require a human action is noise — it trains your team to ignore the next alert.", "bullets": [ "Alert on symptoms (error rate >5%), not causes (CPU >80%).", "Use for: to avoid flapping alerts from transient spikes.", "Only escalate to on-call if the alert requires immediate human intervention within 15 minutes.", "A well-designed dashboard should show metrics, alerts should show anomalies." ] }, "production_insight": "Alertmanager can receive thousands of alerts during an incident — but human brains can process 3-5 per minute. Use inhibition rules to silence less-severe alerts when a critical alert fires (e.g., silence all InstanceDown alerts if HighErrorRate is already firing for the same cluster). Rule: alert about the root cause, not every symptom.", "key_takeaway": "Alerting isn't about being loud — it's about being precise. Every alert should answer: what's broken, why, and what to fix. If you can't script the fix, don't alert — dashboards are better." } ]
Alerting That Won't Wake You at 3 AM for Nothing
Most setups stop at dashboards. That’s not monitoring — that’s a screensaver. Real monitoring alerts you when something breaks, not when a metric twitches. Prometheus Alertmanager does this, but it takes configuration to avoid noise.
Start with recording rules. They pre-compute expensive queries so your alert rules don’t hammer Prometheus. Write alert rules that trigger on sustained anomalies, not single spikes. A 5-second CPU burst is noise. 5 minutes at 95% is a problem.
Route alerts by severity and team. Critical goes to PagerDuty, warnings go to Slack, info emails get archived. Use inhibition rules to suppress low-severity alerts when a higher one fires. No one cares about disk latency when the node is down.
Test your alerts. Run amtool against your rules before deploying. Simulate failures in staging. If your first real alert is a false positive, your team stops trusting the system.
Prometheus Storage: Don't Run Out of Disk at 2 PM on a Tuesday
Prometheus stores time-series data locally by default. That data has a retention period — set it or it fills your disk. Default is 15 days. Adjust based on how far back you need to query. If you never look at last week, drop retention to 7 days and save space.
Block storage is a thing. Prometheus writes data in two-hour blocks. Each block is immutable after compaction. Deleting old data means dropping entire blocks, not individual metrics. Plan retention in multiples of two hours, or you waste space.
Tsdb retention is configured in Prometheus startup flags: --storage.tsdb.retention.time=30d. Pair that with --storage.tsdb.retention.size=100GB to cap disk usage. If your blocks exceed 100GB, Prometheus drops oldest data regardless of time. Use both flags. Never one.
For long-term storage, use Thanos or Cortex. They let you keep years of data in object storage (S3, GCS). Prometheus itself is ephemeral — its local storage is for fast queries on recent data. Archive everything older to external storage.
ls -lh /data/prometheus and check du -sh * to see block sizes. If any block exceeds 1GB, your scrape interval is too aggressive or you're scraping too many metrics.Instrumenting a Spring Boot App
Why instrument your app? Without metrics, you're flying blind. Spring Boot makes this trivial with Micrometer, a facade that feeds Prometheus. Add micrometer-registry-prometheus to your pom.xml or build.gradle. Expose the /actuator/prometheus endpoint. That single change gives you JVM metrics—heap, threads, GC pauses—plus HTTP request counters and timers. Custom metrics? Annotate a method with @Timed or wire a MeterRegistry bean to record counters and gauges. The real power: you can aggregate these across all instances. CPU spikes become visible across your fleet. Out-of-memory warnings appear before the crash. This is observability without a PhD. One dependency, one endpoint, infinite clarity.
/actuator/prometheus to the public internet. Restrict access with a firewall or authentication gateway. Otherwise, anyone can scrape your heap dumps.Pitfall 1: Overloaded Prometheus
Prometheus scrapes 10,000 time series per target by default. Push beyond that and scrapes fail, queries lag, storage balloons. Why it happens: instrumenting every SQL query, HTTP header, or loop iteration. Metrics with high cardinality—like user_id or session_id—explode the time series count. A single label with 1,000 unique values creates 1,000 series per metric. Times 100 metrics gives 100,000 series. This collapses the server. To diagnose, check prometheus_tsdb_head_series and note the cardinality. Fix by removing high-cardinality labels or aggregating with recording rules. Use count by (status) instead of sum by (user_id). Run promtool tsdb analyze on blocks. Protect storage with retention limits. An overloaded Prometheus is silent sabotage. Trim the fat.
request_id can triple storage overnight. Monitor series count per job via prometheus_target_scrapes_exceeded_sample_limit_total.sample_limit and analyze cardinality weekly.Pitfall 2: Dashboard Clutter
Why clutter is dangerous: too many panels hide the one signal that matters. A Grafana dashboard with 50 graphs—CPU, memory, disk per host—blinds operators. The brain can't scan that fast. Root cause: copying generic dashboards without trimming. Every panel adds cognitive load. In an incident, you need two things: what's failing and where. The rest is noise. Fix by following the RED method—Rate, Errors, Duration—for each service. Start with 3 panels per service. Aggregate by team. Use variables to filter, not multiply. Remove graphs that no one looked at in the last 30 days. Enable dashboard provisioning from version control so changes are reviewed. A cluttered dashboard is an insecure system. Kill the noise.
allowUiUpdates: false to force git-based changes only.Introduction: The Power of Knowing Your Systems
Before metrics, operations teams flew blind. A server crash meant frantic log dives, cascading failures caught hours late, and capacity planning was pure guesswork. Prometheus and Grafana flip that script by giving you real-time visibility into every moving part of your infrastructure. Prometheus scrapes metrics from targets, stores them as time-series data, and exposes a powerful query language (PromQL) for slicing and dicing. Grafana consumes that data to build dashboards and alerts that surface anomalies the moment they happen. Together, they transform opaque systems into instruments you can read and tune. This setup isn’t just about avoiding downtime—it’s about understanding performance trends, correlating events across services, and building confidence that your architecture can scale. When you know exactly what normal looks like, you spot the abnormal before it becomes a crisis.
The Story: From Chaos to Clarity
Our platform’s early days were a mess. A burst of user traffic would silently crash a backend service, but our monitoring—a cobbled-together Nagios with static thresholds—only alerted us after a full outage. We’d spend hours piecing together log fragments, chasing ghosts. The turning point came when we deployed Prometheus to scrape every container’s CPU, memory, and request latency. Grafana turned those raw numbers into a live dashboard: a single pane showing latency spikes, error rates, and resource saturation. Suddenly, we could see a slow database query pushing CPU to 90% thirty minutes before any alert fired. We tuned alerts from noise to signal: only page when p99 latency triples for five minutes. The result? Mean time to detection dropped from hours to under a minute. Our team stopped firefighting and started engineering—because we finally understood what our systems were saying.
Case Study: E-Commerce Turnaround
A mid-sized e-commerce site faced weekend flash sales that turned into outages: cart hangs, checkout freezes, and frustrated customers. Before Prometheus, they relied on basic cloud metrics—CPU and memory—but couldn’t see the real bottleneck: database connection pool exhaustion caused by a spike in abandoned sessions. We instrumented their Node.js checkout service to expose custom metrics: checkout_requests_total, checkout_duration_seconds, and db_connections_in_use. Grafana displayed a heatmap of request latency across product categories, and a PromQL query showed that rate(checkout_requests_total[5m]) correlating with rate(db_connections_in_use[5m]) diverged during sales. The fix: connection pooling tuning and a Circuit Breaker pattern for the database. The next flash sale saw zero downtime, and checkout latency dropped 40%. Prometheus didn’t just monitor—it guided the architecture fix.
The Silent Outage: Missing Metrics Alert
- Never rely on static IPs for Prometheus targets in dynamic environments.
- Always configure a 'tooling' blackbox exporter to verify that Prometheus itself can still reach its targets.
- Add a synthetic metric (up == 0?) alert with low threshold to catch total scrape loss.
docker compose logs <service> to check if the metrics endpoint is listening. Verify the port in the scrape config matches the container's exposed port. Use curl http://localhost:<port>/metrics from the host to confirm accessibility.promtool tsdb analyze --extended /prometheus/data. Look for metrics with high label value counts. Common culprit: request_latency_seconds_bucket with user_id label. Reduce label cardinality or increase retention window.for: duration before firing. Check ALERTS metric in Prometheus. Also verify Alertmanager is reachable from Prometheus and that inhibition rules aren't suppressing the alert.curl -s http://localhost:9090/api/v1/query?query=upcurl -s http://localhost:3000/api/datasources/proxy/1/query?query=up&time=`date +%s`20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's Monitoring. Mark it forged?
15 min read · try the examples if you haven't