Intermediate 10 min · March 06, 2026

Prometheus & Grafana Setup - Static IPs Cause Outages

No alert fired despite payment failures - static IP targets in Docker caused scrape loss after restart.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Prometheus scrapes metrics from services and stores them in a time-series database
  • Grafana visualises those metrics on customizable dashboards without writing UI code
  • Alertmanager handles notifications when metrics cross thresholds
  • A healthy Prometheus instance handles ~1M samples/second per core — cardinality is the bottleneck
  • Missing scraping targets because of network misconfiguration is the #1 production incident
  • Most engineers over-scrape: higher scrape frequency than scrape timeout causes cascading failures
Plain-English First

Imagine your app is a car engine. You wouldn't drive cross-country without a dashboard showing your speed, fuel, and temperature — you'd break down without warning. Prometheus is the set of sensors bolted to that engine, constantly measuring everything. Grafana is the beautiful dashboard on your steering wheel that turns those raw sensor readings into dials you can actually understand at a glance. Without this combo, you're driving blind.

Every production system fails eventually — the only question is whether YOU find out first, or your users do. In 2024, a five-minute outage at a mid-sized SaaS company can cost tens of thousands of dollars and destroy user trust built over months. The teams that catch problems in seconds rather than minutes aren't lucky — they have observability pipelines built with tools like Prometheus and Grafana that surface anomalies the moment they appear, not after a support ticket rolls in.

Before Prometheus became the de-facto standard for cloud-native monitoring, teams were duct-taping together cron jobs, custom scripts, and expensive APM vendors to answer the simplest question: 'Is my service healthy right now?' Prometheus solves this with a pull-based model that scrapes metrics from your services on a schedule, stores them in a time-series database, and lets you query them with a powerful expression language called PromQL. Grafana then plugs into that database and lets you visualise, alert on, and share those metrics without writing a single line of UI code.

By the end of this article you'll have a fully working Prometheus and Grafana stack running locally via Docker Compose, a real Node.js app exposing custom business metrics, a PromQL query that actually answers a business question, and an alerting rule that fires before your users notice a problem. This is the exact setup you'd use as a foundation for a production monitoring stack.

What is Prometheus and Grafana Setup?

Prometheus is an open-source time-series database and monitoring system that scrapes metrics from configured targets at regular intervals. Grafana is the analytics and visualization layer that queries Prometheus (and many other data sources) to build dashboards, alerts, and ad-hoc queries.

The combination gives you a complete observability pipeline: instrument your code → expose metrics → scrape → store → query → alert → visualize. No external SaaS required, no per-seat licensing, and full control over retention.

You don't need to be an SRE to run this. A single Docker Compose file gets you a working stack in under 10 minutes. But the simplicity hides depth — get the scrape cadence wrong and you'll either burn your metrics disk or miss critical data.

Production Insight
The default scrape interval of 15 seconds is fine for CPU/memory but too slow for request-rate spikes.
Set high-cardinality metrics (request latency per endpoint) to 10s, low-cardinality (disk space) to 60s.
Rule: different scrape intervals for different metric families — always configure per-job scrape_timeout and scrape_interval.
Key Takeaway
Prometheus stores what you scrape, not what you hope to debug later.
Design your metrics with query patterns in mind from day one.
Scrape sloppy, debug blind.

Prometheus Architecture Flow Visual

Understanding the data flow from instrumentation to alerting is critical for debugging production issues. The following diagram shows how metrics travel from your application to Prometheus, then to Alertmanager and Grafana.

Flow overview: - Your application exposes metrics at /metrics (pull-based). - Prometheus scrapes these targets based on its scrape_configs and stores the data in its TSDB. - Prometheus evaluates alerting rules against the stored data and pushes alerts to Alertmanager. - Alertmanager handles deduplication, grouping, and routing to notification channels (PagerDuty, Slack, email). - Grafana queries Prometheus via its API to render dashboards.

The key architectural decision is that Prometheus pulls from targets, not the other way around. This makes it resilient to network partitions on the target side and gives you control over scrape cadence.

Pull vs Push Misconception
If your target is behind a firewall or has a dynamic IP, Prometheus cannot scrape it. Use Prometheus Pushgateway for short-lived jobs or Blackbox exporter for probes. The pull model is a feature, not a bug — it makes the monitoring system the source of truth for target availability.
Production Insight
The flow diagram maps directly to debugging steps:
- No data in Grafana? Check the Prometheus API first.
- No alert fired? Check alerting rules and Alertmanager status.
- Missing targets? Verify scrape configuration and target reachability.
Keep this flow in mind when triaging — it prevents jumping to wrong conclusions.
Key Takeaway
Metrics flow from targets to Prometheus to Alertmanager and Grafana.
Each hop is a potential failure point.
Debug upstream first (Prometheus API), then downstream (Alertmanager, Grafana).

Setting Up the Stack with Docker Compose

Here's a minimal docker-compose.yml that runs Prometheus, Grafana, and a Node.js app that exposes custom metrics. We'll use the prom/prometheus and grafana/grafana official images. The Node.js app uses the prom-client library.

``yaml version: '3.8' services: prometheus: image: prom/prometheus:v2.53.0 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090" grafana: image: grafana/grafana:10.4.2 ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin depends_on: - prometheus app: build: ./app ports: - "3001:3001" environment: - METRICS_PORT=3001 ``

The Prometheus config file (prometheus.yml) must define scrape targets. Notice we reference services by Docker Compose service name — the embedded DNS resolver resolves them.

``yaml scrape_configs: - job_name: 'node-app' scrape_interval: 15s static_configs: - targets: ['app:3001'] ``

Don't forget to add a network alias if you have multiple networks, or use depends_on: condition: service_healthy to ensure your app is ready before Prometheus starts scraping.

docker-compose.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
  grafana:
    image: grafana/grafana:10.4.2
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus
  app:
    build: ./app
    ports:
      - "3001:3001"
    environment:
      - METRICS_PORT=3001
Common Pitfall: depends_on is not a readiness check
Prometheus will start before your app is ready to serve metrics. Add a healthcheck to the app service and use condition: service_healthy in the Prometheus depends_on block. Otherwise you'll see target 'DOWN' on first scrape
Production Insight
Docker Compose DNS resolves service names internally, but if you override network_mode: host, service discovery breaks.
Always verify connectivity with docker compose exec prometheus wget -q -O- http://app:3001/metrics.
Rule: test the scrape endpoint from inside the Prometheus container, not from the host.
Key Takeaway
Compose files are reproducible local stacks — but don't treat them as production config.
Always add health checks and conditional depends_on.
Scrape target resolution fails silently; verify before relying on it.

Service Discovery: Configuring Consul, Kubernetes, and Docker Targets

Static targets work fine for a single Docker Compose stack, but in production your services scale, restart, and move IPs. That's where service discovery comes in. Prometheus supports multiple discovery mechanisms that dynamically generate targets from your infrastructure catalog.

Consul Service Discovery If you run Consul, Prometheus can discover targets by querying the Consul API. The following config scrapes all services that have a metrics tag and are passing health checks.

``yaml scrape_configs: - job_name: 'consul-services' consul_sd_configs: - server: 'consul:8500' tags: - metrics relabel_configs: - source_labels: [__meta_consul_service] target_label: job - source_labels: [__meta_consul_node] target_label: instance - source_labels: [__meta_consul_service_address, __meta_consul_service_port] separator: ':' target_label: __address__ ``

Kubernetes Service Discovery On Kubernetes, Prometheus uses the API to watch pods, services, endpoints, and ingresses. The most common pattern is scraping pods based on annotations.

``yaml scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) ``

Docker Service Discovery For Docker Swarm or individual containers, use Docker SD. This config discovers running containers and scrapes those with the prometheus.io/scrape: true label.

``yaml scrape_configs: - job_name: 'docker-containers' docker_sd_configs: - host: 'unix:///var/run/docker.sock' refresh_interval: 30s relabel_configs: - source_labels: [__meta_docker_container_label_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_docker_container_name] regex: '/(.*)' replacement: '$1' target_label: container ``

Each SD config uses relabel_configs to map the discovered metadata into Prometheus target labels. The key is to extract the IP, port, and any meaningful labels (service name, environment) from the provider's metadata.

prometheus-service-discovery.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Consul SD
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul:8500'
        tags: ['metrics']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_node]
        target_label: instance
      - source_labels: [__meta_consul_service_address, __meta_consul_service_port]
        separator: ':'
        target_label: __address__

# Kubernetes Pod SD
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

# Docker SD
  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: 'unix:///var/run/docker.sock'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_docker_container_name]
        regex: '/(.*)'
        replacement: '$1'
        target_label: container
Security: Don't Expose the Docker Socket Directly
Giving Prometheus access to the Docker socket is a security risk. Use a dedicated Docker API proxy with read-only access, or switch to file-based discovery if your orchestrator doesn't change often. The docker_sd_configs approach should only be used in trusted environments.
Production Insight
Service discovery eliminates the static IP problem, but introduces latency: changes may take up to refresh_interval to propagate. In Consul, this is typically 30s. In Kubernetes, the watch mechanism is near-instant. Always set refresh_interval based on your scaling cadence — 5s for autoscaling services, 60s for long-running VMs.
Rule: combine SD with a blackbox exporter that pings targets from Prometheus's perspective to validate connectivity independent of SD.
Key Takeaway
Static targets fail in dynamic environments.
Service discovery ties Prometheus to your orchestrator's truth.
Use relabel_configs to transform provider metadata into meaningful labels.

Instrumenting Your Application

Your app must expose a /metrics HTTP endpoint that Prometheus can scrape. For Node.js, the prom-client library provides the OpenMetrics format. Create a simple Express app that tracks request count and latency.

```javascript const express = require('express'); const prometheus = require('prom-client');

const app = express(); const register = new prometheus.Registry();

prometheus.collectDefaultMetrics({ register });

const httpRequestDurationMicroseconds = new prometheus.Histogram({ name: 'http_request_duration_seconds', help: 'Duration of HTTP requests in seconds', labelNames: ['method', 'route', 'status'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5] }); register.registerMetric(httpRequestDurationMicroseconds);

app.use((req, res, next) => { const end = httpRequestDurationMicroseconds.startTimer(); res.on('finish', () => { end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode }); }); next(); });

app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });

app.get('/hello', (req, res) => res.json({ message: 'hello' }));

app.listen(3001, () => console.log('Metrics at http://localhost:3001/metrics')); ```

Key points: register default metrics (CPU, memory, event loop lag), define custom histograms with appropriate buckets, and always use labelNames that correspond to production dimensions (method, route, status). Too many unique label combinations will cause high cardinality — be surgical.

app/index.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const express = require('express');
const prometheus = require('prom-client');

const app = express();
const register = new prometheus.Registry();

prometheus.collectDefaultMetrics({ register });

const httpRequestDurationMicroseconds = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
register.registerMetric(httpRequestDurationMicroseconds);

app.use((req, res, next) => {\n  const end = httpRequestDurationMicroseconds.startTimer();\n  res.on('finish', () => {\n    end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });\n  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.get('/hello', (req, res) => res.json({ message: 'hello' }));

app.listen(3001, () => console.log('Metrics at http://localhost:3001/metrics'));
Mental Model: Metrics Are Dimensions, Not Counters
  • A histogram with method, route, status creates series for each (method, route, status) triplet.
  • If you have 5 methods × 20 routes × 5 statuses = 500 series per histogram bucket.
  • Prometheus uses memory proportional to series count — 1M series uses ~2GB RAM.
  • Aggressively limit label variability: use path code instead of raw URL path, or use aggregation before ingestion.
Production Insight
High cardinality is the silent killer of Prometheus. A simple gauge with user_id label creates 10k series for 10k users — that's 10k unique time-series per gauge.
If you must track per-user metrics, push them to a separate TSDB like VictoriaMetrics with higher cardinality tolerance.
Rule: never put unbounded labels (user_id, request_id, session_id) on metrics.
Key Takeaway
Instrumentation is a contract: you decide what's measurable.
Every label you add multiplies the storage cost.
Measure what matters, not everything that moves.

PromQL: The Query Language That Makes Metrics Useful

PromQL is the expression language that turns stored metrics into actionable insights. It's deceptively simple — rate(http_requests_total[5m]) gives you requests per second — but mastering aggregations, offset, and subqueries separates the pro from the panic.

  • Request rate per route: rate(http_request_duration_seconds_count{job="node-app"}[5m])
  • 95th percentile latency: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))
  • Error ratio: sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

Common gotcha: rate() requires a counter — only use it on metrics ending with _total or _count. For gauges (CPU, memory), use avg_over_time() or max_over_time().

Another trap: missing by() clause in histogram_quantile aggregates across all labels, giving you one value for your entire app. Always by (le, route) to get per-route latencies.", "code": { "language": "promql", "filename": "queries.promql", "code": "# Request rate per route (counter) rate(http_request_duration_seconds_count{job=\"node-app\"}[5m])

# 95th percentile latency per route (histogram) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"node-app\"}[5m])) by (le, route) )

# Error ratio (status 5xx vs total) sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))

# CPU usage of the app container (gauge) avg by (container_name) (rate(container_cpu_usage_seconds_total{name=\"app\"}[1m]))" }, "callout": { "type": "info", "title": "PromQL Tip: Always Use Aggregators", "text": "Queries without by() or without() may return multiple time-series depending on label values. Use sum by(...) or avg by(...) to reduce dimensionality and make dashboards predictable." }, "production_insight": "PromQL queries with high cardinality time-series can timeout — default timeout is 60s. Use recording rules for expensive queries (e.g., hourly aggregated error rate) to precompute them every 5 minutes. Rule: never execute an unaggregated PromQL query on a dashboard with auto-refresh <30s.", "key_takeaway": "PromQL is powerful but expensive. Aggregate early, aggregate often. Use recording rules for dashboard queries — your CPU budget will thank you." }, { "heading": "PromQL Common Query Cheat Sheet (rate, sum, increase)", "content": "Here's a quick-reference cheat sheet for the three most frequently used PromQL functions. Bookmark it for when you need to write a query fast.

rate() – per-second average rate of increase over a time window (for counters) - rate(metric_total[5m]) → requests per second over last 5 minutes - rate(http_requests_total{status=\"500\"}[10m]) → error rate per second - Always use with counters ending in _total or _count

increase() – total increase over a time window (for counters) - increase(http_requests_total[1h]) → total requests in the last hour - increase(errors_total[24h]) → total errors in the last day - Useful for billing or compliance metrics that need cumulative totals

sum() – aggregates time series across labels - sum(rate(http_requests_total[5m])) → total request rate across all instances - sum by (service) (rate(http_requests_total[5m])) → request rate per service - Combine with rate() or increase() to reduce dimensionality

Common combinations: ```promql # Latency > 1s per minute (count of slow requests) sum(increase(http_request_duration_seconds_count{le=\"1\"}[1m]))

# 99th percentile latency by endpoint histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# Percentage of failed requests per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))

# Memory usage per container in MB container_memory_usage_bytes / 1024 / 1024 ```

Quick rules of thumb: - Counter metric → rate() or increase() - Gauge metric → avg_over_time(), max_over_time(), or raw value - Histogram metric → histogram_quantile() + rate(_bucket) - High cardinality → sum by() to collapse - Missing by() → you get multiple series (often not what you want)", "code": { "language": "promql", "filename": "promql-cheat-sheet.promql", "code": "# RATE: Requests per second rate(http_requests_total[5m])

# INCREASE: Total requests in last hour increase(http_requests_total[1h])

# SUM + RATE: Total error rate across all instances sum(rate(http_requests_total{status=~\"5..\"}[5m]))

# SUM BY: Error rate per service sum by (service) (rate(http_requests_total{status=~\"5..\"}[5m]))

# HISTOGRAM QUANTILE: 99th percentile latency histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

# RATIO: Error percentage per route sum by (route) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (route) (rate(http_requests_total[5m]))

# GAUGE: CPU usage percentage (container_cpu_usage_seconds_total is a counter) avg by (container_name) (rate(container_cpu_usage_seconds_total[1m])) * 100" }, "callout": { "type": "info", "title": "Common Mistake: Using rate() on a gauge", "text": "rate() is only valid for counters (always increasing). Using it on a gauge that goes up and down (e.g., memory usage) will produce nonsense values. Use avg_over_time() or delta() for gauges." }, "production_insight": "A well-tuned PromQL query is the difference between a 10ms dashboard load and a 10s timeout. Always use recording rules for queries that appear on multiple dashboards or have a heavy by() clause. In a production incident, increase(errors_total[5m]) is faster than rate() when you need to know the raw count quickly.", "key_takeaway": "Memorize the rate, increase, sum combo. Counter → rate/increase. Gauge → average/raw. Aggregate by meaningful labels to control cardinality." }, { "heading": "Alerting and Grafana Dashboards", "content": "Prometheus alerting works in two parts: alerting rules in Prometheus, and alert routing by Alertmanager. Rules are defined in a YAML file and loaded by Prometheus. They evaluate at the scrape interval (default 15s) and can fire into different severities.

Example rule for high error rate: ``yaml groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" ``

The for: clause prevents flapping — the condition must be true for 2 minutes before an alert fires. Configure Alertmanager to send to email, PagerDuty, or Slack.

Grafana dashboards visualize these alerts. Create a panel with metric ALERTS{alertstate=\"firing\"} to show active alerts. Better yet, use the Alertmanager datasource in Grafana for a dedicated alerts page.

For dashboards, follow the SRE golden signals: latency, traffic, errors, saturation. Start with a system overview panel, then drill-down panels per service.", "code": { "language": "yaml", "filename": "alerting-rules.yml", "code": "groups: - name: example rules: - alert: HighErrorRate expr: | (sum(rate(http_request_duration_seconds_count{status=~\"5..\"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))) > 0.05 for: 2m labels: severity: critical annotations: summary: \"Error rate above 5% for {{ $labels.job }}\" - alert: InstanceDown expr: up{job=\"node-app\"} == 0 for: 1m labels: severity: critical annotations: summary: \"{{ $labels.instance }} is down\"" }, "callout": { "type": "mental_model", "title": "Mental Model: Alert Fatigue is a Design Problem", "hook": "Every alert that fires but doesn't require a human action is noise — it trains your team to ignore the next alert.", "bullets": [ "Alert on symptoms (error rate >5%), not causes (CPU >80%).", "Use for: to avoid flapping alerts from transient spikes.", "Only escalate to on-call if the alert requires immediate human intervention within 15 minutes.", "A well-designed dashboard should show metrics, alerts should show anomalies." ] }, "production_insight": "Alertmanager can receive thousands of alerts during an incident — but human brains can process 3-5 per minute. Use inhibition rules to silence less-severe alerts when a critical alert fires (e.g., silence all InstanceDown alerts if HighErrorRate is already firing for the same cluster). Rule: alert about the root cause, not every symptom.", "key_takeaway": "Alerting isn't about being loud — it's about being precise. Every alert should answer: what's broken, why, and what to fix. If you can't script the fix, don't alert — dashboards are better." } ]

● Production incidentPOST-MORTEMseverity: high

The Silent Outage: Missing Metrics Alert

Symptom
Customers reporting payment failures but no alert fired. All service dashboards showed green.
Assumption
The team assumed Prometheus would auto-discover new container IPs via DNS or service discovery.
Root cause
Prometheus was configured with static targets pointing to container IPs that changed after a Docker restart. No service discovery was configured, and the new IPs were never scraped.
Fix
Switch to service discovery: use a file-based target with consul_sd_configs or dns_sd_configs. Or use a Docker Compose network alias so container name resolves consistently.
Key lesson
  • Never rely on static IPs for Prometheus targets in dynamic environments.
  • Always configure a 'tooling' blackbox exporter to verify that Prometheus itself can still reach its targets.
  • Add a synthetic metric (up == 0?) alert with low threshold to catch total scrape loss.
Production debug guideSymptom → Action guide for the most common production problems4 entries
Symptom · 01
Grafana dashboards show 'No data' for a panel
Fix
Check the Prometheus data source connection in Grafana. Verify that the query returns data in Prometheus's own expression browser. Common cause: time range mismatch — Grafana defaults to last 6 hours but data might be older.
Symptom · 02
Prometheus target shows 'DOWN' with connection refused
Fix
Run docker compose logs <service> to check if the metrics endpoint is listening. Verify the port in the scrape config matches the container's exposed port. Use curl http://localhost:<port>/metrics from the host to confirm accessibility.
Symptom · 03
High memory usage on Prometheus server
Fix
Check cardinality: promtool tsdb analyze --extended /prometheus/data. Look for metrics with high label value counts. Common culprit: request_latency_seconds_bucket with user_id label. Reduce label cardinality or increase retention window.
Symptom · 04
Alerts not firing despite rule evaluation showing pending
Fix
Alerts must remain in 'Pending' state for the configured for: duration before firing. Check ALERTS metric in Prometheus. Also verify Alertmanager is reachable from Prometheus and that inhibition rules aren't suppressing the alert.
★ Quick Debug Cheat Sheet for Prometheus & GrafanaRun these commands in order when things go dark
Grafana shows no panels at all
Immediate action
Check browser console for 404 on datasource proxy
Commands
curl -s http://localhost:9090/api/v1/query?query=up
curl -s http://localhost:3000/api/datasources/proxy/1/query?query=up&time=`date +%s`
Fix now
Make sure the datasource URL in Grafana matches the Prometheus server URL (internal network name, not localhost).
Prometheus targets show as DOWN+
Immediate action
Check if the target process is running and exposing /metrics
Commands
docker compose ps
docker compose logs <target-service> | grep -i metrics
Fix now
If the service is running, verify the metrics endpoint path. Add --web.route-prefix=/ to Prometheus config if using custom path.
Alertmanager not delivering alerts+
Immediate action
Check alertmanager logs for receiver errors
Commands
docker compose logs alertmanager | tail -30
curl -s http://localhost:9093/api/v2/alerts | jq '. | length'
Fix now
Verify your receiver configuration (webhook URL, email SMTP) is correct. Use a test webhook service like webhook.site to verify delivery.
🔥

That's Monitoring. Mark it forged?

10 min read · try the examples if you haven't

Previous
Introduction to Monitoring and Observability
2 / 9 · Monitoring
Next
ELK Stack — Elasticsearch Logstash Kibana