Intermediate 7 min · March 06, 2026

Database Monitoring Tools

Idle-in-Transaction Detection — pg_stat_activity Alerting

Q: What is the best free database monitoring tool for MySQL and PostgreSQL?

The Prometheus + Grafana stack with mysqld_exporter and postgres_exporter is the industry-standard free option used by teams at every scale. It's free, open-source, runs in Docker, has pre-built community dashboards, and supports alerting via Alertmanager. For a managed option, Datadog and New Relic both have excellent database monitoring but charge per host.

Q: How do I find which queries are slowing down my PostgreSQL database?

First, install the pg_stat_statements extension (CREATE EXTENSION pg_stat_statements) and add it to shared_preload_libraries in postgresql.conf. Then query SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20. This shows you normalized queries sorted by average execution time — the top results are your optimization targets.

Q: What is the difference between monitoring and observability for databases?

Monitoring tracks predefined metrics — you know in advance what you want to measure (query latency, connection count, cache hit rate) and you set thresholds on those. Observability is the ability to understand why your database is behaving the way it is from any angle, even for failure modes you didn't predict. Monitoring answers 'is something wrong?', observability answers 'why is it wrong?'. In practice, the pg_stat_* views and performance_schema are your observability layer — they let you ask new questions you didn't plan for.

Q: Should I use the same exporter for development and production?

No. Use a separate monitoring user per environment with environment-specific passwords. In development, you can use a more permissive configuration to explore metrics, but production credentials must be strictly read-only and restricted to the exporter's IP or container. Never reuse passwords across environments — if a developer's local monitoring stack is compromised, production should not be affected.

Q: My PostgreSQL exporter shows 'pg_stat_activity_count' but not 'pg_stat_user_tables'. What's wrong?

The postgres_exporter has multiple collectors, and not all are enabled by default. You need to explicitly enable the table statistics collector. Add the command-line flag --collect.stat_user_tables to the exporter startup command. For Docker, pass it in the 'command' section of the docker-compose.yml. Alternatively, use the newer postgres_exporter v0.15+ which has automatic metric collection — check the documentation for your version.

Dead rows hit 200% of live data when idle transactions blocked autovacuum for hours.

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Written from production experience, not tutorials.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

MySQL and PostgreSQL monitoring uses the Prometheus + Grafana stack: exporters scrape internal metrics (performance_schema, pg_stat_*) every 15s
Core metrics to watch: Threads_running (MySQL), active queries (PG), buffer pool hit rate, VACUUM health, lock contention
mysqld_exporter on port 9104, postgres_exporter on port 9187 — both expose /metrics endpoints for Prometheus
Pre-built Grafana dashboards: MySQL ID 7362, PostgreSQL ID 9628 — import in 30 seconds, then customize thresholds
Production insight: idle-in-transaction connections in PG block VACUUM and cause silent table bloat — alert on state='idle in transaction' lasting >2min
Biggest mistake: using exporter with database superuser — create a dedicated read-only monitoring user with minimum grants

✦ Definition~90s read

What is Database Monitoring Tools?

Idle-in-transaction detection is a critical PostgreSQL monitoring pattern that catches connections holding open transactions without doing any work — a common source of bloat, lock contention, and connection pool exhaustion. When an application opens a transaction, performs some reads or writes, then goes silent (e.g., waiting on an external API or user input), that transaction remains open, holding locks and preventing autovacuum from reclaiming dead tuples.

★

Imagine your database is a busy restaurant kitchen.

Over time, these zombie transactions accumulate, degrading query performance and eventually maxing out connection limits. The pg_stat_activity view exposes this directly via the state column showing 'idle in transaction' and the state_change timestamp, letting you compute duration and alert before damage spreads.

This pattern sits at the intersection of application behavior and database health — it's not a metric you'd typically get from OS-level exporters like node_exporter. Instead, you query PostgreSQL's system views directly, often via a custom exporter or a Prometheus PostgreSQL adapter.

The alternative is relying on connection pool middleware (PgBouncer, Pgpool-II) to enforce timeouts, but those are blunt instruments that can break legitimate long-running transactions. Idle-in-transaction detection gives you surgical visibility: you can identify the exact backend PID, query, application name, and client address responsible, then alert or kill selectively.

Tools like pg_terminate_backend() let you automate cleanup, but only if you know which transactions to target.

In production environments handling thousands of connections — think Rails apps with 50+ puma workers each holding a pool of 5 connections — even a 1% idle-in-transaction rate can consume 50+ connections per host. Combined with connection pooling limits (e.g., PgBouncer's max_client_conn default of 100), this quickly becomes a capacity crisis.

The Prometheus alert rule typically fires when any transaction has been idle for more than 5-10 minutes, with severity escalating based on count and duration. This is one of the highest-signal, lowest-noise alerts you can set up, because idle-in-transaction is almost never intentional in OLTP workloads — it's always a bug or a misconfiguration waiting to cascade.

Plain-English First

Imagine your database is a busy restaurant kitchen. Orders come in, food goes out, and everything runs smoothly — until it doesn't. A database monitoring tool is like a health inspector who lives in that kitchen 24/7, watching every chef, every stove burner, and every order ticket. The moment something starts burning or a chef freezes up, the inspector sounds an alarm before the whole dinner service collapses. That's exactly what monitoring does for your database: it watches everything so you don't have to.

Every production application eventually hits the same wall: something goes wrong with the database at 2 AM, a customer calls to complain, and you're staring at logs trying to figure out what happened 40 minutes ago. Slow queries, connection pool exhaustion, disk space creeping toward 100% — these aren't hypothetical problems. They happen to every team, at every scale, and they're almost always preventable with the right visibility. Database monitoring isn't optional infrastructure — it's the difference between being proactive and being constantly reactive.

MySQL and PostgreSQL are the two most widely deployed open-source databases in the world, and each has its own internal metrics, query execution engine, and failure modes. A tool that works brilliantly for MySQL's InnoDB buffer pool might give you almost nothing useful for PostgreSQL's VACUUM process. Understanding not just which tools exist, but what they actually measure and why those metrics matter, is what separates engineers who can debug production incidents in minutes from those who spend hours guessing.

By the end of this article you'll know how to query the internal monitoring views of both MySQL and PostgreSQL to get real-time health data, how to set up Prometheus with the right exporters to scrape those metrics automatically, how to visualize everything in Grafana, and — most importantly — you'll understand exactly why each metric matters so you can make smart decisions when something breaks. We'll also cover the traps that catch most engineers off guard the first time they set this up.

What Your Database Is Actually Telling You — Core Metrics That Matter

Before you install a single monitoring tool, you need to speak your database's language. Both MySQL and PostgreSQL expose rich internal metrics through system views and information schema tables. The problem is there are hundreds of them, and most tutorials just dump a list without explaining which ones actually predict failure.

Think of database health in three layers: throughput (how much work is being done), latency (how long that work takes), and resource pressure (how close you are to running out of something critical). Every useful metric falls into one of these buckets.

For MySQL, the most critical real-time signals live in performance_schema and information_schema. For PostgreSQL, pg_stat_activity, pg_stat_bgwriter, and pg_locks are your best friends. The queries below show you how to pull actionable snapshots from each — not just raw numbers, but numbers with context.

One critical thing most developers miss: a single metric in isolation is almost meaningless. A query taking 500ms is fine for a reporting job, catastrophic for a login endpoint. Always correlate metrics with the workload context before raising the alarm.

mysql_health_snapshot.sqlSQL

-- ============================================================
-- MySQL Health Snapshot: throughput, connections, slow queries
-- Run this against any MySQL 5.7+ or MySQL 8+ instance
-- ============================================================

-- 1. Current connection pressure
--    max_connections is set in my.cnf — if you're at 80%+ you're at risk
SELECT 
    variable_name,
    variable_value
FROM performance_schema.global_status
WHERE variable_name IN (
    'Threads_connected',       -- active connections right now
    'Threads_running',         -- connections actively executing a query (the hot number)
    'Max_used_connections',    -- historical peak — tells you your worst day
    'Connection_errors_max_connections'  -- non-zero means users got rejected
);

-- 2. Query throughput — reads vs writes ratio
--    A suddenly spiking Com_select with flat Com_insert suggests a runaway report query
SELECT 
    variable_name,
    variable_value,
    -- calculate reads-to-writes ratio for context
    CASE variable_name
        WHEN 'Com_select' THEN 'READ'
        WHEN 'Com_insert' THEN 'WRITE'
        WHEN 'Com_update' THEN 'WRITE'
        WHEN 'Com_delete' THEN 'WRITE'
        ELSE 'OTHER'
    END AS operation_type
FROM performance_schema.global_status
WHERE variable_name IN (
    'Com_select', 'Com_insert', 'Com_update', 'Com_delete'
);

-- 3. InnoDB buffer pool efficiency — the single most important MySQL memory metric
--    If Innodb_buffer_pool_reads is high relative to Innodb_buffer_pool_read_requests,
--    your buffer pool is too small — you're reading from disk instead of RAM
SELECT 
    (SELECT variable_value 
     FROM performance_schema.global_status 
     WHERE variable_name = 'Innodb_buffer_pool_read_requests') AS total_logical_reads,
    (SELECT variable_value 
     FROM performance_schema.global_status 
     WHERE variable_name = 'Innodb_buffer_pool_reads') AS physical_disk_reads,
    ROUND(
        100 - (
            (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_reads') /
            (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_read_requests')
        ) * 100,
        2
    ) AS buffer_pool_hit_rate_pct;  -- aim for 99%+ in production

-- 4. Top 5 slowest normalized queries (requires performance_schema enabled)
--    This tells you WHERE to focus optimization effort
SELECT 
    DIGEST_TEXT AS normalized_query,
    COUNT_STAR AS execution_count,
    ROUND(AVG_TIMER_WAIT / 1000000000, 3) AS avg_latency_ms,
    ROUND(MAX_TIMER_WAIT / 1000000000, 3) AS max_latency_ms,
    SUM_ROWS_EXAMINED AS total_rows_examined,
    SUM_ROWS_SENT AS total_rows_returned
FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT IS NOT NULL
ORDER BY AVG_TIMER_WAIT DESC
LIMIT 5;

Output

-- Query 1: Connection pressure

+----------------------------------+----------------+

| variable_name | variable_value |

+----------------------------------+----------------+

| Connection_errors_max_connections| 0 |

| Max_used_connections | 47 |

| Threads_connected | 23 |

| Threads_running | 4 |

+----------------------------------+----------------+

-- Query 3: Buffer pool hit rate

+---------------------+---------------------+------------------------+

| total_logical_reads | physical_disk_reads | buffer_pool_hit_rate_pct|

+---------------------+---------------------+------------------------+

| 8423917 | 12043 | 99.86 |

+---------------------+---------------------+------------------------+

-- Query 4: Slowest queries

+------------------------------------------+---+-------+--------+

+------------------------------------------+---+-------+--------+

| SELECT * FROM orders WHERE created_at... | 47| 840.3 | 2103.1 |

| UPDATE users SET last_login = ? WHERE.. | 312| 12.4 | 89.2 |

+------------------------------------------+---+-------+--------+

⚠ Watch Out: Threads_running vs Threads_connected

Most engineers alarm on Threads_connected, but the real danger signal is Threads_running. You can have 200 connections where 199 are idle (waiting for the app to send the next query) — that's fine. If Threads_running hits double digits on a standard app server, you have a concurrency problem right now. Set your alert on Threads_running, not Threads_connected.

📊 Production Insight

If you only monitor Threads_connected, you'll miss the moment your database is actually drowning in active work.

Threads_running is the true concurrency alarm — anything above 20 on a 4-core server means queries are fighting for CPU.

Rule: set for: 1m on Threads_running alerts to avoid transient batch job spikes waking you at 3 AM.

🎯 Key Takeaway

Threads_running tells you how many queries are executing RIGHT NOW.

Threads_connected tells you how many idle clients have the socket open — not useful for alerting.

Trust Threads_running as your concurrency canary, not Threads_connected.

Which metric to watch for concurrency?

IfMySQL, need to know if database is overloaded by active queries

→

UseWatch Threads_running from performance_schema.global_status. Alert if >20 for 1 minute.

IfPostgreSQL, need to know if connections are causing contention

→

UseWatch count of active state from pg_stat_activity. Also check wait_event_type != NULL to see blocked queries.

IfYou see high connection count but no performance issue

→

UseIgnore it — connections pool idle connections. Concentrate on active queries and lock waits.

thecodeforge.io

Database Monitoring Tools

PostgreSQL Deep-Dive — Querying pg_stat Views for Real Operational Insight

PostgreSQL's monitoring story is fundamentally different from MySQL's. Where MySQL centralizes most metrics in performance_schema, PostgreSQL scatters them across a family of pg_stat_* views, each focused on a specific subsystem. This design is actually more powerful once you know where to look, because each view gives you surgical precision instead of one giant table.

The three views you absolutely must understand are pg_stat_activity (who's doing what right now), pg_stat_bgwriter (how hard your background writer is working), and pg_locks (who's waiting on whom). Together they cover the three most common categories of PostgreSQL incidents: long-running queries, I/O bottlenecks, and lock contention.

There's also a uniquely PostgreSQL problem that has no MySQL equivalent: table bloat from VACUUM. PostgreSQL uses MVCC (Multi-Version Concurrency Control), which means old row versions pile up in the table file until VACUUM cleans them. If VACUUM falls behind — because it's disabled, misconfigured, or being blocked by a long transaction — your tables can grow to 10x their logical size and queries slow to a crawl. pg_stat_user_tables exposes exactly this data.

The queries below are battle-tested snippets we'd use in a real production incident response runbook.

postgres_health_snapshot.sqlSQL

-- ============================================================
-- PostgreSQL Health Snapshot for production incident triage
-- Tested on PostgreSQL 13, 14, 15, 16
-- ============================================================

-- 1. What is running RIGHT NOW? — First query to run during any incident
--    Queries older than 30 seconds in a transactional app are almost always a problem
SELECT 
    pid,                                    -- process ID — use this with pg_terminate_backend()
    usename AS db_user,
    application_name,
    state,                                  -- 'active', 'idle', 'idle in transaction' (danger!)
    wait_event_type,                        -- NULL means it's actually running, not blocked
    wait_event,                             -- e.g. 'Lock', 'ClientRead', 'DataFileRead'
    ROUND(EXTRACT(EPOCH FROM (NOW() - query_start))::numeric, 1) AS running_for_seconds,
    LEFT(query, 120) AS query_preview       -- first 120 chars — enough to identify it
FROM pg_stat_activity
WHERE state != 'idle'                       -- filter out idle connections
  AND query_start IS NOT NULL
ORDER BY running_for_seconds DESC NULLS LAST;

-- 2. Lock contention map — who is blocked by whom?
--    If you see rows here, someone is waiting and someone else needs to finish or be killed
SELECT 
    blocked.pid                  AS blocked_pid,
    blocked.usename              AS blocked_user,
    blocking.pid                 AS blocking_pid,
    blocking.usename             AS blocking_user,
    ROUND(EXTRACT(EPOCH FROM (NOW() - blocked.query_start))::numeric, 1) AS blocked_for_seconds,
    LEFT(blocked.query, 100)     AS blocked_query,
    LEFT(blocking.query, 100)    AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_locks AS blocked_locks 
    ON blocked.pid = blocked_locks.pid
JOIN pg_locks AS blocking_locks 
    ON blocking_locks.granted = TRUE
    AND blocked_locks.relation = blocking_locks.relation  -- same table
    AND blocked_locks.pid != blocking_locks.pid           -- different processes
JOIN pg_stat_activity AS blocking 
    ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;   -- only rows where the lock request is WAITING

-- 3. VACUUM health — tables with stale rows that VACUUM hasn't cleaned yet
--    n_dead_tup > n_live_tup * 0.2 means 20%+ of the table is dead weight
SELECT 
    schemaname,
    relname AS table_name,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_row_pct,
    last_autovacuum,                        -- NULL means autovacuum has NEVER run on this table
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000                    -- only tables with meaningful bloat
ORDER BY dead_row_pct DESC
LIMIT 10;

-- 4. Cache hit ratio per table — similar to MySQL's buffer pool hit rate
--    Below 95% on a busy OLTP table means you need more shared_buffers or RAM
SELECT 
    relname AS table_name,
    heap_blks_read  AS disk_reads,
    heap_blks_hit   AS cache_hits,
    ROUND(
        100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0),
        2
    ) AS cache_hit_pct
FROM pg_statio_user_tables
WHERE heap_blks_hit + heap_blks_read > 0   -- skip empty tables
ORDER BY cache_hit_pct ASC                 -- worst performers first
LIMIT 10;

Output

-- Query 1: Active queries

+------+--------+-------+--------------------+----------+------------------+--------------------+

+------+--------+-------+--------------------+----------+------------------+--------------------+

| 8821 | app |active | Lock | relation | 127.4 | UPDATE orders SET |

| 9043 | app |active | NULL | NULL | 2.1 | SELECT * FROM prod |

+------+--------+-------+--------------------+----------+------------------+--------------------+

-- Query 3: VACUUM health (example with bloated table)

+--------+-------------------+----------+-----------+--------------+---------------------+

+--------+-------------------+----------+-----------+--------------+---------------------+

| public | user_sessions | 45230 | 38901 | 46.3 | 2024-01-10 08:12:44 |

| public | audit_log | 210000 | 18000 | 7.9 | 2024-01-12 14:01:22 |

+--------+-------------------+----------+-----------+--------------+---------------------+

⚠ Watch Out: 'idle in transaction' Is a Ticking Time Bomb

In pg_stat_activity, a state of 'idle in transaction' means the application opened a transaction, sent a query, got the result, but never issued COMMIT or ROLLBACK. The transaction is still holding locks. If this sits for minutes, it blocks VACUUM from cleaning dead rows AND can block other writes. Search your application code for any place you open a transaction in a try block but don't have a finally block that guarantees a commit or rollback.

📊 Production Insight

We've seen a single 'idle in transaction' connection cause 10x table bloat over two weeks.

The database's VACUUM process simply skipped the bloated table because it couldn't get a lock.

Always alert on state='idle in transaction' lasting >60 seconds — it's the quietest killer of PostgreSQL performance.

🎯 Key Takeaway

idle in transaction holds locks that block VACUUM and writes.

A long-running idle transaction causes table bloat silently.

Fix: enforce transaction timeout (idle_in_transaction_session_timeout) or fix code to always commit/rollback.

What to check first when PostgreSQL is slow?

Ifpg_stat_activity shows many queries in 'active' state with wait_event != NULL

→

UseRun the lock contention query (above) to find blocking sessions.

Ifpg_stat_activity shows no active queries but performance is still bad

→

UseCheck pg_stat_user_tables for dead_row_pct > 30 — that's bloat from behind VACUUM.

IfCache hit ratio below 95% for a hot table

→

UseIncrease shared_buffers or add more RAM. Check if sequential scans are thrashing the cache by adding missing indexes.

Prometheus + Exporters + Grafana — Building a Production Monitoring Stack

Running those SQL queries by hand is great for incident triage, but you can't watch a terminal 24/7. What you want is a system that scrapes those metrics automatically every 15 seconds, stores them historically so you can see trends, and alerts you before the disk fills up or the query latency spikes. That's exactly what the Prometheus + Grafana stack does.

The architecture is simple: Prometheus is a time-series database that periodically scrapes HTTP endpoints called 'exporters'. For MySQL there's mysqld_exporter, for PostgreSQL there's postgres_exporter. Each exporter connects to your database, runs the equivalent of the queries above, and exposes the results as a /metrics HTTP endpoint that Prometheus can consume. Grafana then connects to Prometheus and lets you build dashboards with those metrics.

What makes this stack genuinely powerful isn't the dashboards — it's Prometheus's alerting rules. You write a rule like 'if the PostgreSQL dead_row_pct for any table stays above 30% for more than 10 minutes, fire an alert'. Grafana Alerting or Prometheus Alertmanager then notifies your Slack channel or PagerDuty. You stop checking dashboards and start only seeing problems when they actually need attention.

The configuration below gets you from zero to a working stack with both exporters running.

docker-compose-db-monitoring.ymlYAML

# ============================================================
# Full monitoring stack: Prometheus + MySQL exporter +
# PostgreSQL exporter + Grafana
# Run with: docker compose up -d
# ============================================================

version: '3.8'

services:

  # --- Prometheus: scrapes and stores all metrics ---------------
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"          # access Prometheus UI at localhost:9090
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml  # our scrape config below
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'  # keep 30 days of history
    restart: unless-stopped

  # --- MySQL Exporter: translates MySQL metrics to Prometheus format ---
  mysqld_exporter:
    image: prom/mysqld-exporter:v0.15.1
    container_name: mysqld_exporter
    ports:
      - "9104:9104"          # Prometheus scrapes this endpoint
    environment:
      # Use a dedicated monitoring user — never root!
      # CREATE USER 'monitoring'@'%' IDENTIFIED BY 'strongpassword';
      # GRANT SELECT, PROCESS, REPLICATION CLIENT ON *.* TO 'monitoring'@'%';
      DATA_SOURCE_NAME: "monitoring:strongpassword@tcp(mysql_host:3306)/"
    command:
      # Explicitly enable the collectors you want — don't collect everything
      - '--collect.global_status'        # Com_select, Threads_running, etc.
      - '--collect.global_variables'     # max_connections, innodb_buffer_pool_size
      - '--collect.info_schema.innodb_metrics'  # InnoDB internals
      - '--collect.perf_schema.eventsstatements'  # slow query data
      - '--collect.auto_increment.columns'  # catch auto_increment columns near overflow!
    restart: unless-stopped

  # --- PostgreSQL Exporter: same idea for Postgres ---------------
  postgres_exporter:
    image: prometheuscommunity/postgres-exporter:v0.15.0
    container_name: postgres_exporter
    ports:
      - "9187:9187"          # Prometheus scrapes this endpoint
    environment:
      # Create a dedicated monitoring role in PostgreSQL:
      # CREATE ROLE monitoring WITH LOGIN PASSWORD 'strongpassword';
      # GRANT pg_monitor TO monitoring;  -- pg_monitor role added in PG10+
      DATA_SOURCE_NAME: "postgresql://monitoring:strongpassword@postgres_host:5432/postgres?sslmode=disable"
    restart: unless-stopped

  # --- Grafana: dashboards and alerting UI ----------------------
  grafana:
    image: grafana/grafana:10.4.1
    container_name: grafana
    ports:
      - "3000:3000"          # access Grafana at localhost:3000
    environment:
      GF_SECURITY_ADMIN_PASSWORD: changeme_in_production
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana_provisioning:/etc/grafana/provisioning  # auto-provision datasources
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Output

$ docker compose up -d

[+] Running 5/5

✔ Network db_monitoring_default Created

✔ Container prometheus Started

✔ Container mysqld_exporter Started

✔ Container postgres_exporter Started

✔ Container grafana Started

# Verify mysqld_exporter is scraping successfully:

$ curl -s localhost:9104/metrics | grep mysql_global_status_threads_running

mysql_global_status_threads_running 4

# Verify postgres_exporter is working:

$ curl -s localhost:9187/metrics | grep pg_stat_activity_count

pg_stat_activity_count{datname="appdb",state="active"} 7

pg_stat_activity_count{datname="appdb",state="idle"} 18

pg_stat_activity_count{datname="appdb",state="idle in transaction"} 2

💡Pro Tip: Import Community Dashboards, Don't Build From Scratch

Grafana has a public dashboard library at grafana.com/grafana/dashboards. Dashboard ID 7362 is the standard MySQL overview dashboard that works directly with mysqld_exporter. Dashboard ID 9628 is the equivalent for PostgreSQL with postgres_exporter. In Grafana go to Dashboards → Import → enter the ID. You get a production-grade dashboard in 30 seconds — then customize from there instead of building panels from scratch.

📊 Production Insight

We've seen teams spend two weeks building custom dashboards that already exist as open-source.

The community dashboards (7362, 9628) track all the right metrics out of the box.

Rule: import first, customize second — never build from zero unless you have unique requirements.

🎯 Key Takeaway

Prometheus scrapes exporters, Grafana visualizes, Alertmanager notifies.

Import dashboards (7362 MySQL, 9628 Postgres) — they track every metric that matters.

Don't build your own dashboards from scratch until you've outgrown the community ones.

Should you build your own dashboard or import?

IfYou need monitoring for a standard MySQL or PostgreSQL deployment

→

UseImport IDs 7362 (MySQL) and 9628 (PostgreSQL) — they cover 90% of what you need.

IfYou need custom metrics (e.g., application-level query latency percentile)

→

UseImport and extend. Add panels on top of the community dashboard. Don't rebuild the entire visualization.

IfYour database version is very old or has custom patches

→

UseImport first, then verify each panel works. Some metrics may not be present in older versions — remove or replace those panels.

thecodeforge.io

Database Monitoring Tools

Prometheus Alert Rules — Catching Failures Before Your Users Do

Dashboards are reactive. Alert rules are proactive. The real value of this monitoring stack isn't the pretty graphs — it's the 3 AM Slack message that says 'PostgreSQL table user_sessions is 46% dead rows — VACUUM may be blocked' before a developer even notices slowness.

Prometheus alert rules use PromQL, Prometheus's query language. They follow a simple pattern: define an expression, a threshold, a duration (how long the condition must persist before alerting), and a severity. The duration is critical — without it, you'll get flooded with noisy one-off alerts for transient spikes that self-resolve.

The rules below cover the five scenarios that account for the vast majority of MySQL and PostgreSQL production incidents: connection exhaustion, query latency spikes, replication lag, disk space pressure, and PostgreSQL VACUUM falling behind. Treat these as your starting template — tune the thresholds to match your specific workload after you've observed normal baseline values for a week.

alert_rules.ymlYAML

# ============================================================
# Prometheus Alert Rules for MySQL + PostgreSQL
# Save as alert_rules.yml and reference in prometheus.yml
# ============================================================

groups:
  - name: mysql_alerts
    rules:

      # Alert when MySQL is approaching connection limit
      # Symptom without this: "Too many connections" errors reaching users
      - alert: MySQLConnectionsNearLimit
        expr: |
          mysql_global_status_threads_connected /
          mysql_global_variables_max_connections * 100 > 80
        for: 2m          # must stay above 80% for 2 full minutes — avoids noisy transient spikes
        labels:
          severity: warning
        annotations:
          summary: "MySQL connection pool above 80% on {{ $labels.instance }}"
          description: "Currently at {{ $value | printf '%.1f' }}% of max_connections. Check for connection leaks or increase pool size."

      # Alert on active threads spike — the real concurrency danger signal
      - alert: MySQLHighThreadsRunning
        expr: mysql_global_status_threads_running > 20
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL thread storm on {{ $labels.instance }}"
          description: "{{ $value }} threads actively running queries. Investigate for table locks or a missing index causing full table scans."

      # InnoDB buffer pool hit rate drop — means you're reading from disk constantly
      - alert: MySQLBufferPoolHitRateLow
        expr: |
          (
            mysql_global_status_innodb_buffer_pool_read_requests -
            mysql_global_status_innodb_buffer_pool_reads
          ) /
          mysql_global_status_innodb_buffer_pool_read_requests * 100 < 95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL InnoDB buffer pool hit rate below 95% on {{ $labels.instance }}"
          description: "Hit rate: {{ $value | printf '%.1f' }}%. Consider increasing innodb_buffer_pool_size if RAM allows."

  - name: postgresql_alerts
    rules:

      # Long-running queries blocking everything else
      - alert: PostgreSQLLongRunningQuery
        expr: |
          max by (datname) (
            pg_stat_activity_max_tx_duration{state="active"}
          ) > 300   # 300 seconds = 5 minutes
        for: 0m     # alert immediately — a 5-min query is already a problem
        labels:
          severity: warning
        annotations:
          summary: "Long-running query on PostgreSQL {{ $labels.datname }}"
          description: "Longest active query has been running {{ $value | printf '%.0f' }} seconds. Use pg_stat_activity to identify and consider pg_terminate_backend()."

      # Idle-in-transaction connections — these hold locks and block VACUUM
      - alert: PostgreSQLIdleInTransactionConnections
        expr: |
          pg_stat_activity_count{state="idle in transaction"} > 5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Too many idle-in-transaction connections on {{ $labels.datname }}"
          description: "{{ $value }} connections stuck in open transactions. Check application code for missing COMMIT/ROLLBACK in error paths."

      # Replication lag — replica is falling behind the primary
      # Critical for any app using replicas for reads
      - alert: PostgreSQLReplicationLagHigh
        expr: pg_replication_lag > 30   # 30 seconds of lag
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL replication lag {{ $value }}s on {{ $labels.instance }}"
          description: "Replica is {{ $value | printf '%.0f' }} seconds behind primary. Reads from this replica may return stale data."

      # Disk space — catches the #1 cause of surprise database downtime
      - alert: DatabaseDiskSpaceCritical
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} /
           node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL data directory below 15% free on {{ $labels.instance }}"
          description: "Only {{ $value | printf '%.1f' }}% disk space remaining. Database will halt writes when disk is full."

Output

# When an alert fires, Prometheus marks it PENDING first (during the 'for' window),

# then FIRING once the duration threshold is met.

# Example Alertmanager notification payload sent to Slack:

{

"status": "firing",

"labels": {

"alertname": "PostgreSQLIdleInTransactionConnections",

"datname": "production_db",

"severity": "warning"

"annotations": {

"summary": "Too many idle-in-transaction connections on production_db",

"description": "8 connections stuck in open transactions. Check application code for missing COMMIT/ROLLBACK in error paths."

"startsAt": "2024-01-15T03:42:00Z"

}

🔥Interview Gold: Why Alert Duration (for:) Matters

Interviewers love asking why Prometheus alerts have a 'for' duration. Without it, a transient 2-second CPU spike fires an alert, wakes someone up, and resolves itself before they even open their laptop — this is called alert fatigue and it leads teams to start ignoring alerts entirely. The 'for' duration means 'this condition must persist continuously for X minutes before I care'. Tune it based on how quickly each specific problem causes user impact.

📊 Production Insight

Alert fatigue is the silent killer of monitoring adoption. Start with longer 'for' durations (5m for subtle issues, 30s for immediate threats).

A team that ignores alerts is functionally unmonitored — and they don't know it until a customer calls.

Rule: every alert that fires should require action. If it doesn't, suppress it or remove it.

🎯 Key Takeaway

Dashboards are reactive, alerts are proactive.

Use 'for' durations to avoid alert fatigue — never fire on transient blips.

Tune thresholds after observing baseline for one week — don't guess production values.

How long should the 'for' duration be?

IfMetric changes slowly (e.g., disk space, table bloat)

→

UseSet for: 5–10 minutes. These problems don't cause instant downtime and tend to resolve slowly.

IfMetric changes quickly (e.g., connection surge, query latency spike)

→

UseSet for: 1–3 minutes. Long enough to avoid transient blips, short enough to catch real problems before user impact.

IfImmediate user-facing impact (e.g., database unreachable, too many connection errors)

→

UseSet for: 0m (alert immediately). These need human attention right away.

Debugging Your Monitoring Stack — Common Failures and Fixes

Your database monitoring stack itself can break. Exporters stop responding, Prometheus loses connection, Grafana panels show 'No data'. When that happens you're flying blind — exactly when you need monitoring the most.

Most monitoring stack failures fall into three categories: configuration drift (scrape targets change but prometheus.yml doesn't), credential rotation (database passwords change but exporter DATA_SOURCE_NAME doesn't), and version mismatches (exporter updated and metrics renamed).

The quickest way to verify the whole chain: start at the exporter, then Prometheus, then Grafana. Don't jump to conclusions. A common mistake is debugging Grafana panels when the exporter itself is down.

Below are the debugging commands and patterns we use in production to restore monitoring fast.

debug_monitoring_stack.shBASH

# ============================================================
# Debugging the Prometheus + Exporter + Grafana chain
# Run these in order — exporter first, then Prometheus, then Grafana
# ============================================================

# Step 1: Can the exporter reach the database?
# If this fails, exporter can't collect metrics.
docker exec mysqld_exporter /bin/sh -c \
  'mysql -u monitoring -pstrongpassword -h mysql_host -e "SELECT 1"'

# Step 2: Is the exporter exposing metrics correctly?
# Expected: a page full of metric lines, not HTML or error.
# If you see "connection refused", exporter is not running or wrong port.
curl -s localhost:9104/metrics | head -20

# Step 3: Can Prometheus reach the exporter?
# Run from inside the Prometheus container.
docker exec prometheus wget -q -O- http://mysqld_exporter:9104/metrics 2>&1 || \
  echo "Prometheus cannot reach exporter"

# Step 4: Check Prometheus targets page (human read)
# All targets should be UP. If DOWN, check network and hostnames.
open http://localhost:9090/targets

# Step 5: Check Prometheus rules status
# Look for PENDING or FIRING alerts. If no firing alerts but you expect one, evaluate rule expression manually.
open http://localhost:9090/rules

# Step 6: Verify Prometheus data source in Grafana
# Go to Configuration → Data Sources → Prometheus → Test Connection
# If fails, Grafana cannot query Prometheus — check URL (should be http://prometheus:9090 inside Docker)

# Step 7: If all else fails — restart everything and watch startup logs
docker compose logs -f --tail=100 prometheus mysqld_exporter postgres_exporter grafana

Output

$ curl -s localhost:9104/metrics | head -8

# HELP mysql_global_status_threads_connected MySQL global status variable

# TYPE mysql_global_status_threads_connected untyped

mysql_global_status_threads_connected 45

$ docker exec prometheus wget -q -O- http://mysqld_exporter:9104/metrics > /dev/null 2>&1 && echo "OK"

⚠ Don't Debug Grafana First — Always Start at the Exporter

We see this pattern constantly: a developer notices a dashboard panel is empty and starts editing queries in Grafana. 90% of the time the exporter is down or Prometheus can't reach it. Always start at the data source — exporter → Prometheus → Grafana. If the exporter returns metrics, Prometheus is scraping them, and Grafana's data source test passes, then you can look at panel configuration.

📊 Production Insight

When your monitoring stack goes down, you're blind during the highest-risk time.

We automate a health check that runs every 5 minutes: exporter /metrics → Prometheus target UP → Grafana data source connected.

If any step fails, it pages the on-call engineer before a real failure happens.

🎯 Key Takeaway

Monitoring stack failures follow a chain: exporter → Prometheus → Grafana.

Always debug from bottom up: exporter first, then Prometheus scraping, then Grafana visualization.

Automate a periodic health check of the whole chain to catch monitoring outages early.

Where to start when monitoring shows no data?

IfExporter /metrics returns 200 OK with data

→

UseCheck Prometheus target status and scrape configuration. The exporter is healthy.

IfExporter /metrics returns connection refused or empty response

→

UseExporter is down, misconfigured, or cannot reach database. Restart exporter and check its logs.

IfExporter works, Prometheus scrapes, but Grafana shows no data

→

UseCheck Prometheus data source in Grafana. Ensure URL uses correct service name (e.g., http://prometheus:9090). Test connection.

The Hidden Cost of Connections — Why Your Pool Settings Are Sabotaging Performance

Most devs think connection pooling is a set-and-forget knob. It's not. The default pool size in most frameworks (looking at you, Hibernate's 50) is a ticket to connection storms. Every idle connection burns memory — typically 2-10MB per connection in PostgreSQL, more in Oracle. That's not your problem until you hit 200 connections and the OOM killer starts eyeing your postmaster.

The real signal isn't connection count. It's connection churn — rate of new connections per second. High churn means your pool is too small or your queries are too slow, forcing clients to open fresh connections while others sit idle. Monitor pg_stat_activity for state = 'idle in transaction' — that's your tax on sloppy ORM usage.

Why this matters: Connection storms trigger lock escalation, checkpoint flooding, and replication lag. A single misconfigured pool can bring down a cluster faster than any query. Fix it by capping pools at (2 * core_count) + effective_spindle_count and monitoring wait_event for client_read or client_write — those are network backpressure signals from your pool being a bottleneck.

ConnectionChurnAnalysis.sqlSQL

// io.thecodeforge — database tutorial

-- Find connections stuck idle in transaction (your worst offenders)
SELECT pid, usename, application_name, state, 
       now() - query_start AS query_age,
       wait_event_type, wait_event
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND backend_type = 'client backend'
  AND wait_event IS NOT NULL
ORDER BY query_age DESC;

-- Connection churn: new connections per second over last 5 min
SELECT count(*) / 300.0 AS conns_per_sec
FROM pg_stat_activity
WHERE backend_start > now() - interval '5 minutes';

-- Pool health check: max connections used vs configured
SELECT count(*) AS active_connections,
       current_setting('max_connections')::int AS max_connections,
       round(100.0 * count(*) / current_setting('max_connections')::int, 1) AS pct_used
FROM pg_stat_activity
WHERE backend_type = 'client backend';

Output

-------+---------+------------------+---------------------+-----------+-----------------+--------------

conns_per_sec

----------------

12.47

active_connections | max_connections | pct_used

--------------------+-----------------+----------

187 | 200 | 93.5

⚠ Production Trap:

If your pool size exceeds PostgreSQL's max_connections (default 100), you'll get 'FATAL: sorry, too many clients already' — not a graceful degrade. Always set your pool max 10-15% below the server's max_connections to leave headroom for monitoring connections.

🎯 Key Takeaway

Monitor connection churn and idle-in-transaction count, not just total connections. One stuck transaction blocks vacuum and can cascade into disk bloat.

Lock Contention Mapping — Finding the Queries That Serialize Your Database

When your app slows to a crawl and CPU is idle, you're almost certainly looking at lock contention. PostgreSQL's lock system is fine-grained — row-level, page-level, table-level — but one long-running DDL or UPDATE without a proper index can escalate to blocking everything.

The standard pg_locks view is a firehose. You need to filter for the real troublemakers: granted locks blocking waiting locks. The worst pattern is an AccessExclusiveLock on a table — that's your DDL (ALTER TABLE, VACUUM FULL) that blocks all reads and writes. Next is RowExclusiveLock contention on hot rows — often from ORM-generated UPDATEs that lack index support, turning row locks into table scans.

Run this query to find the blocked query, the blocker's PID, and how long the victim has been waiting. Kill the blocker only if you know what it's doing — an accidental SIGKILL on a long-running migration will leave a corrupted schema. Instead, set lock_timeout per session for DDL operations (PostgreSQL 9.6+) to cap the damage.

LockContentionMap.sqlSQL

// io.thecodeforge — database tutorial

WITH lock_waits AS (
  SELECT 
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocked.wait_event AS blocked_wait,
    blocker.pid AS blocker_pid,
    blocker.query AS blocker_query,
    blocked.wait_start AS blocked_since
  FROM pg_catalog.pg_locks blocked_locks
  JOIN pg_catalog.pg_stat_activity blocked 
    ON blocked_locks.pid = blocked.pid
  JOIN pg_catalog.pg_locks blocker_locks 
    ON blocked_locks.locktype = blocker_locks.locktype
    AND blocked_locks.database IS NOT DISTINCT FROM blocker_locks.database
    AND blocked_locks.relation IS NOT DISTINCT FROM blocker_locks.relation
    AND blocked_locks.page IS NOT DISTINCT FROM blocker_locks.page
    AND blocked_locks.tuple IS NOT DISTINCT FROM blocker_locks.tuple
  JOIN pg_catalog.pg_stat_activity blocker 
    ON blocker_locks.pid = blocker.pid
  WHERE NOT blocked_locks.granted
    AND blocker_locks.granted
)
SELECT * FROM lock_waits
ORDER BY blocked_since ASC;

Output

-------------+------------------------------------------------------+--------------+-------------+--------------------------------------------+------------------------

4521 | UPDATE orders SET status = 'shipped' WHERE id = 123; | LWLock | 4509 | UPDATE orders SET status = 'pending' WHERE id = 123; | 2025-01-15 14:23:11.45

4522 | SELECT * FROM orders WHERE id = 123; | ClientRead | 4509 | UPDATE orders SET status = 'pending' WHERE id = 123; | 2025-01-15 14:23:12.01

💡Senior Shortcut:

Long-running lock chains often stem from missing indexes on foreign keys. A child table WITHOUT an index on the FK column forces a full table lock on every parent UPDATE. Add CREATE INDEX CONCURRENTLY to avoid downtime.

🎯 Key Takeaway

Always check lock contention before blaming queries. A single missing index can turn a 1ms UPDATE into a 10-second table lock cascade.

pg_stat_statements: PostgreSQL Query Monitoring

The pg_stat_statements extension is a powerful tool for tracking query performance in PostgreSQL. It captures execution statistics for all SQL statements, including query text, calls, total time, rows returned, and I/O timing. To enable it, add pg_stat_statements to shared_preload_libraries in postgresql.conf and create the extension in your database. Once active, query pg_stat_statements to identify slow queries, high-frequency queries, and those with excessive I/O. For example, to find the top 10 queries by total execution time:

``sql SELECT query, calls, total_time, rows, shared_blks_hit, shared_blks_read FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10; ``

This reveals which queries consume the most resources. You can also filter by database or user. Use pg_stat_statements_reset() to reset statistics for a fresh baseline. Combine with pg_stat_activity to correlate query performance with active sessions. This extension is invaluable for performance tuning and capacity planning.

pg_stat_statements_top_queries.sqlSQL

SELECT query, calls, total_time, rows, shared_blks_hit, shared_blks_read
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

💡Reset Statistics for Accurate Baselines

📊 Production Insight

In production, monitor pg_stat_statements regularly to detect regressions after deployments. Set up alerts when total_time per query exceeds thresholds, indicating potential performance issues.

🎯 Key Takeaway

pg_stat_statements provides granular query-level performance data, enabling you to pinpoint and optimize resource-intensive SQL statements.

Performance Schema: MySQL Monitoring Deep-Dive

MySQL's Performance Schema provides detailed instrumentation for monitoring server internals. It tracks events like waits, statements, transactions, and memory usage. Enable it by setting performance_schema=ON in my.cnf (default in MySQL 5.7+). Key tables include events_statements_summary_by_digest for query analysis, events_waits_summary_global_by_event_name for wait events, and memory_summary_global_by_event_name for memory usage. To find the top queries by total latency:

``sql SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT/1000000000 AS total_latency_ms FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 10; ``

This reveals slow queries. For I/O waits, query events_waits_summary_global_by_event_name where EVENT_NAME contains 'wait/io'. Performance Schema has minimal overhead and is essential for deep MySQL observability. Use it alongside sys schema for simplified queries.

performance_schema_top_queries.sqlSQL

SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT/1000000000 AS total_latency_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

🔥Performance Schema vs Slow Query Log

📊 Production Insight

In production, enable Performance Schema and use the sys schema for easier queries. Monitor events_statements_summary_by_digest to detect query regressions and set alerts on high latency or error counts.

🎯 Key Takeaway

MySQL Performance Schema offers comprehensive, low-overhead instrumentation for query performance, waits, and memory, enabling deep operational insight.

thecodeforge.io

Database Monitoring Tools

OpenTelemetry for Database Observability

OpenTelemetry (OTel) provides a vendor-neutral framework for collecting traces, metrics, and logs from databases. For PostgreSQL, use the OTel Collector with the PostgreSQL receiver to gather metrics like active connections, transaction duration, and query counts. For MySQL, use the MySQL receiver. Configure the collector to export data to Prometheus, Jaeger, or other backends. Example OTel Collector configuration for PostgreSQL:

```yaml receivers: postgresql: endpoint: localhost:5432 username: otel_user password: ${env:DB_PASSWORD} databases: - mydb metrics: postgresql.connections: enabled: true postgresql.transactions: enabled: true

exporters: prometheus: endpoint: 0.0.0.0:9464

service: pipelines: metrics: receivers: [postgresql] exporters: [prometheus] ```

This enables database metrics in Prometheus format. For tracing, instrument your application with OTel SDKs to capture database call spans. OTel provides a unified view across your stack, reducing tool fragmentation. Use it to correlate database performance with application behavior.

otel-collector-config.yamlYAML

receivers:
  postgresql:
    endpoint: localhost:5432
    username: otel_user
    password: ${env:DB_PASSWORD}
    databases:
      - mydb
    metrics:
      postgresql.connections:
        enabled: true
      postgresql.transactions:
        enabled: true

exporters:
  prometheus:
    endpoint: 0.0.0.0:9464

service:
  pipelines:
    metrics:
      receivers: [postgresql]
      exporters: [prometheus]

⚠ Secure Database Credentials

📊 Production Insight

In production, deploy the OTel Collector as a sidecar or daemonset. Use database receivers to feed metrics into your existing Prometheus/Grafana stack, and enable tracing to debug slow queries end-to-end.

🎯 Key Takeaway

OpenTelemetry standardizes database observability by collecting metrics and traces, enabling correlation across your entire stack with minimal vendor lock-in.

● Production incidentPOST-MORTEMseverity: high

The Idle-in-Transaction That Killed Query Performance

Symptom

Queries on the user_sessions table that used to take 50ms started taking 2s. CPU and disk IO were normal. Pg_stat_activity showed a few 'idle in transaction' connections lasting hours.

Assumption

The team assumed it was an indexing or hardware problem because there were no obviously stuck 'active' queries. They spent three days running VACUUM manually and adding indexes before checking lock chains.

Root cause

The application had a try block that started a transaction and a catch block that logged the error but never called ROLLBACK. The transaction stayed open, holding advisory locks that blocked autovacuum from cleaning dead rows on user_sessions. Dead rows grew to 200% of live rows.

Fix

Terminate the hung backends with pg_terminate_backend(pid), then force a VACUUM FULL on user_sessions to reclaim space. Then fix the application code: wrap the transaction in a try-finally block that ensures COMMIT or ROLLBACK.

Key lesson

Always alert on 'idle in transaction' connections lasting over 60 seconds — they're the root cause of many silent performance degradations.
Autovacuum logs show when it's being blocked — check postgresql.log for 'skipping vacuum of table because no free space' or 'canceling autovacuum'.
Include pg_locks monitoring in your dashboard — it visualizes lock chains that connect idle transactions to active query slowness.

Production debug guideCommon issues with MySQL and PostgreSQL monitoring, what they look like, and the exact command or fix to apply.5 entries

Symptom · 01

MySQL exporter returns no data for Threads_running

→

Fix

Check exporter permissions: must have PROCESS privilege. Run: GRANT PROCESS ON . TO 'monitoring'@'%'; Also verify exporter is scraping the correct MySQL instance and is not blocked by firewall.

Symptom · 02

PostgreSQL cache hit ratio stuck below 90%

→

Fix

Query pg_statio_user_tables to identify problematic tables. Increase shared_buffers (up to 25% of RAM if <64GB). For a quick win, check if sequential scans are thrashing the cache — add missing indexes.

Symptom · 03

Prometheus shows stale metrics (values not updating)

→

Fix

Check exporter endpoint: curl localhost:9104/metrics | head -20. If stale, restart exporter. Then check Prometheus targets page (localhost:9090/targets) for scrape failures. Common cause: scrape_interval too short for exporters behind VPN.

Symptom · 04

Replication lag warning fires intermittently

→

Fix

Check pg_replication_lag in Grafana vs actual lag from pg_stat_replication. If lag spikes correlate with large data loads, it's expected. Warn if lag >30s for more than 5min during normal traffic. Tune wal_keep_size or increase replica resources.

Symptom · 05

Grafana dashboard panels show 'No data' for PostgreSQL

→

Fix

Verify postgres_exporter is running and the Grafana data source URL points to the Prometheus server, not directly to the exporter. Grafana needs Prometheus as the intermediary — never configure Grafana to scrape exporters directly.

★ Quick Debug: Monitoring Stack HealthWhen your monitoring itself breaks — exporters down, Prometheus not scraping, Grafana not showing data — use these commands to diagnose and fix fast.

Exporter /metrics endpoint returns connection refused−

Immediate action

Restart the exporter container and check its logs for startup errors.

Commands

docker logs mysqld_exporter 2>&1 | tail -50

curl -v localhost:9104/metrics 2>&1

Fix now

Exporter is not running or bound to wrong interface. If using Docker, verify port mapping (-p 9104:9104) and that the exporter can reach the database host from inside the container.

Prometheus target shows 'Get ... dial tcp: connection refused'+

Grafana 'Explore' can query Prometheus but dashboard panels show nothing+

Alert rules never fire even when conditions met+

MySQL vs PostgreSQL Monitoring Comparison

Aspect	MySQL (mysqld_exporter)	PostgreSQL (postgres_exporter)
Primary metrics source	performance_schema + information_schema	pg_stat_* system views
Connection metric to watch	Threads_running (not Threads_connected)	pg_stat_activity WHERE state='active'
Memory health indicator	InnoDB buffer pool hit rate (target 99%+)	shared_buffers cache hit rate (target 95%+)
Unique failure mode	Auto-increment column overflow (silent data corruption)	VACUUM falling behind (table bloat + query slowdown)
Replication monitoring	Seconds_Behind_Master via SHOW SLAVE STATUS	pg_replication_lag in seconds from pg_stat_replication
Lock detection	information_schema.INNODB_TRX + INNODB_LOCKS	pg_locks JOIN pg_stat_activity
Slow query history	performance_schema.events_statements_summary_by_digest	pg_stat_statements extension (must be installed)
Required permissions for exporter	SELECT, PROCESS, REPLICATION CLIENT	pg_monitor role (PostgreSQL 10+) or manual GRANT
Prometheus exporter port	9104 (default)	9187 (default)
Best free Grafana dashboard	Dashboard ID 7362	Dashboard ID 9628

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
mysql_health_snapshot.sql	SELECT	What Your Database Is Actually Telling You
postgres_health_snapshot.sql	SELECT	PostgreSQL Deep-Dive
docker-compose-db-monitoring.yml	version: '3.8'	Prometheus + Exporters + Grafana
alert_rules.yml	groups:	Prometheus Alert Rules
debug_monitoring_stack.sh	docker exec mysqld_exporter /bin/sh -c \	Debugging Your Monitoring Stack
ConnectionChurnAnalysis.sql	SELECT pid, usename, application_name, state,	The Hidden Cost of Connections
LockContentionMap.sql	WITH lock_waits AS (	Lock Contention Mapping
pg_stat_statements_top_queries.sql	SELECT query, calls, total_time, rows, shared_blks_hit, shared_blks_read	pg_stat_statements
performance_schema_top_queries.sql	SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT/1000000000 AS total_latency_ms	Performance Schema
otel-collector-config.yaml	receivers:	OpenTelemetry for Database Observability

Key takeaways

Threads_running in MySQL (not Threads_connected) is the real concurrency alarm

it tells you how many queries are actually executing right now, not just connected and idle.

PostgreSQL's 'idle in transaction' state is the silent killer

it holds locks, blocks VACUUM, and causes table bloat without showing any obviously 'stuck' queries in most monitoring setups.

The Prometheus + mysqld_exporter/postgres_exporter + Grafana stack costs zero dollars and gives you historical trending, alerting, and dashboards that commercial tools charge thousands for monthly.

Always create a dedicated read-only monitoring user for your exporters

never point monitoring tools at your application user or superuser, even in development, because habits formed in dev become incidents in prod.

Alert 'for' durations are critical

too short causes alert fatigue, too long causes missed incidents. Tune each alert based on how fast the metric changes and how quickly it impacts users.

Debug monitoring stack failures from bottom up

exporter → Prometheus → Grafana. Start at the data source, not the dashboard.

Import community Grafana dashboards (7362 for MySQL, 9628 for PostgreSQL)

they track every metric that matters and save you weeks of development time.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

If a production MySQL server suddenly shows Threads_running spiking to 8...

Q02SENIOR

A PostgreSQL table that had queries running in 10ms is now taking 2 seco...

Q03SENIOR

Someone asks you to set up alerting for database disk space. They sugges...

Q04SENIOR

How would you monitor auto-increment column overflow in MySQL? What happ...

Q01 of 04SENIOR

If a production MySQL server suddenly shows Threads_running spiking to 80 but Threads_connected is only 95 out of a max of 500, what does that tell you and what's your first investigation step?

ANSWER

It tells you the database is under heavy concurrency load — 80 queries are actively executing, probably fighting for CPU, disk, or locks. The low Threads_connected relative to max means you haven't exhausted the connection pool, but you're close to CPU saturation. First investigation step: query performance_schema.events_statements_summary_by_digest to find which queries are consuming the most time, and check information_schema.INNODB_TRX for any long-running transactions or lock waits. Also check SHOW PROCESSLIST to see if many threads are in 'Sending data' or 'Table lock' state. Most likely cause: a missing index causing a full table scan, or a blocking lock chain.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is the best free database monitoring tool for MySQL and PostgreSQL?

How do I find which queries are slowing down my PostgreSQL database?

What is the difference between monitoring and observability for databases?

Should I use the same exporter for development and production?

My PostgreSQL exporter shows 'pg_stat_activity_count' but not 'pg_stat_user_tables'. What's wrong?

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Written from production experience, not tutorials.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's MySQL & PostgreSQL. Mark it forged?

7 min read · try the examples if you haven't