Home Database Database Monitoring Tools for MySQL & PostgreSQL Explained

Database Monitoring Tools for MySQL & PostgreSQL Explained

In Plain English 🔥
Imagine your database is a busy restaurant kitchen. Orders come in, food goes out, and everything runs smoothly — until it doesn't. A database monitoring tool is like a health inspector who lives in that kitchen 24/7, watching every chef, every stove burner, and every order ticket. The moment something starts burning or a chef freezes up, the inspector sounds an alarm before the whole dinner service collapses. That's exactly what monitoring does for your database: it watches everything so you don't have to.
⚡ Quick Answer
Imagine your database is a busy restaurant kitchen. Orders come in, food goes out, and everything runs smoothly — until it doesn't. A database monitoring tool is like a health inspector who lives in that kitchen 24/7, watching every chef, every stove burner, and every order ticket. The moment something starts burning or a chef freezes up, the inspector sounds an alarm before the whole dinner service collapses. That's exactly what monitoring does for your database: it watches everything so you don't have to.

Every production application eventually hits the same wall: something goes wrong with the database at 2 AM, a customer calls to complain, and you're staring at logs trying to figure out what happened 40 minutes ago. Slow queries, connection pool exhaustion, disk space creeping toward 100% — these aren't hypothetical problems. They happen to every team, at every scale, and they're almost always preventable with the right visibility. Database monitoring isn't optional infrastructure — it's the difference between being proactive and being constantly reactive.

MySQL and PostgreSQL are the two most widely deployed open-source databases in the world, and each has its own internal metrics, query execution engine, and failure modes. A tool that works brilliantly for MySQL's InnoDB buffer pool might give you almost nothing useful for PostgreSQL's VACUUM process. Understanding not just which tools exist, but what they actually measure and why those metrics matter, is what separates engineers who can debug production incidents in minutes from those who spend hours guessing.

By the end of this article you'll know how to query the internal monitoring views of both MySQL and PostgreSQL to get real-time health data, how to set up Prometheus with the right exporters to scrape those metrics automatically, how to visualize everything in Grafana, and — most importantly — you'll understand exactly why each metric matters so you can make smart decisions when something breaks. We'll also cover the traps that catch most engineers off guard the first time they set this up.

What Your Database Is Actually Telling You — Core Metrics That Matter

Before you install a single monitoring tool, you need to speak your database's language. Both MySQL and PostgreSQL expose rich internal metrics through system views and information schema tables. The problem is there are hundreds of them, and most tutorials just dump a list without explaining which ones actually predict failure.

Think of database health in three layers: throughput (how much work is being done), latency (how long that work takes), and resource pressure (how close you are to running out of something critical). Every useful metric falls into one of these buckets.

For MySQL, the most critical real-time signals live in performance_schema and information_schema. For PostgreSQL, pg_stat_activity, pg_stat_bgwriter, and pg_locks are your best friends. The queries below show you how to pull actionable snapshots from each — not just raw numbers, but numbers with context.

One critical thing most developers miss: a single metric in isolation is almost meaningless. A query taking 500ms is fine for a reporting job, catastrophic for a login endpoint. Always correlate metrics with the workload context before raising the alarm.

mysql_health_snapshot.sql · SQL
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
-- ============================================================
-- MySQL Health Snapshot: throughput, connections, slow queries
-- Run this against any MySQL 5.7+ or MySQL 8+ instance
-- ============================================================

-- 1. Current connection pressure
--    max_connections is set in my.cnf — if you're at 80%+ you're at risk
SELECT 
    variable_name,
    variable_value
FROM performance_schema.global_status
WHERE variable_name IN (
    'Threads_connected',       -- active connections right now
    'Threads_running',         -- connections actively executing a query (the hot number)
    'Max_used_connections',    -- historical peak — tells you your worst day
    'Connection_errors_max_connections'  -- non-zero means users got rejected
);

-- 2. Query throughput — reads vs writes ratio
--    A suddenly spiking Com_select with flat Com_insert suggests a runaway report query
SELECT 
    variable_name,
    variable_value,
    -- calculate reads-to-writes ratio for context
    CASE variable_name
        WHEN 'Com_select' THEN 'READ'
        WHEN 'Com_insert' THEN 'WRITE'
        WHEN 'Com_update' THEN 'WRITE'
        WHEN 'Com_delete' THEN 'WRITE'
        ELSE 'OTHER'
    END AS operation_type
FROM performance_schema.global_status
WHERE variable_name IN (
    'Com_select', 'Com_insert', 'Com_update', 'Com_delete'
);

-- 3. InnoDB buffer pool efficiency — the single most important MySQL memory metric
--    If Innodb_buffer_pool_reads is high relative to Innodb_buffer_pool_read_requests,
--    your buffer pool is too small — you're reading from disk instead of RAM
SELECT 
    (SELECT variable_value 
     FROM performance_schema.global_status 
     WHERE variable_name = 'Innodb_buffer_pool_read_requests') AS total_logical_reads,
    (SELECT variable_value 
     FROM performance_schema.global_status 
     WHERE variable_name = 'Innodb_buffer_pool_reads') AS physical_disk_reads,
    ROUND(
        100 - (
            (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_reads') /
            (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_read_requests')
        ) * 100,
        2
    ) AS buffer_pool_hit_rate_pct;  -- aim for 99%+ in production

-- 4. Top 5 slowest normalized queries (requires performance_schema enabled)
--    This tells you WHERE to focus optimization effort
SELECT 
    DIGEST_TEXT AS normalized_query,
    COUNT_STAR AS execution_count,
    ROUND(AVG_TIMER_WAIT / 1000000000, 3) AS avg_latency_ms,
    ROUND(MAX_TIMER_WAIT / 1000000000, 3) AS max_latency_ms,
    SUM_ROWS_EXAMINED AS total_rows_examined,
    SUM_ROWS_SENT AS total_rows_returned
FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST_TEXT IS NOT NULL
ORDER BY AVG_TIMER_WAIT DESC
LIMIT 5;
▶ Output
-- Query 1: Connection pressure
+----------------------------------+----------------+
| variable_name | variable_value |
+----------------------------------+----------------+
| Connection_errors_max_connections| 0 |
| Max_used_connections | 47 |
| Threads_connected | 23 |
| Threads_running | 4 |
+----------------------------------+----------------+

-- Query 3: Buffer pool hit rate
+---------------------+---------------------+------------------------+
| total_logical_reads | physical_disk_reads | buffer_pool_hit_rate_pct|
+---------------------+---------------------+------------------------+
| 8423917 | 12043 | 99.86 |
+---------------------+---------------------+------------------------+

-- Query 4: Slowest queries
+------------------------------------------+---+-------+--------+
| normalized_query |cnt|avg_ms |max_ms |
+------------------------------------------+---+-------+--------+
| SELECT * FROM orders WHERE created_at... | 47| 840.3 | 2103.1 |
| UPDATE users SET last_login = ? WHERE.. | 312| 12.4 | 89.2 |
+------------------------------------------+---+-------+--------+
⚠️
Watch Out: Threads_running vs Threads_connectedMost engineers alarm on Threads_connected, but the real danger signal is Threads_running. You can have 200 connections where 199 are idle (waiting for the app to send the next query) — that's fine. If Threads_running hits double digits on a standard app server, you have a concurrency problem right now. Set your alert on Threads_running, not Threads_connected.

PostgreSQL Deep-Dive — Querying pg_stat Views for Real Operational Insight

PostgreSQL's monitoring story is fundamentally different from MySQL's. Where MySQL centralizes most metrics in performance_schema, PostgreSQL scatters them across a family of pg_stat_* views, each focused on a specific subsystem. This design is actually more powerful once you know where to look, because each view gives you surgical precision instead of one giant table.

The three views you absolutely must understand are pg_stat_activity (who's doing what right now), pg_stat_bgwriter (how hard your background writer is working), and pg_locks (who's waiting on whom). Together they cover the three most common categories of PostgreSQL incidents: long-running queries, I/O bottlenecks, and lock contention.

There's also a uniquely PostgreSQL problem that has no MySQL equivalent: table bloat from VACUUM. PostgreSQL uses MVCC (Multi-Version Concurrency Control), which means old row versions pile up in the table file until VACUUM cleans them. If VACUUM falls behind — because it's disabled, misconfigured, or being blocked by a long transaction — your tables can grow to 10x their logical size and queries slow to a crawl. pg_stat_user_tables exposes exactly this data.

The queries below are battle-tested snippets we'd use in a real production incident response runbook.

postgres_health_snapshot.sql · SQL
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
-- ============================================================
-- PostgreSQL Health Snapshot for production incident triage
-- Tested on PostgreSQL 13, 14, 15, 16
-- ============================================================

-- 1. What is running RIGHT NOW? — First query to run during any incident
--    Queries older than 30 seconds in a transactional app are almost always a problem
SELECT 
    pid,                                    -- process ID — use this with pg_terminate_backend()
    usename AS db_user,
    application_name,
    state,                                  -- 'active', 'idle', 'idle in transaction' (danger!)
    wait_event_type,                        -- NULL means it's actually running, not blocked
    wait_event,                             -- e.g. 'Lock', 'ClientRead', 'DataFileRead'
    ROUND(EXTRACT(EPOCH FROM (NOW() - query_start))::numeric, 1) AS running_for_seconds,
    LEFT(query, 120) AS query_preview       -- first 120 chars — enough to identify it
FROM pg_stat_activity
WHERE state != 'idle'                       -- filter out idle connections
  AND query_start IS NOT NULL
ORDER BY running_for_seconds DESC NULLS LAST;

-- 2. Lock contention map — who is blocked by whom?
--    If you see rows here, someone is waiting and someone else needs to finish or be killed
SELECT 
    blocked.pid                  AS blocked_pid,
    blocked.usename              AS blocked_user,
    blocking.pid                 AS blocking_pid,
    blocking.usename             AS blocking_user,
    ROUND(EXTRACT(EPOCH FROM (NOW() - blocked.query_start))::numeric, 1) AS blocked_for_seconds,
    LEFT(blocked.query, 100)     AS blocked_query,
    LEFT(blocking.query, 100)    AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_locks AS blocked_locks 
    ON blocked.pid = blocked_locks.pid
JOIN pg_locks AS blocking_locks 
    ON blocking_locks.granted = TRUE
    AND blocked_locks.relation = blocking_locks.relation  -- same table
    AND blocked_locks.pid != blocking_locks.pid           -- different processes
JOIN pg_stat_activity AS blocking 
    ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;   -- only rows where the lock request is WAITING

-- 3. VACUUM health — tables with stale rows that VACUUM hasn't cleaned yet
--    n_dead_tup > n_live_tup * 0.2 means 20%+ of the table is dead weight
SELECT 
    schemaname,
    relname AS table_name,
    n_live_tup AS live_rows,
    n_dead_tup AS dead_rows,
    ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_row_pct,
    last_autovacuum,                        -- NULL means autovacuum has NEVER run on this table
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000                    -- only tables with meaningful bloat
ORDER BY dead_row_pct DESC
LIMIT 10;

-- 4. Cache hit ratio per table — similar to MySQL's buffer pool hit rate
--    Below 95% on a busy OLTP table means you need more shared_buffers or RAM
SELECT 
    relname AS table_name,
    heap_blks_read  AS disk_reads,
    heap_blks_hit   AS cache_hits,
    ROUND(
        100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0),
        2
    ) AS cache_hit_pct
FROM pg_statio_user_tables
WHERE heap_blks_hit + heap_blks_read > 0   -- skip empty tables
ORDER BY cache_hit_pct ASC                 -- worst performers first
LIMIT 10;
▶ Output
-- Query 1: Active queries
+------+--------+-------+--------------------+----------+------------------+--------------------+
| pid | db_user| state | wait_event_type | wait_evnt| running_for_secs | query_preview |
+------+--------+-------+--------------------+----------+------------------+--------------------+
| 8821 | app |active | Lock | relation | 127.4 | UPDATE orders SET |
| 9043 | app |active | NULL | NULL | 2.1 | SELECT * FROM prod |
| 7201 | report |active | NULL | NULL | 1840.3 | SELECT COUNT(*) FR |
+------+--------+-------+--------------------+----------+------------------+--------------------+

-- Query 3: VACUUM health (example with bloated table)
+--------+-------------------+----------+-----------+--------------+---------------------+
|schema | table_name | live_rows | dead_rows | dead_row_pct | last_autovacuum |
+--------+-------------------+----------+-----------+--------------+---------------------+
| public | user_sessions | 45230 | 38901 | 46.3 | 2024-01-10 08:12:44 |
| public | audit_log | 210000 | 18000 | 7.9 | 2024-01-12 14:01:22 |
+--------+-------------------+----------+-----------+--------------+---------------------+
⚠️
Watch Out: 'idle in transaction' Is a Ticking Time BombIn pg_stat_activity, a state of 'idle in transaction' means the application opened a transaction, sent a query, got the result, but never issued COMMIT or ROLLBACK. The transaction is still holding locks. If this sits for minutes, it blocks VACUUM from cleaning dead rows AND can block other writes. Search your application code for any place you open a transaction in a try block but don't have a finally block that guarantees a commit or rollback.

Prometheus + Exporters + Grafana — Building a Production Monitoring Stack

Running those SQL queries by hand is great for incident triage, but you can't watch a terminal 24/7. What you want is a system that scrapes those metrics automatically every 15 seconds, stores them historically so you can see trends, and alerts you before the disk fills up or the query latency spikes. That's exactly what the Prometheus + Grafana stack does.

The architecture is simple: Prometheus is a time-series database that periodically scrapes HTTP endpoints called 'exporters'. For MySQL there's mysqld_exporter, for PostgreSQL there's postgres_exporter. Each exporter connects to your database, runs the equivalent of the queries above, and exposes the results as a /metrics HTTP endpoint that Prometheus can consume. Grafana then connects to Prometheus and lets you build dashboards with those metrics.

What makes this stack genuinely powerful isn't the dashboards — it's Prometheus's alerting rules. You write a rule like 'if the PostgreSQL dead_row_pct for any table stays above 30% for more than 10 minutes, fire an alert'. Grafana Alerting or Prometheus Alertmanager then notifies your Slack channel or PagerDuty. You stop checking dashboards and start only seeing problems when they actually need attention.

The configuration below gets you from zero to a working stack with both exporters running.

docker-compose-db-monitoring.yml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
# ============================================================
# Full monitoring stack: Prometheus + MySQL exporter +
# PostgreSQL exporter + Grafana
# Run with: docker compose up -d
# ============================================================

version: '3.8'

services:

  # --- Prometheus: scrapes and stores all metrics ---------------
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"          # access Prometheus UI at localhost:9090
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml  # our scrape config below
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'  # keep 30 days of history
    restart: unless-stopped

  # --- MySQL Exporter: translates MySQL metrics to Prometheus format ---
  mysqld_exporter:
    image: prom/mysqld-exporter:v0.15.1
    container_name: mysqld_exporter
    ports:
      - "9104:9104"          # Prometheus scrapes this endpoint
    environment:
      # Use a dedicated monitoring user — never root!
      # CREATE USER 'monitoring'@'%' IDENTIFIED BY 'strongpassword';
      # GRANT SELECT, PROCESS, REPLICATION CLIENT ON *.* TO 'monitoring'@'%';
      DATA_SOURCE_NAME: "monitoring:strongpassword@tcp(mysql_host:3306)/"
    command:
      # Explicitly enable the collectors you want — don't collect everything
      - '--collect.global_status'        # Com_select, Threads_running, etc.
      - '--collect.global_variables'     # max_connections, innodb_buffer_pool_size
      - '--collect.info_schema.innodb_metrics'  # InnoDB internals
      - '--collect.perf_schema.eventsstatements'  # slow query data
      - '--collect.auto_increment.columns'  # catch auto_increment columns near overflow!
    restart: unless-stopped

  # --- PostgreSQL Exporter: same idea for Postgres ---------------
  postgres_exporter:
    image: prometheuscommunity/postgres-exporter:v0.15.0
    container_name: postgres_exporter
    ports:
      - "9187:9187"          # Prometheus scrapes this endpoint
    environment:
      # Create a dedicated monitoring role in PostgreSQL:
      # CREATE ROLE monitoring WITH LOGIN PASSWORD 'strongpassword';
      # GRANT pg_monitor TO monitoring;  -- pg_monitor role added in PG10+
      DATA_SOURCE_NAME: "postgresql://monitoring:strongpassword@postgres_host:5432/postgres?sslmode=disable"
    restart: unless-stopped

  # --- Grafana: dashboards and alerting UI ----------------------
  grafana:
    image: grafana/grafana:10.4.1
    container_name: grafana
    ports:
      - "3000:3000"          # access Grafana at localhost:3000
    environment:
      GF_SECURITY_ADMIN_PASSWORD: changeme_in_production
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana_provisioning:/etc/grafana/provisioning  # auto-provision datasources
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
▶ Output
$ docker compose up -d
[+] Running 5/5
✔ Network db_monitoring_default Created
✔ Container prometheus Started
✔ Container mysqld_exporter Started
✔ Container postgres_exporter Started
✔ Container grafana Started

# Verify mysqld_exporter is scraping successfully:
$ curl -s localhost:9104/metrics | grep mysql_global_status_threads_running
mysql_global_status_threads_running 4

# Verify postgres_exporter is working:
$ curl -s localhost:9187/metrics | grep pg_stat_activity_count
pg_stat_activity_count{datname="appdb",state="active"} 7
pg_stat_activity_count{datname="appdb",state="idle"} 18
pg_stat_activity_count{datname="appdb",state="idle in transaction"} 2
⚠️
Pro Tip: Import Community Dashboards, Don't Build From ScratchGrafana has a public dashboard library at grafana.com/grafana/dashboards. Dashboard ID 7362 is the standard MySQL overview dashboard that works directly with mysqld_exporter. Dashboard ID 9628 is the equivalent for PostgreSQL with postgres_exporter. In Grafana go to Dashboards → Import → enter the ID. You get a production-grade dashboard in 30 seconds — then customize from there instead of building panels from scratch.

Prometheus Alert Rules — Catching Failures Before Your Users Do

Dashboards are reactive. Alert rules are proactive. The real value of this monitoring stack isn't the pretty graphs — it's the 3 AM Slack message that says 'PostgreSQL table user_sessions is 46% dead rows — VACUUM may be blocked' before a developer even notices slowness.

Prometheus alert rules use PromQL, Prometheus's query language. They follow a simple pattern: define an expression, a threshold, a duration (how long the condition must persist before alerting), and a severity. The duration is critical — without it, you'll get flooded with noisy one-off alerts for transient spikes that self-resolve.

The rules below cover the five scenarios that account for the vast majority of MySQL and PostgreSQL production incidents: connection exhaustion, query latency spikes, replication lag, disk space pressure, and PostgreSQL VACUUM falling behind. Treat these as your starting template — tune the thresholds to match your specific workload after you've observed normal baseline values for a week.

alert_rules.yml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
# ============================================================
# Prometheus Alert Rules for MySQL + PostgreSQL
# Save as alert_rules.yml and reference in prometheus.yml
# ============================================================

groups:
  - name: mysql_alerts
    rules:

      # Alert when MySQL is approaching connection limit
      # Symptom without this: "Too many connections" errors reaching users
      - alert: MySQLConnectionsNearLimit
        expr: |
          mysql_global_status_threads_connected /
          mysql_global_variables_max_connections * 100 > 80
        for: 2m          # must stay above 80% for 2 full minutes — avoids noisy transient spikes
        labels:
          severity: warning
        annotations:
          summary: "MySQL connection pool above 80% on {{ $labels.instance }}"
          description: "Currently at {{ $value | printf '%.1f' }}% of max_connections. Check for connection leaks or increase pool size."

      # Alert on active threads spike — the real concurrency danger signal
      - alert: MySQLHighThreadsRunning
        expr: mysql_global_status_threads_running > 20
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL thread storm on {{ $labels.instance }}"
          description: "{{ $value }} threads actively running queries. Investigate for table locks or a missing index causing full table scans."

      # InnoDB buffer pool hit rate drop — means you're reading from disk constantly
      - alert: MySQLBufferPoolHitRateLow
        expr: |
          (
            mysql_global_status_innodb_buffer_pool_read_requests -
            mysql_global_status_innodb_buffer_pool_reads
          ) /
          mysql_global_status_innodb_buffer_pool_read_requests * 100 < 95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL InnoDB buffer pool hit rate below 95% on {{ $labels.instance }}"
          description: "Hit rate: {{ $value | printf '%.1f' }}%. Consider increasing innodb_buffer_pool_size if RAM allows."

  - name: postgresql_alerts
    rules:

      # Long-running queries blocking everything else
      - alert: PostgreSQLLongRunningQuery
        expr: |
          max by (datname) (
            pg_stat_activity_max_tx_duration{state="active"}
          ) > 300   # 300 seconds = 5 minutes
        for: 0m     # alert immediately — a 5-min query is already a problem
        labels:
          severity: warning
        annotations:
          summary: "Long-running query on PostgreSQL {{ $labels.datname }}"
          description: "Longest active query has been running {{ $value | printf '%.0f' }} seconds. Use pg_stat_activity to identify and consider pg_terminate_backend()."

      # Idle-in-transaction connections — these hold locks and block VACUUM
      - alert: PostgreSQLIdleInTransactionConnections
        expr: |
          pg_stat_activity_count{state="idle in transaction"} > 5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Too many idle-in-transaction connections on {{ $labels.datname }}"
          description: "{{ $value }} connections stuck in open transactions. Check application code for missing COMMIT/ROLLBACK in error paths."

      # Replication lag — replica is falling behind the primary
      # Critical for any app using replicas for reads
      - alert: PostgreSQLReplicationLagHigh
        expr: pg_replication_lag > 30   # 30 seconds of lag
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL replication lag {{ $value }}s on {{ $labels.instance }}"
          description: "Replica is {{ $value | printf '%.0f' }} seconds behind primary. Reads from this replica may return stale data."

      # Disk space — catches the #1 cause of surprise database downtime
      - alert: DatabaseDiskSpaceCritical
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} /
           node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL data directory below 15% free on {{ $labels.instance }}"
          description: "Only {{ $value | printf '%.1f' }}% disk space remaining. Database will halt writes when disk is full."
▶ Output
# When an alert fires, Prometheus marks it PENDING first (during the 'for' window),
# then FIRING once the duration threshold is met.
# Example Alertmanager notification payload sent to Slack:

{
"status": "firing",
"labels": {
"alertname": "PostgreSQLIdleInTransactionConnections",
"datname": "production_db",
"severity": "warning"
},
"annotations": {
"summary": "Too many idle-in-transaction connections on production_db",
"description": "8 connections stuck in open transactions. Check application code for missing COMMIT/ROLLBACK in error paths."
},
"startsAt": "2024-01-15T03:42:00Z"
}
🔥
Interview Gold: Why Alert Duration (for:) MattersInterviewers love asking why Prometheus alerts have a 'for' duration. Without it, a transient 2-second CPU spike fires an alert, wakes someone up, and resolves itself before they even open their laptop — this is called alert fatigue and it leads teams to start ignoring alerts entirely. The 'for' duration means 'this condition must persist continuously for X minutes before I care'. Tune it based on how quickly each specific problem causes user impact.
AspectMySQL (mysqld_exporter)PostgreSQL (postgres_exporter)
Primary metrics sourceperformance_schema + information_schemapg_stat_* system views
Connection metric to watchThreads_running (not Threads_connected)pg_stat_activity WHERE state='active'
Memory health indicatorInnoDB buffer pool hit rate (target 99%+)shared_buffers cache hit rate (target 95%+)
Unique failure modeAuto-increment column overflow (silent data corruption)VACUUM falling behind (table bloat + query slowdown)
Replication monitoringSeconds_Behind_Master via SHOW SLAVE STATUSpg_replication_lag in seconds from pg_stat_replication
Lock detectioninformation_schema.INNODB_TRX + INNODB_LOCKSpg_locks JOIN pg_stat_activity
Slow query historyperformance_schema.events_statements_summary_by_digestpg_stat_statements extension (must be installed)
Required permissions for exporterSELECT, PROCESS, REPLICATION CLIENTpg_monitor role (PostgreSQL 10+) or manual GRANT
Prometheus exporter port9104 (default)9187 (default)
Best free Grafana dashboardDashboard ID 7362Dashboard ID 9628

🎯 Key Takeaways

  • Threads_running in MySQL (not Threads_connected) is the real concurrency alarm — it tells you how many queries are actually executing right now, not just connected and idle.
  • PostgreSQL's 'idle in transaction' state is the silent killer — it holds locks, blocks VACUUM, and causes table bloat without showing any obviously 'stuck' queries in most monitoring setups.
  • The Prometheus + mysqld_exporter/postgres_exporter + Grafana stack costs zero dollars and gives you historical trending, alerting, and dashboards that commercial tools charge thousands for monthly.
  • Always create a dedicated read-only monitoring user for your exporters — never point monitoring tools at your application user or superuser, even in development, because habits formed in dev become incidents in prod.

⚠ Common Mistakes to Avoid

  • Mistake 1: Running the database exporter as a superuser/root — Symptom: works fine, but if the exporter is compromised, the attacker has full database access. Fix: create a dedicated monitoring user with the minimum required grants. For MySQL: GRANT SELECT, PROCESS, REPLICATION CLIENT ON . TO 'monitoring'@'localhost'. For PostgreSQL 10+: GRANT pg_monitor TO monitoring_role. The exporter only needs to READ, never write.
  • Mistake 2: Setting Prometheus scrape_interval too aggressively (e.g. every 1 second) — Symptom: the exporter itself becomes a performance problem, and the database starts showing 'monitoring overhead' in its own slow query log. Fix: use 15s for most metrics, 60s for heavy collection like events_statements_summary. Match your scrape interval to how quickly the metric actually changes in practice — Threads_running changes fast, table sizes change slowly.
  • Mistake 3: Ignoring the 'idle in transaction' state in pg_stat_activity and only alarming on 'active' queries — Symptom: PostgreSQL VACUUM gets blocked by a session that appears idle, tables balloon with dead rows, all queries slow down gradually over days, and no alert fires because no query looks 'stuck'. Fix: add a separate alert specifically for connections in 'idle in transaction' state lasting more than 60 seconds — this catches the application bug (missing COMMIT/ROLLBACK) before it causes table bloat.

Interview Questions on This Topic

  • QIf a production MySQL server suddenly shows Threads_running spiking to 80 but Threads_connected is only 95 out of a max of 500, what does that tell you and what's your first investigation step?
  • QA PostgreSQL table that had queries running in 10ms is now taking 2 seconds with no schema changes. You check pg_stat_activity and see nothing unusual. What other views do you check and what are you looking for?
  • QSomeone asks you to set up alerting for database disk space. They suggest setting a single alert at 5% free space. What's wrong with that approach and how would you design the alerting strategy instead?

Frequently Asked Questions

What is the best free database monitoring tool for MySQL and PostgreSQL?

The Prometheus + Grafana stack with mysqld_exporter and postgres_exporter is the industry-standard free option used by teams at every scale. It's free, open-source, runs in Docker, has pre-built community dashboards, and supports alerting via Alertmanager. For a managed option, Datadog and New Relic both have excellent database monitoring but charge per host.

How do I find which queries are slowing down my PostgreSQL database?

First, install the pg_stat_statements extension (CREATE EXTENSION pg_stat_statements) and add it to shared_preload_libraries in postgresql.conf. Then query SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20. This shows you normalized queries sorted by average execution time — the top results are your optimization targets.

What is the difference between monitoring and observability for databases?

Monitoring tracks predefined metrics — you know in advance what you want to measure (query latency, connection count, cache hit rate) and you set thresholds on those. Observability is the ability to understand why your database is behaving the way it is from any angle, even for failure modes you didn't predict. Monitoring answers 'is something wrong?', observability answers 'why is it wrong?'. In practice, the pg_stat_* views and performance_schema are your observability layer — they let you ask new questions you didn't plan for.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousNeo4j Graph Database BasicsNext →ActiveRecord vs DataMapper Pattern
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged