Senior 16 min · April 26, 2026

PostgreSQL Read Replicas — Lag Caused 400 Double Charges

A 800ms-2s replica lag caused 400 double charges despite Stripe idempotency keys.

N
Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • A read replica is a hot standby that applies WAL from the primary and serves read queries — but it's always behind the primary by some replication lag.
  • Streaming replication: primary runs walsender → replica runs walreceiver → recovery process replays WAL. Lag accumulates at network, disk I/O, or apply stage.
  • Replica reads are eventually consistent — writing to primary then SELECTing from replica is a race condition that will lose updates.
  • Performance insight: a 2048-bit RSA key encrypts 256 bytes; OAEP padding adds ~42 bytes overhead, reducing max plaintext to 214 bytes.
  • Production pitfall: routing a read that gates a financial transaction to the replica — replication lag as low as 200ms causes double charges.
  • Biggest mistake: treating synchronous_commit as a panacea — it only confirms WAL arrived on replica, not that it's applied. Use synchronous_commit = remote_apply for zero-lag reads, but accept 2-5x write latency penalty.
✦ Definition~90s read
What is Read Replicas in PostgreSQL?

PostgreSQL read replicas (standbys) are copies of your primary database that receive a continuous stream of write-ahead log (WAL) records via streaming replication. They exist to offload read queries—analytics, reporting, dashboard refreshes—from the primary, which is already busy handling writes and maintaining transactional consistency.

Imagine your company's most popular restaurant has one chef and fifty hungry customers.

The core trade-off is that replicas are eventually consistent: they lag behind the primary by some amount of time or data, and that lag can be milliseconds or minutes depending on network, hardware, and workload. When you route a read that depends on a recent write to a replica that hasn't caught up, you get stale data—and in the case that triggered this article, that stale data caused a billing system to double-charge customers because it read 'no charge yet' from a lagging replica, then the primary committed the charge, and the application retried the operation.

This is a classic distributed systems footgun: replicas are not a free scaling lever, they're a consistency hazard you must design around.

Under the hood, streaming replication works by the primary writing every transaction's WAL records to disk, then shipping them to one or more standby servers over a TCP connection. The standby replays those records in real time, rebuilding the exact same data pages.

The lag you care about is the gap between the last WAL position the primary has written and the last one the standby has replayed. You measure it with pg_stat_replication (replay_lag) or by comparing system timestamps. Panic when lag exceeds your application's tolerance—for a billing system that's probably under 100ms.

The fix isn't to make replicas faster (though faster disks help), it's to route reads intelligently: either use synchronous_commit on the primary (which kills write throughput) or, more practically, have your connection pooler or proxy check replica lag before routing and fall back to the primary when lag is too high. Never trust application developers to remember this—they won't.

Failover promotion is where replicas earn their keep beyond read scaling. When the primary dies, you promote a standby to become the new primary by running pg_ctl promote or using a tool like Patroni. This replays the remaining WAL, opens the replica for writes, and you redirect your application connections.

The danger: if you promoted a replica that was lagging, you lose any transactions that hadn't arrived yet. That's why you use synchronous replication for critical data—it ensures at least one standby has every committed transaction before the primary acknowledges the commit.

Cascading replication lets you chain standbys (replica feeding another replica) to reduce load on the primary, but it increases lag at each hop. For billing systems, never cascade—keep your replicas directly connected to the primary and monitor lag with alerts, not dashboards.

Plain-English First

Imagine your company's most popular restaurant has one chef and fifty hungry customers. The chef can cook AND take orders, but doing both at once means both get done badly. So you hire a team of runners who memorised the entire menu — they can answer every question customers have, take drink orders, describe every dish in detail, but they can't actually cook anything. That's a read replica: an exact copy of your database that can answer every read question your app throws at it, so your primary database can focus entirely on writes. The 'menu' gets updated to the runners every few seconds — that slight delay is replication lag, and it's the thing that will eventually cause you a production incident if you don't plan for it.

Your primary database is not a read server that also handles writes — treating it like one is how you end up with a 4AM PagerDuty alert because a reporting query from the analytics team locked up your checkout flow. Read replicas aren't just a scaling trick; they're an architectural boundary that forces you to think honestly about which operations actually need the freshest data and which ones are lying to themselves about it.

The problem read replicas solve is brutally simple: PostgreSQL's primary instance is single-writer by design. Every SELECT that hits the primary competes for buffer pool space, CPU cycles, and I/O bandwidth with every INSERT, UPDATE, and DELETE in your write path. At low load this is invisible. At production scale with mixed OLTP and reporting workloads running against the same instance, you start seeing query plans degrade, autovacuum fall behind, and connection counts spike in ways that look like traffic problems but are really resource contention problems. The old hack was to run analytics queries at 2AM with pg_cron and hope the batch window held. It doesn't hold. It never holds.

After reading this you'll be able to set up streaming replication with correct recovery configuration, route read traffic intelligently without trusting application developers to do it manually, diagnose and mitigate replication lag before it causes stale-read incidents, design a promotion strategy that survives primary failure without data loss, and know the exact scenarios where read replicas will make your architecture worse, not better.

How Read Replicas in PostgreSQL Actually Work — And Why Lag Matters

A read replica is a separate PostgreSQL instance that continuously applies write-ahead log (WAL) data from a primary server. The primary handles all writes; replicas serve read-only queries. This is asynchronous streaming replication by default — the replica is always behind the primary by some amount of time.

Replicas apply WAL in a single process, sequentially. A heavy write burst on the primary (e.g., batch inserts, index rebuilds) creates a backlog on the replica. The replica's apply lag grows. Queries on the replica see a stale snapshot. PostgreSQL does not block reads on the replica — it serves whatever data has been replayed so far. There is no built-in mechanism to wait for consistency.

Use read replicas when your read workload exceeds a single node's capacity or when you want to isolate reporting/analytics from production writes. But never assume replica data is current. If your application reads from a replica and then writes based on that read, you risk acting on stale state — exactly the pattern that caused double charges in this case.

Replication Lag Is Not a Bug — It's a Design Property
PostgreSQL streaming replication is asynchronous by design. There is no guarantee a replica has seen the latest write. Treat every replica read as potentially stale.
Production Insight
A payment service read a user's balance from a replica to check for sufficient funds before charging. The replica lagged 200ms behind the primary. Two concurrent requests both saw the old balance, both approved the charge, and the user was double-charged.
Exact symptom: duplicate charges with identical timestamps in the audit log, but balance checks showed sufficient funds.
Rule of thumb: never read-then-write from a replica. If you must, use pg_current_wal_lsn() to verify the replica has applied the primary's last write before proceeding.
Key Takeaway
Read replicas are eventually consistent — never assume they reflect the primary's latest state.
Always route read-then-write operations through the primary to avoid race conditions.
Monitor replica lag in production with pg_stat_replication and alert when lag exceeds your application's tolerance (e.g., >100ms).
PostgreSQL Read Replicas & Replication Lag THECODEFORGE.IO PostgreSQL Read Replicas & Replication Lag How streaming replication works and why lag causes double charges Primary DB Handles writes; generates WAL records WAL Sender Streams WAL to replicas continuously Replica WAL Receiver Receives and applies WAL to replica Replication Lag Delay between write and replica visibility Stale Read Application reads old data from replica Double Charge User retries due to stale balance; both succeed ⚠ Replication lag can cause stale reads leading to double charges Use synchronous replication or read-after-write consistency THECODEFORGE.IO
thecodeforge.io
PostgreSQL Read Replicas & Replication Lag
Read Replicas Postgresql

Streaming Replication Internals: What's Actually Happening Under the Hood

Before you configure anything, you need to know what PostgreSQL is actually shipping between instances. Streaming replication works by sending WAL — Write-Ahead Log — records from primary to standby in near-real-time. Every change made to the primary is first written to WAL before it touches the actual data files. The standby connects to the primary using the replication protocol, receives these WAL records as a stream, and applies them to its own data files. The standby is replaying the exact same operations the primary executed, in the same order, against its own copy of the data.

This has a critical implication that trips up almost everyone: the replica isn't just a copy of the data files — it's a copy of the primary's entire physical state, including every bloated table, every suboptimal index, every dead tuple autovacuum hasn't cleaned up yet. You can't add an index on the replica that doesn't exist on the primary (unless you use logical replication, which is a different beast entirely). You also can't run DDL on a streaming replica. It's physically coupled to the primary's state.

The WAL sender process on the primary and the WAL receiver process on the replica maintain a persistent TCP connection. The replica's recovery process applies the incoming WAL records. You can monitor the health of this pipeline on the primary via pg_stat_replication and on the replica via pg_stat_wal_receiver — these two views tell you everything you need to know about whether your replication pipeline is healthy. If you're not monitoring both, you're flying blind.

One more nuance: replication slots vs wal_keep_size. A replication slot is a persistent marker on the primary that says 'don't recycle WAL until this replica has consumed it'. Without a slot, you rely on wal_keep_size — a fixed retention window. If a replica disconnects longer than wal_keep_size covers, you need a full pg_basebackup to resync. Slots prevent that, but if the replica never reconnects, the primary's disk will fill. Choose based on your tolerance for full resync vs disk space.

io/thecodeforge/postgres/StreamingReplicationSetup.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
-- io.thecodeforge.postgres.replication — Production setup
-- ============================================================
-- PRIMARY configuration (postgresql.conf)
-- ============================================================

-- Enable WAL archiving level — 'replica' is sufficient for streaming replication
ALTER SYSTEM SET wal_level = 'replica';

-- Allow at least as many WAL senders as you have replicas + 1 for pg_basebackup
ALTER SYSTEM SET max_wal_senders = 5;

-- Keep enough WAL segments to allow replica to catch up after a brief disconnect
-- At 16MB per segment, this is 1GB of WAL retained — tune based on your write volume
ALTER SYSTEM SET wal_keep_size = '1GB';

-- Enable hot_standby on primary (needed so replica can serve reads)
-- Technically a replica setting, but set it on primary too for when roles flip
ALTER SYSTEM SET hot_standby = on;

-- Reload config — no restart needed for most of these
SELECT pg_reload_conf();

-- ============================================================
-- Create replication user (PRIMARY)
-- Never use a superuser for replication — least privilege matters
-- ============================================================

CREATE USER replicator
    WITH REPLICATION        -- grants REPLICATION privilege specifically
    LOGIN
    PASSWORD 'use_a_real_secret_from_vault_not_this';

-- ============================================================
-- pg_hba.conf on PRIMARY — allow replica to connect
-- Add this line to pg_hba.conf, then SELECT pg_reload_conf();
-- ============================================================

-- host  replication  replicator  10.0.1.50/32  scram-sha-256

-- ============================================================
-- On the REPLICA host — take base backup from primary
-- Run as postgres OS user on the REPLICA machine
-- ============================================================

-- pg_basebackup \
--   --host=10.0.1.10 \
--   --username=replicator \
--   --pgdata=/var/lib/postgresql/15/main \
--   --wal-method=stream \
--   --write-recovery-conf \
--   --checkpoint=fast \
--   --progress
--
-- --write-recovery-conf automatically creates standby.signal
-- and writes primary_conninfo into postgresql.auto.conf

-- ============================================================
-- REPLICA configuration (postgresql.conf)
-- ============================================================

-- Connection string to primary — vault this credential properly
ALTER SYSTEM SET primary_conninfo = 
    'host=10.0.1.10 port=5432 user=replicator password=your_secret application_name=replica_01';

-- hot_standby = on allows read queries while in recovery mode
-- Without this the replica is dark — no reads, just applying WAL
ALTER SYSTEM SET hot_standby = on;

-- Send feedback to primary about which transactions are still active on replica
-- This prevents primary from vacuuming rows the replica still needs for queries
-- Trade-off: primary retains dead tuples longer — monitor table bloat
ALTER SYSTEM SET hot_standby_feedback = on;

-- How long to wait before cancelling a conflicting query on the replica
-- -1 means wait indefinitely — use this only if your replica queries are short
-- In practice, set to 30s-120s depending on your reporting query SLAs
ALTER SYSTEM SET max_standby_streaming_delay = '120s';

-- ============================================================
-- Monitoring replication health — run on PRIMARY
-- ============================================================

SELECT
    application_name,                          -- identifies which replica
    client_addr,                               -- replica's IP
    state,                                     -- 'streaming' is healthy; 'catchup' means it's behind
    sent_lsn,                                  -- how far primary has sent WAL
    write_lsn,                                 -- how far replica has written WAL to disk
    flush_lsn,                                 -- how far replica has fsync'd WAL
    replay_lsn,                                -- how far replica has APPLIED WAL
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
    sync_state                                 -- 'async' or 'sync'
FROM pg_stat_replication;

-- ============================================================
-- Monitoring lag in seconds — run on REPLICA
-- ============================================================

SELECT
    CASE WHEN pg_is_in_recovery() THEN 'replica' ELSE 'primary' END AS role,
    EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) AS lag_seconds,
    pg_last_xact_replay_timestamp() AS last_applied_at
FROM pg_stat_wal_receiver;
Output
-- On PRIMARY (healthy replica):
-- application_name | client_addr | state | sync_state | replication_lag_bytes
-- replica_01 | 10.0.1.50 | streaming | async | 4096
-- On REPLICA (healthy):
-- role | lag_seconds | last_applied_at
-- replica | 0.8 | 2024-03-15 14:32:01.234+00
Production Trap: wal_keep_size Too Small
If your replica disconnects (network blip, deploy, anything) and can't reconnect before the primary has recycled the WAL segments it needs, you'll see 'ERROR: requested WAL segment has already been removed' in replica logs. The replica can no longer catch up via streaming and needs a full pg_basebackup restart. Set wal_keep_size to at least 10 minutes of WAL volume at peak write rate, or use WAL archiving to S3 as a safety net — set archive_mode = on and archive_command to copy WAL to durable storage. Monitor: SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) FROM pg_stat_replication;
Production Insight
A team set wal_keep_size = 512MB on a primary that generated 200MB/min of WAL. A replica disconnected for 3 minutes for maintenance. The primary rotated past 512MB of WAL in 2.5 minutes. Replica reconnected, asked for WAL segment 0/5A000028, primary responded 'segment removed'. Replica needed a full pg_basebackup — 2 hours of downtime.
Fix: set wal_keep_size = 5GB, monitor pg_stat_replication lag_bytes, add WAL archiving to S3.
Rule: wal_keep_size × 2 should cover your longest expected replica downtime × peak WAL generation rate.
Key Takeaway
Streaming replication: primary → walsender → TCP → walreceiver → recovery apply on replica.
Monitor both pg_stat_replication (primary) and pg_stat_wal_receiver (replica).
Physical replicas are exact copies — same indexes, same bloat, same everything.
wal_keep_size must be larger than your worst-case replica disconnect time × write rate.

Connection Routing Without Trusting Your Application Developers

Here's the brutal truth about read replica routing: if you rely on application developers to manually choose between a primary DSN and a replica DSN in their code, you will have primary reads that should be on the replica and replica reads that should be on the primary, and you won't know which is which until something breaks. Application developers are solving business problems. Database topology is not their mental model and it shouldn't have to be.

The right architecture puts routing below the application layer. PgBouncer is the most common first layer — it pools connections and can be configured with multiple backends. But PgBouncer alone doesn't route by intent. For intent-based routing you need either a proxy that understands read/write splitting (Pgpool-II, RDS Proxy, or a sidecar proxy like ProxySQL's PostgreSQL mode) or a connection wrapper in your DAL that maps transaction context to the right pool.

The pattern I've seen work in production at scale: two PgBouncer pools — one pointing to primary, one pointing to replica — with a lightweight DAL wrapper that routes based on transaction mode. Any call inside an explicit BEGIN...COMMIT block goes to primary. Any bare SELECT outside a transaction block goes to the replica pool. This gets you 80% of the benefit with minimal application code changes. The remaining 20% — cases where a SELECT needs fresh data but isn't in a write transaction — you handle with an explicit annotation or by accepting primary reads for those paths.

One anti-pattern to avoid: putting the replica pool behind a load balancer that does health checks but not read/write splitting. You'll end up with reads on primary anyway. Another trap: using statement-based routing (e.g., check if query starts with 'SELECT') fails for SELECT inside transactions, SELECT...FOR UPDATE (needs primary), and CTEs. The transaction-boundary approach is simpler and more reliable.

io/thecodeforge/postgres/DatabaseRouter.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# io.thecodeforge.postgres.routing — Production routing layer

import contextlib
from enum import Enum, auto
from typing import Generator
import psycopg2
import psycopg2.pool
from threading import local


class ReadPolicy(Enum):
    EVENTUAL = auto()   # replica pool — accepts lag
    STRONG = auto()     # primary pool — fresh data


class DatabaseRouter:
    """
    Routes database connections to primary or replica based on context.
    Thread-local state tracks whether we're inside a write transaction.
    
    PgBouncer endpoints:
      - PRIMARY:  pgbouncer-primary:6432  (transaction mode)
      - REPLICA:  pgbouncer-replica:6432  (session mode for replicas)
    """

    def __init__(self, primary_dsn: str, replica_dsn: str, pool_size: int = 20):
        self._local = local()
        self._primary_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=2, maxconn=pool_size, dsn=primary_dsn,
            application_name='order_service_primary'
        )
        self._replica_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=5, maxconn=pool_size * 2, dsn=replica_dsn,
            application_name='order_service_replica'
        )

    def _in_write_transaction(self) -> bool:
        return getattr(self._local, 'write_transaction_active', False)

    def _get_connection(self, policy: ReadPolicy):
        # If inside a write transaction, MUST use primary — consistency overrides
        if self._in_write_transaction() or policy == ReadPolicy.STRONG:
            return self._primary_pool.getconn(), self._primary_pool
        return self._replica_pool.getconn(), self._replica_pool

    @contextlib.contextmanager
    def write_transaction(self) -> Generator:
        """All reads inside this block go to primary — prevents stale-read bugs."""
        conn = self._primary_pool.getconn()
        self._local.write_transaction_active = True
        try:
            yield conn
            conn.commit()
        except Exception:
            conn.rollback()
            raise
        finally:
            self._local.write_transaction_active = False
            self._primary_pool.putconn(conn)

    @contextlib.contextmanager
    def read_query(self, policy: ReadPolicy = ReadPolicy.EVENTUAL) -> Generator:
        """Default to replica — caller must opt into STRONG for fresh reads."""
        conn, pool = self._get_connection(policy)
        try:
            yield conn
            conn.rollback()  # ends implicit transaction, returns to pool cleanly
        except Exception:
            conn.rollback()
            raise
        finally:
            pool.putconn(conn)


# ============================================================
# Checkout flow with proper routing
# ============================================================

def process_checkout(router: DatabaseRouter, cart_id: str, user_id: str) -> dict:
    # Inventory check — replica is fine, lag acceptable
    with router.read_query(ReadPolicy.EVENTUAL) as conn:
        cursor = conn.cursor()
        cursor.execute(
            'SELECT product_id, quantity FROM inventory WHERE cart_id = %s',
            (cart_id,)
        )
        inventory = cursor.fetchall()

    if not inventory:
        raise ValueError('Cart empty')

    # Write transaction — everything inside hits primary
    with router.write_transaction() as conn:
        cursor = conn.cursor()
        cursor.execute(
            'INSERT INTO orders (cart_id, user_id, status) VALUES (%s, %s, %s) RETURNING id',
            (cart_id, user_id, 'pending')
        )
        order_id = cursor.fetchone()[0]

        cursor.execute(
            'INSERT INTO payments (order_id, status, created_at) VALUES (%s, %s, NOW())',
            (order_id, 'initiated')
        )

    # Order summary read — replica is fine, display only
    with router.read_query(ReadPolicy.EVENTUAL) as conn:
        cursor = conn.cursor()
        cursor.execute(
            'SELECT id, status, created_at FROM orders WHERE id = %s',
            (order_id,)
        )
        order = cursor.fetchone()

    return {'order_id': order_id, 'status': order[1] if order else 'pending'}
Output
# No runtime output — this is a module.
# When deployed:
# write_transaction() → pgbouncer-primary:6432
# read_query(EVENTUAL) → pgbouncer-replica:6432
# read_query(STRONG) → pgbouncer-primary:6432
#
# pg_stat_activity shows:
# application_name = 'order_service_primary' on primary
# application_name = 'order_service_replica' on replica
# Lets you immediately see which app is hitting which host in monitoring.
Never Do This: SELECT After INSERT Without Explicit Routing
INSERT to primary, then SELECT from replica to confirm the insert — this returns the old state if lag >0ms, which is always. The exact symptom: your service inserts a record, immediately reads it back to populate a response, gets zero rows, then throws a 404 or NullPointerException. The fix: any SELECT that confirms or reads data written in the same request must go to primary, or use RETURNING clause in the INSERT to get the data back in the same round-trip without a second query at all.
Production Insight
The double-charge incident at the start of this article happened because a SELECT after INSERT went to the replica. The developer assumed 'replication lag is milliseconds' — but under peak load, it hit 800ms. The retry logic re-read status every 200ms. On the third retry, the replica still hadn't applied the INSERT. The customer was charged twice.
Rule: any read that gates a financial write goes to primary. Document this in your runbook and enforce it in code reviews. The cost of a double-charge incident is ten times the cost of an extra primary read.
Key Takeaway
Route at the connection pool / DAL level — don't make app devs choose.
Transaction boundary is the correct routing signal: BEGIN...COMMIT blocks go to primary.
SELECT after INSERT must go to primary, or use RETURNING to avoid the SELECT entirely.
Monitor pg_stat_activity.application_name to verify routing is working as intended.

Replication Lag: Why It Happens, How to Measure It, When to Panic

Replication lag is not a bug. It's a fundamental property of async replication. What is a bug is building a system that assumes lag is zero, or monitoring for it reactively instead of proactively. I've watched teams add read replicas to solve a performance problem and spend the next month firefighting stale-read bugs that were ten times harder to debug than the original performance problem.

Lag accumulates from three different places and you need to know which one is causing yours. Network lag is the time for WAL bytes to travel from primary to replica — usually <5ms on properly networked instances in the same AZ. I/O lag on the replica is the time to write and fsync incoming WAL segments — on a replica with an undersized disk or competing I/O from read queries, this is where lag balloons. Apply lag is the time for the recovery process to apply WAL records to data pages — on tables with high contention or during bulk loads, apply lag can spike to seconds even with fast I/O.

The real danger is cumulative lag under write bursts. If your primary ingests a bulk load at 200MB/s of WAL for 60 seconds, and your replica can only sustain 150MB/s of apply throughput, you're 6GB of WAL behind when the load stops. At steady state the replica catches up — but for those minutes, any read against the replica is returning data that's potentially minutes old. You need to monitor pg_stat_replication on primary continuously, alert at 10 seconds of lag, and have a circuit breaker that routes to primary automatically when lag exceeds your SLA threshold.

A subtle point: replica lag reported in bytes (pg_wal_lsn_diff) is more precise than lag in seconds (replay_lag). Bytes tell you exactly how much work is pending. Seconds include network jitter and OS scheduling. For alerting, use seconds with a hysteresis (e.g., lag > 30s for 2 consecutive checks). For root cause analysis, examine the three byte-level components.

io/thecodeforge/postgres/ReplicationLagMonitoring.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
-- ============================================================
-- Comprehensive lag monitoring view — create on PRIMARY
-- ============================================================

CREATE OR REPLACE VIEW v_replication_health AS
SELECT
    r.application_name,
    r.client_addr,
    r.state,
    r.sync_state,
    -- Byte-level lag breakdown
    pg_wal_lsn_diff(r.sent_lsn, r.write_lsn)   AS network_lag_bytes,
    pg_wal_lsn_diff(r.write_lsn, r.flush_lsn)  AS disk_lag_bytes,
    pg_wal_lsn_diff(r.flush_lsn, r.replay_lsn) AS apply_lag_bytes,
    pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn)  AS total_lag_bytes,
    r.replay_lag,                               -- human time
    rs.slot_name,
    rs.active AS slot_active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), rs.confirmed_flush_lsn) AS slot_lag_bytes
FROM
    pg_stat_replication r
    LEFT JOIN pg_replication_slots rs ON rs.active_pid = r.pid
ORDER BY total_lag_bytes DESC;

-- ============================================================
-- Circuit breaker: check if replica is fresh enough for reads
-- Run on REPLICA — embed this in your connection pool health check
-- ============================================================

CREATE OR REPLACE FUNCTION fn_replica_is_fresh(
    max_lag_seconds NUMERIC DEFAULT 30
) RETURNS BOOLEAN AS $$
DECLARE
    lag_seconds NUMERIC;
BEGIN
    SELECT COALESCE(
        EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())),
        999999  -- if NULL, replica hasn't replayed anything yet
    ) INTO lag_seconds;
    RETURN lag_seconds <= max_lag_seconds;
END;
$$ LANGUAGE plpgsql STABLE;

-- Usage: SELECT fn_replica_is_fresh(30); -- TRUE if lag < 30s

-- ============================================================
-- Find what's causing lag spikes: conflicting queries on replica
-- Run on REPLICA
-- ============================================================

SELECT
    pid,
    now() - query_start AS duration,
    wait_event_type,
    wait_event,
    LEFT(query, 120) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
  AND (wait_event_type = 'Lock' OR wait_event LIKE '%recovery%')
ORDER BY duration DESC;
Output
-- v_replication_health on PRIMARY:
-- application_name | network_lag_bytes | disk_lag_bytes | apply_lag_bytes | total_lag_bytes | replay_lag
-- replica_01 | 0 | 0 | 8192 | 8192 | 00:00:00.8
-- fn_replica_is_fresh(30) on REPLICA:
-- fn_replica_is_fresh
-- t
-- Conflict query on REPLICA:
-- pid | duration | wait_event_type | wait_event | query_preview
-- 1234 | 00:02:15 | Lock | relation | SELECT COUNT(*) FROM orders WHERE status = 'pending'
Production Trap: Replication Slots Without Monitoring = Disk Death
If you use a replication slot (required for logical replication, optional for physical) and the replica goes offline, the primary will retain ALL WAL produced since the slot went inactive — forever — until the replica reconnects. I've seen this fill a 500GB disk in under 6 hours during a replica maintenance window that ran long. The error when disk fills: 'PANIC: could not write to file pg_wal/XXXX: No space left on device' — which crashes the primary. Run SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS retained_bytes FROM pg_replication_slots; and alert if retained_bytes exceeds 20GB. If a slot goes inactive and you can't reconnect the replica in minutes, DROP the slot immediately and rebuild the replica.
Production Insight
A healthcare analytics platform had a replica with 3TB of storage. A developer ran a query that took 45 minutes on the replica, blocking WAL apply for the entire duration. Replay lag grew to 45 minutes. The monitoring system was only checking write_lag, which remained low (WAL was being received and written, just not applied). No alert fired. The dashboard showed data from 45 minutes ago for an hour before anyone noticed.
Fix: add alert on replay_lag and apply_lag_bytes. Also implement pg_stat_activity monitoring that kills queries running longer than max_standby_streaming_delay.
Key Takeaway
Lag has three components: network, disk I/O, and apply. Monitor all three — each points to a different bottleneck.
Use circuit breakers: if lag > SLA, route reads back to primary automatically.
Replication slots are a double-edged sword — they prevent resync but can fill your disk if a replica dies.
Alert on replay_lag, not just write_lag. Write_lag being low doesn't mean your data is fresh.

Promotion and Failover: What Actually Happens When Your Primary Dies

Failover is where most PostgreSQL read replica setups reveal they were built with optimism rather than engineering. The scenario: primary becomes unavailable. Every second of downtime is revenue loss. You need to promote a replica to primary, update your connection strings, ensure the old primary never comes back as a rogue writer, and ideally do all of this without data loss. If you haven't rehearsed this, you don't actually have HA — you have a hope.

PostgreSQL's promotion mechanism is clean: on the standby, run pg_ctl promote or create a trigger file, and the replica exits recovery mode and starts accepting writes. The problem is everything around it. Your application's connection pool is still pointing at the dead primary's IP. Other replicas are still trying to stream from the dead primary. If the old primary recovers and starts accepting writes while the new primary is also accepting writes — split-brain — you have two diverging transaction histories and neither node knows about the other's writes.

Patroni is the industry answer to this. It uses a distributed consensus store (etcd or Consul) to maintain a single authoritative leader, handles automatic promotion, updates a virtual IP or DNS entry that your application uses, fences the old primary (prevents it from accepting writes even if it recovers), and handles adding the old primary back as a replica via pg_rewind. If you're running PostgreSQL in production without Patroni or an equivalent, you have a manual failover runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.

A specific gotcha: after promoting a replica, the timeline ID changes. The old primary, if it comes back online, is on a diverged timeline. It cannot rejoin as a replica without pg_rewind, which replays WAL from the divergence point to bring it back into sync. Running pg_rewind requires the old primary to be stopped and requires the new primary to have wal_log_hints = on or data checksums enabled. If those aren't set, pg_rewind fails and you need a full pg_basebackup.

io/thecodeforge/postgres/FailoverValidation.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- ============================================================
-- Pre-failover health check — run on PRIMARY
-- ============================================================

SELECT
    application_name,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS unconfirmed_bytes,
    replay_lag,
    sync_state
FROM pg_stat_replication
ORDER BY unconfirmed_bytes ASC;

-- For zero-RPO failover: only promote if unconfirmed_bytes = 0

-- ============================================================
-- Manual promotion (run on REPLICA being promoted)
-- ============================================================

-- Shell: pg_ctl promote -D /var/lib/postgresql/15/main
-- OR if trigger_file is configured: touch /tmp/postgresql.trigger.5432

-- After promotion, confirm:
SELECT
    pg_is_in_recovery() AS is_still_replica,     -- should be false
    pg_current_wal_lsn() AS current_lsn,
    current_setting('transaction_read_only') AS read_only;

-- ============================================================
-- After promotion: re-sync old primary using pg_rewind
-- Run on OLD PRIMARY (now stopped)
-- ============================================================

-- pg_rewind \
--   --target-pgdata=/var/lib/postgresql/15/main \
--   --source-server='host=10.0.1.50 user=replicator dbname=postgres' \
--   --progress
--
-- Then: echo 'primary_conninfo = ...' >> postgresql.auto.conf
-- touch standby.signal
-- pg_ctl start

-- ============================================================
-- Verify topology after failover — run on new PRIMARY
-- ============================================================

SELECT
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
Output
-- Pre-failover (replica is caught up):
-- application_name | state | unconfirmed_bytes | replay_lag
-- replica_01 | streaming | 0 | 00:00:00
-- After promotion (on new primary):
-- is_still_replica | current_lsn | read_only
-- f | 0/5B000028 | off
-- Topology after re-sync:
-- application_name | client_addr | state | lag_bytes
-- old_primary | 10.0.1.10 | streaming | 0
Production Trap: The Zombie Primary That Keeps Writing
After promoting a replica to primary, if the old primary recovers from a network partition (not actually dead — just unreachable) and restarts without being fenced, you have two nodes accepting writes. Your application will write to both, guided by stale DNS or stale connection pool entries. This is split-brain and it corrupts both databases. The fix: STONITH (Shoot The Other Node In The Head) — Patroni uses watchdog devices or cloud API calls to forcibly power off the old primary before completing promotion. If you can't fence the old primary, you can't safely complete failover. This is why you never promote a replica manually without confirming the old primary is unreachable and fenced.
Production Insight
A fintech company had a primary database in AWS us-east-1 and a replica in us-west-2. A routing issue made us-east-1 unreachable from the application but the database was still running. The on-call engineer manually promoted the us-west-2 replica. 5 minutes later, network connectivity recovered. Suddenly both us-east-1 and us-west-2 were accepting writes. The application wrote to whichever node had the faster network path. Two different transaction histories evolved. Reconciling the split-brain took 3 days of manual data repair.
Rule: never promote a replica manually without fencing the old primary — power it down, block its security group, or use Patroni's STONITH. The 5 minutes of extra downtime is worth avoiding split-brain.
Key Takeaway
Failover is not just pg_ctl promote — it's fencing, DNS, and re-sync of other replicas.
Patroni is the industry standard for automated failover; don't build your own.
pg_rewind re-syncs a diverged primary after failover — requires wal_log_hints = on.
Split-brain is the single most dangerous outcome of a failover. Fence the old primary before promotion.

Cascading Replication: How to Make Your Standbys Work for a Living

Most teams treat standby replicas as passive sponges that just soak up WAL and wait to be promoted. That's wasteful. Cascading replication lets a replica serve as a replication source for downstream replicas, offloading the primary's network and CPU burden. The WHY is simple: your primary shouldn't be sending the same WAL data to thirty application clients and fifteen replicas. It's a bottleneck you're paying for in latency and risk.

Set standby.sync to on and hot_standby to on in the replica that'll be your cascade source. That replica then becomes both a read target and a feeder. But here's the catch—cascading adds latency at each hop. If you chain three levels, your replica at the bottom sees lag that's the sum of all upstream propagation delays. Measure it with pg_stat_replication on each node. A common production pattern is two-tier: primary feeds two regional cascades, each feeds five application replicas. Your primary sends only 2 streams instead of 12. That's a win.

Never trust a cascade replica for critical failover unless you've validated its WAL position is within acceptable lag. Otherwise you promote a stale node and lose committed transactions.

CascadeTopologyCheck.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — database tutorial

-- Check which replicas are feeding others
SELECT
    client_addr,
    application_name,
    state,
    sync_state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), redo_lsn) AS lag_bytes
FROM pg_stat_replication;

-- On a cascade replica, check what IT is sending downstream
SELECT
    client_addr,
    application_name,
    state
FROM pg_stat_replication;

-- Expected output for a 2-tier cascade:
--  client_addr    | application_name | state     | sync_state
--  192.168.1.10   | cascade_us_east  | streaming | async
--  192.168.1.20   | cascade_us_west  | streaming | async
Output
client_addr | application_name | state | lag_bytes
192.168.1.10 | cascade_us_east | streaming | 56780
192.168.1.20 | cascade_us_west | streaming | 102400
Production Trap:
Cascade replicas don't report their downstream lag to the primary. You have to query each cascade node individually to see if your deep replicas are falling behind.
Key Takeaway
Cascade replication reduces primary load by letting replicas serve as WAL sources, but every hop adds latency—monitor each tier independently.

Synchronous Replication: When You Absolutely Cannot Lose a Transaction

Streaming replication in async mode means the primary commits before the replica acknowledges receipt. That's fine for reporting dashboards. It's not fine for payment systems, order processing, or anything where a primary crash could lose the last few seconds of writes. Synchronous replication forces the primary to wait for at least one replica to confirm the WAL was flushed to disk before reporting success to the client.

The cost is write latency—every INSERT, UPDATE, or DELETE now waits for a network round-trip to the sync standby. In a cross-datacenter setup with 10ms latency, that adds 10ms to every write. You don't want this on every connection. Use synchronous_standby_names to specify which replicas are sync targets, and set synchronous_commit = remote_write for a lighter guarantee (replica received it but hasn't flushed) or on for full flush confirmation.

Production pitfall: if your sync standby dies, the primary blocks all writes until either the standby comes back or you remove it from the sync list. Always have a quorum commit setup or a monitoring alert that auto-demotes a dead sync standby before your users start screaming.

SyncReplicationConfig.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — database tutorial

-- On primary: postgresql.conf
-- synchronous_standby_names = '1 (dc1_replica, dc2_replica)'

-- Check sync replication status
SELECT
    application_name,
    sync_state,  -- 'sync' means synchronous
    sync_priority,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
WHERE sync_state = 'sync';

-- Example output:
-- application_name | sync_state | sync_priority | write_lag
-- dc1_replica      | sync       | 1             | 00:00:00.003
-- dc2_replica      | potential  | 2             | 00:00:00.042
Output
application_name | sync_state | sync_priority | write_lag
dc1_replica | sync | 1 | 00:00:00.003
dc2_replica | potential | 2 | 00:00:00.042
Production Trap:
If your sync standby goes down and you have only one synchronous replica configured, the primary will block ALL writes until it comes back or you restart the primary without the sync standby name.
Key Takeaway
Synchronous replication guarantees zero data loss but introduces write latency proportional to network distance—never use it across high-latency links without understanding the tradeoff.

WAL Archive & Continuous Archival: Your Replicas Can't Save You From Disk Failure

Replicas are great for read scaling and failover. They are not backups. If your primary suffers a physical disk failure that corrupts the WAL files before they're replicated, all your replicas will replicate that corruption. That's the day you learn streaming replication is not a backup strategy. You need continuous WAL archiving to a separate storage system, preferably an object store like S3 or a dedicated backup server.

Set archive_mode = on and archive_command to copy completed WAL segments to your archive location. PostgreSQL will call this command every time it finishes writing a WAL segment (usually 16MB). The archive target must be durable—a separate volume, different physical hardware, or cloud object storage. Combine this with pg_basebackup for full base backups and slot-based WAL retention so you don't lose WAL before replicas consume it.

Production pattern: full backup every 6 hours, WAL archiving every 5 minutes, and a separate recovery.conf that can restore from archive in minutes. Test your restore process quarterly. I've seen teams that had two years of WAL archives but couldn't actually get PostgreSQL to restore from them because they changed the archive format without testing. Don't be that team.

WALArchiveConfig.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — database tutorial

-- postgresql.conf
archive_mode = on
archive_command = 'rsync -a %p backup@archive-server:/wal_archive/%f'
wal_level = replica
max_wal_senders = 10

-- Verify archive status
SELECT
    last_archived_wal,
    last_archived_time,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;

-- Example output:
-- last_archived_wal      | last_archived_time
-- 000000010000000A000000BB | 2024-05-14 13:45:12

-- If last_failed_wal is not null, your archiving is broken.
-- Fix it before you need it.
Output
last_archived_wal | last_archived_time
000000010000000A000000BB | 2024-05-14 13:45:12
last_failed_wal | last_failed_time
NULL | NULL
Production Trap:
A failed archive command doesn't crash the primary—PostgreSQL just logs the error and keeps running. Monitor pg_stat_archiver in your alerting or your backup will silently rot.
Key Takeaway
Replicas don't protect against storage corruption—continuous WAL archiving to a separate durable target is your actual backup. Test restoration quarterly.

Project Structure: Don't Let Your Repo Look Like a Junior Dev's Desktop

Most PostgreSQL replication guides dump everything into a single directory and call it a day. That works until your third replica joins the cluster and you can't tell which config belongs to which node. Stop that.

Your project structure is your first line of defense against operational chaos. Separate primary config, replica config, and initialization scripts into distinct directories. Use environment-specific override files so staging doesn't accidentally connect to prod. If your docker-compose.yml references a config path that doesn't exist, your replica won't start — that's a feature, not a bug.

Every production cluster I've managed that survived more than six months had a clear directory layout: pg-primary/, pg-replica/, pg-init/, and a top-level docker-compose.yml. No ambiguity. No guessing. You should be able to clone the repo and have a working three-node cluster in under two minutes. If you can't, your structure is wrong.

project-structure.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — database tutorial

-- Project directory layout for read replica cluster
-- Each component gets its own config and scripts

├── docker-compose.yml          # Orchestrates primary + replicas
├── .env                        # Shared env vars (DB name, passwords)
├── pg-primary/
│   ├── postgresql.conf         # Primary-specific settings
│   ├── pg_hba.conf             # Replication user access
│   └── init.sql                # Initial schema + replication user
├── pg-replica/
│   ├── postgresql.conf         # Replica-optimized settings
│   ├── setup-replica.sh        # Streaming replication bootstrap
│   └── .pgpass                  # Credentials for replication user
└── pg-replica-2/
    ├── postgresql.conf
    ├── setup-replica.sh
    └── .pgpass
Output
Directory structure for a 3-node cluster (1 primary, 2 replicas)
Production Trap:
Never store .pgpass files in version control. Use Docker secrets or a vault. One leaked password = full cluster compromise.
Key Takeaway
Three directories. One compose file. No ambiguity. Your future self running incident response at 3 AM will thank you.

Step 5: Write the Docker Compose File — Because Manual Setup Is for Masochists

You could manually configure PostgreSQL on three different servers. Or you could write a Docker Compose file and have a working cluster in 30 seconds. Your choice.

The compose file must define three services: one primary and two replicas. The primary exposes port 5432. Replicas expose different ports (5433, 5434) to avoid conflicts during local development. Each replica depends on the primary — if the primary isn't ready, the replicas will retry.

Critical detail: the replica setup script runs on container start, not build time. You need the primary to be fully initialized before the replica can pull the base backup. Use a health check on the primary service. Use depends_on with condition: service_healthy. Without that, your replica will fail silently and you'll waste an hour debugging why WAL streaming never started.

One more thing: mount your config directories as read-only volumes. If a misconfigured replica tries to write to its own config, it should crash immediately. Fail fast, not silently.

docker-compose.ymlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// io.thecodeforge — database tutorial

version: '3.8'

services:
  pg-primary:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - ./pg-primary/postgresql.conf:/etc/postgresql/postgresql.conf:ro
      - ./pg-primary/pg_hba.conf:/etc/postgresql/pg_hba.conf:ro
      - ./pg-primary/init.sql:/docker-entrypoint-initdb.d/init.sql
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  pg-replica-1:
    image: postgres:16
    depends_on:
      pg-primary:
        condition: service_healthy
    volumes:
      - ./pg-replica/postgresql.conf:/etc/postgresql/postgresql.conf:ro
      - ./pg-replica/setup-replica.sh:/docker-entrypoint-initdb.d/setup-replica.sh
    ports:
      - "5433:5432"
    environment:
      PG_PRIMARY_HOST: pg-primary
      PG_REPLICA_USER: replicator
      PG_REPLICA_PASSWORD: ${REPLICA_PASSWORD}
Output
docker-compose.yml - Read replica cluster with health checks
Senior Shortcut:
Use PG_REPLICA_PASSWORD from environment, not a hardcoded value. One env file per environment. Keep secrets out of your compose file.
Key Takeaway
A health-checked primary with dependent replicas is the minimum viable production pattern. Anything less is asking for downtime.

Cleaning Up Orphaned WAL Files After Replica Promotion

When you promote a read replica to primary, PostgreSQL leaves behind WAL segments on the old primary that can fill disk and confuse monitoring. The old primary continues writing WAL files even after it's orphaned, because it doesn't know it's been demoted. You must stop PostgreSQL on the old primary first, then manually remove stale WAL files from pg_wal. Use pg_archivecleanup with the last valid WAL segment from the new primary to avoid removing files still needed for recovery. The cleanup tool reads the archive status and deletes everything before the given segment. Always run a CHECKPOINT on the new primary before cleaning, because checkpoint writes a new timeline history file that marks the split point. Failure to clean orphaned WAL causes disk full errors and prevents the old primary from being safely reused as a standby. Implement a monitoring alert for pg_wal directory size and automatically run archive cleanup after any failover test.

cleanup_orphaned_wal.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — database tutorial

-- Stop the orphaned primary
pg_ctl stop -D /var/lib/postgresql/data

-- On the NEW primary, find the latest checkpoint's WAL
SELECT pg_current_wal_lsn();
SELECT pg_walfile_name(pg_current_wal_lsn());

-- Output example: 00000001000000000000000A
-- Run on the OLD primary
pg_archivecleanup /var/lib/postgresql/15/pg_wal 00000001000000000000000A

-- Verify cleaning
SELECT count(*) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}';
Output
pg_archivecleanup: removing file "00000001000000000000000009"
pg_archivecleanup: removing file "00000001000000000000000008"
count
-------
3
(1 row)
Production Trap:
Never run pg_archivecleanup on a running primary. It deletes WAL files that in-flight transactions still need. Always confirm the standby is fully caught up and promoted before cleaning.
Key Takeaway
Always orphan old primary's WAL after promotion — run pg_archivecleanup with the new primary's latest WAL segment.

What We Covered: The Only Three Rules You Need for Read Replicas

You now know everything necessary to run read replicas without surprises. First rule: lag is measured, not guessed — monitor pg_stat_replication for write_lag and flush_lag, and set alerts at 10 seconds for OLTP, 60 seconds for batch reporting. Second rule: failure requires a sequence — stop the old primary, promote the replica, update your connection router's DNS, then clean orphaned WAL. Never skip the stop step; it prevents split-brain where two servers accept writes. Third rule: disk failures still kill you — WAL archiving to cloud storage is the only safety net. Replicas are for read scaling, not backup. Combine synchronous replication for zero-loss critical transactions with cascading replicas for reporting loads. Use pg_rewind to reattach the old primary as a standby after failover instead of rebuilding from scratch. These three rules — measure lag, follow sequence, archive WAL — eliminate 90% of replica-related outages. The remaining 10% is people ignoring them.

rules_check.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — database tutorial

-- Rule 1: Lag check
SELECT application_name, write_lag, flush_lag, replay_lag 
FROM pg_stat_replication;

-- Rule 2: Failover checklist
-- 1. pg_ctl stop on old primary
-- 2. SELECT pg_promote() on replica
-- 3. Update application DNS/haproxy
-- 4. Run pg_archivecleanup on old primary

-- Rule 3: Verify WAL archiving works
SELECT archived_count, failed_count 
FROM pg_stat_archiver
WHERE failed_count > 0;
Output
application_name | write_lag | flush_lag | replay_lag
-----------------+-----------+-----------+------------
replica1 | 00:00:02 | 00:00:03 | 00:00:03
(1 row)
archived_count | failed_count
----------------+--------------
15234 | 0
(1 row)
Golden Rule:
If you only remember one thing: async replicas will lag. Design your application to tolerate stale reads, or use synchronous replication but accept the write latency penalty.
Key Takeaway
Three rules prevent 90% of replica failures: measure lag, follow failover sequence, archive WAL off-server.

Introduction: Why Read Replicas Exist (and Why You Should Care)

PostgreSQL read replicas solve a fundamental tension: your primary database must handle writes with strict consistency, but reads can tolerate slight staleness for massive throughput gains. Without replicas, every analytics query, dashboard refresh, and background job competes for the same CPU and I/O as your critical write path. The result? Latency spikes, connection pool exhaustion, and eventually, angry users. Replicas offload read traffic to separate nodes that asynchronously replay the write-ahead log (WAL) from the primary. This isn't just about scaling—it's about isolation. A badly written reporting query can bring a replica to its knees, but your primary stays healthy. The trade-off? Replication lag. Data written to the primary may take milliseconds (or seconds) to appear on a replica. If your application cannot tolerate even 100ms of staleness, replicas are the wrong tool. But for the 90% use case—read-heavy workloads, geographical distribution, or hot standby—replicas are the simplest, most battle-tested pattern in the PostgreSQL ecosystem.

intro_replica_check.sqlSQL
1
2
3
4
5
6
7
8
9
10
// io.thecodeforge — database tutorial
// Check if current session is on a replica
SELECT pg_is_in_recovery() AS is_replica;

-- Check replication lag in bytes
SELECT pg_wal_lsn_diff(
  pg_current_wal_lsn(),
  pg_last_wal_receive_lsn()
) AS lag_bytes
FROM pg_stat_wal_receiver;
Output
is_replica | lag_bytes
------------+----------
t | 24576
Production Trap:
Never assume 0 lag. Always measure actual lag with pg_stat_replication before routing reads. A replica that's 10 seconds behind is useless for real-time dashboards.
Key Takeaway
Read replicas decouple read and write workloads, but always measure lag before trusting them.

Prerequisites: What You Need Before Touching Replicas

Before deploying PostgreSQL read replicas, three non-negotiable foundations must exist. First, a properly configured primary with wal_level = replica (or logical for logical replication) and max_wal_senders set high enough to support your replica count plus one buffer. Without these, replicas cannot stream WAL. Second, network connectivity: replicas must reach the primary on port 5432 (or your custom port) with firewall rules allowing PostgreSQL's streaming replication protocol—this is not HTTP, so load balancers often break it. Third, authentication: create a dedicated replication user with REPLICATION privilege and store credentials in a vault, not in code comments. Bonus prerequisite: understand your application's tolerance for stale reads. If your business logic demands causal consistency (e.g., "show me my order immediately after placing it"), replicas will betray you. Finally, ensure PostgreSQL version parity between primary and replicas—major version mismatches silently corrupt replication. On the operational side, you'll need monitoring (Prometheus + pg_stat_replication), a way to replay WAL after failover, and a plan for handling replication slot bloat. Skip any of these, and your replica setup will fail in production at the worst possible moment.

prereq_setup.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — database tutorial
-- Verify WAL level on primary
SHOW wal_level;

-- Create replication user
CREATE USER repl_user WITH REPLICATION PASSWORD 'strong_password';

-- Check max_wal_senders
SHOW max_wal_senders;

-- Grant necessary permissions
GRANT CONNECT ON DATABASE your_db TO repl_user;
Output
wal_level | max_wal_senders
------------+----------------
replica | 10
Architecture Check:
Replicas require persistent network tunnels. If using cloud VPCs, ensure security groups allow inbound from replica IPs—and never expose the primary's replication port to the internet.
Key Takeaway
WAL level, replication user, network access, and version parity are mandatory before a single replica can stream.
● Production incidentPOST-MORTEMseverity: high

The Replica That Made Us Double-Charge 400 Customers

Symptom
Payment service started showing duplicate charge reports sporadically after a deploy. Charges were confirmed in Stripe but the order status in the UI showed 'pending'. Customers were being charged, not getting their order, and retrying — getting charged again.
Assumption
The team assumed the bug was in the idempotency key logic in the payment microservice — someone had pushed a change to the retry handler that day.
Root cause
The order status check before triggering a payment was routed to the read replica to reduce load on the primary. Replication lag was sitting at 800ms–2s under peak load. The write confirming 'payment initiated' hadn't propagated to the replica by the time the retry logic re-read order status, so the check returned 'pending' and triggered a second charge. The idempotency key on Stripe stopped the double-charge at the Stripe layer for most customers but 400 orders had a race condition where two separate charge attempts hit Stripe within the same idempotency window.
Fix
Immediate fix: set all payment status reads to route to primary — added a / force_primary / hint that PgBouncer's routing rules respected. Permanent fix: any read that gates a financial write must go to primary. Added a ReadPolicy.STRONG annotation in the DAL that bypassed the replica pool entirely. Bumped synchronous_commit = remote_apply for the payments schema specifically using ALTER DATABASE payments_db SET synchronous_commit = 'remote_apply'; to ensure writes were applied on replica before acknowledging to the application.
Key lesson
  • If a read result gates a write that moves money, it goes to the primary — full stop, no exceptions, not even at 1AM when the primary CPU graph looks scary.
Production debug guideQuick reference for diagnosing replica issues in production6 entries
Symptom · 01
Replica falling behind — lag climbing past 30 seconds
Fix
Check iostat on replica disk. WAL apply is sequential and disk-bound. If disk iowait > 20%, move replica to faster storage or separate WAL volume. Run the lag breakdown query from Section 3 to see if network, disk, or apply is the bottleneck.
Symptom · 02
ERROR: canceling statement due to conflict with recovery in replica logs
Fix
A long-running read query on the replica is blocking WAL apply. Set hot_standby_feedback = on on the replica to prevent vacuum from cleaning rows the replica still needs. Increase max_standby_streaming_delay to 120s. For analytics queries that can't be short, move them to a dedicated replica.
Symptom · 03
Primary disk filling up — pg_wal directory growing unbounded
Fix
Check for inactive replication slots: SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots;. If a replica is down and using a slot, the primary retains all WAL indefinitely. Drop the slot with SELECT pg_drop_replication_slot('slot_name'); then rebuild the replica.
Symptom · 04
Queries on replica are slower than identical queries on primary
Fix
Replica suffers from different query plans because autovacuum doesn't run as aggressively. Run ANALYZE on replica manually after major writes. For identical plan problems, enable log_duration on both nodes and compare — replica may lack indexes that exist on primary (but physical replication includes them). The issue is usually statistics, not structure.
Symptom · 05
Promoted replica won't accept writes — ERROR: cannot execute INSERT in a read-only transaction
Fix
You forgot to remove the standby.signal file after promotion. PostgreSQL checks for this file at startup; if it exists, the server starts in recovery mode regardless of primary_conninfo. Delete standby.signal and restart PostgreSQL. After promotion, pg_is_in_recovery() should return false.
Symptom · 06
Replication connection timeout — FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
Fix
Network or firewall issue between primary and replica. Check pg_hba.conf on primary — add host replication replicator replica_ip/32 md5. Test telnet from replica to primary on port 5432. Increase wal_sender_timeout on primary (default 60s).
★ Quick Replication Debug Cheat SheetCommands to diagnose replication issues in under 60 seconds
Check if replica is actually replicating
Immediate action
Run on replica: SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();
Commands
SELECT pg_is_in_recovery() AS is_standby, pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, now() - pg_last_xact_replay_timestamp() AS lag_seconds;
Run on primary: SELECT application_name, state, sync_state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;
Fix now
If lag_bytes > 1GB or lag_seconds > 30, investigate bottleneck. If state = 'catchup', replica is still initialising.
Find what's blocking replication on replica+
Immediate action
Run on replica: SELECT pid, now() - query_start AS duration, wait_event_type, wait_event, LEFT(query, 80) FROM pg_stat_activity WHERE state != 'idle' AND wait_event_type = 'Lock' ORDER BY duration DESC;
Commands
SELECT pid, pg_blocking_pids(pid) AS blockers, now() - query_start AS duration, query FROM pg_stat_activity WHERE wait_event LIKE '%recovery%';
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE wait_event = 'recovery' AND now() - query_start > interval '30 seconds';
Fix now
Cancel the blocking query with pg_cancel_backend(pid). If that fails, pg_terminate_backend(pid). Then add hot_standby_feedback = on to prevent recurrence.
Check if replication slot is causing disk bloat+
Immediate action
Run on primary: SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS retained_wal_size FROM pg_replication_slots;
Commands
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) / (1024*1024) AS retained_mb FROM pg_replication_slots;
SELECT pg_drop_replication_slot('slot_name') WHERE active = false AND retained_mb > 10240;
Fix now
If a slot is inactive and retaining >10GB of WAL, drop it. The replica will need a fresh pg_basebackup to rejoin, but that's better than a full primary disk.
Test if replica can be promoted+
Immediate action
Run on replica: CHECKPOINT; then SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
Commands
SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS apply_lag_bytes;
SHOW primary_conninfo; SHOW restore_command;
Fix now
If apply_lag_bytes > 0, any promotion will lose those transactions. For zero-RPO promotion, wait for apply_lag_bytes to reach 0. For minimal RPO, document the loss window.
Async vs Sync vs Logical Replication
AspectAsync Streaming ReplicationSync Streaming ReplicationLogical Replication
Replication lagMilliseconds to seconds under loadZero — primary waits for replica ackSeconds to minutes depending on change volume
Write throughput impact on primaryNear-zero — fire and forgetSignificant — every write waits for remote flush/apply (2-5x latency)Low — only published tables are replicated
RPO on primary failureUp to several seconds of data lossZero data loss (remote_apply) or minimal (remote_write)Variable — depends on subscription lag
Replica structureIdentical physical copyIdentical physical copySubscriber can have different indexes, partitioning
Can promote to primary?Yes — full promotion via pg_ctl promoteYes — zero data loss promotionNo — subscriber can't be promoted
DDL on replicaNo — follows primary exactlyNo — follows primary exactlyNo — DDL not replicated automatically
Use casePrimary HA replica and read scalingFinancial systems, zero-RPO requirementPartial replication, cross-version upgrades

Key takeaways

1
Replication lag is not a failure mode
it's a design parameter. Define your RPO and SLA before choosing async vs sync replication, not after your first stale-read incident.
2
Any read that gates a financial write goes to the primary. No exceptions. Not even when the primary CPU is spiking. A double-charge is ten times more expensive to resolve than an extra primary read.
3
Reach for read replicas when your pg_stat_activity shows read queries holding connections and CPU while write latency is climbing
not when someone just says 'we need to scale'. The symptom is the signal.
4
Logical replication and physical streaming replication look similar from the outside but are completely different animals under the hood
physical replicas can be promoted to primary during failover, logical subscribers cannot. Mixing them in your HA pool is a disaster waiting for a production incident to express itself.
5
walsender → walreceiver → recovery apply. Monitor each stage with sent_lsn, write_lsn, flush_lsn, replay_lsn. High replay_lag with low write_lag means recovery contention
long queries on the replica are blocking WAL apply.
6
Patroni is the industry standard for automated failover; don't build your own. If you're not using Patroni, you have a manual runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.
7
wal_keep_size + WAL archiving is safer than replication slots for most setups. Slots prevent resync but can fill your disk. Monitor pg_replication_slots.retained_bytes and alert at 10GB.

Common mistakes to avoid

5 patterns
×

Pointing analytics queries at the same replica handling OLTP read traffic

Symptom
Production reads start timing out, lag spikes to 30+ seconds, 'ERROR: canceling statement due to conflict with recovery' appears in replica logs.
Fix
Run a dedicated analytics replica with max_standby_streaming_delay = -1 and hot_standby_feedback = on, isolated from the OLTP replica pool entirely. Use separate PgBouncer pools with distinct application_names to monitor separation.
×

Using a replication slot on a replica that has routine maintenance windows

Symptom
Primary disk fills during the maintenance window, 'PANIC: could not write to file pg_wal: No space left on device' crashes the primary.
Fix
Don't use replication slots for physical streaming unless you need them (e.g., Patroni uses them internally). For simple streaming replication, wal_keep_size is safer. Monitor SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots; and alert above 10GB. If a slot goes inactive, drop it and rebuild the replica.
×

Running pg_basebackup without --wal-method=stream

Symptom
Base backup completes but replica fails to start with 'FATAL: could not connect to the primary server' or immediately falls behind because it's missing WAL from the backup window.
Fix
Always use --wal-method=stream during pg_basebackup. This streams WAL produced during the backup in parallel, so the replica has a consistent starting point. Alternatively configure WAL archiving so the replica can fetch the missing segments from archive.
×

Assuming replica will automatically reconnect after primary IP change

Symptom
Replica goes to 'disconnected' state, pg_stat_wal_receiver shows 'waiting for WAL to become available' indefinitely.
Fix
primary_conninfo uses a hostname backed by a DNS record that updates on failover (Patroni's virtual IP or Route53 with short TTL). Never hardcode IP addresses in primary_conninfo — use a stable DNS name. For Kubernetes, use a Service with a stable ClusterIP.
×

Not testing promotion before you need it

Symptom
During an actual outage, promotion succeeds but the application's connection pool doesn't reconnect because it cached the dead primary's connection, leading to continued downtime.
Fix
Test failover in staging monthly. Verify PgBouncer or your proxy updates its upstream on promotion. Confirm application connection retry logic handles the reconnect window with exponential backoff. The first failover you run should never be a real incident. Document the RTO and RPO of your failover process.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through exactly what happens at the WAL level when a PostgreSQL ...
Q02SENIOR
You have a read replica that's receiving 10,000 read queries per minute ...
Q03SENIOR
You've promoted a replica to primary during an incident. The old primary...
Q04SENIOR
Your synchronous replica is configured with synchronous_commit = remote_...
Q01 of 04SENIOR

Walk me through exactly what happens at the WAL level when a PostgreSQL primary executes an UPDATE while a streaming replica is connected — where does each process hand off to the next, and where can lag accumulate in that chain?

ANSWER
The primary writes the UPDATE to WAL. The walsender process reads that WAL record and streams it over TCP to the replica's walreceiver. The walreceiver writes the record to replica's WAL (write_lag). The walreceiver then flushes the WAL to disk (flush_lag). The replica's recovery process replays the WAL record against replica's data pages (replay_lag). Lag accumulates at each stage: network latency (write to send), replica disk I/O (write to flush), and recovery contention (flush to replay). The primary does not wait for any of these in async mode — it just sends and forgets, which is why data loss on primary failure is possible. The replay stage is single-threaded, so a long-running query on the replica can stall replay entirely, causing replay_lag to spike while write_lag remains low.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
How much replication lag is acceptable for a PostgreSQL read replica?
02
What's the difference between a read replica and a standby in PostgreSQL?
03
How do I force a specific query to always run on the primary and never hit the replica?
04
Can PostgreSQL read replicas fall so far behind that they can never catch up?
05
What's the performance impact of turning on hot_standby_feedback?
N
Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's MySQL & PostgreSQL. Mark it forged?

16 min read · try the examples if you haven't

Previous
Database Monitoring Tools
12 / 13 · MySQL & PostgreSQL
Next
SQL Server Management Studio (SSMS): Installation and Features