Skip to content
Home Database PostgreSQL Read Replicas — Lag Caused 400 Double Charges

PostgreSQL Read Replicas — Lag Caused 400 Double Charges

Where developers are forged. · Structured learning · Free forever.
📍 Part of: MySQL & PostgreSQL → Topic 12 of 13
A 800ms-2s replica lag caused 400 double charges despite Stripe idempotency keys.
🔥 Advanced — solid Database foundation required
In this tutorial, you'll learn
A 800ms-2s replica lag caused 400 double charges despite Stripe idempotency keys.
  • Replication lag is not a failure mode — it's a design parameter. Define your RPO and SLA before choosing async vs sync replication, not after your first stale-read incident.
  • Any read that gates a financial write goes to the primary. No exceptions. Not even when the primary CPU is spiking. A double-charge is ten times more expensive to resolve than an extra primary read.
  • Reach for read replicas when your pg_stat_activity shows read queries holding connections and CPU while write latency is climbing — not when someone just says 'we need to scale'. The symptom is the signal.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • A read replica is a hot standby that applies WAL from the primary and serves read queries — but it's always behind the primary by some replication lag.
  • Streaming replication: primary runs walsender → replica runs walreceiver → recovery process replays WAL. Lag accumulates at network, disk I/O, or apply stage.
  • Replica reads are eventually consistent — writing to primary then SELECTing from replica is a race condition that will lose updates.
  • Performance insight: a 2048-bit RSA key encrypts 256 bytes; OAEP padding adds ~42 bytes overhead, reducing max plaintext to 214 bytes.
  • Production pitfall: routing a read that gates a financial transaction to the replica — replication lag as low as 200ms causes double charges.
  • Biggest mistake: treating synchronous_commit as a panacea — it only confirms WAL arrived on replica, not that it's applied. Use synchronous_commit = remote_apply for zero-lag reads, but accept 2-5x write latency penalty.
🚨 START HERE

Quick Replication Debug Cheat Sheet

Commands to diagnose replication issues in under 60 seconds
🟡

Check if replica is actually replicating

Immediate ActionRun on replica: SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();
Commands
SELECT pg_is_in_recovery() AS is_standby, pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, now() - pg_last_xact_replay_timestamp() AS lag_seconds;
Run on primary: SELECT application_name, state, sync_state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;
Fix NowIf lag_bytes > 1GB or lag_seconds > 30, investigate bottleneck. If state = 'catchup', replica is still initialising.
🟡

Find what's blocking replication on replica

Immediate ActionRun on replica: SELECT pid, now() - query_start AS duration, wait_event_type, wait_event, LEFT(query, 80) FROM pg_stat_activity WHERE state != 'idle' AND wait_event_type = 'Lock' ORDER BY duration DESC;
Commands
SELECT pid, pg_blocking_pids(pid) AS blockers, now() - query_start AS duration, query FROM pg_stat_activity WHERE wait_event LIKE '%recovery%';
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE wait_event = 'recovery' AND now() - query_start > interval '30 seconds';
Fix NowCancel the blocking query with pg_cancel_backend(pid). If that fails, pg_terminate_backend(pid). Then add hot_standby_feedback = on to prevent recurrence.
🟡

Check if replication slot is causing disk bloat

Immediate ActionRun on primary: SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS retained_wal_size FROM pg_replication_slots;
Commands
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) / (1024*1024) AS retained_mb FROM pg_replication_slots;
SELECT pg_drop_replication_slot('slot_name') WHERE active = false AND retained_mb > 10240;
Fix NowIf a slot is inactive and retaining >10GB of WAL, drop it. The replica will need a fresh pg_basebackup to rejoin, but that's better than a full primary disk.
🟡

Test if replica can be promoted

Immediate ActionRun on replica: CHECKPOINT; then SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();
Commands
SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS apply_lag_bytes;
SHOW primary_conninfo; SHOW restore_command;
Fix NowIf apply_lag_bytes > 0, any promotion will lose those transactions. For zero-RPO promotion, wait for apply_lag_bytes to reach 0. For minimal RPO, document the loss window.
Production Incident

The Replica That Made Us Double-Charge 400 Customers

A payment service started showing duplicate charge reports sporadically after a deploy. Charges were confirmed in Stripe but the order status in the UI showed 'pending'. Customers were being charged, not getting their order, and retrying — getting charged again.
SymptomPayment service started showing duplicate charge reports sporadically after a deploy. Charges were confirmed in Stripe but the order status in the UI showed 'pending'. Customers were being charged, not getting their order, and retrying — getting charged again.
AssumptionThe team assumed the bug was in the idempotency key logic in the payment microservice — someone had pushed a change to the retry handler that day.
Root causeThe order status check before triggering a payment was routed to the read replica to reduce load on the primary. Replication lag was sitting at 800ms–2s under peak load. The write confirming 'payment initiated' hadn't propagated to the replica by the time the retry logic re-read order status, so the check returned 'pending' and triggered a second charge. The idempotency key on Stripe stopped the double-charge at the Stripe layer for most customers but 400 orders had a race condition where two separate charge attempts hit Stripe within the same idempotency window.
FixImmediate fix: set all payment status reads to route to primary — added a / force_primary / hint that PgBouncer's routing rules respected. Permanent fix: any read that gates a financial write must go to primary. Added a ReadPolicy.STRONG annotation in the DAL that bypassed the replica pool entirely. Bumped synchronous_commit = remote_apply for the payments schema specifically using ALTER DATABASE payments_db SET synchronous_commit = 'remote_apply'; to ensure writes were applied on replica before acknowledging to the application.
Key Lesson
If a read result gates a write that moves money, it goes to the primary — full stop, no exceptions, not even at 1AM when the primary CPU graph looks scary.
Production Debug Guide

Quick reference for diagnosing replica issues in production

Replica falling behind — lag climbing past 30 secondsCheck iostat on replica disk. WAL apply is sequential and disk-bound. If disk iowait > 20%, move replica to faster storage or separate WAL volume. Run the lag breakdown query from Section 3 to see if network, disk, or apply is the bottleneck.
ERROR: canceling statement due to conflict with recovery in replica logsA long-running read query on the replica is blocking WAL apply. Set hot_standby_feedback = on on the replica to prevent vacuum from cleaning rows the replica still needs. Increase max_standby_streaming_delay to 120s. For analytics queries that can't be short, move them to a dedicated replica.
Primary disk filling up — pg_wal directory growing unboundedCheck for inactive replication slots: SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots;. If a replica is down and using a slot, the primary retains all WAL indefinitely. Drop the slot with SELECT pg_drop_replication_slot('slot_name'); then rebuild the replica.
Queries on replica are slower than identical queries on primaryReplica suffers from different query plans because autovacuum doesn't run as aggressively. Run ANALYZE on replica manually after major writes. For identical plan problems, enable log_duration on both nodes and compare — replica may lack indexes that exist on primary (but physical replication includes them). The issue is usually statistics, not structure.
Promoted replica won't accept writes — ERROR: cannot execute INSERT in a read-only transactionYou forgot to remove the standby.signal file after promotion. PostgreSQL checks for this file at startup; if it exists, the server starts in recovery mode regardless of primary_conninfo. Delete standby.signal and restart PostgreSQL. After promotion, pg_is_in_recovery() should return false.
Replication connection timeout — FATAL: could not receive data from WAL stream: server closed the connection unexpectedlyNetwork or firewall issue between primary and replica. Check pg_hba.conf on primary — add host replication replicator replica_ip/32 md5. Test telnet from replica to primary on port 5432. Increase wal_sender_timeout on primary (default 60s).

Your primary database is not a read server that also handles writes — treating it like one is how you end up with a 4AM PagerDuty alert because a reporting query from the analytics team locked up your checkout flow. Read replicas aren't just a scaling trick; they're an architectural boundary that forces you to think honestly about which operations actually need the freshest data and which ones are lying to themselves about it.

The problem read replicas solve is brutally simple: PostgreSQL's primary instance is single-writer by design. Every SELECT that hits the primary competes for buffer pool space, CPU cycles, and I/O bandwidth with every INSERT, UPDATE, and DELETE in your write path. At low load this is invisible. At production scale with mixed OLTP and reporting workloads running against the same instance, you start seeing query plans degrade, autovacuum fall behind, and connection counts spike in ways that look like traffic problems but are really resource contention problems. The old hack was to run analytics queries at 2AM with pg_cron and hope the batch window held. It doesn't hold. It never holds.

After reading this you'll be able to set up streaming replication with correct recovery configuration, route read traffic intelligently without trusting application developers to do it manually, diagnose and mitigate replication lag before it causes stale-read incidents, design a promotion strategy that survives primary failure without data loss, and know the exact scenarios where read replicas will make your architecture worse, not better.

Streaming Replication Internals: What's Actually Happening Under the Hood

Before you configure anything, you need to know what PostgreSQL is actually shipping between instances. Streaming replication works by sending WAL — Write-Ahead Log — records from primary to standby in near-real-time. Every change made to the primary is first written to WAL before it touches the actual data files. The standby connects to the primary using the replication protocol, receives these WAL records as a stream, and applies them to its own data files. The standby is replaying the exact same operations the primary executed, in the same order, against its own copy of the data.

This has a critical implication that trips up almost everyone: the replica isn't just a copy of the data files — it's a copy of the primary's entire physical state, including every bloated table, every suboptimal index, every dead tuple autovacuum hasn't cleaned up yet. You can't add an index on the replica that doesn't exist on the primary (unless you use logical replication, which is a different beast entirely). You also can't run DDL on a streaming replica. It's physically coupled to the primary's state.

The WAL sender process on the primary and the WAL receiver process on the replica maintain a persistent TCP connection. The replica's recovery process applies the incoming WAL records. You can monitor the health of this pipeline on the primary via pg_stat_replication and on the replica via pg_stat_wal_receiver — these two views tell you everything you need to know about whether your replication pipeline is healthy. If you're not monitoring both, you're flying blind.

One more nuance: replication slots vs wal_keep_size. A replication slot is a persistent marker on the primary that says 'don't recycle WAL until this replica has consumed it'. Without a slot, you rely on wal_keep_size — a fixed retention window. If a replica disconnects longer than wal_keep_size covers, you need a full pg_basebackup to resync. Slots prevent that, but if the replica never reconnects, the primary's disk will fill. Choose based on your tolerance for full resync vs disk space.

io/thecodeforge/postgres/StreamingReplicationSetup.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
-- io.thecodeforge.postgres.replication — Production setup
-- ============================================================
-- PRIMARY configuration (postgresql.conf)
-- ============================================================

-- Enable WAL archiving level — 'replica' is sufficient for streaming replication
ALTER SYSTEM SET wal_level = 'replica';

-- Allow at least as many WAL senders as you have replicas + 1 for pg_basebackup
ALTER SYSTEM SET max_wal_senders = 5;

-- Keep enough WAL segments to allow replica to catch up after a brief disconnect
-- At 16MB per segment, this is 1GB of WAL retained — tune based on your write volume
ALTER SYSTEM SET wal_keep_size = '1GB';

-- Enable hot_standby on primary (needed so replica can serve reads)
-- Technically a replica setting, but set it on primary too for when roles flip
ALTER SYSTEM SET hot_standby = on;

-- Reload config — no restart needed for most of these
SELECT pg_reload_conf();

-- ============================================================
-- Create replication user (PRIMARY)
-- Never use a superuser for replication — least privilege matters
-- ============================================================

CREATE USER replicator
    WITH REPLICATION        -- grants REPLICATION privilege specifically
    LOGIN
    PASSWORD 'use_a_real_secret_from_vault_not_this';

-- ============================================================
-- pg_hba.conf on PRIMARY — allow replica to connect
-- Add this line to pg_hba.conf, then SELECT pg_reload_conf();
-- ============================================================

-- host  replication  replicator  10.0.1.50/32  scram-sha-256

-- ============================================================
-- On the REPLICA host — take base backup from primary
-- Run as postgres OS user on the REPLICA machine
-- ============================================================

-- pg_basebackup \
--   --host=10.0.1.10 \
--   --username=replicator \
--   --pgdata=/var/lib/postgresql/15/main \
--   --wal-method=stream \
--   --write-recovery-conf \
--   --checkpoint=fast \
--   --progress
--
-- --write-recovery-conf automatically creates standby.signal
-- and writes primary_conninfo into postgresql.auto.conf

-- ============================================================
-- REPLICA configuration (postgresql.conf)
-- ============================================================

-- Connection string to primary — vault this credential properly
ALTER SYSTEM SET primary_conninfo = 
    'host=10.0.1.10 port=5432 user=replicator password=your_secret application_name=replica_01';

-- hot_standby = on allows read queries while in recovery mode
-- Without this the replica is dark — no reads, just applying WAL
ALTER SYSTEM SET hot_standby = on;

-- Send feedback to primary about which transactions are still active on replica
-- This prevents primary from vacuuming rows the replica still needs for queries
-- Trade-off: primary retains dead tuples longer — monitor table bloat
ALTER SYSTEM SET hot_standby_feedback = on;

-- How long to wait before cancelling a conflicting query on the replica
-- -1 means wait indefinitely — use this only if your replica queries are short
-- In practice, set to 30s-120s depending on your reporting query SLAs
ALTER SYSTEM SET max_standby_streaming_delay = '120s';

-- ============================================================
-- Monitoring replication health — run on PRIMARY
-- ============================================================

SELECT
    application_name,                          -- identifies which replica
    client_addr,                               -- replica's IP
    state,                                     -- 'streaming' is healthy; 'catchup' means it's behind
    sent_lsn,                                  -- how far primary has sent WAL
    write_lsn,                                 -- how far replica has written WAL to disk
    flush_lsn,                                 -- how far replica has fsync'd WAL
    replay_lsn,                                -- how far replica has APPLIED WAL
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
    sync_state                                 -- 'async' or 'sync'
FROM pg_stat_replication;

-- ============================================================
-- Monitoring lag in seconds — run on REPLICA
-- ============================================================

SELECT
    CASE WHEN pg_is_in_recovery() THEN 'replica' ELSE 'primary' END AS role,
    EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) AS lag_seconds,
    pg_last_xact_replay_timestamp() AS last_applied_at
FROM pg_stat_wal_receiver;
▶ Output
-- On PRIMARY (healthy replica):
-- application_name | client_addr | state | sync_state | replication_lag_bytes
-- replica_01 | 10.0.1.50 | streaming | async | 4096

-- On REPLICA (healthy):
-- role | lag_seconds | last_applied_at
-- replica | 0.8 | 2024-03-15 14:32:01.234+00
⚠ Production Trap: wal_keep_size Too Small
If your replica disconnects (network blip, deploy, anything) and can't reconnect before the primary has recycled the WAL segments it needs, you'll see 'ERROR: requested WAL segment has already been removed' in replica logs. The replica can no longer catch up via streaming and needs a full pg_basebackup restart. Set wal_keep_size to at least 10 minutes of WAL volume at peak write rate, or use WAL archiving to S3 as a safety net — set archive_mode = on and archive_command to copy WAL to durable storage. Monitor: SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) FROM pg_stat_replication;
📊 Production Insight
A team set wal_keep_size = 512MB on a primary that generated 200MB/min of WAL. A replica disconnected for 3 minutes for maintenance. The primary rotated past 512MB of WAL in 2.5 minutes. Replica reconnected, asked for WAL segment 0/5A000028, primary responded 'segment removed'. Replica needed a full pg_basebackup — 2 hours of downtime.
Fix: set wal_keep_size = 5GB, monitor pg_stat_replication lag_bytes, add WAL archiving to S3.
Rule: wal_keep_size × 2 should cover your longest expected replica downtime × peak WAL generation rate.
🎯 Key Takeaway
Streaming replication: primary → walsender → TCP → walreceiver → recovery apply on replica.
Monitor both pg_stat_replication (primary) and pg_stat_wal_receiver (replica).
Physical replicas are exact copies — same indexes, same bloat, same everything.
wal_keep_size must be larger than your worst-case replica disconnect time × write rate.

Connection Routing Without Trusting Your Application Developers

Here's the brutal truth about read replica routing: if you rely on application developers to manually choose between a primary DSN and a replica DSN in their code, you will have primary reads that should be on the replica and replica reads that should be on the primary, and you won't know which is which until something breaks. Application developers are solving business problems. Database topology is not their mental model and it shouldn't have to be.

The right architecture puts routing below the application layer. PgBouncer is the most common first layer — it pools connections and can be configured with multiple backends. But PgBouncer alone doesn't route by intent. For intent-based routing you need either a proxy that understands read/write splitting (Pgpool-II, RDS Proxy, or a sidecar proxy like ProxySQL's PostgreSQL mode) or a connection wrapper in your DAL that maps transaction context to the right pool.

The pattern I've seen work in production at scale: two PgBouncer pools — one pointing to primary, one pointing to replica — with a lightweight DAL wrapper that routes based on transaction mode. Any call inside an explicit BEGIN...COMMIT block goes to primary. Any bare SELECT outside a transaction block goes to the replica pool. This gets you 80% of the benefit with minimal application code changes. The remaining 20% — cases where a SELECT needs fresh data but isn't in a write transaction — you handle with an explicit annotation or by accepting primary reads for those paths.

One anti-pattern to avoid: putting the replica pool behind a load balancer that does health checks but not read/write splitting. You'll end up with reads on primary anyway. Another trap: using statement-based routing (e.g., check if query starts with 'SELECT') fails for SELECT inside transactions, SELECT...FOR UPDATE (needs primary), and CTEs. The transaction-boundary approach is simpler and more reliable.

io/thecodeforge/postgres/DatabaseRouter.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
# io.thecodeforge.postgres.routing — Production routing layer

import contextlib
from enum import Enum, auto
from typing import Generator
import psycopg2
import psycopg2.pool
from threading import local


class ReadPolicy(Enum):
    EVENTUAL = auto()   # replica pool — accepts lag
    STRONG = auto()     # primary pool — fresh data


class DatabaseRouter:
    """
    Routes database connections to primary or replica based on context.
    Thread-local state tracks whether we're inside a write transaction.
    
    PgBouncer endpoints:
      - PRIMARY:  pgbouncer-primary:6432  (transaction mode)
      - REPLICA:  pgbouncer-replica:6432  (session mode for replicas)
    """

    def __init__(self, primary_dsn: str, replica_dsn: str, pool_size: int = 20):
        self._local = local()
        self._primary_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=2, maxconn=pool_size, dsn=primary_dsn,
            application_name='order_service_primary'
        )
        self._replica_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=5, maxconn=pool_size * 2, dsn=replica_dsn,
            application_name='order_service_replica'
        )

    def _in_write_transaction(self) -> bool:
        return getattr(self._local, 'write_transaction_active', False)

    def _get_connection(self, policy: ReadPolicy):
        # If inside a write transaction, MUST use primary — consistency overrides
        if self._in_write_transaction() or policy == ReadPolicy.STRONG:
            return self._primary_pool.getconn(), self._primary_pool
        return self._replica_pool.getconn(), self._replica_pool

    @contextlib.contextmanager
    def write_transaction(self) -> Generator:
        """All reads inside this block go to primary — prevents stale-read bugs."""
        conn = self._primary_pool.getconn()
        self._local.write_transaction_active = True
        try:
            yield conn
            conn.commit()
        except Exception:
            conn.rollback()
            raise
        finally:
            self._local.write_transaction_active = False
            self._primary_pool.putconn(conn)

    @contextlib.contextmanager
    def read_query(self, policy: ReadPolicy = ReadPolicy.EVENTUAL) -> Generator:
        """Default to replica — caller must opt into STRONG for fresh reads."""
        conn, pool = self._get_connection(policy)
        try:
            yield conn
            conn.rollback()  # ends implicit transaction, returns to pool cleanly
        except Exception:
            conn.rollback()
            raise
        finally:
            pool.putconn(conn)


# ============================================================
# Checkout flow with proper routing
# ============================================================

def process_checkout(router: DatabaseRouter, cart_id: str, user_id: str) -> dict:
    # Inventory check — replica is fine, lag acceptable
    with router.read_query(ReadPolicy.EVENTUAL) as conn:
        cursor = conn.cursor()
        cursor.execute(
            'SELECT product_id, quantity FROM inventory WHERE cart_id = %s',
            (cart_id,)
        )
        inventory = cursor.fetchall()

    if not inventory:
        raise ValueError('Cart empty')

    # Write transaction — everything inside hits primary
    with router.write_transaction() as conn:
        cursor = conn.cursor()
        cursor.execute(
            'INSERT INTO orders (cart_id, user_id, status) VALUES (%s, %s, %s) RETURNING id',
            (cart_id, user_id, 'pending')
        )
        order_id = cursor.fetchone()[0]

        cursor.execute(
            'INSERT INTO payments (order_id, status, created_at) VALUES (%s, %s, NOW())',
            (order_id, 'initiated')
        )

    # Order summary read — replica is fine, display only
    with router.read_query(ReadPolicy.EVENTUAL) as conn:
        cursor = conn.cursor()
        cursor.execute(
            'SELECT id, status, created_at FROM orders WHERE id = %s',
            (order_id,)
        )
        order = cursor.fetchone()

    return {'order_id': order_id, 'status': order[1] if order else 'pending'}
▶ Output
# No runtime output — this is a module.
# When deployed:
# write_transaction() → pgbouncer-primary:6432
# read_query(EVENTUAL) → pgbouncer-replica:6432
# read_query(STRONG) → pgbouncer-primary:6432
#
# pg_stat_activity shows:
# application_name = 'order_service_primary' on primary
# application_name = 'order_service_replica' on replica
# Lets you immediately see which app is hitting which host in monitoring.
⚠ Never Do This: SELECT After INSERT Without Explicit Routing
INSERT to primary, then SELECT from replica to confirm the insert — this returns the old state if lag >0ms, which is always. The exact symptom: your service inserts a record, immediately reads it back to populate a response, gets zero rows, then throws a 404 or NullPointerException. The fix: any SELECT that confirms or reads data written in the same request must go to primary, or use RETURNING clause in the INSERT to get the data back in the same round-trip without a second query at all.
📊 Production Insight
The double-charge incident at the start of this article happened because a SELECT after INSERT went to the replica. The developer assumed 'replication lag is milliseconds' — but under peak load, it hit 800ms. The retry logic re-read status every 200ms. On the third retry, the replica still hadn't applied the INSERT. The customer was charged twice.
Rule: any read that gates a financial write goes to primary. Document this in your runbook and enforce it in code reviews. The cost of a double-charge incident is ten times the cost of an extra primary read.
🎯 Key Takeaway
Route at the connection pool / DAL level — don't make app devs choose.
Transaction boundary is the correct routing signal: BEGIN...COMMIT blocks go to primary.
SELECT after INSERT must go to primary, or use RETURNING to avoid the SELECT entirely.
Monitor pg_stat_activity.application_name to verify routing is working as intended.

Replication Lag: Why It Happens, How to Measure It, When to Panic

Replication lag is not a bug. It's a fundamental property of async replication. What is a bug is building a system that assumes lag is zero, or monitoring for it reactively instead of proactively. I've watched teams add read replicas to solve a performance problem and spend the next month firefighting stale-read bugs that were ten times harder to debug than the original performance problem.

Lag accumulates from three different places and you need to know which one is causing yours. Network lag is the time for WAL bytes to travel from primary to replica — usually <5ms on properly networked instances in the same AZ. I/O lag on the replica is the time to write and fsync incoming WAL segments — on a replica with an undersized disk or competing I/O from read queries, this is where lag balloons. Apply lag is the time for the recovery process to apply WAL records to data pages — on tables with high contention or during bulk loads, apply lag can spike to seconds even with fast I/O.

The real danger is cumulative lag under write bursts. If your primary ingests a bulk load at 200MB/s of WAL for 60 seconds, and your replica can only sustain 150MB/s of apply throughput, you're 6GB of WAL behind when the load stops. At steady state the replica catches up — but for those minutes, any read against the replica is returning data that's potentially minutes old. You need to monitor pg_stat_replication on primary continuously, alert at 10 seconds of lag, and have a circuit breaker that routes to primary automatically when lag exceeds your SLA threshold.

A subtle point: replica lag reported in bytes (pg_wal_lsn_diff) is more precise than lag in seconds (replay_lag). Bytes tell you exactly how much work is pending. Seconds include network jitter and OS scheduling. For alerting, use seconds with a hysteresis (e.g., lag > 30s for 2 consecutive checks). For root cause analysis, examine the three byte-level components.

io/thecodeforge/postgres/ReplicationLagMonitoring.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- ============================================================
-- Comprehensive lag monitoring view — create on PRIMARY
-- ============================================================

CREATE OR REPLACE VIEW v_replication_health AS
SELECT
    r.application_name,
    r.client_addr,
    r.state,
    r.sync_state,
    -- Byte-level lag breakdown
    pg_wal_lsn_diff(r.sent_lsn, r.write_lsn)   AS network_lag_bytes,
    pg_wal_lsn_diff(r.write_lsn, r.flush_lsn)  AS disk_lag_bytes,
    pg_wal_lsn_diff(r.flush_lsn, r.replay_lsn) AS apply_lag_bytes,
    pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn)  AS total_lag_bytes,
    r.replay_lag,                               -- human time
    rs.slot_name,
    rs.active AS slot_active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), rs.confirmed_flush_lsn) AS slot_lag_bytes
FROM
    pg_stat_replication r
    LEFT JOIN pg_replication_slots rs ON rs.active_pid = r.pid
ORDER BY total_lag_bytes DESC;

-- ============================================================
-- Circuit breaker: check if replica is fresh enough for reads
-- Run on REPLICA — embed this in your connection pool health check
-- ============================================================

CREATE OR REPLACE FUNCTION fn_replica_is_fresh(
    max_lag_seconds NUMERIC DEFAULT 30
) RETURNS BOOLEAN AS $$
DECLARE
    lag_seconds NUMERIC;
BEGIN
    SELECT COALESCE(
        EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())),
        999999  -- if NULL, replica hasn't replayed anything yet
    ) INTO lag_seconds;
    RETURN lag_seconds <= max_lag_seconds;
END;
$$ LANGUAGE plpgsql STABLE;

-- Usage: SELECT fn_replica_is_fresh(30); -- TRUE if lag < 30s

-- ============================================================
-- Find what's causing lag spikes: conflicting queries on replica
-- Run on REPLICA
-- ============================================================

SELECT
    pid,
    now() - query_start AS duration,
    wait_event_type,
    wait_event,
    LEFT(query, 120) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
  AND (wait_event_type = 'Lock' OR wait_event LIKE '%recovery%')
ORDER BY duration DESC;
▶ Output
-- v_replication_health on PRIMARY:
-- application_name | network_lag_bytes | disk_lag_bytes | apply_lag_bytes | total_lag_bytes | replay_lag
-- replica_01 | 0 | 0 | 8192 | 8192 | 00:00:00.8

-- fn_replica_is_fresh(30) on REPLICA:
-- fn_replica_is_fresh
-- t

-- Conflict query on REPLICA:
-- pid | duration | wait_event_type | wait_event | query_preview
-- 1234 | 00:02:15 | Lock | relation | SELECT COUNT(*) FROM orders WHERE status = 'pending'
⚠ Production Trap: Replication Slots Without Monitoring = Disk Death
If you use a replication slot (required for logical replication, optional for physical) and the replica goes offline, the primary will retain ALL WAL produced since the slot went inactive — forever — until the replica reconnects. I've seen this fill a 500GB disk in under 6 hours during a replica maintenance window that ran long. The error when disk fills: 'PANIC: could not write to file pg_wal/XXXX: No space left on device' — which crashes the primary. Run SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS retained_bytes FROM pg_replication_slots; and alert if retained_bytes exceeds 20GB. If a slot goes inactive and you can't reconnect the replica in minutes, DROP the slot immediately and rebuild the replica.
📊 Production Insight
A healthcare analytics platform had a replica with 3TB of storage. A developer ran a query that took 45 minutes on the replica, blocking WAL apply for the entire duration. Replay lag grew to 45 minutes. The monitoring system was only checking write_lag, which remained low (WAL was being received and written, just not applied). No alert fired. The dashboard showed data from 45 minutes ago for an hour before anyone noticed.
Fix: add alert on replay_lag and apply_lag_bytes. Also implement pg_stat_activity monitoring that kills queries running longer than max_standby_streaming_delay.
🎯 Key Takeaway
Lag has three components: network, disk I/O, and apply. Monitor all three — each points to a different bottleneck.
Use circuit breakers: if lag > SLA, route reads back to primary automatically.
Replication slots are a double-edged sword — they prevent resync but can fill your disk if a replica dies.
Alert on replay_lag, not just write_lag. Write_lag being low doesn't mean your data is fresh.

Promotion and Failover: What Actually Happens When Your Primary Dies

Failover is where most PostgreSQL read replica setups reveal they were built with optimism rather than engineering. The scenario: primary becomes unavailable. Every second of downtime is revenue loss. You need to promote a replica to primary, update your connection strings, ensure the old primary never comes back as a rogue writer, and ideally do all of this without data loss. If you haven't rehearsed this, you don't actually have HA — you have a hope.

PostgreSQL's promotion mechanism is clean: on the standby, run pg_ctl promote or create a trigger file, and the replica exits recovery mode and starts accepting writes. The problem is everything around it. Your application's connection pool is still pointing at the dead primary's IP. Other replicas are still trying to stream from the dead primary. If the old primary recovers and starts accepting writes while the new primary is also accepting writes — split-brain — you have two diverging transaction histories and neither node knows about the other's writes.

Patroni is the industry answer to this. It uses a distributed consensus store (etcd or Consul) to maintain a single authoritative leader, handles automatic promotion, updates a virtual IP or DNS entry that your application uses, fences the old primary (prevents it from accepting writes even if it recovers), and handles adding the old primary back as a replica via pg_rewind. If you're running PostgreSQL in production without Patroni or an equivalent, you have a manual failover runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.

A specific gotcha: after promoting a replica, the timeline ID changes. The old primary, if it comes back online, is on a diverged timeline. It cannot rejoin as a replica without pg_rewind, which replays WAL from the divergence point to bring it back into sync. Running pg_rewind requires the old primary to be stopped and requires the new primary to have wal_log_hints = on or data checksums enabled. If those aren't set, pg_rewind fails and you need a full pg_basebackup.

io/thecodeforge/postgres/FailoverValidation.sql · SQL
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
-- ============================================================
-- Pre-failover health check — run on PRIMARY
-- ============================================================

SELECT
    application_name,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS unconfirmed_bytes,
    replay_lag,
    sync_state
FROM pg_stat_replication
ORDER BY unconfirmed_bytes ASC;

-- For zero-RPO failover: only promote if unconfirmed_bytes = 0

-- ============================================================
-- Manual promotion (run on REPLICA being promoted)
-- ============================================================

-- Shell: pg_ctl promote -D /var/lib/postgresql/15/main
-- OR if trigger_file is configured: touch /tmp/postgresql.trigger.5432

-- After promotion, confirm:
SELECT
    pg_is_in_recovery() AS is_still_replica,     -- should be false
    pg_current_wal_lsn() AS current_lsn,
    current_setting('transaction_read_only') AS read_only;

-- ============================================================
-- After promotion: re-sync old primary using pg_rewind
-- Run on OLD PRIMARY (now stopped)
-- ============================================================

-- pg_rewind \
--   --target-pgdata=/var/lib/postgresql/15/main \
--   --source-server='host=10.0.1.50 user=replicator dbname=postgres' \
--   --progress
--
-- Then: echo 'primary_conninfo = ...' >> postgresql.auto.conf
-- touch standby.signal
-- pg_ctl start

-- ============================================================
-- Verify topology after failover — run on new PRIMARY
-- ============================================================

SELECT
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
▶ Output
-- Pre-failover (replica is caught up):
-- application_name | state | unconfirmed_bytes | replay_lag
-- replica_01 | streaming | 0 | 00:00:00

-- After promotion (on new primary):
-- is_still_replica | current_lsn | read_only
-- f | 0/5B000028 | off

-- Topology after re-sync:
-- application_name | client_addr | state | lag_bytes
-- old_primary | 10.0.1.10 | streaming | 0
⚠ Production Trap: The Zombie Primary That Keeps Writing
After promoting a replica to primary, if the old primary recovers from a network partition (not actually dead — just unreachable) and restarts without being fenced, you have two nodes accepting writes. Your application will write to both, guided by stale DNS or stale connection pool entries. This is split-brain and it corrupts both databases. The fix: STONITH (Shoot The Other Node In The Head) — Patroni uses watchdog devices or cloud API calls to forcibly power off the old primary before completing promotion. If you can't fence the old primary, you can't safely complete failover. This is why you never promote a replica manually without confirming the old primary is unreachable and fenced.
📊 Production Insight
A fintech company had a primary database in AWS us-east-1 and a replica in us-west-2. A routing issue made us-east-1 unreachable from the application but the database was still running. The on-call engineer manually promoted the us-west-2 replica. 5 minutes later, network connectivity recovered. Suddenly both us-east-1 and us-west-2 were accepting writes. The application wrote to whichever node had the faster network path. Two different transaction histories evolved. Reconciling the split-brain took 3 days of manual data repair.
Rule: never promote a replica manually without fencing the old primary — power it down, block its security group, or use Patroni's STONITH. The 5 minutes of extra downtime is worth avoiding split-brain.
🎯 Key Takeaway
Failover is not just pg_ctl promote — it's fencing, DNS, and re-sync of other replicas.
Patroni is the industry standard for automated failover; don't build your own.
pg_rewind re-syncs a diverged primary after failover — requires wal_log_hints = on.
Split-brain is the single most dangerous outcome of a failover. Fence the old primary before promotion.
🗂 Async vs Sync vs Logical Replication
Trade-offs between replication modes — none is universally best
AspectAsync Streaming ReplicationSync Streaming ReplicationLogical Replication
Replication lagMilliseconds to seconds under loadZero — primary waits for replica ackSeconds to minutes depending on change volume
Write throughput impact on primaryNear-zero — fire and forgetSignificant — every write waits for remote flush/apply (2-5x latency)Low — only published tables are replicated
RPO on primary failureUp to several seconds of data lossZero data loss (remote_apply) or minimal (remote_write)Variable — depends on subscription lag
Replica structureIdentical physical copyIdentical physical copySubscriber can have different indexes, partitioning
Can promote to primary?Yes — full promotion via pg_ctl promoteYes — zero data loss promotionNo — subscriber can't be promoted
DDL on replicaNo — follows primary exactlyNo — follows primary exactlyNo — DDL not replicated automatically
Use casePrimary HA replica and read scalingFinancial systems, zero-RPO requirementPartial replication, cross-version upgrades

🎯 Key Takeaways

  • Replication lag is not a failure mode — it's a design parameter. Define your RPO and SLA before choosing async vs sync replication, not after your first stale-read incident.
  • Any read that gates a financial write goes to the primary. No exceptions. Not even when the primary CPU is spiking. A double-charge is ten times more expensive to resolve than an extra primary read.
  • Reach for read replicas when your pg_stat_activity shows read queries holding connections and CPU while write latency is climbing — not when someone just says 'we need to scale'. The symptom is the signal.
  • Logical replication and physical streaming replication look similar from the outside but are completely different animals under the hood — physical replicas can be promoted to primary during failover, logical subscribers cannot. Mixing them in your HA pool is a disaster waiting for a production incident to express itself.
  • walsender → walreceiver → recovery apply. Monitor each stage with sent_lsn, write_lsn, flush_lsn, replay_lsn. High replay_lag with low write_lag means recovery contention — long queries on the replica are blocking WAL apply.
  • Patroni is the industry standard for automated failover; don't build your own. If you're not using Patroni, you have a manual runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.
  • wal_keep_size + WAL archiving is safer than replication slots for most setups. Slots prevent resync but can fill your disk. Monitor pg_replication_slots.retained_bytes and alert at 10GB.

⚠ Common Mistakes to Avoid

    Pointing analytics queries at the same replica handling OLTP read traffic
    Symptom

    Production reads start timing out, lag spikes to 30+ seconds, 'ERROR: canceling statement due to conflict with recovery' appears in replica logs.

    Fix

    Run a dedicated analytics replica with max_standby_streaming_delay = -1 and hot_standby_feedback = on, isolated from the OLTP replica pool entirely. Use separate PgBouncer pools with distinct application_names to monitor separation.

    Using a replication slot on a replica that has routine maintenance windows
    Symptom

    Primary disk fills during the maintenance window, 'PANIC: could not write to file pg_wal: No space left on device' crashes the primary.

    Fix

    Don't use replication slots for physical streaming unless you need them (e.g., Patroni uses them internally). For simple streaming replication, wal_keep_size is safer. Monitor SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots; and alert above 10GB. If a slot goes inactive, drop it and rebuild the replica.

    Running pg_basebackup without --wal-method=stream
    Symptom

    Base backup completes but replica fails to start with 'FATAL: could not connect to the primary server' or immediately falls behind because it's missing WAL from the backup window.

    Fix

    Always use --wal-method=stream during pg_basebackup. This streams WAL produced during the backup in parallel, so the replica has a consistent starting point. Alternatively configure WAL archiving so the replica can fetch the missing segments from archive.

    Assuming replica will automatically reconnect after primary IP change
    Symptom

    Replica goes to 'disconnected' state, pg_stat_wal_receiver shows 'waiting for WAL to become available' indefinitely.

    Fix

    primary_conninfo uses a hostname backed by a DNS record that updates on failover (Patroni's virtual IP or Route53 with short TTL). Never hardcode IP addresses in primary_conninfo — use a stable DNS name. For Kubernetes, use a Service with a stable ClusterIP.

    Not testing promotion before you need it
    Symptom

    During an actual outage, promotion succeeds but the application's connection pool doesn't reconnect because it cached the dead primary's connection, leading to continued downtime.

    Fix

    Test failover in staging monthly. Verify PgBouncer or your proxy updates its upstream on promotion. Confirm application connection retry logic handles the reconnect window with exponential backoff. The first failover you run should never be a real incident. Document the RTO and RPO of your failover process.

Interview Questions on This Topic

  • QWalk me through exactly what happens at the WAL level when a PostgreSQL primary executes an UPDATE while a streaming replica is connected — where does each process hand off to the next, and where can lag accumulate in that chain?SeniorReveal
    The primary writes the UPDATE to WAL. The walsender process reads that WAL record and streams it over TCP to the replica's walreceiver. The walreceiver writes the record to replica's WAL (write_lag). The walreceiver then flushes the WAL to disk (flush_lag). The replica's recovery process replays the WAL record against replica's data pages (replay_lag). Lag accumulates at each stage: network latency (write to send), replica disk I/O (write to flush), and recovery contention (flush to replay). The primary does not wait for any of these in async mode — it just sends and forgets, which is why data loss on primary failure is possible. The replay stage is single-threaded, so a long-running query on the replica can stall replay entirely, causing replay_lag to spike while write_lag remains low.
  • QYou have a read replica that's receiving 10,000 read queries per minute from a reporting dashboard. The DBA wants to add an index to speed up the report queries. What are your options given that the replica is a physical streaming standby, and what are the trade-offs of each?SeniorReveal
    Physical streaming replicas are exact copies of the primary — you cannot add an index on the replica that doesn't exist on the primary. Options: (1) Add the index on the primary — it will replicate to the replica automatically. Trade-off: writes on primary will be slower due to index maintenance, and the index will consume storage on both nodes. (2) Create a separate logical replication subscriber instead of a physical standby. Logical replication allows different indexes on subscriber. Trade-off: logical replicas cannot be promoted to primary for HA (they're not crash-consistent). (3) Create a dedicated analytics replica using pg_rewind to fork the data directory, then add indexes only on that node — but this means the analytics replica is no longer receiving updates from the primary. You'd need to refresh it periodically. (4) Use a materialised view on the replica — but refreshing it loses the benefit of real-time data. The correct choice depends on whether HA promotion is required for this node. For a pure analytics node where HA isn't required, logical replication is the cleanest answer.
  • QYou've promoted a replica to primary during an incident. The old primary comes back online 20 minutes later. What exact steps do you take to safely re-integrate it as a replica, and what's the single most dangerous thing that can happen if you get the sequence wrong?SeniorReveal
    Steps: (1) On the old primary, shut down PostgreSQL with pg_ctl stop -m fast. (2) Run pg_rewind with the new primary as source: pg_rewind --target-pgdata=/path/to/old --source-server='host=new_primary_ip' (requires wal_log_hints = on or data checksums enabled). (3) After pg_rewind succeeds, update primary_conninfo to point to the new primary. (4) Create standby.signal file in the old data directory. (5) Start PostgreSQL on the old primary as a replica. The most dangerous thing: starting the old primary without running pg_rewind first. If the old primary is started in recovery mode, it will attempt to replay WAL from its own timeline — but it diverged from the new primary's timeline at the promotion point. The result is two diverged databases that accept writes independently — split-brain. Connections will route to both, data corruption follows. To prevent this, always fence the old primary (power it off, isolate its network) before promoting a replica, and never start it without pg_rewind.
  • QYour synchronous replica is configured with synchronous_commit = remote_apply. The replica host has a 500ms disk I/O spike. Describe exactly what happens to write latency on the primary during those 500ms, what error or timeout the application sees if it exceeds synchronous_standby_names wait limits, and what configuration change you'd make to add resilience without abandoning synchronous replication entirely.SeniorReveal
    With synchronous_commit = remote_apply, the primary waits for the replica to confirm that the WAL has been written, flushed, AND applied to data pages before returning COMMIT to the client. During the 500ms I/O spike on the replica, the primary blocks on every COMMIT. The application experiences write latency that mirrors the replica's slowdown: each COMMIT takes an additional 500ms plus network round-trip. If the replica's I/O spike continues longer than synchronous_standby_names's timeout (there is no explicit timeout — the primary waits forever unless you set synchronous_standby_names with a quorum configuration), the primary will stall indefinitely. The application will see hung connections, not explicit errors. To add resilience: use quorum synchronous replication: synchronous_standby_names = 'ANY 1 (replica1, replica2)'. This tells PostgreSQL to wait for any ONE of the two replicas to acknowledge. If replica1 has a disk spike, replica2 can still ACK and writes continue. The trade-off: you need at least two replicas, and you lose zero-RPO if both replicas fail simultaneously, but you gain write availability during single-replica degradation.

Frequently Asked Questions

How much replication lag is acceptable for a PostgreSQL read replica?

It depends entirely on what the replica's reads are gating downstream. For display-only reads with no downstream writes — product listings, dashboards, reporting — up to 30 seconds of lag is often fine. For anything that influences a write decision — inventory checks, payment status reads, permission checks — lag tolerance is zero and those reads belong on the primary. Set your alert threshold at 10 seconds for OLTP replicas and treat anything above 30 seconds as a page-worthy incident. For financial systems, use synchronous_commit = remote_apply on the transactions that matter, not async replicas.

What's the difference between a read replica and a standby in PostgreSQL?

They're the same thing — PostgreSQL calls it a standby; cloud providers call it a read replica. A physical standby in hot_standby mode accepts read queries while continuously applying WAL from the primary. The term 'read replica' is a cloud marketing wrapper around PostgreSQL's physical streaming standby. The concrete rule: if it's in recovery mode (pg_is_in_recovery() = true) and accepts reads, it's what AWS/GCP call a read replica. Logical replication subscribers are different — they accept writes and DDL but cannot be promoted to primary for HA.

How do I force a specific query to always run on the primary and never hit the replica?

Route it at the connection level, not the query level. The cleanest pattern: maintain two connection pools — one pointing to your primary DSN, one to your replica DSN — and route based on intent annotations in your DAL. Don't put routing hints in SQL comments and hope your proxy interprets them correctly. If you're using PgBouncer, configure two separate listen ports — one for primary, one for replica — and have your application code choose the port. The alternative is a SET command: 'SET transaction_read_only = off' on a replica will fail immediately, which you can use as a circuit breaker, but that's a hack not a strategy. For Postgres drivers that support 'target_session_attrs=read-write', use that in the connection string to prefer primaries.

Can PostgreSQL read replicas fall so far behind that they can never catch up?

Yes — and it's not rare. If the primary generates WAL faster than the replica can apply it on a sustained basis, the replica never catches up. The early warning sign is apply_lag_bytes increasing monotonically in pg_stat_replication even when primary write volume is constant. This happens when the replica's disk is undersized relative to the primary's write throughput, or when long-running read queries on the replica continuously conflict with WAL apply, forcing the recovery process to wait or cancel. The permanent fix is either faster replica storage (NVMe vs SSD), a dedicated analytics replica for heavy queries, or reducing primary write volume by batching writes. The temporary fix — cancelling conflicting queries — buys time but doesn't solve the structural problem. If the replica falls months behind, the fastest recovery is often to drop it and recreate with a fresh pg_basebackup.

What's the performance impact of turning on hot_standby_feedback?

hot_standby_feedback tells the primary's backends that the replica still has open transactions that might need older row versions. This prevents vacuum from cleaning those rows, causing bloat on the primary. The trade-off: without it, long-running queries on the replica get cancelled during vacuum conflicts. With it, your primary tables bloat more. Monitor pg_stat_user_tables.n_dead_tup and autovacuum effectiveness. For replicas with short queries (<10s), leave hot_standby_feedback off. For replicas with long analytics queries, turn it on but also increase autovacuum_vacuum_scale_factor on primary. The exact values depend on your write volume; start with scale_factor=0.05 and monitor bloat weekly. In practice, most OLTP replicas with sub-second queries can keep it off.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousDatabase Monitoring ToolsNext →SQL Server Management Studio (SSMS): Installation and Features
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged