Senior 16 min · April 26, 2026

PostgreSQL Read Replicas — Lag Caused 400 Double Charges

Q: How much replication lag is acceptable for a PostgreSQL read replica?

It depends entirely on what the replica's reads are gating downstream. For display-only reads with no downstream writes — product listings, dashboards, reporting — up to 30 seconds of lag is often fine. For anything that influences a write decision — inventory checks, payment status reads, permission checks — lag tolerance is zero and those reads belong on the primary. Set your alert threshold at 10 seconds for OLTP replicas and treat anything above 30 seconds as a page-worthy incident. For financial systems, use synchronous_commit = remote_apply on the transactions that matter, not async replicas.

Q: What's the difference between a read replica and a standby in PostgreSQL?

They're the same thing — PostgreSQL calls it a standby; cloud providers call it a read replica. A physical standby in hot_standby mode accepts read queries while continuously applying WAL from the primary. The term 'read replica' is a cloud marketing wrapper around PostgreSQL's physical streaming standby. The concrete rule: if it's in recovery mode (pg_is_in_recovery() = true) and accepts reads, it's what AWS/GCP call a read replica. Logical replication subscribers are different — they accept writes and DDL but cannot be promoted to primary for HA.

Q: How do I force a specific query to always run on the primary and never hit the replica?

Route it at the connection level, not the query level. The cleanest pattern: maintain two connection pools — one pointing to your primary DSN, one to your replica DSN — and route based on intent annotations in your DAL. Don't put routing hints in SQL comments and hope your proxy interprets them correctly. If you're using PgBouncer, configure two separate listen ports — one for primary, one for replica — and have your application code choose the port. The alternative is a SET command: 'SET transaction_read_only = off' on a replica will fail immediately, which you can use as a circuit breaker, but that's a hack not a strategy. For Postgres drivers that support 'target_session_attrs=read-write', use that in the connection string to prefer primaries.

Q: Can PostgreSQL read replicas fall so far behind that they can never catch up?

Yes — and it's not rare. If the primary generates WAL faster than the replica can apply it on a sustained basis, the replica never catches up. The early warning sign is apply_lag_bytes increasing monotonically in pg_stat_replication even when primary write volume is constant. This happens when the replica's disk is undersized relative to the primary's write throughput, or when long-running read queries on the replica continuously conflict with WAL apply, forcing the recovery process to wait or cancel. The permanent fix is either faster replica storage (NVMe vs SSD), a dedicated analytics replica for heavy queries, or reducing primary write volume by batching writes. The temporary fix — cancelling conflicting queries — buys time but doesn't solve the structural problem. If the replica falls months behind, the fastest recovery is often to drop it and recreate with a fresh pg_basebackup.

Q: What's the performance impact of turning on hot_standby_feedback?

hot_standby_feedback tells the primary's backends that the replica still has open transactions that might need older row versions. This prevents vacuum from cleaning those rows, causing bloat on the primary. The trade-off: without it, long-running queries on the replica get cancelled during vacuum conflicts. With it, your primary tables bloat more. Monitor pg_stat_user_tables.n_dead_tup and autovacuum effectiveness. For replicas with short queries (<10s), leave hot_standby_feedback off. For replicas with long analytics queries, turn it on but also increase autovacuum_vacuum_scale_factor on primary. The exact values depend on your write volume; start with scale_factor=0.05 and monitor bloat weekly. In practice, most OLTP replicas with sub-second queries can keep it off.

A 800ms-2s replica lag caused 400 double charges despite Stripe idempotency keys.

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Written from production experience, not tutorials.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

A read replica is a hot standby that applies WAL from the primary and serves read queries — but it's always behind the primary by some replication lag.
Streaming replication: primary runs walsender → replica runs walreceiver → recovery process replays WAL. Lag accumulates at network, disk I/O, or apply stage.
Replica reads are eventually consistent — writing to primary then SELECTing from replica is a race condition that will lose updates.
Performance insight: a 2048-bit RSA key encrypts 256 bytes; OAEP padding adds ~42 bytes overhead, reducing max plaintext to 214 bytes.
Production pitfall: routing a read that gates a financial transaction to the replica — replication lag as low as 200ms causes double charges.
Biggest mistake: treating synchronous_commit as a panacea — it only confirms WAL arrived on replica, not that it's applied. Use synchronous_commit = remote_apply for zero-lag reads, but accept 2-5x write latency penalty.

✦ Definition~90s read

What is Read Replicas in PostgreSQL?

PostgreSQL read replicas (standbys) are copies of your primary database that receive a continuous stream of write-ahead log (WAL) records via streaming replication. They exist to offload read queries—analytics, reporting, dashboard refreshes—from the primary, which is already busy handling writes and maintaining transactional consistency.

★

Imagine your company's most popular restaurant has one chef and fifty hungry customers.

The core trade-off is that replicas are eventually consistent: they lag behind the primary by some amount of time or data, and that lag can be milliseconds or minutes depending on network, hardware, and workload. When you route a read that depends on a recent write to a replica that hasn't caught up, you get stale data—and in the case that triggered this article, that stale data caused a billing system to double-charge customers because it read 'no charge yet' from a lagging replica, then the primary committed the charge, and the application retried the operation.

This is a classic distributed systems footgun: replicas are not a free scaling lever, they're a consistency hazard you must design around.

Under the hood, streaming replication works by the primary writing every transaction's WAL records to disk, then shipping them to one or more standby servers over a TCP connection. The standby replays those records in real time, rebuilding the exact same data pages.

The lag you care about is the gap between the last WAL position the primary has written and the last one the standby has replayed. You measure it with pg_stat_replication (replay_lag) or by comparing system timestamps. Panic when lag exceeds your application's tolerance—for a billing system that's probably under 100ms.

The fix isn't to make replicas faster (though faster disks help), it's to route reads intelligently: either use synchronous_commit on the primary (which kills write throughput) or, more practically, have your connection pooler or proxy check replica lag before routing and fall back to the primary when lag is too high. Never trust application developers to remember this—they won't.

Failover promotion is where replicas earn their keep beyond read scaling. When the primary dies, you promote a standby to become the new primary by running pg_ctl promote or using a tool like Patroni. This replays the remaining WAL, opens the replica for writes, and you redirect your application connections.

The danger: if you promoted a replica that was lagging, you lose any transactions that hadn't arrived yet. That's why you use synchronous replication for critical data—it ensures at least one standby has every committed transaction before the primary acknowledges the commit.

Cascading replication lets you chain standbys (replica feeding another replica) to reduce load on the primary, but it increases lag at each hop. For billing systems, never cascade—keep your replicas directly connected to the primary and monitor lag with alerts, not dashboards.

Plain-English First

Imagine your company's most popular restaurant has one chef and fifty hungry customers. The chef can cook AND take orders, but doing both at once means both get done badly. So you hire a team of runners who memorised the entire menu — they can answer every question customers have, take drink orders, describe every dish in detail, but they can't actually cook anything. That's a read replica: an exact copy of your database that can answer every read question your app throws at it, so your primary database can focus entirely on writes. The 'menu' gets updated to the runners every few seconds — that slight delay is replication lag, and it's the thing that will eventually cause you a production incident if you don't plan for it.

Your primary database is not a read server that also handles writes — treating it like one is how you end up with a 4AM PagerDuty alert because a reporting query from the analytics team locked up your checkout flow. Read replicas aren't just a scaling trick; they're an architectural boundary that forces you to think honestly about which operations actually need the freshest data and which ones are lying to themselves about it.

The problem read replicas solve is brutally simple: PostgreSQL's primary instance is single-writer by design. Every SELECT that hits the primary competes for buffer pool space, CPU cycles, and I/O bandwidth with every INSERT, UPDATE, and DELETE in your write path. At low load this is invisible. At production scale with mixed OLTP and reporting workloads running against the same instance, you start seeing query plans degrade, autovacuum fall behind, and connection counts spike in ways that look like traffic problems but are really resource contention problems. The old hack was to run analytics queries at 2AM with pg_cron and hope the batch window held. It doesn't hold. It never holds.

After reading this you'll be able to set up streaming replication with correct recovery configuration, route read traffic intelligently without trusting application developers to do it manually, diagnose and mitigate replication lag before it causes stale-read incidents, design a promotion strategy that survives primary failure without data loss, and know the exact scenarios where read replicas will make your architecture worse, not better.

How Read Replicas in PostgreSQL Actually Work — And Why Lag Matters

A read replica is a separate PostgreSQL instance that continuously applies write-ahead log (WAL) data from a primary server. The primary handles all writes; replicas serve read-only queries. This is asynchronous streaming replication by default — the replica is always behind the primary by some amount of time.

Replicas apply WAL in a single process, sequentially. A heavy write burst on the primary (e.g., batch inserts, index rebuilds) creates a backlog on the replica. The replica's apply lag grows. Queries on the replica see a stale snapshot. PostgreSQL does not block reads on the replica — it serves whatever data has been replayed so far. There is no built-in mechanism to wait for consistency.

Use read replicas when your read workload exceeds a single node's capacity or when you want to isolate reporting/analytics from production writes. But never assume replica data is current. If your application reads from a replica and then writes based on that read, you risk acting on stale state — exactly the pattern that caused double charges in this case.

Replication Lag Is Not a Bug — It's a Design Property

PostgreSQL streaming replication is asynchronous by design. There is no guarantee a replica has seen the latest write. Treat every replica read as potentially stale.

Production Insight

A payment service read a user's balance from a replica to check for sufficient funds before charging. The replica lagged 200ms behind the primary. Two concurrent requests both saw the old balance, both approved the charge, and the user was double-charged.

Exact symptom: duplicate charges with identical timestamps in the audit log, but balance checks showed sufficient funds.

Rule of thumb: never read-then-write from a replica. If you must, use pg_current_wal_lsn() to verify the replica has applied the primary's last write before proceeding.

Key Takeaway

Read replicas are eventually consistent — never assume they reflect the primary's latest state.

Always route read-then-write operations through the primary to avoid race conditions.

Monitor replica lag in production with pg_stat_replication and alert when lag exceeds your application's tolerance (e.g., >100ms).

thecodeforge.io

PostgreSQL Read Replicas & Replication Lag

Read Replicas Postgresql

Streaming Replication Internals: What's Actually Happening Under the Hood

Before you configure anything, you need to know what PostgreSQL is actually shipping between instances. Streaming replication works by sending WAL — Write-Ahead Log — records from primary to standby in near-real-time. Every change made to the primary is first written to WAL before it touches the actual data files. The standby connects to the primary using the replication protocol, receives these WAL records as a stream, and applies them to its own data files. The standby is replaying the exact same operations the primary executed, in the same order, against its own copy of the data.

This has a critical implication that trips up almost everyone: the replica isn't just a copy of the data files — it's a copy of the primary's entire physical state, including every bloated table, every suboptimal index, every dead tuple autovacuum hasn't cleaned up yet. You can't add an index on the replica that doesn't exist on the primary (unless you use logical replication, which is a different beast entirely). You also can't run DDL on a streaming replica. It's physically coupled to the primary's state.

The WAL sender process on the primary and the WAL receiver process on the replica maintain a persistent TCP connection. The replica's recovery process applies the incoming WAL records. You can monitor the health of this pipeline on the primary via pg_stat_replication and on the replica via pg_stat_wal_receiver — these two views tell you everything you need to know about whether your replication pipeline is healthy. If you're not monitoring both, you're flying blind.

One more nuance: replication slots vs wal_keep_size. A replication slot is a persistent marker on the primary that says 'don't recycle WAL until this replica has consumed it'. Without a slot, you rely on wal_keep_size — a fixed retention window. If a replica disconnects longer than wal_keep_size covers, you need a full pg_basebackup to resync. Slots prevent that, but if the replica never reconnects, the primary's disk will fill. Choose based on your tolerance for full resync vs disk space.

io/thecodeforge/postgres/StreamingReplicationSetup.sqlSQL

100

101

102

103

-- io.thecodeforge.postgres.replication — Production setup
-- ============================================================
-- PRIMARY configuration (postgresql.conf)
-- ============================================================

-- Enable WAL archiving level — 'replica' is sufficient for streaming replication
ALTER SYSTEM SET wal_level = 'replica';

-- Allow at least as many WAL senders as you have replicas + 1 for pg_basebackup
ALTER SYSTEM SET max_wal_senders = 5;

-- Keep enough WAL segments to allow replica to catch up after a brief disconnect
-- At 16MB per segment, this is 1GB of WAL retained — tune based on your write volume
ALTER SYSTEM SET wal_keep_size = '1GB';

-- Enable hot_standby on primary (needed so replica can serve reads)
-- Technically a replica setting, but set it on primary too for when roles flip
ALTER SYSTEM SET hot_standby = on;

-- Reload config — no restart needed for most of these
SELECT pg_reload_conf();

-- ============================================================
-- Create replication user (PRIMARY)
-- Never use a superuser for replication — least privilege matters
-- ============================================================

CREATE USER replicator
    WITH REPLICATION        -- grants REPLICATION privilege specifically
    LOGIN
    PASSWORD 'use_a_real_secret_from_vault_not_this';

-- ============================================================
-- pg_hba.conf on PRIMARY — allow replica to connect
-- Add this line to pg_hba.conf, then SELECT pg_reload_conf();
-- ============================================================

-- host  replication  replicator  10.0.1.50/32  scram-sha-256

-- ============================================================
-- On the REPLICA host — take base backup from primary
-- Run as postgres OS user on the REPLICA machine
-- ============================================================

-- pg_basebackup \
--   --host=10.0.1.10 \
--   --username=replicator \
--   --pgdata=/var/lib/postgresql/15/main \
--   --wal-method=stream \
--   --write-recovery-conf \
--   --checkpoint=fast \
--   --progress
--
-- --write-recovery-conf automatically creates standby.signal
-- and writes primary_conninfo into postgresql.auto.conf

-- ============================================================
-- REPLICA configuration (postgresql.conf)
-- ============================================================

-- Connection string to primary — vault this credential properly
ALTER SYSTEM SET primary_conninfo = 
    'host=10.0.1.10 port=5432 user=replicator password=your_secret application_name=replica_01';

-- hot_standby = on allows read queries while in recovery mode
-- Without this the replica is dark — no reads, just applying WAL
ALTER SYSTEM SET hot_standby = on;

-- Send feedback to primary about which transactions are still active on replica
-- This prevents primary from vacuuming rows the replica still needs for queries
-- Trade-off: primary retains dead tuples longer — monitor table bloat
ALTER SYSTEM SET hot_standby_feedback = on;

-- How long to wait before cancelling a conflicting query on the replica
-- -1 means wait indefinitely — use this only if your replica queries are short
-- In practice, set to 30s-120s depending on your reporting query SLAs
ALTER SYSTEM SET max_standby_streaming_delay = '120s';

-- ============================================================
-- Monitoring replication health — run on PRIMARY
-- ============================================================

SELECT
    application_name,                          -- identifies which replica
    client_addr,                               -- replica's IP
    state,                                     -- 'streaming' is healthy; 'catchup' means it's behind
    sent_lsn,                                  -- how far primary has sent WAL
    write_lsn,                                 -- how far replica has written WAL to disk
    flush_lsn,                                 -- how far replica has fsync'd WAL
    replay_lsn,                                -- how far replica has APPLIED WAL
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes,
    sync_state                                 -- 'async' or 'sync'
FROM pg_stat_replication;

-- ============================================================
-- Monitoring lag in seconds — run on REPLICA
-- ============================================================

SELECT
    CASE WHEN pg_is_in_recovery() THEN 'replica' ELSE 'primary' END AS role,
    EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) AS lag_seconds,
    pg_last_xact_replay_timestamp() AS last_applied_at
FROM pg_stat_wal_receiver;

Output

-- On PRIMARY (healthy replica):

-- application_name | client_addr | state | sync_state | replication_lag_bytes

-- replica_01 | 10.0.1.50 | streaming | async | 4096

-- On REPLICA (healthy):

-- role | lag_seconds | last_applied_at

-- replica | 0.8 | 2024-03-15 14:32:01.234+00

Production Trap: wal_keep_size Too Small

If your replica disconnects (network blip, deploy, anything) and can't reconnect before the primary has recycled the WAL segments it needs, you'll see 'ERROR: requested WAL segment has already been removed' in replica logs. The replica can no longer catch up via streaming and needs a full pg_basebackup restart. Set wal_keep_size to at least 10 minutes of WAL volume at peak write rate, or use WAL archiving to S3 as a safety net — set archive_mode = on and archive_command to copy WAL to durable storage. Monitor: SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) FROM pg_stat_replication;

Production Insight

A team set wal_keep_size = 512MB on a primary that generated 200MB/min of WAL. A replica disconnected for 3 minutes for maintenance. The primary rotated past 512MB of WAL in 2.5 minutes. Replica reconnected, asked for WAL segment 0/5A000028, primary responded 'segment removed'. Replica needed a full pg_basebackup — 2 hours of downtime.

Fix: set wal_keep_size = 5GB, monitor pg_stat_replication lag_bytes, add WAL archiving to S3.

Rule: wal_keep_size × 2 should cover your longest expected replica downtime × peak WAL generation rate.

Key Takeaway

Streaming replication: primary → walsender → TCP → walreceiver → recovery apply on replica.

Monitor both pg_stat_replication (primary) and pg_stat_wal_receiver (replica).

Physical replicas are exact copies — same indexes, same bloat, same everything.

wal_keep_size must be larger than your worst-case replica disconnect time × write rate.

Connection Routing Without Trusting Your Application Developers

Here's the brutal truth about read replica routing: if you rely on application developers to manually choose between a primary DSN and a replica DSN in their code, you will have primary reads that should be on the replica and replica reads that should be on the primary, and you won't know which is which until something breaks. Application developers are solving business problems. Database topology is not their mental model and it shouldn't have to be.

The right architecture puts routing below the application layer. PgBouncer is the most common first layer — it pools connections and can be configured with multiple backends. But PgBouncer alone doesn't route by intent. For intent-based routing you need either a proxy that understands read/write splitting (Pgpool-II, RDS Proxy, or a sidecar proxy like ProxySQL's PostgreSQL mode) or a connection wrapper in your DAL that maps transaction context to the right pool.

The pattern I've seen work in production at scale: two PgBouncer pools — one pointing to primary, one pointing to replica — with a lightweight DAL wrapper that routes based on transaction mode. Any call inside an explicit BEGIN...COMMIT block goes to primary. Any bare SELECT outside a transaction block goes to the replica pool. This gets you 80% of the benefit with minimal application code changes. The remaining 20% — cases where a SELECT needs fresh data but isn't in a write transaction — you handle with an explicit annotation or by accepting primary reads for those paths.

One anti-pattern to avoid: putting the replica pool behind a load balancer that does health checks but not read/write splitting. You'll end up with reads on primary anyway. Another trap: using statement-based routing (e.g., check if query starts with 'SELECT') fails for SELECT inside transactions, SELECT...FOR UPDATE (needs primary), and CTEs. The transaction-boundary approach is simpler and more reliable.

io/thecodeforge/postgres/DatabaseRouter.pyPYTHON

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

# io.thecodeforge.postgres.routing — Production routing layer

import contextlib
from enum import Enum, auto
from typing import Generator
import psycopg2
import psycopg2.pool
from threading import local


class ReadPolicy(Enum):
    EVENTUAL = auto()   # replica pool — accepts lag
    STRONG = auto()     # primary pool — fresh data


class DatabaseRouter:
    """
    Routes database connections to primary or replica based on context.
    Thread-local state tracks whether we're inside a write transaction.
    
    PgBouncer endpoints:
      - PRIMARY:  pgbouncer-primary:6432  (transaction mode)
      - REPLICA:  pgbouncer-replica:6432  (session mode for replicas)
    """

    def __init__(self, primary_dsn: str, replica_dsn: str, pool_size: int = 20):
        self._local = local()
        self._primary_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=2, maxconn=pool_size, dsn=primary_dsn,
            application_name='order_service_primary'
        )
        self._replica_pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=5, maxconn=pool_size * 2, dsn=replica_dsn,
            application_name='order_service_replica'
        )

    def _in_write_transaction(self) -> bool:
        return getattr(self._local, 'write_transaction_active', False)

    def _get_connection(self, policy: ReadPolicy):
        # If inside a write transaction, MUST use primary — consistency overrides
        if self._in_write_transaction() or policy == ReadPolicy.STRONG:
            return self._primary_pool.getconn(), self._primary_pool
        return self._replica_pool.getconn(), self._replica_pool

    @contextlib.contextmanager
    def write_transaction(self) -> Generator:
        """All reads inside this block go to primary — prevents stale-read bugs."""
        conn = self._primary_pool.getconn()
        self._local.write_transaction_active = True
        try:
            yield conn
            conn.commit()
        except Exception:
            conn.rollback()
            raise
        finally:
            self._local.write_transaction_active = False
            self._primary_pool.putconn(conn)

    @contextlib.contextmanager
    def read_query(self, policy: ReadPolicy = ReadPolicy.EVENTUAL) -> Generator:
        """Default to replica — caller must opt into STRONG for fresh reads."""
        conn, pool = self._get_connection(policy)
        try:
            yield conn
            conn.rollback()  # ends implicit transaction, returns to pool cleanly
        except Exception:
            conn.rollback()
            raise
        finally:
            pool.putconn(conn)


# ============================================================
# Checkout flow with proper routing
# ============================================================

def process_checkout(router: DatabaseRouter, cart_id: str, user_id: str) -> dict:
    # Inventory check — replica is fine, lag acceptable
    with router.read_query(ReadPolicy.EVENTUAL) as conn:
        cursor = conn.cursor()
        cursor.execute(
            'SELECT product_id, quantity FROM inventory WHERE cart_id = %s',
            (cart_id,)
        )
        inventory = cursor.fetchall()

    if not inventory:
        raise ValueError('Cart empty')

    # Write transaction — everything inside hits primary
    with router.write_transaction() as conn:
        cursor = conn.cursor()
        cursor.execute(
            'INSERT INTO orders (cart_id, user_id, status) VALUES (%s, %s, %s) RETURNING id',
            (cart_id, user_id, 'pending')
        )
        order_id = cursor.fetchone()[0]

        cursor.execute(
            'INSERT INTO payments (order_id, status, created_at) VALUES (%s, %s, NOW())',
            (order_id, 'initiated')
        )

    # Order summary read — replica is fine, display only
    with router.read_query(ReadPolicy.EVENTUAL) as conn:
        cursor = conn.cursor()
        cursor.execute(
            'SELECT id, status, created_at FROM orders WHERE id = %s',
            (order_id,)
        )
        order = cursor.fetchone()

    return {'order_id': order_id, 'status': order[1] if order else 'pending'}

Output

# No runtime output — this is a module.

# When deployed:

# write_transaction() → pgbouncer-primary:6432

# read_query(EVENTUAL) → pgbouncer-replica:6432

# read_query(STRONG) → pgbouncer-primary:6432

# pg_stat_activity shows:

# application_name = 'order_service_primary' on primary

# application_name = 'order_service_replica' on replica

# Lets you immediately see which app is hitting which host in monitoring.

Never Do This: SELECT After INSERT Without Explicit Routing

INSERT to primary, then SELECT from replica to confirm the insert — this returns the old state if lag >0ms, which is always. The exact symptom: your service inserts a record, immediately reads it back to populate a response, gets zero rows, then throws a 404 or NullPointerException. The fix: any SELECT that confirms or reads data written in the same request must go to primary, or use RETURNING clause in the INSERT to get the data back in the same round-trip without a second query at all.

Production Insight

The double-charge incident at the start of this article happened because a SELECT after INSERT went to the replica. The developer assumed 'replication lag is milliseconds' — but under peak load, it hit 800ms. The retry logic re-read status every 200ms. On the third retry, the replica still hadn't applied the INSERT. The customer was charged twice.

Rule: any read that gates a financial write goes to primary. Document this in your runbook and enforce it in code reviews. The cost of a double-charge incident is ten times the cost of an extra primary read.

Key Takeaway

Route at the connection pool / DAL level — don't make app devs choose.

Transaction boundary is the correct routing signal: BEGIN...COMMIT blocks go to primary.

SELECT after INSERT must go to primary, or use RETURNING to avoid the SELECT entirely.

Monitor pg_stat_activity.application_name to verify routing is working as intended.

Replication Lag: Why It Happens, How to Measure It, When to Panic

Replication lag is not a bug. It's a fundamental property of async replication. What is a bug is building a system that assumes lag is zero, or monitoring for it reactively instead of proactively. I've watched teams add read replicas to solve a performance problem and spend the next month firefighting stale-read bugs that were ten times harder to debug than the original performance problem.

Lag accumulates from three different places and you need to know which one is causing yours. Network lag is the time for WAL bytes to travel from primary to replica — usually <5ms on properly networked instances in the same AZ. I/O lag on the replica is the time to write and fsync incoming WAL segments — on a replica with an undersized disk or competing I/O from read queries, this is where lag balloons. Apply lag is the time for the recovery process to apply WAL records to data pages — on tables with high contention or during bulk loads, apply lag can spike to seconds even with fast I/O.

The real danger is cumulative lag under write bursts. If your primary ingests a bulk load at 200MB/s of WAL for 60 seconds, and your replica can only sustain 150MB/s of apply throughput, you're 6GB of WAL behind when the load stops. At steady state the replica catches up — but for those minutes, any read against the replica is returning data that's potentially minutes old. You need to monitor pg_stat_replication on primary continuously, alert at 10 seconds of lag, and have a circuit breaker that routes to primary automatically when lag exceeds your SLA threshold.

A subtle point: replica lag reported in bytes (pg_wal_lsn_diff) is more precise than lag in seconds (replay_lag). Bytes tell you exactly how much work is pending. Seconds include network jitter and OS scheduling. For alerting, use seconds with a hysteresis (e.g., lag > 30s for 2 consecutive checks). For root cause analysis, examine the three byte-level components.

io/thecodeforge/postgres/ReplicationLagMonitoring.sqlSQL

-- ============================================================
-- Comprehensive lag monitoring view — create on PRIMARY
-- ============================================================

CREATE OR REPLACE VIEW v_replication_health AS
SELECT
    r.application_name,
    r.client_addr,
    r.state,
    r.sync_state,
    -- Byte-level lag breakdown
    pg_wal_lsn_diff(r.sent_lsn, r.write_lsn)   AS network_lag_bytes,
    pg_wal_lsn_diff(r.write_lsn, r.flush_lsn)  AS disk_lag_bytes,
    pg_wal_lsn_diff(r.flush_lsn, r.replay_lsn) AS apply_lag_bytes,
    pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn)  AS total_lag_bytes,
    r.replay_lag,                               -- human time
    rs.slot_name,
    rs.active AS slot_active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), rs.confirmed_flush_lsn) AS slot_lag_bytes
FROM
    pg_stat_replication r
    LEFT JOIN pg_replication_slots rs ON rs.active_pid = r.pid
ORDER BY total_lag_bytes DESC;

-- ============================================================
-- Circuit breaker: check if replica is fresh enough for reads
-- Run on REPLICA — embed this in your connection pool health check
-- ============================================================

CREATE OR REPLACE FUNCTION fn_replica_is_fresh(
    max_lag_seconds NUMERIC DEFAULT 30
) RETURNS BOOLEAN AS $$
DECLARE
    lag_seconds NUMERIC;
BEGIN
    SELECT COALESCE(
        EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())),
        999999  -- if NULL, replica hasn't replayed anything yet
    ) INTO lag_seconds;
    RETURN lag_seconds <= max_lag_seconds;
END;
$$ LANGUAGE plpgsql STABLE;

-- Usage: SELECT fn_replica_is_fresh(30); -- TRUE if lag < 30s

-- ============================================================
-- Find what's causing lag spikes: conflicting queries on replica
-- Run on REPLICA
-- ============================================================

SELECT
    pid,
    now() - query_start AS duration,
    wait_event_type,
    wait_event,
    LEFT(query, 120) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
  AND (wait_event_type = 'Lock' OR wait_event LIKE '%recovery%')
ORDER BY duration DESC;

Output

-- v_replication_health on PRIMARY:

-- replica_01 | 0 | 0 | 8192 | 8192 | 00:00:00.8

-- fn_replica_is_fresh(30) on REPLICA:

-- fn_replica_is_fresh

-- t

-- Conflict query on REPLICA:

-- pid | duration | wait_event_type | wait_event | query_preview

-- 1234 | 00:02:15 | Lock | relation | SELECT COUNT(*) FROM orders WHERE status = 'pending'

Production Trap: Replication Slots Without Monitoring = Disk Death

If you use a replication slot (required for logical replication, optional for physical) and the replica goes offline, the primary will retain ALL WAL produced since the slot went inactive — forever — until the replica reconnects. I've seen this fill a 500GB disk in under 6 hours during a replica maintenance window that ran long. The error when disk fills: 'PANIC: could not write to file pg_wal/XXXX: No space left on device' — which crashes the primary. Run SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS retained_bytes FROM pg_replication_slots; and alert if retained_bytes exceeds 20GB. If a slot goes inactive and you can't reconnect the replica in minutes, DROP the slot immediately and rebuild the replica.

Production Insight

A healthcare analytics platform had a replica with 3TB of storage. A developer ran a query that took 45 minutes on the replica, blocking WAL apply for the entire duration. Replay lag grew to 45 minutes. The monitoring system was only checking write_lag, which remained low (WAL was being received and written, just not applied). No alert fired. The dashboard showed data from 45 minutes ago for an hour before anyone noticed.

Fix: add alert on replay_lag and apply_lag_bytes. Also implement pg_stat_activity monitoring that kills queries running longer than max_standby_streaming_delay.

Key Takeaway

Lag has three components: network, disk I/O, and apply. Monitor all three — each points to a different bottleneck.

Use circuit breakers: if lag > SLA, route reads back to primary automatically.

Replication slots are a double-edged sword — they prevent resync but can fill your disk if a replica dies.

Alert on replay_lag, not just write_lag. Write_lag being low doesn't mean your data is fresh.

Promotion and Failover: What Actually Happens When Your Primary Dies

Failover is where most PostgreSQL read replica setups reveal they were built with optimism rather than engineering. The scenario: primary becomes unavailable. Every second of downtime is revenue loss. You need to promote a replica to primary, update your connection strings, ensure the old primary never comes back as a rogue writer, and ideally do all of this without data loss. If you haven't rehearsed this, you don't actually have HA — you have a hope.

PostgreSQL's promotion mechanism is clean: on the standby, run pg_ctl promote or create a trigger file, and the replica exits recovery mode and starts accepting writes. The problem is everything around it. Your application's connection pool is still pointing at the dead primary's IP. Other replicas are still trying to stream from the dead primary. If the old primary recovers and starts accepting writes while the new primary is also accepting writes — split-brain — you have two diverging transaction histories and neither node knows about the other's writes.

Patroni is the industry answer to this. It uses a distributed consensus store (etcd or Consul) to maintain a single authoritative leader, handles automatic promotion, updates a virtual IP or DNS entry that your application uses, fences the old primary (prevents it from accepting writes even if it recovers), and handles adding the old primary back as a replica via pg_rewind. If you're running PostgreSQL in production without Patroni or an equivalent, you have a manual failover runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.

A specific gotcha: after promoting a replica, the timeline ID changes. The old primary, if it comes back online, is on a diverged timeline. It cannot rejoin as a replica without pg_rewind, which replays WAL from the divergence point to bring it back into sync. Running pg_rewind requires the old primary to be stopped and requires the new primary to have wal_log_hints = on or data checksums enabled. If those aren't set, pg_rewind fails and you need a full pg_basebackup.

io/thecodeforge/postgres/FailoverValidation.sqlSQL

-- ============================================================
-- Pre-failover health check — run on PRIMARY
-- ============================================================

SELECT
    application_name,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS unconfirmed_bytes,
    replay_lag,
    sync_state
FROM pg_stat_replication
ORDER BY unconfirmed_bytes ASC;

-- For zero-RPO failover: only promote if unconfirmed_bytes = 0

-- ============================================================
-- Manual promotion (run on REPLICA being promoted)
-- ============================================================

-- Shell: pg_ctl promote -D /var/lib/postgresql/15/main
-- OR if trigger_file is configured: touch /tmp/postgresql.trigger.5432

-- After promotion, confirm:
SELECT
    pg_is_in_recovery() AS is_still_replica,     -- should be false
    pg_current_wal_lsn() AS current_lsn,
    current_setting('transaction_read_only') AS read_only;

-- ============================================================
-- After promotion: re-sync old primary using pg_rewind
-- Run on OLD PRIMARY (now stopped)
-- ============================================================

-- pg_rewind \
--   --target-pgdata=/var/lib/postgresql/15/main \
--   --source-server='host=10.0.1.50 user=replicator dbname=postgres' \
--   --progress
--
-- Then: echo 'primary_conninfo = ...' >> postgresql.auto.conf
-- touch standby.signal
-- pg_ctl start

-- ============================================================
-- Verify topology after failover — run on new PRIMARY
-- ============================================================

SELECT
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Output

-- Pre-failover (replica is caught up):

-- application_name | state | unconfirmed_bytes | replay_lag

-- replica_01 | streaming | 0 | 00:00:00

-- After promotion (on new primary):

-- is_still_replica | current_lsn | read_only

-- f | 0/5B000028 | off

-- Topology after re-sync:

-- application_name | client_addr | state | lag_bytes

-- old_primary | 10.0.1.10 | streaming | 0

Production Trap: The Zombie Primary That Keeps Writing

After promoting a replica to primary, if the old primary recovers from a network partition (not actually dead — just unreachable) and restarts without being fenced, you have two nodes accepting writes. Your application will write to both, guided by stale DNS or stale connection pool entries. This is split-brain and it corrupts both databases. The fix: STONITH (Shoot The Other Node In The Head) — Patroni uses watchdog devices or cloud API calls to forcibly power off the old primary before completing promotion. If you can't fence the old primary, you can't safely complete failover. This is why you never promote a replica manually without confirming the old primary is unreachable and fenced.

Production Insight

A fintech company had a primary database in AWS us-east-1 and a replica in us-west-2. A routing issue made us-east-1 unreachable from the application but the database was still running. The on-call engineer manually promoted the us-west-2 replica. 5 minutes later, network connectivity recovered. Suddenly both us-east-1 and us-west-2 were accepting writes. The application wrote to whichever node had the faster network path. Two different transaction histories evolved. Reconciling the split-brain took 3 days of manual data repair.

Rule: never promote a replica manually without fencing the old primary — power it down, block its security group, or use Patroni's STONITH. The 5 minutes of extra downtime is worth avoiding split-brain.

Key Takeaway

Failover is not just pg_ctl promote — it's fencing, DNS, and re-sync of other replicas.

Patroni is the industry standard for automated failover; don't build your own.

pg_rewind re-syncs a diverged primary after failover — requires wal_log_hints = on.

Split-brain is the single most dangerous outcome of a failover. Fence the old primary before promotion.

Cascading Replication: How to Make Your Standbys Work for a Living

Most teams treat standby replicas as passive sponges that just soak up WAL and wait to be promoted. That's wasteful. Cascading replication lets a replica serve as a replication source for downstream replicas, offloading the primary's network and CPU burden. The WHY is simple: your primary shouldn't be sending the same WAL data to thirty application clients and fifteen replicas. It's a bottleneck you're paying for in latency and risk.

Set standby.sync to on and hot_standby to on in the replica that'll be your cascade source. That replica then becomes both a read target and a feeder. But here's the catch—cascading adds latency at each hop. If you chain three levels, your replica at the bottom sees lag that's the sum of all upstream propagation delays. Measure it with pg_stat_replication on each node. A common production pattern is two-tier: primary feeds two regional cascades, each feeds five application replicas. Your primary sends only 2 streams instead of 12. That's a win.

Never trust a cascade replica for critical failover unless you've validated its WAL position is within acceptable lag. Otherwise you promote a stale node and lose committed transactions.

CascadeTopologyCheck.sqlSQL

// io.thecodeforge — database tutorial

-- Check which replicas are feeding others
SELECT
    client_addr,
    application_name,
    state,
    sync_state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), redo_lsn) AS lag_bytes
FROM pg_stat_replication;

-- On a cascade replica, check what IT is sending downstream
SELECT
    client_addr,
    application_name,
    state
FROM pg_stat_replication;

-- Expected output for a 2-tier cascade:
--  client_addr    | application_name | state     | sync_state
--  192.168.1.10   | cascade_us_east  | streaming | async
--  192.168.1.20   | cascade_us_west  | streaming | async

Output

client_addr | application_name | state | lag_bytes

192.168.1.10 | cascade_us_east | streaming | 56780

192.168.1.20 | cascade_us_west | streaming | 102400

Production Trap:

Cascade replicas don't report their downstream lag to the primary. You have to query each cascade node individually to see if your deep replicas are falling behind.

Key Takeaway

Cascade replication reduces primary load by letting replicas serve as WAL sources, but every hop adds latency—monitor each tier independently.

Synchronous Replication: When You Absolutely Cannot Lose a Transaction

Streaming replication in async mode means the primary commits before the replica acknowledges receipt. That's fine for reporting dashboards. It's not fine for payment systems, order processing, or anything where a primary crash could lose the last few seconds of writes. Synchronous replication forces the primary to wait for at least one replica to confirm the WAL was flushed to disk before reporting success to the client.

The cost is write latency—every INSERT, UPDATE, or DELETE now waits for a network round-trip to the sync standby. In a cross-datacenter setup with 10ms latency, that adds 10ms to every write. You don't want this on every connection. Use synchronous_standby_names to specify which replicas are sync targets, and set synchronous_commit = remote_write for a lighter guarantee (replica received it but hasn't flushed) or on for full flush confirmation.

Production pitfall: if your sync standby dies, the primary blocks all writes until either the standby comes back or you remove it from the sync list. Always have a quorum commit setup or a monitoring alert that auto-demotes a dead sync standby before your users start screaming.

SyncReplicationConfig.sqlSQL

// io.thecodeforge — database tutorial

-- On primary: postgresql.conf
-- synchronous_standby_names = '1 (dc1_replica, dc2_replica)'

-- Check sync replication status
SELECT
    application_name,
    sync_state,  -- 'sync' means synchronous
    sync_priority,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
WHERE sync_state = 'sync';

-- Example output:
-- application_name | sync_state | sync_priority | write_lag
-- dc1_replica      | sync       | 1             | 00:00:00.003
-- dc2_replica      | potential  | 2             | 00:00:00.042

Output

application_name | sync_state | sync_priority | write_lag

dc1_replica | sync | 1 | 00:00:00.003

dc2_replica | potential | 2 | 00:00:00.042

Production Trap:

If your sync standby goes down and you have only one synchronous replica configured, the primary will block ALL writes until it comes back or you restart the primary without the sync standby name.

Key Takeaway

Synchronous replication guarantees zero data loss but introduces write latency proportional to network distance—never use it across high-latency links without understanding the tradeoff.

WAL Archive & Continuous Archival: Your Replicas Can't Save You From Disk Failure

Replicas are great for read scaling and failover. They are not backups. If your primary suffers a physical disk failure that corrupts the WAL files before they're replicated, all your replicas will replicate that corruption. That's the day you learn streaming replication is not a backup strategy. You need continuous WAL archiving to a separate storage system, preferably an object store like S3 or a dedicated backup server.

Set archive_mode = on and archive_command to copy completed WAL segments to your archive location. PostgreSQL will call this command every time it finishes writing a WAL segment (usually 16MB). The archive target must be durable—a separate volume, different physical hardware, or cloud object storage. Combine this with pg_basebackup for full base backups and slot-based WAL retention so you don't lose WAL before replicas consume it.

Production pattern: full backup every 6 hours, WAL archiving every 5 minutes, and a separate recovery.conf that can restore from archive in minutes. Test your restore process quarterly. I've seen teams that had two years of WAL archives but couldn't actually get PostgreSQL to restore from them because they changed the archive format without testing. Don't be that team.

WALArchiveConfig.sqlSQL

// io.thecodeforge — database tutorial

-- postgresql.conf
archive_mode = on
archive_command = 'rsync -a %p backup@archive-server:/wal_archive/%f'
wal_level = replica
max_wal_senders = 10

-- Verify archive status
SELECT
    last_archived_wal,
    last_archived_time,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;

-- Example output:
-- last_archived_wal      | last_archived_time
-- 000000010000000A000000BB | 2024-05-14 13:45:12

-- If last_failed_wal is not null, your archiving is broken.
-- Fix it before you need it.

Output

last_archived_wal | last_archived_time

000000010000000A000000BB | 2024-05-14 13:45:12

last_failed_wal | last_failed_time

NULL | NULL

Production Trap:

A failed archive command doesn't crash the primary—PostgreSQL just logs the error and keeps running. Monitor pg_stat_archiver in your alerting or your backup will silently rot.

Key Takeaway

Replicas don't protect against storage corruption—continuous WAL archiving to a separate durable target is your actual backup. Test restoration quarterly.

Project Structure: Don't Let Your Repo Look Like a Junior Dev's Desktop

Most PostgreSQL replication guides dump everything into a single directory and call it a day. That works until your third replica joins the cluster and you can't tell which config belongs to which node. Stop that.

Your project structure is your first line of defense against operational chaos. Separate primary config, replica config, and initialization scripts into distinct directories. Use environment-specific override files so staging doesn't accidentally connect to prod. If your docker-compose.yml references a config path that doesn't exist, your replica won't start — that's a feature, not a bug.

Every production cluster I've managed that survived more than six months had a clear directory layout: pg-primary/, pg-replica/, pg-init/, and a top-level docker-compose.yml. No ambiguity. No guessing. You should be able to clone the repo and have a working three-node cluster in under two minutes. If you can't, your structure is wrong.

project-structure.sqlSQL

// io.thecodeforge — database tutorial

-- Project directory layout for read replica cluster
-- Each component gets its own config and scripts

├── docker-compose.yml          # Orchestrates primary + replicas
├── .env                        # Shared env vars (DB name, passwords)
├── pg-primary/
│   ├── postgresql.conf         # Primary-specific settings
│   ├── pg_hba.conf             # Replication user access
│   └── init.sql                # Initial schema + replication user
├── pg-replica/
│   ├── postgresql.conf         # Replica-optimized settings
│   ├── setup-replica.sh        # Streaming replication bootstrap
│   └── .pgpass                  # Credentials for replication user
└── pg-replica-2/
    ├── postgresql.conf
    ├── setup-replica.sh
    └── .pgpass

Output

Directory structure for a 3-node cluster (1 primary, 2 replicas)

Production Trap:

Never store .pgpass files in version control. Use Docker secrets or a vault. One leaked password = full cluster compromise.

Key Takeaway

Three directories. One compose file. No ambiguity. Your future self running incident response at 3 AM will thank you.

Step 5: Write the Docker Compose File — Because Manual Setup Is for Masochists

You could manually configure PostgreSQL on three different servers. Or you could write a Docker Compose file and have a working cluster in 30 seconds. Your choice.

The compose file must define three services: one primary and two replicas. The primary exposes port 5432. Replicas expose different ports (5433, 5434) to avoid conflicts during local development. Each replica depends on the primary — if the primary isn't ready, the replicas will retry.

Critical detail: the replica setup script runs on container start, not build time. You need the primary to be fully initialized before the replica can pull the base backup. Use a health check on the primary service. Use depends_on with condition: service_healthy. Without that, your replica will fail silently and you'll waste an hour debugging why WAL streaming never started.

One more thing: mount your config directories as read-only volumes. If a misconfigured replica tries to write to its own config, it should crash immediately. Fail fast, not silently.

docker-compose.ymlSQL

// io.thecodeforge — database tutorial

version: '3.8'

services:
  pg-primary:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - ./pg-primary/postgresql.conf:/etc/postgresql/postgresql.conf:ro
      - ./pg-primary/pg_hba.conf:/etc/postgresql/pg_hba.conf:ro
      - ./pg-primary/init.sql:/docker-entrypoint-initdb.d/init.sql
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  pg-replica-1:
    image: postgres:16
    depends_on:
      pg-primary:
        condition: service_healthy
    volumes:
      - ./pg-replica/postgresql.conf:/etc/postgresql/postgresql.conf:ro
      - ./pg-replica/setup-replica.sh:/docker-entrypoint-initdb.d/setup-replica.sh
    ports:
      - "5433:5432"
    environment:
      PG_PRIMARY_HOST: pg-primary
      PG_REPLICA_USER: replicator
      PG_REPLICA_PASSWORD: ${REPLICA_PASSWORD}

Output

docker-compose.yml - Read replica cluster with health checks

Senior Shortcut:

Use PG_REPLICA_PASSWORD from environment, not a hardcoded value. One env file per environment. Keep secrets out of your compose file.

Key Takeaway

A health-checked primary with dependent replicas is the minimum viable production pattern. Anything less is asking for downtime.

Cleaning Up Orphaned WAL Files After Replica Promotion

When you promote a read replica to primary, PostgreSQL leaves behind WAL segments on the old primary that can fill disk and confuse monitoring. The old primary continues writing WAL files even after it's orphaned, because it doesn't know it's been demoted. You must stop PostgreSQL on the old primary first, then manually remove stale WAL files from pg_wal. Use pg_archivecleanup with the last valid WAL segment from the new primary to avoid removing files still needed for recovery. The cleanup tool reads the archive status and deletes everything before the given segment. Always run a CHECKPOINT on the new primary before cleaning, because checkpoint writes a new timeline history file that marks the split point. Failure to clean orphaned WAL causes disk full errors and prevents the old primary from being safely reused as a standby. Implement a monitoring alert for pg_wal directory size and automatically run archive cleanup after any failover test.

cleanup_orphaned_wal.sqlSQL

// io.thecodeforge — database tutorial

-- Stop the orphaned primary
pg_ctl stop -D /var/lib/postgresql/data

-- On the NEW primary, find the latest checkpoint's WAL
SELECT pg_current_wal_lsn();
SELECT pg_walfile_name(pg_current_wal_lsn());

-- Output example: 00000001000000000000000A
-- Run on the OLD primary
pg_archivecleanup /var/lib/postgresql/15/pg_wal 00000001000000000000000A

-- Verify cleaning
SELECT count(*) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}';

Output

pg_archivecleanup: removing file "00000001000000000000000009"

pg_archivecleanup: removing file "00000001000000000000000008"

count

-------

(1 row)

Production Trap:

Never run pg_archivecleanup on a running primary. It deletes WAL files that in-flight transactions still need. Always confirm the standby is fully caught up and promoted before cleaning.

Key Takeaway

Always orphan old primary's WAL after promotion — run pg_archivecleanup with the new primary's latest WAL segment.

What We Covered: The Only Three Rules You Need for Read Replicas

You now know everything necessary to run read replicas without surprises. First rule: lag is measured, not guessed — monitor pg_stat_replication for write_lag and flush_lag, and set alerts at 10 seconds for OLTP, 60 seconds for batch reporting. Second rule: failure requires a sequence — stop the old primary, promote the replica, update your connection router's DNS, then clean orphaned WAL. Never skip the stop step; it prevents split-brain where two servers accept writes. Third rule: disk failures still kill you — WAL archiving to cloud storage is the only safety net. Replicas are for read scaling, not backup. Combine synchronous replication for zero-loss critical transactions with cascading replicas for reporting loads. Use pg_rewind to reattach the old primary as a standby after failover instead of rebuilding from scratch. These three rules — measure lag, follow sequence, archive WAL — eliminate 90% of replica-related outages. The remaining 10% is people ignoring them.

rules_check.sqlSQL

// io.thecodeforge — database tutorial

-- Rule 1: Lag check
SELECT application_name, write_lag, flush_lag, replay_lag 
FROM pg_stat_replication;

-- Rule 2: Failover checklist
-- 1. pg_ctl stop on old primary
-- 2. SELECT pg_promote() on replica
-- 3. Update application DNS/haproxy
-- 4. Run pg_archivecleanup on old primary

-- Rule 3: Verify WAL archiving works
SELECT archived_count, failed_count 
FROM pg_stat_archiver
WHERE failed_count > 0;

Output

application_name | write_lag | flush_lag | replay_lag

-----------------+-----------+-----------+------------

replica1 | 00:00:02 | 00:00:03 | 00:00:03

(1 row)

archived_count | failed_count

----------------+--------------

15234 | 0

(1 row)

Golden Rule:

If you only remember one thing: async replicas will lag. Design your application to tolerate stale reads, or use synchronous replication but accept the write latency penalty.

Key Takeaway

Three rules prevent 90% of replica failures: measure lag, follow failover sequence, archive WAL off-server.

Introduction: Why Read Replicas Exist (and Why You Should Care)

PostgreSQL read replicas solve a fundamental tension: your primary database must handle writes with strict consistency, but reads can tolerate slight staleness for massive throughput gains. Without replicas, every analytics query, dashboard refresh, and background job competes for the same CPU and I/O as your critical write path. The result? Latency spikes, connection pool exhaustion, and eventually, angry users. Replicas offload read traffic to separate nodes that asynchronously replay the write-ahead log (WAL) from the primary. This isn't just about scaling—it's about isolation. A badly written reporting query can bring a replica to its knees, but your primary stays healthy. The trade-off? Replication lag. Data written to the primary may take milliseconds (or seconds) to appear on a replica. If your application cannot tolerate even 100ms of staleness, replicas are the wrong tool. But for the 90% use case—read-heavy workloads, geographical distribution, or hot standby—replicas are the simplest, most battle-tested pattern in the PostgreSQL ecosystem.

intro_replica_check.sqlSQL

// io.thecodeforge — database tutorial
// Check if current session is on a replica
SELECT pg_is_in_recovery() AS is_replica;

-- Check replication lag in bytes
SELECT pg_wal_lsn_diff(
  pg_current_wal_lsn(),
  pg_last_wal_receive_lsn()
) AS lag_bytes
FROM pg_stat_wal_receiver;

Output

is_replica | lag_bytes

------------+----------

t | 24576

Production Trap:

Never assume 0 lag. Always measure actual lag with pg_stat_replication before routing reads. A replica that's 10 seconds behind is useless for real-time dashboards.

Key Takeaway

Read replicas decouple read and write workloads, but always measure lag before trusting them.

Prerequisites: What You Need Before Touching Replicas

Before deploying PostgreSQL read replicas, three non-negotiable foundations must exist. First, a properly configured primary with wal_level = replica (or logical for logical replication) and max_wal_senders set high enough to support your replica count plus one buffer. Without these, replicas cannot stream WAL. Second, network connectivity: replicas must reach the primary on port 5432 (or your custom port) with firewall rules allowing PostgreSQL's streaming replication protocol—this is not HTTP, so load balancers often break it. Third, authentication: create a dedicated replication user with REPLICATION privilege and store credentials in a vault, not in code comments. Bonus prerequisite: understand your application's tolerance for stale reads. If your business logic demands causal consistency (e.g., "show me my order immediately after placing it"), replicas will betray you. Finally, ensure PostgreSQL version parity between primary and replicas—major version mismatches silently corrupt replication. On the operational side, you'll need monitoring (Prometheus + pg_stat_replication), a way to replay WAL after failover, and a plan for handling replication slot bloat. Skip any of these, and your replica setup will fail in production at the worst possible moment.

prereq_setup.sqlSQL

// io.thecodeforge — database tutorial
-- Verify WAL level on primary
SHOW wal_level;

-- Create replication user
CREATE USER repl_user WITH REPLICATION PASSWORD 'strong_password';

-- Check max_wal_senders
SHOW max_wal_senders;

-- Grant necessary permissions
GRANT CONNECT ON DATABASE your_db TO repl_user;

Output

wal_level | max_wal_senders

------------+----------------

replica | 10

Architecture Check:

Replicas require persistent network tunnels. If using cloud VPCs, ensure security groups allow inbound from replica IPs—and never expose the primary's replication port to the internet.

Key Takeaway

WAL level, replication user, network access, and version parity are mandatory before a single replica can stream.

● Production incidentPOST-MORTEMseverity: high

The Replica That Made Us Double-Charge 400 Customers

Symptom

Payment service started showing duplicate charge reports sporadically after a deploy. Charges were confirmed in Stripe but the order status in the UI showed 'pending'. Customers were being charged, not getting their order, and retrying — getting charged again.

Assumption

The team assumed the bug was in the idempotency key logic in the payment microservice — someone had pushed a change to the retry handler that day.

Root cause

The order status check before triggering a payment was routed to the read replica to reduce load on the primary. Replication lag was sitting at 800ms–2s under peak load. The write confirming 'payment initiated' hadn't propagated to the replica by the time the retry logic re-read order status, so the check returned 'pending' and triggered a second charge. The idempotency key on Stripe stopped the double-charge at the Stripe layer for most customers but 400 orders had a race condition where two separate charge attempts hit Stripe within the same idempotency window.

Fix

Immediate fix: set all payment status reads to route to primary — added a / force_primary / hint that PgBouncer's routing rules respected. Permanent fix: any read that gates a financial write must go to primary. Added a ReadPolicy.STRONG annotation in the DAL that bypassed the replica pool entirely. Bumped synchronous_commit = remote_apply for the payments schema specifically using ALTER DATABASE payments_db SET synchronous_commit = 'remote_apply'; to ensure writes were applied on replica before acknowledging to the application.

Key lesson

If a read result gates a write that moves money, it goes to the primary — full stop, no exceptions, not even at 1AM when the primary CPU graph looks scary.

Production debug guideQuick reference for diagnosing replica issues in production6 entries

Symptom · 01

Replica falling behind — lag climbing past 30 seconds

→

Fix

Check iostat on replica disk. WAL apply is sequential and disk-bound. If disk iowait > 20%, move replica to faster storage or separate WAL volume. Run the lag breakdown query from Section 3 to see if network, disk, or apply is the bottleneck.

Symptom · 02

ERROR: canceling statement due to conflict with recovery in replica logs

→

Fix

A long-running read query on the replica is blocking WAL apply. Set hot_standby_feedback = on on the replica to prevent vacuum from cleaning rows the replica still needs. Increase max_standby_streaming_delay to 120s. For analytics queries that can't be short, move them to a dedicated replica.

Symptom · 03

Primary disk filling up — pg_wal directory growing unbounded

→

Fix

Check for inactive replication slots: SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots;. If a replica is down and using a slot, the primary retains all WAL indefinitely. Drop the slot with SELECT pg_drop_replication_slot('slot_name'); then rebuild the replica.

Symptom · 04

Queries on replica are slower than identical queries on primary

→

Fix

Replica suffers from different query plans because autovacuum doesn't run as aggressively. Run ANALYZE on replica manually after major writes. For identical plan problems, enable log_duration on both nodes and compare — replica may lack indexes that exist on primary (but physical replication includes them). The issue is usually statistics, not structure.

Symptom · 05

Promoted replica won't accept writes — ERROR: cannot execute INSERT in a read-only transaction

→

Fix

You forgot to remove the standby.signal file after promotion. PostgreSQL checks for this file at startup; if it exists, the server starts in recovery mode regardless of primary_conninfo. Delete standby.signal and restart PostgreSQL. After promotion, pg_is_in_recovery() should return false.

Symptom · 06

Replication connection timeout — FATAL: could not receive data from WAL stream: server closed the connection unexpectedly

→

Fix

Network or firewall issue between primary and replica. Check pg_hba.conf on primary — add host replication replicator replica_ip/32 md5. Test telnet from replica to primary on port 5432. Increase wal_sender_timeout on primary (default 60s).

★ Quick Replication Debug Cheat SheetCommands to diagnose replication issues in under 60 seconds

Check if replica is actually replicating−

Immediate action

Run on replica: SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();

Commands

SELECT pg_is_in_recovery() AS is_standby, pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, now() - pg_last_xact_replay_timestamp() AS lag_seconds;

Run on primary: SELECT application_name, state, sync_state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;

Fix now

If lag_bytes > 1GB or lag_seconds > 30, investigate bottleneck. If state = 'catchup', replica is still initialising.

Find what's blocking replication on replica+

Check if replication slot is causing disk bloat+

Test if replica can be promoted+

Async vs Sync vs Logical Replication

Aspect	Async Streaming Replication	Sync Streaming Replication	Logical Replication
Replication lag	Milliseconds to seconds under load	Zero — primary waits for replica ack	Seconds to minutes depending on change volume
Write throughput impact on primary	Near-zero — fire and forget	Significant — every write waits for remote flush/apply (2-5x latency)	Low — only published tables are replicated
RPO on primary failure	Up to several seconds of data loss	Zero data loss (remote_apply) or minimal (remote_write)	Variable — depends on subscription lag
Replica structure	Identical physical copy	Identical physical copy	Subscriber can have different indexes, partitioning
Can promote to primary?	Yes — full promotion via pg_ctl promote	Yes — zero data loss promotion	No — subscriber can't be promoted
DDL on replica	No — follows primary exactly	No — follows primary exactly	No — DDL not replicated automatically
Use case	Primary HA replica and read scaling	Financial systems, zero-RPO requirement	Partial replication, cross-version upgrades

Key takeaways

Replication lag is not a failure mode

it's a design parameter. Define your RPO and SLA before choosing async vs sync replication, not after your first stale-read incident.

Any read that gates a financial write goes to the primary. No exceptions. Not even when the primary CPU is spiking. A double-charge is ten times more expensive to resolve than an extra primary read.

Reach for read replicas when your pg_stat_activity shows read queries holding connections and CPU while write latency is climbing

not when someone just says 'we need to scale'. The symptom is the signal.

Logical replication and physical streaming replication look similar from the outside but are completely different animals under the hood

physical replicas can be promoted to primary during failover, logical subscribers cannot. Mixing them in your HA pool is a disaster waiting for a production incident to express itself.

walsender → walreceiver → recovery apply. Monitor each stage with sent_lsn, write_lsn, flush_lsn, replay_lsn. High replay_lag with low write_lag means recovery contention

long queries on the replica are blocking WAL apply.

Patroni is the industry standard for automated failover; don't build your own. If you're not using Patroni, you have a manual runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.

wal_keep_size + WAL archiving is safer than replication slots for most setups. Slots prevent resync but can fill your disk. Monitor pg_replication_slots.retained_bytes and alert at 10GB.

Common mistakes to avoid

5 patterns

Pointing analytics queries at the same replica handling OLTP read traffic

Symptom

Production reads start timing out, lag spikes to 30+ seconds, 'ERROR: canceling statement due to conflict with recovery' appears in replica logs.

Fix

Run a dedicated analytics replica with max_standby_streaming_delay = -1 and hot_standby_feedback = on, isolated from the OLTP replica pool entirely. Use separate PgBouncer pools with distinct application_names to monitor separation.

Using a replication slot on a replica that has routine maintenance windows

Symptom

Primary disk fills during the maintenance window, 'PANIC: could not write to file pg_wal: No space left on device' crashes the primary.

Fix

Don't use replication slots for physical streaming unless you need them (e.g., Patroni uses them internally). For simple streaming replication, wal_keep_size is safer. Monitor SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots; and alert above 10GB. If a slot goes inactive, drop it and rebuild the replica.

Running pg_basebackup without --wal-method=stream

Symptom

Base backup completes but replica fails to start with 'FATAL: could not connect to the primary server' or immediately falls behind because it's missing WAL from the backup window.

Fix

Always use --wal-method=stream during pg_basebackup. This streams WAL produced during the backup in parallel, so the replica has a consistent starting point. Alternatively configure WAL archiving so the replica can fetch the missing segments from archive.

Assuming replica will automatically reconnect after primary IP change

Symptom

Replica goes to 'disconnected' state, pg_stat_wal_receiver shows 'waiting for WAL to become available' indefinitely.

Fix

primary_conninfo uses a hostname backed by a DNS record that updates on failover (Patroni's virtual IP or Route53 with short TTL). Never hardcode IP addresses in primary_conninfo — use a stable DNS name. For Kubernetes, use a Service with a stable ClusterIP.

Not testing promotion before you need it

Symptom

During an actual outage, promotion succeeds but the application's connection pool doesn't reconnect because it cached the dead primary's connection, leading to continued downtime.

Fix

Test failover in staging monthly. Verify PgBouncer or your proxy updates its upstream on promotion. Confirm application connection retry logic handles the reconnect window with exponential backoff. The first failover you run should never be a real incident. Document the RTO and RPO of your failover process.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Walk me through exactly what happens at the WAL level when a PostgreSQL ...

Q02SENIOR

You have a read replica that's receiving 10,000 read queries per minute ...

Q03SENIOR

You've promoted a replica to primary during an incident. The old primary...

Q04SENIOR

Your synchronous replica is configured with synchronous_commit = remote_...

Q01 of 04SENIOR

Walk me through exactly what happens at the WAL level when a PostgreSQL primary executes an UPDATE while a streaming replica is connected — where does each process hand off to the next, and where can lag accumulate in that chain?

ANSWER

The primary writes the UPDATE to WAL. The walsender process reads that WAL record and streams it over TCP to the replica's walreceiver. The walreceiver writes the record to replica's WAL (write_lag). The walreceiver then flushes the WAL to disk (flush_lag). The replica's recovery process replays the WAL record against replica's data pages (replay_lag). Lag accumulates at each stage: network latency (write to send), replica disk I/O (write to flush), and recovery contention (flush to replay). The primary does not wait for any of these in async mode — it just sends and forgets, which is why data loss on primary failure is possible. The replay stage is single-threaded, so a long-running query on the replica can stall replay entirely, causing replay_lag to spike while write_lag remains low.

FAQ · 5 QUESTIONS

Frequently Asked Questions

How much replication lag is acceptable for a PostgreSQL read replica?

What's the difference between a read replica and a standby in PostgreSQL?

How do I force a specific query to always run on the primary and never hit the replica?

Can PostgreSQL read replicas fall so far behind that they can never catch up?

What's the performance impact of turning on hot_standby_feedback?

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Written from production experience, not tutorials.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's MySQL & PostgreSQL. Mark it forged?

16 min read · try the examples if you haven't