Home Database PostgreSQL Read Replicas: Internals, Setup & Production Gotchas

PostgreSQL Read Replicas: Internals, Setup & Production Gotchas

In Plain English 🔥
Imagine a library with one original book that thousands of people want to read at the same time. Instead of making everyone wait in line at the single original, the librarian makes exact photocopies and hands them out to separate reading rooms — everyone can read simultaneously. That's a read replica: an exact, continuously-updated copy of your database that handles read traffic so your primary database isn't crushed. When someone returns a page to the original (a write), the photocopier automatically updates every copy within milliseconds.
⚡ Quick Answer
Imagine a library with one original book that thousands of people want to read at the same time. Instead of making everyone wait in line at the single original, the librarian makes exact photocopies and hands them out to separate reading rooms — everyone can read simultaneously. That's a read replica: an exact, continuously-updated copy of your database that handles read traffic so your primary database isn't crushed. When someone returns a page to the original (a write), the photocopier automatically updates every copy within milliseconds.

At a certain scale, your PostgreSQL primary database starts groaning under the weight of thousands of simultaneous SELECT queries. Analytics dashboards, reporting jobs, search endpoints — they all hammer the same box that's also trying to handle your writes. The result is lock contention, CPU saturation, and latency spikes that affect paying users. This isn't a theoretical problem: it's the exact scaling wall that companies like Shopify, GitHub, and Notion have all hit and solved with read replicas.

Read replicas solve this by exploiting a fundamental truth about most web applications: roughly 80-90% of database operations are reads. By shipping every write from your primary to one or more standby servers in near-real-time using PostgreSQL's Write-Ahead Log (WAL), you can spread that read load horizontally without any application-level sharding complexity. The primary stays focused on writes; replicas absorb reads.

By the end of this article you'll understand exactly how WAL-based streaming replication works under the hood, how to set up a production-grade replica from scratch, how to route queries intelligently between primary and replica, and — critically — how to avoid the replication lag traps that have caused production incidents at companies you've heard of. This is the article that gets you from 'I know replicas exist' to 'I can architect and operate them confidently.'

How PostgreSQL Streaming Replication Actually Works Under the Hood

Every change in PostgreSQL — an INSERT, UPDATE, DELETE, even a schema change — is first written to the Write-Ahead Log (WAL) before it touches the actual data files. This isn't just for crash recovery; it's the backbone of replication.

When you enable streaming replication, the replica connects to the primary and starts a persistent TCP connection via the replication protocol. The primary runs a walsender process; the replica runs a walreceiver process. As new WAL records are generated on the primary, walsender streams them across the wire in near-real-time. The walreceiver writes them to the replica's own WAL, and then a separate startup process (in recovery mode) replays those WAL records against the replica's data files.

This replay is sequential and single-threaded by design — WAL records encode physical page changes, so they must be applied in exact order. This is why replicas have a hard ceiling on replay throughput: if your primary generates WAL faster than the replica can replay it, you accumulate replication lag.

Critically, the replica is in a permanent 'hot standby' state — it can serve read queries while simultaneously replaying WAL. PostgreSQL achieves this with careful conflict management: if a query on the replica touches a page that the WAL replay needs to modify, the query is cancelled with the error ERROR: canceling statement due to conflict with recovery. This is one of the most misunderstood behaviours in PostgreSQL replication.

check_replication_status.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142
-- ============================================================
-- Run this on your PRIMARY to inspect all connected replicas.
-- pg_stat_replication is the single most important view for
-- understanding the health of your replication setup.
-- ============================================================

SELECT
    client_addr,                         -- IP address of the replica
    application_name,                    -- Name set by the replica (useful when you have many)
    state,                               -- 'streaming' is healthy; 'catchup' means it's behind
    sent_lsn,                            -- Last WAL position sent from primary
    write_lsn,                           -- Last WAL position written to disk on replica
    flush_lsn,                           -- Last WAL position flushed (durable) on replica
    replay_lsn,                          -- Last WAL position actually applied to replica data files
    -- Calculate lag in bytes — this is raw distance, not time
    (sent_lsn - replay_lsn) AS replication_lag_bytes,
    -- Calculate lag as a human-readable time estimate
    -- write_lag: network delay; flush_lag: disk delay; replay_lag: apply delay
    write_lag,
    flush_lag,
    replay_lag,
    sync_state                           -- 'async' or 'sync' — see the comparison table below
FROM
    pg_stat_replication
ORDER BY
    replay_lag DESC NULLS LAST;          -- Put the most-lagged replica first

-- ============================================================
-- Run this on your REPLICA to check its own recovery state.
-- pg_is_in_recovery() = true means this server IS a replica.
-- ============================================================

SELECT
    pg_is_in_recovery()                  AS is_replica,
    pg_last_wal_receive_lsn()            AS last_wal_received,
    pg_last_wal_replay_lsn()             AS last_wal_replayed,
    -- This is the lag in bytes from what was received vs replayed
    pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS replay_queue_bytes,
    pg_last_xact_replay_timestamp()      AS last_transaction_replayed_at,
    -- How many seconds behind is this replica right now?
    EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;
▶ Output
-- On PRIMARY (pg_stat_replication):
client_addr | application_name | state | replication_lag_bytes | write_lag | flush_lag | replay_lag | sync_state
-------------+------------------+-----------+-----------------------+------------------+------------------+------------------+------------
10.0.1.22 | replica-us-east | streaming | 0 bytes | 00:00:00.001823 | 00:00:00.002104 | 00:00:00.018431 | async
10.0.1.45 | replica-us-west | streaming | 16384 bytes | 00:00:00.002011 | 00:00:00.003982 | 00:00:00.312044 | async

-- On REPLICA (pg_last_* functions):
is_replica | last_wal_received | last_wal_replayed | replay_queue_bytes | last_transaction_replayed_at | lag_seconds
-----------+-------------------+-------------------+--------------------+-------------------------------+-------------
t | 0/A1234F00 | 0/A1234D80 | 384 | 2024-11-14 09:23:41.882+00 | 0.31
⚠️
Watch Out: replay_lag vs write_lag are measuring different thingswrite_lag measures network + disk latency from primary to replica. replay_lag measures everything including the time to apply WAL to data files. A high replay_lag with a low write_lag means your replica's disk is struggling to keep up with WAL application — consider a faster SSD or reducing concurrent writes on the primary.

Setting Up a Production-Grade Streaming Replica from Scratch

Setting up a replica isn't just running one command — there's a specific sequence of steps and each one has a production-relevant reason. We'll walk through it properly.

First, the primary needs the right configuration. wal_level must be replica or logical (not minimal), max_wal_senders must be greater than zero (set it to at least the number of replicas plus 2 for headroom), and wal_keep_size (PostgreSQL 13+) or wal_keep_segments ensures WAL files aren't deleted before the replica has consumed them — critical during network interruptions.

Next, you create a dedicated replication user on the primary. Never use a superuser for replication; the replication role only needs the REPLICATION attribute, which limits what it can do if compromised.

The initial data copy uses pg_basebackup — it takes a consistent snapshot of the primary's data directory while the database stays online. This is your replica's starting point. Then you add a standby.signal file and configure postgresql.conf on the replica to point at the primary. The replica boots, connects, requests WAL from where the base backup ended, and streams from there.

PG 12 changed recovery configuration significantly: recovery.conf was absorbed into postgresql.conf. If you're on PG 12+ (you should be), use primary_conninfo in postgresql.conf and create an empty standby.signal file — not recovery.conf.

setup_postgres_replica.sh · BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
#!/bin/bash
# ============================================================
# STEP 1: Configure the PRIMARY server (run these on primary)
# ============================================================

# Edit postgresql.conf on the primary — these are the minimal
# required settings for streaming replication.
cat >> /etc/postgresql/16/main/postgresql.conf << 'EOF'

# --- Replication settings ---
wal_level = replica          # Enables WAL content needed for replication
max_wal_senders = 5          # Max simultaneous replica connections (+2 buffer)
wal_keep_size = 1024         # Keep 1GB of WAL on disk for lagging replicas
                             # Prevents "requested WAL segment has already been removed"
hot_standby = on             # Allow replicas to serve read queries (default on in PG 10+)

# Optional but important for monitoring:
track_commit_timestamp = on  # Needed for pg_last_committed_xact() on replica
EOF

# Create a dedicated replication user — minimal privileges, dedicated password
psql -U postgres -c "
    CREATE ROLE replication_user
    WITH REPLICATION LOGIN
    PASSWORD 'use-a-real-secret-manager-in-production';
"

# Allow the replica's IP to connect for replication in pg_hba.conf
# Replace 10.0.1.22/32 with your actual replica IP
echo "host replication replication_user 10.0.1.22/32 scram-sha-256" \
    >> /etc/postgresql/16/main/pg_hba.conf

# Reload configuration (no restart needed for pg_hba changes)
pg_ctlcluster 16 main reload


# ============================================================
# STEP 2: Take a base backup (run on the REPLICA server)
# ============================================================

# pg_basebackup copies the primary's data directory live.
# -R flag automatically writes primary_conninfo into postgresql.conf
#    AND creates the standby.signal file — huge time saver.
# -P shows progress. -Xs streams WAL during the backup itself
#    so no WAL is missed between backup start and end.
pg_basebackup \
    --host=10.0.1.10 \
    --port=5432 \
    --username=replication_user \
    --pgdata=/var/lib/postgresql/16/main \
    --wal-method=stream \
    --write-recovery-conf \
    --checkpoint=fast \
    --progress \
    --verbose

# Output during backup:
# pg_basebackup: initiating base backup, waiting for checkpoint to complete
# pg_basebackup: checkpoint completed
# pg_basebackup: write-ahead log start point: 0/3000028 on timeline 1
# pg_basebackup: starting background WAL receiver
# 30521/30521 kB (100%), 1/1 tablespace
# pg_basebackup: write-ahead log end point: 0/3000100
# pg_basebackup: syncing data to disk ...
# pg_basebackup: renaming backup_manifest.tmp to backup_manifest
# pg_basebackup: base backup completed


# ============================================================
# STEP 3: Verify the replica auto-configured correctly
# (pg_basebackup -R does this, but let's see what it wrote)
# ============================================================
grep -A5 'primary_conninfo' /var/lib/postgresql/16/main/postgresql.conf

# Expected output:
# primary_conninfo = 'host=10.0.1.10 port=5432 user=replication_user \
#                     password=use-a-real-secret-manager-in-production \
#                     application_name=replica-us-east'

ls -la /var/lib/postgresql/16/main/standby.signal
# Expected output:
# -rw------- 1 postgres postgres 0 Nov 14 09:15 standby.signal
# An empty file — its mere presence tells PostgreSQL "I am a replica"


# ============================================================
# STEP 4: Start the replica
# ============================================================
pg_ctlcluster 16 main start

# Tail the logs to confirm it connected and started streaming:
tail -f /var/log/postgresql/postgresql-16-main.log

# Healthy startup output:
# LOG:  entering standby mode
# LOG:  redo starts at 0/3000028
# LOG:  consistent recovery state reached at 0/3000100
# LOG:  database system is ready to accept read-only connections
# LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
▶ Output
# On PRIMARY — verify replica is connected and streaming:
psql -U postgres -c "SELECT client_addr, state, replay_lag FROM pg_stat_replication;"

client_addr | state | replay_lag
-------------+-----------+------------
10.0.1.22 | streaming | 00:00:00.021

# On REPLICA — confirm it's in recovery and serving reads:
psql -U postgres -c "SELECT pg_is_in_recovery(), pg_last_xact_replay_timestamp();"

pg_is_in_recovery | pg_last_xact_replay_timestamp
-------------------+-------------------------------
t | 2024-11-14 09:23:55.123+00
⚠️
Pro Tip: Set application_name in primary_conninfoAdd application_name=replica-us-east to your primary_conninfo string. It shows up in pg_stat_replication's application_name column, which is invaluable when you're debugging which of your five replicas is lagging. Without it, all replicas show as 'walreceiver' and you're guessing.

Routing Reads to Replicas: Connection Pooling, Drivers and Lag-Aware Logic

Having a replica running is only half the job. The harder part is routing traffic correctly — and doing it in a way that doesn't silently corrupt your application's data consistency guarantees.

The naive approach is splitting reads and writes at the application level: your ORM sends SELECT to a read DSN and everything else to a write DSN. This works, but it breaks for any pattern where you write and immediately read your own write — the replica may not have replayed that WAL yet, so you read stale data. Classic example: user registers, you redirect to their profile page, the profile page reads from the replica, and it 404s because the replica hasn't replayed the INSERT yet.

There are three increasingly sophisticated solutions: (1) Read critical post-write queries from the primary for a short window. (2) Use synchronous commit for specific transactions. (3) Track the LSN of your write and wait for the replica to catch up past it before reading from it.

For connection management, PgBouncer sits in front of your databases and provides connection pooling — essential because each PostgreSQL connection consumes ~5-10MB of RAM and a backend process. For read/write splitting with automatic failover awareness, tools like HAProxy with custom health checks, or purpose-built solutions like RDS Proxy or pgpool-II, are common in production.

The cleanest modern approach for application-level routing is using the PostgreSQL driver's built-in target server type support — psycopg3, asyncpg and the JDBC driver all support specifying target_session_attrs=any or prefer-standby in the connection string.

replica_aware_database_client.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170
import psycopg
from psycopg.rows import dict_row
import logging
from contextlib import contextmanager
from typing import Generator

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# ============================================================
# Connection strings for primary and replica.
# In production, pull these from environment variables or
# a secrets manager — never hardcode credentials.
# ============================================================
PRIMARY_DSN = (
    "host=10.0.1.10 "
    "port=5432 "
    "dbname=app_production "
    "user=app_user "
    "password=secret "
    "target_session_attrs=read-write"  # Guarantees we connect to writable primary
)

REPLICA_DSN = (
    "host=10.0.1.22,10.0.1.45 "        # List multiple replicas — driver picks one
    "port=5432 "
    "dbname=app_production "
    "user=app_user "
    "password=secret "
    "target_session_attrs=prefer-standby"  # Prefer replica, fall back to primary
)


@contextmanager
def get_primary_connection() -> Generator[psycopg.Connection, None, None]:
    """Use for all writes and reads where consistency is critical."""
    with psycopg.connect(PRIMARY_DSN, row_factory=dict_row) as conn:
        yield conn


@contextmanager
def get_replica_connection() -> Generator[psycopg.Connection, None, None]:
    """Use for reads that can tolerate a small amount of lag."""
    with psycopg.connect(REPLICA_DSN, row_factory=dict_row) as conn:
        yield conn


def get_replica_lag_seconds(replica_conn: psycopg.Connection) -> float:
    """
    Check how many seconds behind this replica is before trusting
    it for a time-sensitive read. Returns 0.0 if on primary.
    """
    with replica_conn.cursor() as cursor:
        cursor.execute("""
            SELECT
                CASE
                    WHEN pg_is_in_recovery() THEN
                        EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
                    ELSE
                        0  -- We're actually on the primary, no lag
                END AS lag_seconds;
        """)
        result = cursor.fetchone()
        return float(result['lag_seconds'] or 0)


def create_user_and_fetch_profile(username: str, email: str) -> dict:
    """
    Demonstrates the write-then-read consistency problem.

    WRONG pattern: write to primary, immediately read from replica.
    The replica may not have replayed the INSERT yet.

    RIGHT pattern: write to primary, confirm WAL position, then
    either read from primary or wait for replica to catch up.
    """
    created_user_id = None
    write_lsn = None

    # --- WRITE: Always goes to primary ---
    with get_primary_connection() as primary_conn:
        with primary_conn.cursor() as cursor:
            cursor.execute(
                """
                INSERT INTO users (username, email, created_at)
                VALUES (%s, %s, now())
                RETURNING id;
                """,
                (username, email)
            )
            created_user_id = cursor.fetchone()['id']
            primary_conn.commit()

            # Capture the LSN immediately after commit.
            # This is the WAL position the replica must reach
            # before it will have our new user.
            cursor.execute("SELECT pg_current_wal_lsn() AS current_lsn;")
            write_lsn = cursor.fetchone()['current_lsn']
            logger.info(f"User {created_user_id} created. WAL LSN at write: {write_lsn}")

    # --- READ: Use replica, but verify it's caught up past our write ---
    with get_replica_connection() as replica_conn:
        with replica_conn.cursor() as cursor:
            # Wait for the replica to replay past our write LSN.
            # pg_wal_lsn_diff returns 0 when replica has caught up.
            # Timeout of 5 seconds is a production-reasonable ceiling.
            cursor.execute(
                """
                SELECT pg_wal_replay_wait(%s, timeout => 5000);
                """,
                (write_lsn,)
                # NOTE: pg_wal_replay_wait is PostgreSQL 16+.
                # For older versions, use a polling loop with
                # pg_last_wal_replay_lsn() and pg_wal_lsn_diff().
            )

            lag_after_wait = get_replica_lag_seconds(replica_conn)
            logger.info(f"Replica lag after wait: {lag_after_wait:.3f}s")

            # Now safe to read — replica has replayed our write
            cursor.execute(
                """
                SELECT id, username, email, created_at
                FROM users
                WHERE id = %s;
                """,
                (created_user_id,)
            )
            user_profile = cursor.fetchone()
            return dict(user_profile)


# ============================================================
# ANALYTICS QUERIES — these are perfect replica candidates:
# large scans, aggregate reports, no freshness requirement.
# ============================================================
def get_daily_signups_report(days_back: int = 30) -> list[dict]:
    """Heavy aggregation — ideal for replica to absorb."""
    with get_replica_connection() as conn:
        lag = get_replica_lag_seconds(conn)
        if lag > 60:
            # If replica is more than 60s behind, warn but continue —
            # for a daily report, 60s stale data is acceptable.
            logger.warning(f"Replica is {lag:.1f}s behind. Report may be slightly stale.")

        with conn.cursor() as cursor:
            cursor.execute(
                """
                SELECT
                    date_trunc('day', created_at) AS signup_date,
                    COUNT(*) AS new_users
                FROM users
                WHERE created_at >= now() - make_interval(days => %s)
                GROUP BY 1
                ORDER BY 1 DESC;
                """,
                (days_back,)
            )
            return [dict(row) for row in cursor.fetchall()]


if __name__ == "__main__":
    # Demonstrate lag-safe write-then-read
    profile = create_user_and_fetch_profile("alice_dev", "alice@example.com")
    print(f"Profile fetched from replica: {profile}")

    # Demonstrate analytics query routed to replica
    report = get_daily_signups_report(days_back=7)
    print(f"Daily signups (last 7 days): {report}")
▶ Output
INFO:replica_aware_database_client:User 4821 created. WAL LSN at write: 0/A2341F80
INFO:replica_aware_database_client:Replica lag after wait: 0.019s
Profile fetched from replica: {'id': 4821, 'username': 'alice_dev', 'email': 'alice@example.com', 'created_at': datetime.datetime(2024, 11, 14, 9, 24, 1, 123456, tzinfo=datetime.timezone.utc)}

Daily signups (last 7 days): [
{'signup_date': datetime.datetime(2024, 11, 14, ...), 'new_users': 312},
{'signup_date': datetime.datetime(2024, 11, 13, ...), 'new_users': 298},
{'signup_date': datetime.datetime(2024, 11, 12, ...), 'new_users': 341}
]
🔥
Interview Gold: pg_wal_replay_wait() is a PostgreSQL 16 gemPostgreSQL 16 introduced pg_wal_replay_wait(lsn, timeout) which blocks until the replica has replayed past a given LSN. Before PG16, you had to poll pg_last_wal_replay_lsn() in a loop — ugly and chatty. Mentioning this in interviews signals you track PostgreSQL releases, not just textbook knowledge.

Synchronous vs Asynchronous Replication — The Durability vs Performance Trade-off

By default, PostgreSQL replication is asynchronous: the primary commits a transaction and returns success to the client before confirming the replica has received or replayed that WAL. This is fast — sub-millisecond commits — but it means if the primary crashes at exactly the wrong moment, you could lose the last few committed transactions on the replica. The data is gone from the replica's perspective.

Synchronous replication changes this guarantee: the primary won't return success to the client until at least one designated synchronous replica has written (or flushed) the WAL. This gives you zero-RPO (Recovery Point Objective) — no committed transaction is ever lost. The cost is latency: every commit now waits for a network round-trip to the replica. On a LAN this might add 1-2ms. Cross-datacenter, it could add 20-80ms per commit — devastating for write-heavy OLTP workloads.

The synchronous_commit parameter is PostgreSQL's genius here: it's configurable per-session or per-transaction. You don't have to choose globally. You can require synchronous durability for financial transactions and use async for event logging, in the same application.

The synchronous_standby_names setting controls which replicas participate in synchronous commits, and it supports powerful quorum syntax: ANY 2 (replica1, replica2, replica3) means the primary waits for any 2 of 3 replicas — a quorum approach that tolerates one replica being down without stalling all writes.

One sharp edge: if synchronous_standby_names names a replica that disconnects, the primary will stall ALL commits indefinitely. This has taken down production systems. The fix is monitoring replica connectivity with aggressive alerting and having a runbook to disable synchronous commit if a sync replica fails.

synchronous_replication_config.sql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
-- ============================================================
-- On PRIMARY: configure which replicas must acknowledge commits.
-- The application_name in primary_conninfo must match these names.
-- ============================================================

-- Option A: Wait for ONE specific replica (simple, brittle)
-- ALTER SYSTEM SET synchronous_standby_names = 'replica-us-east';

-- Option B: Quorum — wait for ANY 1 of 2 replicas (resilient)
-- If one replica goes down, commits still succeed via the other.
ALTER SYSTEM SET synchronous_standby_names =
    'ANY 1 (replica-us-east, replica-eu-west)';

SELECT pg_reload_conf(); -- Apply without restart

-- Verify the setting took effect:
SHOW synchronous_standby_names;
-- Result: ANY 1 (replica-us-east, replica-eu-west)


-- ============================================================
-- Per-transaction synchronous_commit control.
-- Use this to selectively require durability without
-- making every transaction pay the latency cost.
-- ============================================================

-- Example: financial transfer requires synchronous durability
BEGIN;
    SET LOCAL synchronous_commit = on;   -- This transaction: wait for replica ACK
    UPDATE accounts SET balance = balance - 500.00 WHERE account_id = 1001;
    UPDATE accounts SET balance = balance + 500.00 WHERE account_id = 2004;
    INSERT INTO transfer_log (from_account, to_account, amount, transferred_at)
        VALUES (1001, 2004, 500.00, now());
COMMIT;
-- The COMMIT here will block until at least 1 of the 2 replicas acknowledges
-- the WAL write. Safe to tell the user "your transfer is confirmed."

-- Example: event tracking can tolerate some loss — use async for speed
BEGIN;
    SET LOCAL synchronous_commit = off;  -- This transaction: don't wait for replica
    INSERT INTO page_view_events (user_id, page_url, viewed_at)
        VALUES (8821, '/products/widget', now());
COMMIT;
-- Returns immediately. In a crash scenario, this event might be lost.
-- That's an acceptable trade-off for non-critical analytics data.


-- ============================================================
-- Monitor synchronous replication ACK status in real time
-- ============================================================
SELECT
    application_name,
    sync_state,              -- 'sync', 'potential', or 'async'
    sent_lsn,
    flush_lsn,
    replay_lsn,
    -- 'sync': currently the active synchronous replica
    -- 'potential': would become sync if the active one fails
    -- 'async': purely asynchronous
    CASE sync_state
        WHEN 'sync'      THEN 'ACTIVE sync replica — commits wait for this one'
        WHEN 'potential' THEN 'STANDBY sync replica — takes over if sync fails'
        WHEN 'async'     THEN 'Pure async — commits never wait for this one'
    END AS sync_role_description
FROM
    pg_stat_replication
ORDER BY
    sync_state;
▶ Output
SHOW synchronous_standby_names;
synchronous_standby_names
------------------------------------------
ANY 1 (replica-us-east, replica-eu-west)

-- pg_stat_replication sync states:
application_name | sync_state | sync_role_description
--------------------+------------+-----------------------------------------------
replica-us-east | sync | ACTIVE sync replica — commits wait for this one
replica-eu-west | potential | STANDBY sync replica — takes over if sync fails
⚠️
Watch Out: A disconnected synchronous replica stalls ALL writesIf your synchronous replica goes offline and you don't have a quorum fallback or a monitoring alert, your primary will queue all new commits indefinitely — no timeout, no automatic degradation. Set up an alert on pg_stat_replication WHERE sync_state = 'sync' returning zero rows, and have a break-glass runbook: ALTER SYSTEM SET synchronous_standby_names = ''; SELECT pg_reload_conf(); to immediately fall back to async.
AspectAsynchronous ReplicationSynchronous Replication
Commit latencyUnaffected — returns before replica ACKsAdds 1-80ms per commit (network round-trip to replica)
Data loss on primary crash (RPO)Possible — last few transactions lostZero — no committed transaction is ever lost
Write throughputFull primary throughputLimited by slowest sync replica's throughput
Replica failure impactNone — primary continues normallyAll writes stall until replica reconnects or config changes
Best forAnalytics replicas, reporting, read scalingFinancial data, compliance workloads, zero-RPO requirements
synchronous_commit settingoff / local / remote_writeon / remote_apply
Quorum supportN/AANY N (replica1, replica2, ...) syntax
Typical production usage90%+ of replica deploymentsUsed alongside async for mixed workloads

🎯 Key Takeaways

  • PostgreSQL streaming replication is WAL-based: every change is first written to WAL, then streamed to replicas via a persistent walsender/walreceiver TCP connection. Replay is sequential and single-threaded, which is your ceiling on how fast lag can be recovered.
  • Replication lag has three distinct components — write_lag (network), flush_lag (disk on replica), replay_lag (WAL application). Each points to a different bottleneck. Monitoring all three in pg_stat_replication is non-negotiable for production operations.
  • The write-then-read consistency trap is the #1 real-world bug caused by replicas. The correct solution is LSN-based read-after-write validation using pg_wal_replay_wait() in PG16+, not 'sleep for 100ms and hope' or always routing to primary.
  • Synchronous replication with a single named standby is an availability trap: one replica failure stalls all primary writes indefinitely. Always use quorum syntax (ANY N of M) in synchronous_standby_names for any synchronous setup with production SLAs.

⚠ Common Mistakes to Avoid

  • Mistake 1: Not setting wal_keep_size high enough — Symptom: replica fails with 'requested WAL segment has already been removed' after any network hiccup or maintenance window, forcing a full pg_basebackup re-sync. Fix: Set wal_keep_size to at least 1-2GB, AND configure a WAL archive (archive_mode = on with archive_command pointing to S3 or similar) so replicas can fetch historical WAL from the archive instead of failing outright. WAL archiving is your safety net.
  • Mistake 2: Reading from a replica immediately after a write without accounting for replication lag — Symptom: your application writes a record then immediately reads it back from the replica, getting a 'not found' result or stale data — a phantom read from the application's perspective. Users see their own changes disappear. Fix: Either use pg_wal_replay_wait() (PG16+) to wait for the replica to catch up to your write's LSN before reading, or route post-write reads to the primary for a short session window. Never assume replica freshness after a write without explicit verification.
  • Mistake 3: Forgetting that long-running queries on hot standby get cancelled — Symptom: replica users see ERROR: canceling statement due to conflict with recovery sporadically and unpredictably, especially during heavy writes on the primary. Fix: Tune hot_standby_feedback = on so the replica tells the primary 'don't vacuum rows my active queries are reading'. Also increase max_standby_streaming_delay and max_standby_archive_delay. Be aware that hot_standby_feedback can cause table bloat on the primary by preventing VACUUM from cleaning dead rows needed by replica queries — it's a trade-off you have to consciously accept.

Interview Questions on This Topic

  • QWalk me through what happens at the byte level when a transaction commits on a PostgreSQL primary and how that change propagates to a streaming replica. Where can replication lag accumulate in this pipeline?
  • QYour application writes a user record and immediately redirects the user to their profile page, which reads from a replica. Users are intermittently seeing 404s on their own profiles. What's happening and how would you fix it without routing all reads to the primary?
  • QYou've configured synchronous replication with a single synchronous replica. At 2am the replica host crashes. What happens to your primary? What's your immediate remediation, and how would you redesign the setup to prevent this class of outage?

Frequently Asked Questions

Can a PostgreSQL read replica fall behind and how do you fix it?

Yes — replication lag accumulates when the primary generates WAL faster than the replica can replay it. This is most common during bulk writes, large batch jobs, or when the replica's disk is under pressure. Check replay_lag in pg_stat_replication on the primary. To reduce lag: ensure the replica has equivalent or better disk I/O than the primary, avoid running heavy analytical queries on the replica during peak write periods, and consider increasing work_mem on the replica to speed up WAL replay for certain operations.

Does a PostgreSQL read replica need to be the same hardware spec as the primary?

For a pure read replica, you can sometimes under-provision CPU compared to the primary since it's not handling writes or WAL generation overhead. However, disk I/O on the replica must be at least as fast as the primary because WAL replay is disk-bound. If the replica also needs to serve heavy analytical queries, it may actually need MORE RAM than the primary to maintain large working sets without evicting WAL replay buffers.

What's the difference between a read replica and a standby in PostgreSQL?

They're the same thing — the terminology is interchangeable in PostgreSQL's world. A 'hot standby' is a replica that accepts read queries while in recovery mode. It's called a 'standby' because it can be promoted to primary if the primary fails (via pg_promote() or a failover tool like Patroni). 'Read replica' is the application-centric term emphasising its role in read scaling. Every read replica is a hot standby, but a hot standby's primary purpose is high availability — read scaling is a bonus.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousApache Kafka BasicsNext →Recursive SQL Queries
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged