PostgreSQL Read Replicas — Lag Caused 400 Double Charges
- Replication lag is not a failure mode — it's a design parameter. Define your RPO and SLA before choosing async vs sync replication, not after your first stale-read incident.
- Any read that gates a financial write goes to the primary. No exceptions. Not even when the primary CPU is spiking. A double-charge is ten times more expensive to resolve than an extra primary read.
- Reach for read replicas when your pg_stat_activity shows read queries holding connections and CPU while write latency is climbing — not when someone just says 'we need to scale'. The symptom is the signal.
- A read replica is a hot standby that applies WAL from the primary and serves read queries — but it's always behind the primary by some replication lag.
- Streaming replication: primary runs walsender → replica runs walreceiver → recovery process replays WAL. Lag accumulates at network, disk I/O, or apply stage.
- Replica reads are eventually consistent — writing to primary then SELECTing from replica is a race condition that will lose updates.
- Performance insight: a 2048-bit RSA key encrypts 256 bytes; OAEP padding adds ~42 bytes overhead, reducing max plaintext to 214 bytes.
- Production pitfall: routing a read that gates a financial transaction to the replica — replication lag as low as 200ms causes double charges.
- Biggest mistake: treating synchronous_commit as a panacea — it only confirms WAL arrived on replica, not that it's applied. Use synchronous_commit = remote_apply for zero-lag reads, but accept 2-5x write latency penalty.
Quick Replication Debug Cheat Sheet
Check if replica is actually replicating
SELECT pg_is_in_recovery() AS is_standby, pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, now() - pg_last_xact_replay_timestamp() AS lag_seconds;Run on primary: SELECT application_name, state, sync_state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;Find what's blocking replication on replica
SELECT pid, pg_blocking_pids(pid) AS blockers, now() - query_start AS duration, query FROM pg_stat_activity WHERE wait_event LIKE '%recovery%';SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE wait_event = 'recovery' AND now() - query_start > interval '30 seconds';Check if replication slot is causing disk bloat
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) / (1024*1024) AS retained_mb FROM pg_replication_slots;SELECT pg_drop_replication_slot('slot_name') WHERE active = false AND retained_mb > 10240;Test if replica can be promoted
SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS apply_lag_bytes;SHOW primary_conninfo; SHOW restore_command;Production Incident
/ force_primary / hint that PgBouncer's routing rules respected. Permanent fix: any read that gates a financial write must go to primary. Added a ReadPolicy.STRONG annotation in the DAL that bypassed the replica pool entirely. Bumped synchronous_commit = remote_apply for the payments schema specifically using ALTER DATABASE payments_db SET synchronous_commit = 'remote_apply'; to ensure writes were applied on replica before acknowledging to the application.Production Debug GuideQuick reference for diagnosing replica issues in production
pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots;. If a replica is down and using a slot, the primary retains all WAL indefinitely. Drop the slot with SELECT pg_drop_replication_slot('slot_name'); then rebuild the replica.pg_is_in_recovery() should return false.Your primary database is not a read server that also handles writes — treating it like one is how you end up with a 4AM PagerDuty alert because a reporting query from the analytics team locked up your checkout flow. Read replicas aren't just a scaling trick; they're an architectural boundary that forces you to think honestly about which operations actually need the freshest data and which ones are lying to themselves about it.
The problem read replicas solve is brutally simple: PostgreSQL's primary instance is single-writer by design. Every SELECT that hits the primary competes for buffer pool space, CPU cycles, and I/O bandwidth with every INSERT, UPDATE, and DELETE in your write path. At low load this is invisible. At production scale with mixed OLTP and reporting workloads running against the same instance, you start seeing query plans degrade, autovacuum fall behind, and connection counts spike in ways that look like traffic problems but are really resource contention problems. The old hack was to run analytics queries at 2AM with pg_cron and hope the batch window held. It doesn't hold. It never holds.
After reading this you'll be able to set up streaming replication with correct recovery configuration, route read traffic intelligently without trusting application developers to do it manually, diagnose and mitigate replication lag before it causes stale-read incidents, design a promotion strategy that survives primary failure without data loss, and know the exact scenarios where read replicas will make your architecture worse, not better.
Streaming Replication Internals: What's Actually Happening Under the Hood
Before you configure anything, you need to know what PostgreSQL is actually shipping between instances. Streaming replication works by sending WAL — Write-Ahead Log — records from primary to standby in near-real-time. Every change made to the primary is first written to WAL before it touches the actual data files. The standby connects to the primary using the replication protocol, receives these WAL records as a stream, and applies them to its own data files. The standby is replaying the exact same operations the primary executed, in the same order, against its own copy of the data.
This has a critical implication that trips up almost everyone: the replica isn't just a copy of the data files — it's a copy of the primary's entire physical state, including every bloated table, every suboptimal index, every dead tuple autovacuum hasn't cleaned up yet. You can't add an index on the replica that doesn't exist on the primary (unless you use logical replication, which is a different beast entirely). You also can't run DDL on a streaming replica. It's physically coupled to the primary's state.
The WAL sender process on the primary and the WAL receiver process on the replica maintain a persistent TCP connection. The replica's recovery process applies the incoming WAL records. You can monitor the health of this pipeline on the primary via pg_stat_replication and on the replica via pg_stat_wal_receiver — these two views tell you everything you need to know about whether your replication pipeline is healthy. If you're not monitoring both, you're flying blind.
One more nuance: replication slots vs wal_keep_size. A replication slot is a persistent marker on the primary that says 'don't recycle WAL until this replica has consumed it'. Without a slot, you rely on wal_keep_size — a fixed retention window. If a replica disconnects longer than wal_keep_size covers, you need a full pg_basebackup to resync. Slots prevent that, but if the replica never reconnects, the primary's disk will fill. Choose based on your tolerance for full resync vs disk space.
-- io.thecodeforge.postgres.replication — Production setup -- ============================================================ -- PRIMARY configuration (postgresql.conf) -- ============================================================ -- Enable WAL archiving level — 'replica' is sufficient for streaming replication ALTER SYSTEM SET wal_level = 'replica'; -- Allow at least as many WAL senders as you have replicas + 1 for pg_basebackup ALTER SYSTEM SET max_wal_senders = 5; -- Keep enough WAL segments to allow replica to catch up after a brief disconnect -- At 16MB per segment, this is 1GB of WAL retained — tune based on your write volume ALTER SYSTEM SET wal_keep_size = '1GB'; -- Enable hot_standby on primary (needed so replica can serve reads) -- Technically a replica setting, but set it on primary too for when roles flip ALTER SYSTEM SET hot_standby = on; -- Reload config — no restart needed for most of these SELECT pg_reload_conf(); -- ============================================================ -- Create replication user (PRIMARY) -- Never use a superuser for replication — least privilege matters -- ============================================================ CREATE USER replicator WITH REPLICATION -- grants REPLICATION privilege specifically LOGIN PASSWORD 'use_a_real_secret_from_vault_not_this'; -- ============================================================ -- pg_hba.conf on PRIMARY — allow replica to connect -- Add this line to pg_hba.conf, then SELECT pg_reload_conf(); -- ============================================================ -- host replication replicator 10.0.1.50/32 scram-sha-256 -- ============================================================ -- On the REPLICA host — take base backup from primary -- Run as postgres OS user on the REPLICA machine -- ============================================================ -- pg_basebackup \ -- --host=10.0.1.10 \ -- --username=replicator \ -- --pgdata=/var/lib/postgresql/15/main \ -- --wal-method=stream \ -- --write-recovery-conf \ -- --checkpoint=fast \ -- --progress -- -- --write-recovery-conf automatically creates standby.signal -- and writes primary_conninfo into postgresql.auto.conf -- ============================================================ -- REPLICA configuration (postgresql.conf) -- ============================================================ -- Connection string to primary — vault this credential properly ALTER SYSTEM SET primary_conninfo = 'host=10.0.1.10 port=5432 user=replicator password=your_secret application_name=replica_01'; -- hot_standby = on allows read queries while in recovery mode -- Without this the replica is dark — no reads, just applying WAL ALTER SYSTEM SET hot_standby = on; -- Send feedback to primary about which transactions are still active on replica -- This prevents primary from vacuuming rows the replica still needs for queries -- Trade-off: primary retains dead tuples longer — monitor table bloat ALTER SYSTEM SET hot_standby_feedback = on; -- How long to wait before cancelling a conflicting query on the replica -- -1 means wait indefinitely — use this only if your replica queries are short -- In practice, set to 30s-120s depending on your reporting query SLAs ALTER SYSTEM SET max_standby_streaming_delay = '120s'; -- ============================================================ -- Monitoring replication health — run on PRIMARY -- ============================================================ SELECT application_name, -- identifies which replica client_addr, -- replica's IP state, -- 'streaming' is healthy; 'catchup' means it's behind sent_lsn, -- how far primary has sent WAL write_lsn, -- how far replica has written WAL to disk flush_lsn, -- how far replica has fsync'd WAL replay_lsn, -- how far replica has APPLIED WAL pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replication_lag_bytes, sync_state -- 'async' or 'sync' FROM pg_stat_replication; -- ============================================================ -- Monitoring lag in seconds — run on REPLICA -- ============================================================ SELECT CASE WHEN pg_is_in_recovery() THEN 'replica' ELSE 'primary' END AS role, EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) AS lag_seconds, pg_last_xact_replay_timestamp() AS last_applied_at FROM pg_stat_wal_receiver;
-- application_name | client_addr | state | sync_state | replication_lag_bytes
-- replica_01 | 10.0.1.50 | streaming | async | 4096
-- On REPLICA (healthy):
-- role | lag_seconds | last_applied_at
-- replica | 0.8 | 2024-03-15 14:32:01.234+00
pg_current_wal_lsn(), replay_lsn)) FROM pg_stat_replication;Connection Routing Without Trusting Your Application Developers
Here's the brutal truth about read replica routing: if you rely on application developers to manually choose between a primary DSN and a replica DSN in their code, you will have primary reads that should be on the replica and replica reads that should be on the primary, and you won't know which is which until something breaks. Application developers are solving business problems. Database topology is not their mental model and it shouldn't have to be.
The right architecture puts routing below the application layer. PgBouncer is the most common first layer — it pools connections and can be configured with multiple backends. But PgBouncer alone doesn't route by intent. For intent-based routing you need either a proxy that understands read/write splitting (Pgpool-II, RDS Proxy, or a sidecar proxy like ProxySQL's PostgreSQL mode) or a connection wrapper in your DAL that maps transaction context to the right pool.
The pattern I've seen work in production at scale: two PgBouncer pools — one pointing to primary, one pointing to replica — with a lightweight DAL wrapper that routes based on transaction mode. Any call inside an explicit BEGIN...COMMIT block goes to primary. Any bare SELECT outside a transaction block goes to the replica pool. This gets you 80% of the benefit with minimal application code changes. The remaining 20% — cases where a SELECT needs fresh data but isn't in a write transaction — you handle with an explicit annotation or by accepting primary reads for those paths.
One anti-pattern to avoid: putting the replica pool behind a load balancer that does health checks but not read/write splitting. You'll end up with reads on primary anyway. Another trap: using statement-based routing (e.g., check if query starts with 'SELECT') fails for SELECT inside transactions, SELECT...FOR UPDATE (needs primary), and CTEs. The transaction-boundary approach is simpler and more reliable.
# io.thecodeforge.postgres.routing — Production routing layer import contextlib from enum import Enum, auto from typing import Generator import psycopg2 import psycopg2.pool from threading import local class ReadPolicy(Enum): EVENTUAL = auto() # replica pool — accepts lag STRONG = auto() # primary pool — fresh data class DatabaseRouter: """ Routes database connections to primary or replica based on context. Thread-local state tracks whether we're inside a write transaction. PgBouncer endpoints: - PRIMARY: pgbouncer-primary:6432 (transaction mode) - REPLICA: pgbouncer-replica:6432 (session mode for replicas) """ def __init__(self, primary_dsn: str, replica_dsn: str, pool_size: int = 20): self._local = local() self._primary_pool = psycopg2.pool.ThreadedConnectionPool( minconn=2, maxconn=pool_size, dsn=primary_dsn, application_name='order_service_primary' ) self._replica_pool = psycopg2.pool.ThreadedConnectionPool( minconn=5, maxconn=pool_size * 2, dsn=replica_dsn, application_name='order_service_replica' ) def _in_write_transaction(self) -> bool: return getattr(self._local, 'write_transaction_active', False) def _get_connection(self, policy: ReadPolicy): # If inside a write transaction, MUST use primary — consistency overrides if self._in_write_transaction() or policy == ReadPolicy.STRONG: return self._primary_pool.getconn(), self._primary_pool return self._replica_pool.getconn(), self._replica_pool @contextlib.contextmanager def write_transaction(self) -> Generator: """All reads inside this block go to primary — prevents stale-read bugs.""" conn = self._primary_pool.getconn() self._local.write_transaction_active = True try: yield conn conn.commit() except Exception: conn.rollback() raise finally: self._local.write_transaction_active = False self._primary_pool.putconn(conn) @contextlib.contextmanager def read_query(self, policy: ReadPolicy = ReadPolicy.EVENTUAL) -> Generator: """Default to replica — caller must opt into STRONG for fresh reads.""" conn, pool = self._get_connection(policy) try: yield conn conn.rollback() # ends implicit transaction, returns to pool cleanly except Exception: conn.rollback() raise finally: pool.putconn(conn) # ============================================================ # Checkout flow with proper routing # ============================================================ def process_checkout(router: DatabaseRouter, cart_id: str, user_id: str) -> dict: # Inventory check — replica is fine, lag acceptable with router.read_query(ReadPolicy.EVENTUAL) as conn: cursor = conn.cursor() cursor.execute( 'SELECT product_id, quantity FROM inventory WHERE cart_id = %s', (cart_id,) ) inventory = cursor.fetchall() if not inventory: raise ValueError('Cart empty') # Write transaction — everything inside hits primary with router.write_transaction() as conn: cursor = conn.cursor() cursor.execute( 'INSERT INTO orders (cart_id, user_id, status) VALUES (%s, %s, %s) RETURNING id', (cart_id, user_id, 'pending') ) order_id = cursor.fetchone()[0] cursor.execute( 'INSERT INTO payments (order_id, status, created_at) VALUES (%s, %s, NOW())', (order_id, 'initiated') ) # Order summary read — replica is fine, display only with router.read_query(ReadPolicy.EVENTUAL) as conn: cursor = conn.cursor() cursor.execute( 'SELECT id, status, created_at FROM orders WHERE id = %s', (order_id,) ) order = cursor.fetchone() return {'order_id': order_id, 'status': order[1] if order else 'pending'}
# When deployed:
# write_transaction() → pgbouncer-primary:6432
# read_query(EVENTUAL) → pgbouncer-replica:6432
# read_query(STRONG) → pgbouncer-primary:6432
#
# pg_stat_activity shows:
# application_name = 'order_service_primary' on primary
# application_name = 'order_service_replica' on replica
# Lets you immediately see which app is hitting which host in monitoring.
Replication Lag: Why It Happens, How to Measure It, When to Panic
Replication lag is not a bug. It's a fundamental property of async replication. What is a bug is building a system that assumes lag is zero, or monitoring for it reactively instead of proactively. I've watched teams add read replicas to solve a performance problem and spend the next month firefighting stale-read bugs that were ten times harder to debug than the original performance problem.
Lag accumulates from three different places and you need to know which one is causing yours. Network lag is the time for WAL bytes to travel from primary to replica — usually <5ms on properly networked instances in the same AZ. I/O lag on the replica is the time to write and fsync incoming WAL segments — on a replica with an undersized disk or competing I/O from read queries, this is where lag balloons. Apply lag is the time for the recovery process to apply WAL records to data pages — on tables with high contention or during bulk loads, apply lag can spike to seconds even with fast I/O.
The real danger is cumulative lag under write bursts. If your primary ingests a bulk load at 200MB/s of WAL for 60 seconds, and your replica can only sustain 150MB/s of apply throughput, you're 6GB of WAL behind when the load stops. At steady state the replica catches up — but for those minutes, any read against the replica is returning data that's potentially minutes old. You need to monitor pg_stat_replication on primary continuously, alert at 10 seconds of lag, and have a circuit breaker that routes to primary automatically when lag exceeds your SLA threshold.
A subtle point: replica lag reported in bytes (pg_wal_lsn_diff) is more precise than lag in seconds (replay_lag). Bytes tell you exactly how much work is pending. Seconds include network jitter and OS scheduling. For alerting, use seconds with a hysteresis (e.g., lag > 30s for 2 consecutive checks). For root cause analysis, examine the three byte-level components.
-- ============================================================ -- Comprehensive lag monitoring view — create on PRIMARY -- ============================================================ CREATE OR REPLACE VIEW v_replication_health AS SELECT r.application_name, r.client_addr, r.state, r.sync_state, -- Byte-level lag breakdown pg_wal_lsn_diff(r.sent_lsn, r.write_lsn) AS network_lag_bytes, pg_wal_lsn_diff(r.write_lsn, r.flush_lsn) AS disk_lag_bytes, pg_wal_lsn_diff(r.flush_lsn, r.replay_lsn) AS apply_lag_bytes, pg_wal_lsn_diff(r.sent_lsn, r.replay_lsn) AS total_lag_bytes, r.replay_lag, -- human time rs.slot_name, rs.active AS slot_active, pg_wal_lsn_diff(pg_current_wal_lsn(), rs.confirmed_flush_lsn) AS slot_lag_bytes FROM pg_stat_replication r LEFT JOIN pg_replication_slots rs ON rs.active_pid = r.pid ORDER BY total_lag_bytes DESC; -- ============================================================ -- Circuit breaker: check if replica is fresh enough for reads -- Run on REPLICA — embed this in your connection pool health check -- ============================================================ CREATE OR REPLACE FUNCTION fn_replica_is_fresh( max_lag_seconds NUMERIC DEFAULT 30 ) RETURNS BOOLEAN AS $$ DECLARE lag_seconds NUMERIC; BEGIN SELECT COALESCE( EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())), 999999 -- if NULL, replica hasn't replayed anything yet ) INTO lag_seconds; RETURN lag_seconds <= max_lag_seconds; END; $$ LANGUAGE plpgsql STABLE; -- Usage: SELECT fn_replica_is_fresh(30); -- TRUE if lag < 30s -- ============================================================ -- Find what's causing lag spikes: conflicting queries on replica -- Run on REPLICA -- ============================================================ SELECT pid, now() - query_start AS duration, wait_event_type, wait_event, LEFT(query, 120) AS query_preview FROM pg_stat_activity WHERE state != 'idle' AND (wait_event_type = 'Lock' OR wait_event LIKE '%recovery%') ORDER BY duration DESC;
-- application_name | network_lag_bytes | disk_lag_bytes | apply_lag_bytes | total_lag_bytes | replay_lag
-- replica_01 | 0 | 0 | 8192 | 8192 | 00:00:00.8
-- fn_replica_is_fresh(30) on REPLICA:
-- fn_replica_is_fresh
-- t
-- Conflict query on REPLICA:
-- pid | duration | wait_event_type | wait_event | query_preview
-- 1234 | 00:02:15 | Lock | relation | SELECT COUNT(*) FROM orders WHERE status = 'pending'
pg_current_wal_lsn(), confirmed_flush_lsn) AS retained_bytes FROM pg_replication_slots; and alert if retained_bytes exceeds 20GB. If a slot goes inactive and you can't reconnect the replica in minutes, DROP the slot immediately and rebuild the replica.Promotion and Failover: What Actually Happens When Your Primary Dies
Failover is where most PostgreSQL read replica setups reveal they were built with optimism rather than engineering. The scenario: primary becomes unavailable. Every second of downtime is revenue loss. You need to promote a replica to primary, update your connection strings, ensure the old primary never comes back as a rogue writer, and ideally do all of this without data loss. If you haven't rehearsed this, you don't actually have HA — you have a hope.
PostgreSQL's promotion mechanism is clean: on the standby, run pg_ctl promote or create a trigger file, and the replica exits recovery mode and starts accepting writes. The problem is everything around it. Your application's connection pool is still pointing at the dead primary's IP. Other replicas are still trying to stream from the dead primary. If the old primary recovers and starts accepting writes while the new primary is also accepting writes — split-brain — you have two diverging transaction histories and neither node knows about the other's writes.
Patroni is the industry answer to this. It uses a distributed consensus store (etcd or Consul) to maintain a single authoritative leader, handles automatic promotion, updates a virtual IP or DNS entry that your application uses, fences the old primary (prevents it from accepting writes even if it recovers), and handles adding the old primary back as a replica via pg_rewind. If you're running PostgreSQL in production without Patroni or an equivalent, you have a manual failover runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.
A specific gotcha: after promoting a replica, the timeline ID changes. The old primary, if it comes back online, is on a diverged timeline. It cannot rejoin as a replica without pg_rewind, which replays WAL from the divergence point to bring it back into sync. Running pg_rewind requires the old primary to be stopped and requires the new primary to have wal_log_hints = on or data checksums enabled. If those aren't set, pg_rewind fails and you need a full pg_basebackup.
-- ============================================================ -- Pre-failover health check — run on PRIMARY -- ============================================================ SELECT application_name, state, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS unconfirmed_bytes, replay_lag, sync_state FROM pg_stat_replication ORDER BY unconfirmed_bytes ASC; -- For zero-RPO failover: only promote if unconfirmed_bytes = 0 -- ============================================================ -- Manual promotion (run on REPLICA being promoted) -- ============================================================ -- Shell: pg_ctl promote -D /var/lib/postgresql/15/main -- OR if trigger_file is configured: touch /tmp/postgresql.trigger.5432 -- After promotion, confirm: SELECT pg_is_in_recovery() AS is_still_replica, -- should be false pg_current_wal_lsn() AS current_lsn, current_setting('transaction_read_only') AS read_only; -- ============================================================ -- After promotion: re-sync old primary using pg_rewind -- Run on OLD PRIMARY (now stopped) -- ============================================================ -- pg_rewind \ -- --target-pgdata=/var/lib/postgresql/15/main \ -- --source-server='host=10.0.1.50 user=replicator dbname=postgres' \ -- --progress -- -- Then: echo 'primary_conninfo = ...' >> postgresql.auto.conf -- touch standby.signal -- pg_ctl start -- ============================================================ -- Verify topology after failover — run on new PRIMARY -- ============================================================ SELECT application_name, client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes FROM pg_stat_replication;
-- application_name | state | unconfirmed_bytes | replay_lag
-- replica_01 | streaming | 0 | 00:00:00
-- After promotion (on new primary):
-- is_still_replica | current_lsn | read_only
-- f | 0/5B000028 | off
-- Topology after re-sync:
-- application_name | client_addr | state | lag_bytes
-- old_primary | 10.0.1.10 | streaming | 0
| Aspect | Async Streaming Replication | Sync Streaming Replication | Logical Replication |
|---|---|---|---|
| Replication lag | Milliseconds to seconds under load | Zero — primary waits for replica ack | Seconds to minutes depending on change volume |
| Write throughput impact on primary | Near-zero — fire and forget | Significant — every write waits for remote flush/apply (2-5x latency) | Low — only published tables are replicated |
| RPO on primary failure | Up to several seconds of data loss | Zero data loss (remote_apply) or minimal (remote_write) | Variable — depends on subscription lag |
| Replica structure | Identical physical copy | Identical physical copy | Subscriber can have different indexes, partitioning |
| Can promote to primary? | Yes — full promotion via pg_ctl promote | Yes — zero data loss promotion | No — subscriber can't be promoted |
| DDL on replica | No — follows primary exactly | No — follows primary exactly | No — DDL not replicated automatically |
| Use case | Primary HA replica and read scaling | Financial systems, zero-RPO requirement | Partial replication, cross-version upgrades |
🎯 Key Takeaways
- Replication lag is not a failure mode — it's a design parameter. Define your RPO and SLA before choosing async vs sync replication, not after your first stale-read incident.
- Any read that gates a financial write goes to the primary. No exceptions. Not even when the primary CPU is spiking. A double-charge is ten times more expensive to resolve than an extra primary read.
- Reach for read replicas when your pg_stat_activity shows read queries holding connections and CPU while write latency is climbing — not when someone just says 'we need to scale'. The symptom is the signal.
- Logical replication and physical streaming replication look similar from the outside but are completely different animals under the hood — physical replicas can be promoted to primary during failover, logical subscribers cannot. Mixing them in your HA pool is a disaster waiting for a production incident to express itself.
- walsender → walreceiver → recovery apply. Monitor each stage with sent_lsn, write_lsn, flush_lsn, replay_lsn. High replay_lag with low write_lag means recovery contention — long queries on the replica are blocking WAL apply.
- Patroni is the industry standard for automated failover; don't build your own. If you're not using Patroni, you have a manual runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.
- wal_keep_size + WAL archiving is safer than replication slots for most setups. Slots prevent resync but can fill your disk. Monitor pg_replication_slots.retained_bytes and alert at 10GB.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWalk me through exactly what happens at the WAL level when a PostgreSQL primary executes an UPDATE while a streaming replica is connected — where does each process hand off to the next, and where can lag accumulate in that chain?SeniorReveal
- QYou have a read replica that's receiving 10,000 read queries per minute from a reporting dashboard. The DBA wants to add an index to speed up the report queries. What are your options given that the replica is a physical streaming standby, and what are the trade-offs of each?SeniorReveal
- QYou've promoted a replica to primary during an incident. The old primary comes back online 20 minutes later. What exact steps do you take to safely re-integrate it as a replica, and what's the single most dangerous thing that can happen if you get the sequence wrong?SeniorReveal
- QYour synchronous replica is configured with synchronous_commit = remote_apply. The replica host has a 500ms disk I/O spike. Describe exactly what happens to write latency on the primary during those 500ms, what error or timeout the application sees if it exceeds synchronous_standby_names wait limits, and what configuration change you'd make to add resilience without abandoning synchronous replication entirely.SeniorReveal
Frequently Asked Questions
How much replication lag is acceptable for a PostgreSQL read replica?
It depends entirely on what the replica's reads are gating downstream. For display-only reads with no downstream writes — product listings, dashboards, reporting — up to 30 seconds of lag is often fine. For anything that influences a write decision — inventory checks, payment status reads, permission checks — lag tolerance is zero and those reads belong on the primary. Set your alert threshold at 10 seconds for OLTP replicas and treat anything above 30 seconds as a page-worthy incident. For financial systems, use synchronous_commit = remote_apply on the transactions that matter, not async replicas.
What's the difference between a read replica and a standby in PostgreSQL?
They're the same thing — PostgreSQL calls it a standby; cloud providers call it a read replica. A physical standby in hot_standby mode accepts read queries while continuously applying WAL from the primary. The term 'read replica' is a cloud marketing wrapper around PostgreSQL's physical streaming standby. The concrete rule: if it's in recovery mode (pg_is_in_recovery() = true) and accepts reads, it's what AWS/GCP call a read replica. Logical replication subscribers are different — they accept writes and DDL but cannot be promoted to primary for HA.
How do I force a specific query to always run on the primary and never hit the replica?
Route it at the connection level, not the query level. The cleanest pattern: maintain two connection pools — one pointing to your primary DSN, one to your replica DSN — and route based on intent annotations in your DAL. Don't put routing hints in SQL comments and hope your proxy interprets them correctly. If you're using PgBouncer, configure two separate listen ports — one for primary, one for replica — and have your application code choose the port. The alternative is a SET command: 'SET transaction_read_only = off' on a replica will fail immediately, which you can use as a circuit breaker, but that's a hack not a strategy. For Postgres drivers that support 'target_session_attrs=read-write', use that in the connection string to prefer primaries.
Can PostgreSQL read replicas fall so far behind that they can never catch up?
Yes — and it's not rare. If the primary generates WAL faster than the replica can apply it on a sustained basis, the replica never catches up. The early warning sign is apply_lag_bytes increasing monotonically in pg_stat_replication even when primary write volume is constant. This happens when the replica's disk is undersized relative to the primary's write throughput, or when long-running read queries on the replica continuously conflict with WAL apply, forcing the recovery process to wait or cancel. The permanent fix is either faster replica storage (NVMe vs SSD), a dedicated analytics replica for heavy queries, or reducing primary write volume by batching writes. The temporary fix — cancelling conflicting queries — buys time but doesn't solve the structural problem. If the replica falls months behind, the fastest recovery is often to drop it and recreate with a fresh pg_basebackup.
What's the performance impact of turning on hot_standby_feedback?
hot_standby_feedback tells the primary's backends that the replica still has open transactions that might need older row versions. This prevents vacuum from cleaning those rows, causing bloat on the primary. The trade-off: without it, long-running queries on the replica get cancelled during vacuum conflicts. With it, your primary tables bloat more. Monitor pg_stat_user_tables.n_dead_tup and autovacuum effectiveness. For replicas with short queries (<10s), leave hot_standby_feedback off. For replicas with long analytics queries, turn it on but also increase autovacuum_vacuum_scale_factor on primary. The exact values depend on your write volume; start with scale_factor=0.05 and monitor bloat weekly. In practice, most OLTP replicas with sub-second queries can keep it off.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.