PostgreSQL Read Replicas — Lag Caused 400 Double Charges
A 800ms-2s replica lag caused 400 double charges despite Stripe idempotency keys.
20+ years shipping high-throughput database systems. Written from production experience, not tutorials.
- A read replica is a hot standby that applies WAL from the primary and serves read queries — but it's always behind the primary by some replication lag.
- Streaming replication: primary runs walsender → replica runs walreceiver → recovery process replays WAL. Lag accumulates at network, disk I/O, or apply stage.
- Replica reads are eventually consistent — writing to primary then SELECTing from replica is a race condition that will lose updates.
- Performance insight: a 2048-bit RSA key encrypts 256 bytes; OAEP padding adds ~42 bytes overhead, reducing max plaintext to 214 bytes.
- Production pitfall: routing a read that gates a financial transaction to the replica — replication lag as low as 200ms causes double charges.
- Biggest mistake: treating synchronous_commit as a panacea — it only confirms WAL arrived on replica, not that it's applied. Use synchronous_commit = remote_apply for zero-lag reads, but accept 2-5x write latency penalty.
Imagine your company's most popular restaurant has one chef and fifty hungry customers. The chef can cook AND take orders, but doing both at once means both get done badly. So you hire a team of runners who memorised the entire menu — they can answer every question customers have, take drink orders, describe every dish in detail, but they can't actually cook anything. That's a read replica: an exact copy of your database that can answer every read question your app throws at it, so your primary database can focus entirely on writes. The 'menu' gets updated to the runners every few seconds — that slight delay is replication lag, and it's the thing that will eventually cause you a production incident if you don't plan for it.
Your primary database is not a read server that also handles writes — treating it like one is how you end up with a 4AM PagerDuty alert because a reporting query from the analytics team locked up your checkout flow. Read replicas aren't just a scaling trick; they're an architectural boundary that forces you to think honestly about which operations actually need the freshest data and which ones are lying to themselves about it.
The problem read replicas solve is brutally simple: PostgreSQL's primary instance is single-writer by design. Every SELECT that hits the primary competes for buffer pool space, CPU cycles, and I/O bandwidth with every INSERT, UPDATE, and DELETE in your write path. At low load this is invisible. At production scale with mixed OLTP and reporting workloads running against the same instance, you start seeing query plans degrade, autovacuum fall behind, and connection counts spike in ways that look like traffic problems but are really resource contention problems. The old hack was to run analytics queries at 2AM with pg_cron and hope the batch window held. It doesn't hold. It never holds.
After reading this you'll be able to set up streaming replication with correct recovery configuration, route read traffic intelligently without trusting application developers to do it manually, diagnose and mitigate replication lag before it causes stale-read incidents, design a promotion strategy that survives primary failure without data loss, and know the exact scenarios where read replicas will make your architecture worse, not better.
How Read Replicas in PostgreSQL Actually Work — And Why Lag Matters
A read replica is a separate PostgreSQL instance that continuously applies write-ahead log (WAL) data from a primary server. The primary handles all writes; replicas serve read-only queries. This is asynchronous streaming replication by default — the replica is always behind the primary by some amount of time.
Replicas apply WAL in a single process, sequentially. A heavy write burst on the primary (e.g., batch inserts, index rebuilds) creates a backlog on the replica. The replica's apply lag grows. Queries on the replica see a stale snapshot. PostgreSQL does not block reads on the replica — it serves whatever data has been replayed so far. There is no built-in mechanism to wait for consistency.
Use read replicas when your read workload exceeds a single node's capacity or when you want to isolate reporting/analytics from production writes. But never assume replica data is current. If your application reads from a replica and then writes based on that read, you risk acting on stale state — exactly the pattern that caused double charges in this case.
pg_current_wal_lsn() to verify the replica has applied the primary's last write before proceeding.pg_stat_replication and alert when lag exceeds your application's tolerance (e.g., >100ms).Streaming Replication Internals: What's Actually Happening Under the Hood
Before you configure anything, you need to know what PostgreSQL is actually shipping between instances. Streaming replication works by sending WAL — Write-Ahead Log — records from primary to standby in near-real-time. Every change made to the primary is first written to WAL before it touches the actual data files. The standby connects to the primary using the replication protocol, receives these WAL records as a stream, and applies them to its own data files. The standby is replaying the exact same operations the primary executed, in the same order, against its own copy of the data.
This has a critical implication that trips up almost everyone: the replica isn't just a copy of the data files — it's a copy of the primary's entire physical state, including every bloated table, every suboptimal index, every dead tuple autovacuum hasn't cleaned up yet. You can't add an index on the replica that doesn't exist on the primary (unless you use logical replication, which is a different beast entirely). You also can't run DDL on a streaming replica. It's physically coupled to the primary's state.
The WAL sender process on the primary and the WAL receiver process on the replica maintain a persistent TCP connection. The replica's recovery process applies the incoming WAL records. You can monitor the health of this pipeline on the primary via pg_stat_replication and on the replica via pg_stat_wal_receiver — these two views tell you everything you need to know about whether your replication pipeline is healthy. If you're not monitoring both, you're flying blind.
One more nuance: replication slots vs wal_keep_size. A replication slot is a persistent marker on the primary that says 'don't recycle WAL until this replica has consumed it'. Without a slot, you rely on wal_keep_size — a fixed retention window. If a replica disconnects longer than wal_keep_size covers, you need a full pg_basebackup to resync. Slots prevent that, but if the replica never reconnects, the primary's disk will fill. Choose based on your tolerance for full resync vs disk space.
pg_current_wal_lsn(), replay_lsn)) FROM pg_stat_replication;Connection Routing Without Trusting Your Application Developers
Here's the brutal truth about read replica routing: if you rely on application developers to manually choose between a primary DSN and a replica DSN in their code, you will have primary reads that should be on the replica and replica reads that should be on the primary, and you won't know which is which until something breaks. Application developers are solving business problems. Database topology is not their mental model and it shouldn't have to be.
The right architecture puts routing below the application layer. PgBouncer is the most common first layer — it pools connections and can be configured with multiple backends. But PgBouncer alone doesn't route by intent. For intent-based routing you need either a proxy that understands read/write splitting (Pgpool-II, RDS Proxy, or a sidecar proxy like ProxySQL's PostgreSQL mode) or a connection wrapper in your DAL that maps transaction context to the right pool.
The pattern I've seen work in production at scale: two PgBouncer pools — one pointing to primary, one pointing to replica — with a lightweight DAL wrapper that routes based on transaction mode. Any call inside an explicit BEGIN...COMMIT block goes to primary. Any bare SELECT outside a transaction block goes to the replica pool. This gets you 80% of the benefit with minimal application code changes. The remaining 20% — cases where a SELECT needs fresh data but isn't in a write transaction — you handle with an explicit annotation or by accepting primary reads for those paths.
One anti-pattern to avoid: putting the replica pool behind a load balancer that does health checks but not read/write splitting. You'll end up with reads on primary anyway. Another trap: using statement-based routing (e.g., check if query starts with 'SELECT') fails for SELECT inside transactions, SELECT...FOR UPDATE (needs primary), and CTEs. The transaction-boundary approach is simpler and more reliable.
Replication Lag: Why It Happens, How to Measure It, When to Panic
Replication lag is not a bug. It's a fundamental property of async replication. What is a bug is building a system that assumes lag is zero, or monitoring for it reactively instead of proactively. I've watched teams add read replicas to solve a performance problem and spend the next month firefighting stale-read bugs that were ten times harder to debug than the original performance problem.
Lag accumulates from three different places and you need to know which one is causing yours. Network lag is the time for WAL bytes to travel from primary to replica — usually <5ms on properly networked instances in the same AZ. I/O lag on the replica is the time to write and fsync incoming WAL segments — on a replica with an undersized disk or competing I/O from read queries, this is where lag balloons. Apply lag is the time for the recovery process to apply WAL records to data pages — on tables with high contention or during bulk loads, apply lag can spike to seconds even with fast I/O.
The real danger is cumulative lag under write bursts. If your primary ingests a bulk load at 200MB/s of WAL for 60 seconds, and your replica can only sustain 150MB/s of apply throughput, you're 6GB of WAL behind when the load stops. At steady state the replica catches up — but for those minutes, any read against the replica is returning data that's potentially minutes old. You need to monitor pg_stat_replication on primary continuously, alert at 10 seconds of lag, and have a circuit breaker that routes to primary automatically when lag exceeds your SLA threshold.
A subtle point: replica lag reported in bytes (pg_wal_lsn_diff) is more precise than lag in seconds (replay_lag). Bytes tell you exactly how much work is pending. Seconds include network jitter and OS scheduling. For alerting, use seconds with a hysteresis (e.g., lag > 30s for 2 consecutive checks). For root cause analysis, examine the three byte-level components.
pg_current_wal_lsn(), confirmed_flush_lsn) AS retained_bytes FROM pg_replication_slots; and alert if retained_bytes exceeds 20GB. If a slot goes inactive and you can't reconnect the replica in minutes, DROP the slot immediately and rebuild the replica.Promotion and Failover: What Actually Happens When Your Primary Dies
Failover is where most PostgreSQL read replica setups reveal they were built with optimism rather than engineering. The scenario: primary becomes unavailable. Every second of downtime is revenue loss. You need to promote a replica to primary, update your connection strings, ensure the old primary never comes back as a rogue writer, and ideally do all of this without data loss. If you haven't rehearsed this, you don't actually have HA — you have a hope.
PostgreSQL's promotion mechanism is clean: on the standby, run pg_ctl promote or create a trigger file, and the replica exits recovery mode and starts accepting writes. The problem is everything around it. Your application's connection pool is still pointing at the dead primary's IP. Other replicas are still trying to stream from the dead primary. If the old primary recovers and starts accepting writes while the new primary is also accepting writes — split-brain — you have two diverging transaction histories and neither node knows about the other's writes.
Patroni is the industry answer to this. It uses a distributed consensus store (etcd or Consul) to maintain a single authoritative leader, handles automatic promotion, updates a virtual IP or DNS entry that your application uses, fences the old primary (prevents it from accepting writes even if it recovers), and handles adding the old primary back as a replica via pg_rewind. If you're running PostgreSQL in production without Patroni or an equivalent, you have a manual failover runbook that will be executed under maximum stress by someone who hasn't slept. That's not HA.
A specific gotcha: after promoting a replica, the timeline ID changes. The old primary, if it comes back online, is on a diverged timeline. It cannot rejoin as a replica without pg_rewind, which replays WAL from the divergence point to bring it back into sync. Running pg_rewind requires the old primary to be stopped and requires the new primary to have wal_log_hints = on or data checksums enabled. If those aren't set, pg_rewind fails and you need a full pg_basebackup.
Cascading Replication: How to Make Your Standbys Work for a Living
Most teams treat standby replicas as passive sponges that just soak up WAL and wait to be promoted. That's wasteful. Cascading replication lets a replica serve as a replication source for downstream replicas, offloading the primary's network and CPU burden. The WHY is simple: your primary shouldn't be sending the same WAL data to thirty application clients and fifteen replicas. It's a bottleneck you're paying for in latency and risk.
Set standby.sync to on and hot_standby to on in the replica that'll be your cascade source. That replica then becomes both a read target and a feeder. But here's the catch—cascading adds latency at each hop. If you chain three levels, your replica at the bottom sees lag that's the sum of all upstream propagation delays. Measure it with pg_stat_replication on each node. A common production pattern is two-tier: primary feeds two regional cascades, each feeds five application replicas. Your primary sends only 2 streams instead of 12. That's a win.
Never trust a cascade replica for critical failover unless you've validated its WAL position is within acceptable lag. Otherwise you promote a stale node and lose committed transactions.
Synchronous Replication: When You Absolutely Cannot Lose a Transaction
Streaming replication in async mode means the primary commits before the replica acknowledges receipt. That's fine for reporting dashboards. It's not fine for payment systems, order processing, or anything where a primary crash could lose the last few seconds of writes. Synchronous replication forces the primary to wait for at least one replica to confirm the WAL was flushed to disk before reporting success to the client.
The cost is write latency—every INSERT, UPDATE, or DELETE now waits for a network round-trip to the sync standby. In a cross-datacenter setup with 10ms latency, that adds 10ms to every write. You don't want this on every connection. Use synchronous_standby_names to specify which replicas are sync targets, and set synchronous_commit = remote_write for a lighter guarantee (replica received it but hasn't flushed) or on for full flush confirmation.
Production pitfall: if your sync standby dies, the primary blocks all writes until either the standby comes back or you remove it from the sync list. Always have a quorum commit setup or a monitoring alert that auto-demotes a dead sync standby before your users start screaming.
WAL Archive & Continuous Archival: Your Replicas Can't Save You From Disk Failure
Replicas are great for read scaling and failover. They are not backups. If your primary suffers a physical disk failure that corrupts the WAL files before they're replicated, all your replicas will replicate that corruption. That's the day you learn streaming replication is not a backup strategy. You need continuous WAL archiving to a separate storage system, preferably an object store like S3 or a dedicated backup server.
Set archive_mode = on and archive_command to copy completed WAL segments to your archive location. PostgreSQL will call this command every time it finishes writing a WAL segment (usually 16MB). The archive target must be durable—a separate volume, different physical hardware, or cloud object storage. Combine this with pg_basebackup for full base backups and slot-based WAL retention so you don't lose WAL before replicas consume it.
Production pattern: full backup every 6 hours, WAL archiving every 5 minutes, and a separate recovery.conf that can restore from archive in minutes. Test your restore process quarterly. I've seen teams that had two years of WAL archives but couldn't actually get PostgreSQL to restore from them because they changed the archive format without testing. Don't be that team.
pg_stat_archiver in your alerting or your backup will silently rot.Project Structure: Don't Let Your Repo Look Like a Junior Dev's Desktop
Most PostgreSQL replication guides dump everything into a single directory and call it a day. That works until your third replica joins the cluster and you can't tell which config belongs to which node. Stop that.
Your project structure is your first line of defense against operational chaos. Separate primary config, replica config, and initialization scripts into distinct directories. Use environment-specific override files so staging doesn't accidentally connect to prod. If your docker-compose.yml references a config path that doesn't exist, your replica won't start — that's a feature, not a bug.
Every production cluster I've managed that survived more than six months had a clear directory layout: pg-primary/, pg-replica/, pg-init/, and a top-level docker-compose.yml. No ambiguity. No guessing. You should be able to clone the repo and have a working three-node cluster in under two minutes. If you can't, your structure is wrong.
Step 5: Write the Docker Compose File — Because Manual Setup Is for Masochists
You could manually configure PostgreSQL on three different servers. Or you could write a Docker Compose file and have a working cluster in 30 seconds. Your choice.
The compose file must define three services: one primary and two replicas. The primary exposes port 5432. Replicas expose different ports (5433, 5434) to avoid conflicts during local development. Each replica depends on the primary — if the primary isn't ready, the replicas will retry.
Critical detail: the replica setup script runs on container start, not build time. You need the primary to be fully initialized before the replica can pull the base backup. Use a health check on the primary service. Use depends_on with condition: service_healthy. Without that, your replica will fail silently and you'll waste an hour debugging why WAL streaming never started.
One more thing: mount your config directories as read-only volumes. If a misconfigured replica tries to write to its own config, it should crash immediately. Fail fast, not silently.
PG_REPLICA_PASSWORD from environment, not a hardcoded value. One env file per environment. Keep secrets out of your compose file.Cleaning Up Orphaned WAL Files After Replica Promotion
When you promote a read replica to primary, PostgreSQL leaves behind WAL segments on the old primary that can fill disk and confuse monitoring. The old primary continues writing WAL files even after it's orphaned, because it doesn't know it's been demoted. You must stop PostgreSQL on the old primary first, then manually remove stale WAL files from pg_wal. Use pg_archivecleanup with the last valid WAL segment from the new primary to avoid removing files still needed for recovery. The cleanup tool reads the archive status and deletes everything before the given segment. Always run a CHECKPOINT on the new primary before cleaning, because checkpoint writes a new timeline history file that marks the split point. Failure to clean orphaned WAL causes disk full errors and prevents the old primary from being safely reused as a standby. Implement a monitoring alert for pg_wal directory size and automatically run archive cleanup after any failover test.
What We Covered: The Only Three Rules You Need for Read Replicas
You now know everything necessary to run read replicas without surprises. First rule: lag is measured, not guessed — monitor pg_stat_replication for write_lag and flush_lag, and set alerts at 10 seconds for OLTP, 60 seconds for batch reporting. Second rule: failure requires a sequence — stop the old primary, promote the replica, update your connection router's DNS, then clean orphaned WAL. Never skip the stop step; it prevents split-brain where two servers accept writes. Third rule: disk failures still kill you — WAL archiving to cloud storage is the only safety net. Replicas are for read scaling, not backup. Combine synchronous replication for zero-loss critical transactions with cascading replicas for reporting loads. Use pg_rewind to reattach the old primary as a standby after failover instead of rebuilding from scratch. These three rules — measure lag, follow sequence, archive WAL — eliminate 90% of replica-related outages. The remaining 10% is people ignoring them.
Introduction: Why Read Replicas Exist (and Why You Should Care)
PostgreSQL read replicas solve a fundamental tension: your primary database must handle writes with strict consistency, but reads can tolerate slight staleness for massive throughput gains. Without replicas, every analytics query, dashboard refresh, and background job competes for the same CPU and I/O as your critical write path. The result? Latency spikes, connection pool exhaustion, and eventually, angry users. Replicas offload read traffic to separate nodes that asynchronously replay the write-ahead log (WAL) from the primary. This isn't just about scaling—it's about isolation. A badly written reporting query can bring a replica to its knees, but your primary stays healthy. The trade-off? Replication lag. Data written to the primary may take milliseconds (or seconds) to appear on a replica. If your application cannot tolerate even 100ms of staleness, replicas are the wrong tool. But for the 90% use case—read-heavy workloads, geographical distribution, or hot standby—replicas are the simplest, most battle-tested pattern in the PostgreSQL ecosystem.
Prerequisites: What You Need Before Touching Replicas
Before deploying PostgreSQL read replicas, three non-negotiable foundations must exist. First, a properly configured primary with wal_level = replica (or logical for logical replication) and max_wal_senders set high enough to support your replica count plus one buffer. Without these, replicas cannot stream WAL. Second, network connectivity: replicas must reach the primary on port 5432 (or your custom port) with firewall rules allowing PostgreSQL's streaming replication protocol—this is not HTTP, so load balancers often break it. Third, authentication: create a dedicated replication user with REPLICATION privilege and store credentials in a vault, not in code comments. Bonus prerequisite: understand your application's tolerance for stale reads. If your business logic demands causal consistency (e.g., "show me my order immediately after placing it"), replicas will betray you. Finally, ensure PostgreSQL version parity between primary and replicas—major version mismatches silently corrupt replication. On the operational side, you'll need monitoring (Prometheus + pg_stat_replication), a way to replay WAL after failover, and a plan for handling replication slot bloat. Skip any of these, and your replica setup will fail in production at the worst possible moment.
The Replica That Made Us Double-Charge 400 Customers
/ force_primary / hint that PgBouncer's routing rules respected. Permanent fix: any read that gates a financial write must go to primary. Added a ReadPolicy.STRONG annotation in the DAL that bypassed the replica pool entirely. Bumped synchronous_commit = remote_apply for the payments schema specifically using ALTER DATABASE payments_db SET synchronous_commit = 'remote_apply'; to ensure writes were applied on replica before acknowledging to the application.- If a read result gates a write that moves money, it goes to the primary — full stop, no exceptions, not even at 1AM when the primary CPU graph looks scary.
pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots;. If a replica is down and using a slot, the primary retains all WAL indefinitely. Drop the slot with SELECT pg_drop_replication_slot('slot_name'); then rebuild the replica.pg_is_in_recovery() should return false.SELECT pg_is_in_recovery() AS is_standby, pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, now() - pg_last_xact_replay_timestamp() AS lag_seconds;Run on primary: SELECT application_name, state, sync_state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;Key takeaways
Common mistakes to avoid
5 patternsPointing analytics queries at the same replica handling OLTP read traffic
Using a replication slot on a replica that has routine maintenance windows
pg_current_wal_lsn(), confirmed_flush_lsn) FROM pg_replication_slots; and alert above 10GB. If a slot goes inactive, drop it and rebuild the replica.Running pg_basebackup without --wal-method=stream
Assuming replica will automatically reconnect after primary IP change
Not testing promotion before you need it
Interview Questions on This Topic
Walk me through exactly what happens at the WAL level when a PostgreSQL primary executes an UPDATE while a streaming replica is connected — where does each process hand off to the next, and where can lag accumulate in that chain?
Frequently Asked Questions
20+ years shipping high-throughput database systems. Written from production experience, not tutorials.
That's MySQL & PostgreSQL. Mark it forged?
16 min read · try the examples if you haven't