Skip to content
Home Database Database Replication Explained: Architecture, Lag, and Production Pitfalls

Database Replication Explained: Architecture, Lag, and Production Pitfalls

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Database Design → Topic 7 of 16
Database replication deep-dive: master-slave vs multi-master, replication lag, conflict resolution, WAL shipping, and real production gotchas engineers face at scale.
🔥 Advanced — solid Database foundation required
In this tutorial, you'll learn
Database replication deep-dive: master-slave vs multi-master, replication lag, conflict resolution, WAL shipping, and real production gotchas engineers face at scale.
  • Replication is for availability and read scalability, not for backups — a DROP TABLE on the primary propagates to every replica in milliseconds.
  • Understand your consistency requirements concretely before choosing async or sync: can your application tolerate a 500ms window of stale data, or does every read need to reflect the most recent write?
  • Write-Ahead Logs are both the crash recovery mechanism and the replication stream — they are the single source of truth for every change the primary has made.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Database replication copies data from a primary node to one or more replicas for read scalability and high availability
  • Primary-Replica is the standard: all writes go to one primary, replicas serve reads and act as hot standbys
  • Asynchronous replication is fast but risks data loss if the primary crashes before shipping WAL; synchronous blocks until a replica acknowledges
  • Replication lag is the delay between a commit on the primary and its appearance on the replica — measure it in seconds or LSN distance
  • Replication is NOT a backup — DROP TABLE propagates to replicas in milliseconds; always maintain separate point-in-time snapshots
  • Automate failover with Patroni or Orchestrator — manual promotion during a 3 AM outage is how data gets lost
🚨 START HERE
Replication Emergency Cheat Sheet
When replication breaks in production, run these commands in order. Resist the urge to jump to fixes — diagnose first.
🟡Replica lag is growing and not recovering
Immediate ActionConfirm the replica is still connected and actively applying WAL — don't assume
Commands
SELECT client_addr, state, sent_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag_bytes FROM pg_stat_replication;
SELECT now() - pg_last_xact_replay_timestamp() AS replica_lag;
Fix NowIf state is not 'streaming', check network and pg_hba.conf. If lag_bytes is climbing, check replica disk I/O with iostat -x 1 — if %util is near 100%, the replica disk cannot keep up with WAL apply and serving reads simultaneously.
🟡Primary disk is full and writes are failing
Immediate ActionFind inactive replication slots pinning WAL — this is the cause 80% of the time
Commands
SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();
Fix NowSELECT pg_drop_replication_slot('dead_slot_name'); — verify disk recovers with df -h within 60 seconds. If no inactive slots exist, look for a single massive table with runaway autovacuum or an unfinished large import.
🟡Replica won't start — 'requested WAL segment not found'
Immediate ActionThe replica needs WAL the primary already recycled — there is no partial fix here
Commands
pg_controldata /var/lib/postgresql/data | grep 'Latest checkpoint'
pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -vP -R
Fix NowFull resync via pg_basebackup is the only path. Once resynced, create a replication slot for this replica immediately and set max_slot_wal_keep_size as a safety valve — otherwise this will happen again the next time the replica falls behind.
🟡Suspected split brain — two nodes both accepting writes
Immediate ActionIdentify which node the application is currently writing to — do not touch anything until you know
Commands
SELECT pg_is_in_recovery(); -- false = primary, true = replica/standby
SELECT * FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat%';
Fix NowFence the old primary first — revoke its VIP, stop its PostgreSQL process, or remove it from pg_hba.conf. Then verify the correct primary with SELECT pg_current_wal_lsn() on both nodes — the higher LSN is the node with more committed data. Rebuild the other as a replica using pg_rewind or pg_basebackup.
Production IncidentReplication slot bloat fills primary disk — full outage during peak trafficA replica went offline for maintenance for 3 days. PostgreSQL retained WAL segments for its replication slot. The primary's disk filled to 100%, all writes failed, and the entire application went down during peak traffic.
SymptomPrimary database disk usage climbed steadily from 40% to 100% over 72 hours. Write queries started failing with 'could not extend file: No space left on device'. The application returned 500 errors on every write path. Read traffic on the healthy replicas continued working — which actually made the initial triage harder, because everything looked fine from the read side.
AssumptionThe team assumed a runaway log file or table bloat on the primary. They ran VACUUM FULL across several large tables and purged several gigabytes of application logs. Disk usage didn't budge. At that point someone opened pg_replication_slots for the first time.
Root causeA PostgreSQL replication slot had been configured for a replica taken offline for OS patching. The slot pinned WAL segments so the replica could catch up cleanly on return. Over 3 days, 180 GB of WAL accumulated in pg_wal. No alerting existed for replication slot lag, slot retention size, or inactive slots. The slot just sat there, invisible to the team, holding the WAL hostage.
FixDropped the stale slot with pg_drop_replication_slot(). PostgreSQL reclaimed 180 GB immediately — disk dropped from 100% to 22% within a minute. Writes resumed without a restart. The team then added monitoring alerts on pg_replication_slots where active = false and pg_wal_lsn_diff() exceeds 1 GB, added a runbook step to explicitly drop or disable replication slots before taking replicas offline for any maintenance longer than a few hours, and set max_slot_wal_keep_size = '10GB' as a hard safety valve going forward.
Key Lesson
Replication slots pin WAL segments — a dead replica's slot will fill the primary's disk silently and completelyMonitor pg_replication_slots.active and pg_wal_lsn_diff() — alert well before the disk threshold, not at 95%Before taking a replica offline for extended maintenance, drop its slot or set max_slot_wal_keep_size — do not assume it's safe to leave itDisk-full on a primary is a total write outage, not a degraded state — treat slot monitoring with the same urgency as CPU or memory
Production Debug GuideCommon replication failures and how to diagnose them without guessing
Replica shows stale data after writes on primaryCheck replication lag with SELECT now() - pg_last_xact_replay_timestamp() on the replica. If lag is growing rather than stable, check network throughput between primary and replica and whether the replica's disk I/O is saturated applying WAL. A replica serving heavy read traffic can starve the WAL apply process — they compete for the same I/O.
Replication stopped — replica is not receiving WALCheck pg_stat_replication on the primary. If the replica's row is missing entirely, the streaming connection dropped. Check pg_hba.conf allows the replication user from the replica's address, and verify basic network connectivity between the two hosts. If the row is present but state is not 'streaming', the replica may be replaying a backlog — watch whether it progresses.
Primary disk filling up with no obvious causeRun SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes FROM pg_replication_slots. Look for inactive slots retaining large amounts of WAL. Drop dead slots immediately with pg_drop_replication_slot(). This is the single most common surprise I see on Postgres clusters that weren't set up with slot monitoring.
Replica crashes with 'requested WAL segment has already been removed'The replica fell so far behind that the primary recycled the WAL segments it still needs. The only recovery path is a full resync: pg_basebackup from the primary, then start fresh. After resyncing, create a replication slot for this replica so the primary stops recycling WAL it hasn't consumed. Set max_slot_wal_keep_size to prevent the slot from growing unbounded if this happens again.
Write latency spiked after enabling synchronous replicationExpected behavior — the primary now blocks on every write until at least one replica acknowledges the WAL write. Measure the network RTT between primary and replica with ping or traceroute. If the replica is in a different AZ, you're paying 2–10ms per write. Consider synchronous_standby_names = 'FIRST 1 (replica1)' to ensure you block on only one fast replica, or use synchronous_commit = 'remote_apply' selectively per transaction rather than cluster-wide.
Split brain after failover — two nodes both accepting writesStop write traffic to the old primary immediately — this is not a situation where you investigate first. Use your fencing mechanism (STONITH, VIP revocation, removing from pg_hba.conf) to ensure the old primary cannot accept new writes. Then compare pg_current_wal_lsn() on both nodes to determine which has more data. Rebuild the lagging node as a replica of the correct primary. Do not try to merge diverged write streams — it ends badly.

At scale, a single database server cannot absorb millions of concurrent reads without buckling — and if it goes down, your entire product goes with it. Database replication is the engineering answer to both problems simultaneously: it spreads read load across multiple servers and keeps a warm standby ready the moment your primary fails.

But replication is deceptively complex under the hood. The problems it solves — availability, durability, and read scalability — arrive bundled with a new class of trade-offs rooted in the CAP theorem and the physical reality of networks. A user sees a stale balance 800 milliseconds after a deposit. Two nodes in a multi-master cluster silently accept conflicting writes and produce corrupt state. A replica goes offline for a weekend and quietly fills the primary's disk to 100%.

I've dealt with all three of those in production, and the pattern is always the same: the failure mode was known, the monitoring wasn't in place, and the runbook didn't exist. This article is the thing I wish existed when I was setting up my first streaming replication cluster.

By the end you'll understand how replication works at the WAL and binary log level, how to reason about lag and its real-world consequences for your users, and how to design a topology that actually survives the failure modes that catch production systems off guard — not just the happy path.

Core Architectures: Primary-Replica vs. Multi-Primary

In a Primary-Replica setup — the architecture that powers the vast majority of production databases you'll encounter — all writes go to a single Primary node. That node records every change in a Write-Ahead Log (WAL) or Binary Log and ships those events to one or more Replica nodes. Replicas are read-only; they apply the log entries to their own copy of the data to stay in sync. This is the industry default for a reason: there is exactly one source of truth, conflict resolution is trivially simple (there are no conflicts), and the operational model is easy to reason about.

Multi-Primary replication allows writes on any node, which sounds like the availability holy grail until you actually implement it. If two clients update the same row on different primaries within the same time window, the system must resolve the conflict using strategies like Last Write Wins (LWW), Vector Clocks, or custom application logic. Last Write Wins sounds simple but silently discards legitimate writes — whichever timestamp wins, the other write vanishes with no error and no log entry. Vector Clocks are more correct but require your application layer to understand and handle conflict signals. Neither is free.

I've seen teams reach for Multi-Primary when their real problem was read scaling, which Primary-Replica solves without any of that complexity. The rule is straightforward: don't adopt multi-primary until you've genuinely exhausted vertical scaling and read-replica offloading, and your team understands distributed consensus deeply enough to operate it at 3 AM.

io/thecodeforge/db/ReplicaStatus.sql · SQL
12345678910111213141516
-- Run this on the Primary to see connected standbys and their lag
SELECT 
    client_addr, 
    state, 
    sent_lsn, 
    write_lsn, 
    flush_lsn, 
    replay_lsn,
    (sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;

-- Run this on a Replica to confirm it is in recovery and check replay timestamp
SELECT 
    pg_is_in_recovery()                  AS is_replica,
    pg_last_xact_replay_timestamp()       AS last_replayed_at,
    now() - pg_last_xact_replay_timestamp() AS current_lag;
▶ Output
client_addr | state | replication_lag_bytes
------------+-----------+----------------------
10.0.1.5 | streaming | 4096
10.0.1.6 | streaming | 98304
Mental Model
Replication Architecture Mental Model
Think of Primary-Replica as a newspaper printing press: one editor (primary) writes the story, and the press (WAL shipping) distributes identical copies to every newsstand (replica). Multi-Primary is like letting every newsstand run its own editorial desk — eventually two editors write conflicting front pages about the same event and you have to decide which one is true.
  • Primary-Replica: one source of truth, replicas are read-only copies — zero conflict resolution needed
  • Multi-Primary: any node accepts writes — requires conflict resolution (LWW, Vector Clocks, or application logic)
  • 99% of production systems run Primary-Replica — it is simpler, safer, and sufficient until write throughput genuinely demands multi-primary
  • Split brain is the #1 operational risk of multi-primary: two nodes accept conflicting writes, data diverges, and the divergence is usually silent
📊 Production Insight
Primary-Replica is the default for a reason — it eliminates conflict resolution entirely by design. Multi-primary adds complexity that compounds at every layer: application code, ORM assumptions, monitoring, backup strategy, and failover logic. The decision should be driven by a hard write-throughput ceiling, not a desire for symmetry. Use Primary-Replica by default. Adopt multi-primary only when horizontal write scaling is a documented, measured hard requirement and your team can operate distributed consensus under pressure.

Synchronous vs. Asynchronous Replication

This is the most consequential trade-off in replication design, and it's worth spending time with rather than defaulting to whatever the tutorial used.

In asynchronous replication, the primary confirms the write to the client as soon as it's committed locally. The WAL is shipped to replicas in the background, after the acknowledgment. This is fast — write latency is bounded only by local disk speed — but it creates a window where committed transactions exist only on the primary. If the primary crashes before shipping that WAL, those transactions are gone. The replica becomes the new primary with a gap in its history that nobody can fill.

In synchronous replication, the primary waits until at least one replica confirms it has received and durably written the WAL before acknowledging the commit to the client. This closes the data loss window entirely — at least one copy of every committed transaction exists before the client gets a response. The cost is that write latency now includes the network round-trip to the replica. If your replica is in a different availability zone, you're paying 2–10ms per write, every write, forever.

Neither is obviously correct. The right choice depends on what your data is worth and what your users will tolerate. A social media feed can absorb async lag without most users noticing. A financial transaction system where a committed payment might vanish on primary failure is a different conversation entirely.

io/thecodeforge/db/docker-compose.yml · DOCKER
1234567891011121314151617181920212223242526272829
# io.thecodeforge: Quick-start Postgres Primary-Replica cluster for local development
# Do not use this config verbatim in production — add TLS, auth, and resource limits
services:
  db-primary:
    image: postgres:16-alpine
    environment:
      POSTGRES_PASSWORD: forge_secret
      POSTGRES_USER: forge_admin
    command: |
      postgres
        -c wal_level=replica
        -c max_wal_senders=10
        -c max_replication_slots=10
        -c wal_keep_size=256MB

  db-replica:
    image: postgres:16-alpine
    depends_on:
      - db-primary
    environment:
      PGPASSWORD: forge_secret
    command: |
      /bin/sh -c "
      until pg_basebackup -h db-primary -D /var/lib/postgresql/data \
        -U forge_admin -vP -W --wal-method=stream -R; do
        echo 'Primary not ready, retrying in 2s...'; sleep 2
      done
      echo 'Basebackup complete. Starting replica...'
      postgres"
▶ Output
db-primary | LOG: database system is ready to accept connections
db-replica | NOTICE: pg_basebackup: initiating base backup
db-replica | NOTICE: pg_basebackup: base backup completed
db-replica | LOG: started streaming WAL from primary at 0/1000000
⚠ Watch Out: Synchronous Replication Multiplies Write Latency
Synchronous replication ties your write latency to the network round-trip between primary and replica. If your replica is in a different availability zone, expect 2–10ms added per write. At 1,000 writes per second, that is a measurable throughput ceiling and a latency floor you cannot escape without moving the replica closer. Always benchmark synchronous replication under production-representative write load before enabling it — the latency impact is rarely what people expect, and it only surprises you at the worst time.
📊 Production Insight
Async replication risks losing committed transactions if the primary crashes before shipping WAL — the window is usually small but nonzero. Sync replication eliminates that window but adds network RTT to every write, and the RTT is not negotiable once your topology is fixed. Measure both before committing to either. Use async by default for most workloads. Enable sync only for financial or compliance-critical write paths where data loss is a harder constraint than latency.
🎯 Key Takeaway
Asynchronous replication gives you fast writes, but the primary can crash before shipping WAL — the last few committed transactions may vanish with no way to recover them. Synchronous replication closes that window entirely but gates every write on the replica's network RTT. Use async by default. Enable sync only when a compliance or financial requirement makes zero data loss non-negotiable — and benchmark the latency impact under realistic write load before you ship it.
Choosing Synchronous vs. Asynchronous Replication
IfGeneral web application — can tolerate potential loss of the last few committed writes on catastrophic primary failure
UseUse asynchronous — lower write latency, higher throughput, simpler operational model
IfFinancial transactions or compliance requirement where zero committed-data loss is mandatory
UseUse synchronous — accept the latency penalty for guaranteed durability, benchmark write throughput under load
IfMulti-AZ deployment where replica is 50–200ms round-trip from primary
UseUse asynchronous with fast WAL shipping and monitoring — synchronous across AZs can add 5–15ms per write and crater your write throughput
IfNeed durability for specific write paths but cannot afford synchronous latency cluster-wide
UseUse synchronous_commit = 'remote_apply' selectively per transaction at the application layer — pay the sync tax only where it is genuinely required

Replication Lag: Causes, Measurement, and Consequences

Replication lag is the time gap between a transaction committing on the primary and that same transaction becoming visible on a replica. It's measured two ways: time-based (how many seconds is the replica behind?) and log-based (how many bytes of WAL has the replica not yet applied?). Both metrics matter and they tell you different things. Time tells you the user-visible impact. Bytes tell you how much backlog the replica is carrying and whether it's catching up or falling further behind.

Lag is not a bug in your replication setup. It is a fundamental, expected property of asynchronous replication, and the engineering question is never 'how do I eliminate lag?' but 'is this lag within acceptable bounds for my use case?' A 200ms lag on a social media activity feed is invisible to users. A 200ms lag on a bank account balance immediately after a wire transfer is a customer complaint and potentially a compliance issue. The same 200ms, completely different problem severity.

The causes of lag that I see most often in production: long-running transactions on the primary that generate a large WAL burst when they commit, replica hardware that is weaker than the primary (slower IOPS means slower WAL apply), network congestion between primary and replica — especially on shared cloud network links during peak hours, and replicas running heavy read queries that compete with the WAL apply process for disk I/O. That last one is subtle and frequently overlooked: your read traffic is physically competing with replication for the same disk, and replication will lose if the disk is saturated.

io/thecodeforge/db/LagMonitoring.sql · SQL
123456789101112131415161718192021222324252627
-- io.thecodeforge: Replication lag monitoring — run both methods, not just one

-- Method 1: Time-based lag (run on the REPLICA)
-- Tells you user-visible staleness in wall-clock time
SELECT 
    now() - pg_last_xact_replay_timestamp() AS time_lag,
    CASE
        WHEN now() - pg_last_xact_replay_timestamp() > INTERVAL '5 seconds' THEN 'PAGE: lag exceeds 5s'
        WHEN now() - pg_last_xact_replay_timestamp() > INTERVAL '1 second'  THEN 'WARN: lag exceeds 1s'
        ELSE 'OK'
    END AS lag_status;

-- Method 2: Byte-based lag (run on the PRIMARY)
-- Tells you WAL backlog depth — a replica at 0s time lag can still carry 500MB of queued WAL
SELECT 
    client_addr,
    state,
    sent_lsn,
    replay_lsn,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_pretty,
    pg_wal_lsn_diff(sent_lsn, replay_lsn)                  AS lag_bytes,
    CASE
        WHEN pg_wal_lsn_diff(sent_lsn, replay_lsn) > 104857600 THEN 'PAGE: >100MB lag'
        WHEN pg_wal_lsn_diff(sent_lsn, replay_lsn) > 10485760  THEN 'WARN: >10MB lag'
        ELSE 'OK'
    END AS lag_status
FROM pg_stat_replication;
▶ Output
-- Replica output
time_lag | lag_status
---------+-----------
00:00:00 | OK

-- Primary output
client_addr | state | lag_pretty | lag_bytes | lag_status
------------+-----------+------------+-----------+-----------
10.0.1.5 | streaming | 512 kB | 524288 | OK
10.0.1.6 | streaming | 48 MB | 50331648 | WARN: >10MB lag
💡Pro Tip: Monitor Both Time and Bytes — They Catch Different Failure Modes
Time-based lag tells you whether users are seeing stale data right now. Byte-based lag tells you whether the replica is about to fall behind the moment write volume spikes. A replica can show 0 seconds of time lag while carrying 500 MB of queued WAL — it looks fine until your next batch job fires and the backlog doubles in 60 seconds. Alert on both metrics with separate thresholds, and graph them together so you can see the correlation when a write burst hits.
📊 Production Insight
Replication lag is inevitable in async replication — the question is whether it is within your application's tolerance for stale data. A 500ms lag on a read-your-own-write flow means users see their own edits disappear for half a second immediately after saving, which erodes trust fast. Measure lag continuously. Set alert thresholds based on your actual consistency requirements, not round numbers. And alert before users complain, not after.
🎯 Key Takeaway
Lag is a property of async replication, not a malfunction — measure it continuously and set thresholds before users find your threshold for you. Monitor both time-based lag (user impact) and byte-based lag (how much work is queued). If your application requires read-after-write consistency, implement it explicitly: route the user's reads to the primary for a short window after their write rather than pretending the replica is always current.
Responding to Replication Lag
IfLag is stable and under your SLA threshold — not growing
UseNo action required — this is normal async replication behavior, document the baseline
IfLag is growing steadily over time during normal traffic
UseCheck replica disk I/O with iostat -x 1. Check whether heavy read queries compete with WAL apply. Consider a dedicated WAL-apply-only replica for failover and separate read replicas for query traffic
IfLag spikes during batch writes or large transactions then recovers
UseExpected burst behavior — monitor recovery time. If it takes longer to recover each cycle, the replica hardware is under-provisioned for your write growth rate
IfUsers report seeing stale data immediately after their own writes
UseImplement read-after-write consistency: route that user's reads to the primary for N seconds after their last write, or track the write LSN and wait for the replica to catch up to it before serving the read

Write-Ahead Logs and Binary Logs: The Replication Plumbing

Every mainstream relational database uses a write-ahead log as the physical foundation of replication. In PostgreSQL it's the WAL. In MySQL it's the Binary Log (binlog). The concept is the same in both: before any change is applied to the actual data files, it's written sequentially to the log first. The log is the authoritative record of every change the primary made, in the exact order it was made. Replicas consume this log to reconstruct the primary's state on their own storage.

WAL serves two distinct purposes that are easy to conflate. First, crash recovery: if the primary dies mid-write, the WAL records enough information to replay or roll back the incomplete transaction on restart. The data files are always recoverable from a consistent checkpoint plus the WAL that follows it. Second, replication: the primary streams WAL segments to connected replicas over a network socket, and replicas apply them in sequence. These two purposes share the same physical log, which is why WAL configuration affects both durability and replication behavior.

The wal_level setting in PostgreSQL determines how much detail the WAL contains. The replica level is the minimum for physical streaming replication — it records enough to reconstruct data changes. The logical level adds row-level detail needed for logical replication and change data capture pipelines like Debezium. Logical WAL generates more volume per write — roughly 2–4x in high-UPDATE workloads — so don't set it unless you actually need it. And changing wal_level requires a server restart, so plan this before you have replicas depending on it.

io/thecodeforge/db/WALConfiguration.sql · SQL
1234567891011121314151617181920212223242526272829303132333435
-- io.thecodeforge: PostgreSQL WAL configuration audit and replication slot monitoring

-- Check current WAL level — requires server restart to change
SHOW wal_level;
SHOW max_wal_senders;
SHOW max_replication_slots;

-- Recommended postgresql.conf settings for a replication primary:
-- wal_level = replica            -- minimum for physical streaming replication
-- max_wal_senders = 10          -- max concurrent replication connections (replicas + tools)
-- max_replication_slots = 10    -- max slots; each slot pins WAL until the replica consumes it
-- wal_keep_size = '1GB'         -- WAL to retain without slots (safety net, not a substitute for slots)
-- max_slot_wal_keep_size = '10GB' -- hard cap on WAL any single slot can pin (CRITICAL — set this)

-- Monitor active WAL position
SELECT 
    pg_current_wal_lsn()                          AS current_lsn,
    pg_walfile_name(pg_current_wal_lsn())          AS current_wal_file;

-- Audit replication slots — pay attention to inactive slots retaining large WAL
SELECT 
    slot_name,
    slot_type,
    active,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal,
    CASE
        WHEN active = false 
         AND pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824
        THEN 'ALERT: inactive slot retaining >1GB — check immediately'
        ELSE 'OK'
    END AS slot_status
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
▶ Output
slot_name | slot_type | active | retained_wal | slot_status
------------+-----------+--------+--------------+------------------------------------------------
replica_1 | physical | true | 16 MB | OK
replica_2 | physical | false | 2 GB | ALERT: inactive slot retaining >1GB — check immediately
🔥Why WAL Retention Matters More Than Most Engineers Realize
Replication slots tell the primary: do not recycle any WAL segment until this replica has consumed it. That guarantee is critical when a replica needs to catch up after a brief network interruption. The problem is the guarantee has no expiry. An offline replica's slot pins WAL indefinitely — 1 hour, 3 days, however long it takes someone to notice. Without max_slot_wal_keep_size set, a single dead replica's slot will eventually fill the primary's entire pg_wal directory, causing a complete write outage. Set that parameter. Monitor slot retention. Drop slots before extended maintenance windows.
📊 Production Insight
WAL is simultaneously the crash recovery mechanism and the replication stream — it is the single source of truth for every change the database has ever made. Replication slots are a powerful safety net for replicas, but without monitoring and a size cap, one offline replica's slot becomes a time bomb for the primary's disk. Always set max_slot_wal_keep_size, monitor inactive slot retention, and treat a slot retaining more than 1 GB as an alert-level event.
🎯 Key Takeaway
WAL is the replication stream — every change is logged sequentially before touching the data files, and replicas apply that log to stay in sync. Replication slots are valuable but dangerous without limits: a dead replica's slot pins WAL indefinitely and will fill your primary's disk. Set max_slot_wal_keep_size unconditionally. Alert on inactive slots retaining more than 1 GB. A disk-full primary is a total outage — not a performance degradation, not a warning. A total outage.
WAL Configuration Decisions
IfSetting up physical streaming replication only
UseSet wal_level = replica — sufficient for streaming replication, minimal WAL overhead
IfNeed logical replication, CDC, or Debezium integration
UseSet wal_level = logical — generates more WAL volume per write, but enables row-level change streams and selective table replication
IfUsing replication slots for any replica
UseSet max_slot_wal_keep_size as a mandatory safety valve — this single setting prevents one offline replica from filling the primary's disk
IfNot using replication slots
UseSet wal_keep_size large enough to cover your longest expected replica downtime — WAL beyond this limit is eligible for recycling and a lagging replica may lose its position

Automated Failover: When the Primary Dies

Manual failover — SSH into a replica, run pg_ctl promote, update your connection strings, restart the application, redirect traffic — works in a controlled staging exercise during business hours. In production at 3 AM with pager alerts firing, adrenaline running, and half the team half-asleep, manual failover is how data gets lost and how the wrong replica gets promoted. I've seen both happen.

Automated failover tools handle the entire sequence: detect the primary failure, select the most up-to-date eligible replica, promote it, reconfigure remaining replicas to follow the new primary, and update the application's connection endpoint. All of that in under 30 seconds, without human judgment calls under pressure.

The two dominant tools for PostgreSQL are Patroni (built on etcd or Consul for distributed consensus and leader election) and Repmgr (simpler setup, less operational overhead for smaller clusters). For MySQL, Orchestrator is the standard. All three solve the same core problem: ensuring that exactly one node holds the write role at any moment. The 'exactly one' constraint is the hard part — distributed systems make this genuinely difficult because a network partition can make a live primary look dead to the failover system.

Failover tooling alone is not enough. The application must know where to send writes after failover completes. This requires either a connection pooler (PgBouncer, ProxySQL) or DNS-based service discovery (Route53, Consul) that can be updated programmatically as part of the failover sequence. Hard-coding the primary's IP address in your application configuration file is a guarantee that your automated failover will still require manual application intervention — which defeats most of the point.

io/thecodeforge/db/patroni.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# io.thecodeforge: Patroni configuration for a 3-node PostgreSQL HA cluster
# Patroni uses a DCS (etcd/consul) for distributed leader election and lock management
# Run one instance of this per PostgreSQL node — change 'name' and 'connect_address' per node

scope: forge-cluster
name: node-1   # change to node-2, node-3 on other nodes
namespace: /db/

restapi:
  listen: 0.0.0.0:8008
  connect_address: node-1.internal:8008

etcd3:
  hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379
  # etcd cluster must have quorum for Patroni to elect a leader

bootstrap:
  dcs:
    ttl: 30                           # leader lock TTL in seconds
    loop_wait: 10                     # how often Patroni checks its status
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — replica must be within 1MB of primary to be promotion-eligible
    synchronous_mode: true            # require at least one synchronous replica before committing
    synchronous_mode_strict: false    # do not block writes if no sync replica is available (degrades to async)
  pg_hba:
    - host replication replicator 10.0.0.0/8 md5
    - host all all 0.0.0.0/0 md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: node-1.internal:5432
  data_dir: /var/lib/postgresql/16/main
  bin_dir: /usr/lib/postgresql/16/bin
  parameters:
    max_connections: 200
    shared_buffers: 2GB
    wal_level: replica
    max_wal_senders: 10
    max_replication_slots: 10
    max_slot_wal_keep_size: 10GB     # safety valve — prevents slot bloat from filling disk
  authentication:
    replication:
      username: replicator
      password: forge_repl_secret
    superuser:
      username: postgres
      password: forge_admin_secret
▶ Output
2026-03-05 14:22:01 INFO: promoted self to leader by acquiring session lock
2026-03-05 14:22:02 INFO: no action. I am the leader with the cluster lock
2026-03-05 14:22:12 INFO: Lock owner: node-1; I am node-1
2026-03-05 14:22:22 INFO: node-2 is now a streaming standby
⚠ Watch Out: Failover Without Fencing Causes Split Brain
If the old primary is unreachable due to a network partition — but not actually dead — it may still be accepting writes from clients whose connection strings haven't updated yet. Promoting a new primary in that scenario gives you two nodes simultaneously accepting writes with diverging histories. Automated failover must include fencing: stopping the old primary process, revoking its VIP, or removing it from the connection pooler before the promotion completes. Without fencing, you are trading one failure mode for a worse one.
📊 Production Insight
Manual failover under pressure is how data gets lost and how the wrong node gets promoted — automate it before you need it. Failover without fencing causes split brain, which is harder to recover from than the original primary failure. Use Patroni with etcd for leader election and ensure your fencing mechanism (VIP revocation, STONITH, pg_hba.conf update) is part of the automated sequence. Never hard-code primary IPs in application config — the failover is only complete when the application is writing to the new primary without human intervention.
🎯 Key Takeaway
Automated failover with Patroni or Orchestrator converts a 3 AM manual recovery sequence into a sub-30-second automated event. Fencing is not optional — without it the old primary can continue accepting writes and you have split brain, which is a worse problem than the original failure. Never hard-code the primary IP in application config. Use a connection pooler or service discovery layer that updates automatically on promotion — otherwise your automated failover still requires a manual step to be useful.
Failover Strategy Decisions
IfSingle-region PostgreSQL deployment, need fast automatic failover with leader election
UseUse Patroni with etcd — handles leader election, replica promotion, standby reconfiguration, and integrates with HAProxy for connection routing
IfMySQL ecosystem, need topology visualization and failover orchestration
UseUse Orchestrator — visualizes the full replication topology, detects primary failure, and orchestrates graceful failovers with configurable hooks
IfMulti-region deployment with cross-region replicas and strict RPO requirements
UseUse Patroni with synchronous_mode enabled and region-aware candidate priority — prefer a same-region replica for promotion to minimize RPO and avoid cross-region WAL gaps
IfApplication currently hard-codes the primary host or IP
UseIntroduce PgBouncer or ProxySQL as a connection proxy before attempting automated failover — the app connects to the proxy, failover updates the proxy's backend, no app restart required

Designing Your Replication Topology for Production

A replication topology is not just 'one primary and a couple of replicas.' It's a deliberate design that accounts for how reads are routed, how WAL shipping load is distributed across the primary's network interface, what the failover blast radius looks like, and whether your replicas will actually help when the primary fails — or whether they'll fail alongside it because they share the same infrastructure failure domain.

The most common production topology is a star: one primary with N replicas, all directly connected to the primary via streaming replication. Read traffic is distributed across replicas via a load balancer or connection pooler. HAProxy and PgBouncer in pool mode are the standard choices here. This works well for single-region deployments up to roughly 10 replicas, at which point the primary's WAL shipping starts to become a bottleneck — it's shipping the same WAL stream N times over the same network interface.

Beyond 10 replicas, or whenever you can measure WAL shipping as a CPU or network ceiling on the primary, cascading (hierarchical) replication is worth evaluating. Tier-1 replicas connect directly to the primary. Tier-2 replicas connect to tier-1 replicas and inherit their lag in addition to their own. The primary's WAL shipping burden drops proportionally, but you add a lag tier at each level — tier-2 replicas are always at least as far behind as tier-1. Monitor each tier's lag separately.

For multi-region deployments, place at least one replica in each region. Route reads to the nearest replica for latency. But be clear-eyed about what cross-region replication lag means: a primary in us-east-1 and a replica in eu-west-1 will carry 80–150ms of lag under normal conditions, and more during network events. Writes always go to the single primary — and if that primary is in a different region than your users, those writes pay the cross-region latency too. Geographic distribution of replicas improves read latency; it does not improve write latency unless you go multi-primary, with all the complexity that brings.

io/thecodeforge/db/topology-diagram.txt · TEXT
12345678910111213141516171819202122232425262728293031
-- io.thecodeforge: Common Production Replication Topologies

-- STAR TOPOLOGY (recommended default, up to ~10 replicas)
-- All replicas connect directly to the primary
-- Primary ships WAL to every replica — network bandwidth scales linearly with replica count
--
--                    [Primary]
--                 /      |      \
--           [Rep1]    [Rep2]    [Rep3]
--         (reads)    (reads)   (hot standby)

-- CASCADING TOPOLOGY (for 10+ replicas, or when WAL shipping is a primary bottleneck)
-- Tier-1 replicas connect to primary; tier-2 replicas connect to tier-1
-- Reduces primary WAL shipping load — each tier-1 ships to its tier-2 children
-- Lag at tier-2 = tier-1 lag + tier-2 lag — monitor each tier separately
--
--                    [Primary]
--                 /             \
--           [Rep1-A]           [Rep1-B]
--           /     \            /      \
--     [Rep2-A] [Rep2-B]  [Rep2-C]  [Rep2-D]

-- MULTI-REGION TOPOLOGY (for geographic read distribution)
-- One or more replicas per region — route reads to nearest replica
-- Cross-region lag is 80-150ms under normal conditions — set SLAs accordingly
-- Writes still go to the single primary — cross-region writes pay the RTT
--
--   [Primary: us-east-1] ----WAL----> [Replica: eu-west-1]
--          |                                  |
--   [Local Read Replicas]          [Local Read Replicas]
--   (us-east-1 reads)              (eu-west-1 reads)
▶ Output
Topology selected based on replica count, read volume, geographic distribution, and acceptable replication lag per tier.
💡Pro Tip: Provision Replicas for WAL Apply Throughput, Not Just Read Traffic
The most common and expensive topology mistake I see is giving replicas weaker hardware than the primary to save cost — smaller instance types, slower disks. The reasoning is that replicas are 'just for reads.' The problem is replicas apply WAL continuously in the background while simultaneously serving read queries, and both operations are heavily I/O-bound. If your replica's disk cannot keep pace with the primary's write rate plus its read workload, lag grows and never recovers. Provision replicas with equal or better disk IOPS than the primary. The cost difference is almost always worth it relative to the operational pain of a permanently lagging replica.
📊 Production Insight
Replicas must match or exceed the primary's disk I/O capacity — WAL apply is I/O-bound and competes directly with read query I/O. Cascading replication solves WAL shipping bottlenecks on the primary but adds lag at each tier. Multi-region replicas reduce read latency per region but do nothing for write latency and carry inherent cross-region lag. Provision replicas for write-apply capacity, not just read-serving capacity — a replica that cannot keep up with the primary's write rate provides degraded availability exactly when you need it most.
🎯 Key Takeaway
Star topology is the right default — direct connections, easy to monitor, easy to understand under pressure. Add cascading replication only when you can measure WAL shipping as a bottleneck on the primary, not speculatively. Multi-region replicas give you read locality but introduce cross-region lag that belongs in your SLA documentation, not as a surprise. The single most important hardware decision: provision replicas for WAL apply throughput. A replica that cannot keep up with the primary's write rate lags permanently — and a permanently lagging replica is a hot standby that isn't ready when you need it.
Choosing a Replication Topology
IfSingle region, up to ~10 replicas, primary WAL shipping is not a measured bottleneck
UseUse star topology — all replicas connect directly to the primary, simplest to monitor and operate
IfSingle region, 10+ replicas, or primary network bandwidth is a measured WAL shipping bottleneck
UseUse cascading topology — tier-1 replicas connect to primary, tier-2 replicas connect to tier-1. Monitor lag per tier separately
IfMulti-region deployment, reads must be low-latency in each region
UsePlace at least one replica per region, route reads to the nearest replica, accept and document cross-region lag in your consistency SLA
IfZero-RPO requirement across regions — cannot tolerate any data loss on cross-region primary failure
UseUse synchronous replication to a cross-region replica — benchmark the cross-region RTT impact on write latency before committing, it is often larger than expected
🗂 Asynchronous vs. Synchronous Replication
Trade-offs for consistency, latency, durability, and operational complexity
DimensionAsynchronous ReplicationSynchronous Replication
Write LatencyLow — bounded by local disk speed onlyHigher — bounded by local disk plus network RTT to replica
Data Loss Risk on Primary CrashSmall but nonzero — WAL in transit is lostZero — replica must durably acknowledge before client gets confirmation
Read ConsistencyEventual — replica lag introduces stale readsStrong — if reading from the synchronous replica
Write Availability on Replica FailureUnaffected — primary continues accepting writesBlocked if no synchronous replica is available (configurable with synchronous_mode_strict)
Operational ComplexityStandard — straightforward to monitor and operateHigher — requires quorum configuration, failover handling, and latency benchmarking
Typical Use CaseWeb applications, analytics replicas, read scalingFinancial transactions, compliance workloads, zero-RPO requirements

🎯 Key Takeaways

  • Replication is for availability and read scalability, not for backups — a DROP TABLE on the primary propagates to every replica in milliseconds.
  • Understand your consistency requirements concretely before choosing async or sync: can your application tolerate a 500ms window of stale data, or does every read need to reflect the most recent write?
  • Write-Ahead Logs are both the crash recovery mechanism and the replication stream — they are the single source of truth for every change the primary has made.
  • Monitor LSN distance (byte lag) alongside time-based lag — a replica at zero seconds of lag can be carrying hundreds of megabytes of queued WAL and about to fall behind hard.
  • Automate failover with Patroni or Orchestrator, include fencing in the automation, and test the full sequence on a quarterly cadence — knowing it works in theory is not the same as having proven it works under load.

⚠ Common Mistakes to Avoid

    Treating replicas as backups
    Symptom

    A developer runs DROP TABLE on the primary — accidentally or during a bad migration. The change propagates to all replicas within milliseconds. Every copy of the data is gone simultaneously. The team discovers that 'we had three replicas' does not help at all.

    Fix

    Replication is for availability and read scaling, not for data protection against destructive operations. Maintain separate point-in-time backups using pg_dump, WAL archiving with pgBackRest or Barman, or a managed backup service. Test restores on a schedule — a backup you haven't tested is not a backup, it's a hope.

    Ignoring replication lag monitoring until users complain
    Symptom

    Users report seeing stale balances, missing profile updates, or phantom orders. Application logs show no errors. The replicas are 30–90 seconds behind the primary but no alerting exists. The team finds out from a support ticket, not a monitor.

    Fix

    Monitor replication lag continuously with alerts set before users hit the threshold. Use pg_stat_replication for byte-based lag and pg_last_xact_replay_timestamp() for time-based lag. Define thresholds based on your actual consistency SLA — alert at 1s, page at 5s is a reasonable starting point for most applications. Adjust from there based on measured user impact.

    Under-provisioning replica hardware relative to the primary
    Symptom

    Replication lag grows steadily during peak write hours. The replica's disk I/O is saturated — it's simultaneously applying WAL and serving read queries, and the disk cannot do both at the capacity the primary generates. Lag climbs through the day and never recovers overnight.

    Fix

    Replicas do work proportional to the primary's write rate. WAL apply is I/O-bound. Provision replicas with equal or better disk IOPS and throughput than the primary. If read traffic is heavy enough to compete meaningfully with WAL apply, consider dedicated replica pools: one set of replicas for failover (WAL apply priority), a separate set for read traffic.

    Hard-coding database IP addresses in application configuration
    Symptom

    The primary fails. Patroni promotes a replica within 20 seconds. The application still connects to the old primary's IP. Writes fail. Engineers spend 30 minutes updating config files, redeploying the application, and restarting connection pools while users see errors. The automated failover was only half-automated.

    Fix

    Use a connection pooler (PgBouncer, ProxySQL) or service discovery (Consul, Route53 with low TTL) as an abstraction layer. The application connects to a stable address that never changes. The failover process updates the pooler's or DNS record's backend target. No application restart required.

    Running VACUUM FULL or large DDL on the primary without accounting for the WAL impact
    Symptom

    VACUUM FULL rewrites an entire table, generating a large WAL burst in a short window. Replicas fall 10–60 minutes behind. If synchronous replication is enabled, the primary blocks on every subsequent write until the replica catches up — application latency spikes to seconds per query during the maintenance window. Engineers didn't plan for this because 'it's just maintenance.'

    Fix

    Schedule heavyweight maintenance (VACUUM FULL, CREATE INDEX CONCURRENTLY, ALTER TABLE with rewrites) during documented low-traffic windows. If synchronous replication is enabled cluster-wide, consider temporarily setting synchronous_commit = 'local' for the maintenance session to prevent primary write blocking, then re-enable after the replica catches up. Monitor lag during and after any large maintenance operation.

    Assuming automated failover tools work without ever testing them
    Symptom

    During a real primary failure, Patroni promotes a replica but the connection pooler config wasn't updated as part of the failover sequence and nobody checked. The application continues writing to the old primary for 8 minutes until someone notices. Or the promoted replica had 45 seconds of unapplied WAL because maximum_lag_on_failover was set to an arbitrarily high value — the data loss was silent.

    Fix

    Run controlled failover drills on a quarterly schedule in a production-equivalent environment. Stop the primary deliberately. Verify Patroni promotes the correct replica within your target RTO. Verify the connection pooler or DNS updates. Verify application writes succeed on the new primary without manual intervention. Verify data consistency after the exercise. Document every gap found and close it before the next drill.

Interview Questions on This Topic

  • QExplain the CAP theorem and how it applies to choosing between synchronous and asynchronous replication.SeniorReveal
    The CAP theorem states a distributed system can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition tolerance. Since network partitions are not optional — they happen whether you plan for them or not — the practical choice is between CP and AP systems. Synchronous replication is CP: the primary blocks writes until a replica acknowledges, guaranteeing that all nodes see the same data, but if the replica is unreachable (partition), writes block or fail — trading availability for consistency. Asynchronous replication is AP: writes succeed on the primary immediately regardless of replica state, so the system stays available through a partition, but replicas may serve stale data — trading consistency for availability. The choice should be driven by what your application can tolerate: data loss on primary failure (async risk) versus write unavailability during a partition (sync risk). Neither is universally correct.
  • QWhat is split brain in a database cluster, and how do quorum-based systems prevent it?SeniorReveal
    Split brain occurs when a network partition causes two nodes to simultaneously believe they are the primary. Both accept writes independently. Their data histories diverge silently with no immediate error signal, and merging the diverged histories after recovery is generally impossible without data loss or manual reconciliation. Quorum-based systems prevent this by requiring a majority vote — N/2 + 1 nodes agreeing — before any node can hold the leader role. A node that cannot reach a majority of the quorum cannot acquire or retain the leader lock, and it demotes itself to read-only. Tools like Patroni use etcd, which implements the Raft consensus algorithm, to maintain a distributed leader lock with a TTL. Only the node holding the lock can accept writes. If the lock holder can't renew its lock before TTL expiry, the quorum elects a new leader. Fencing mechanisms — stopping the old primary process via STONITH or revoking its VIP — add a second layer of protection to ensure the old primary cannot accept writes even if its process is stuck and hasn't self-demoted cleanly.
  • QYou're seeing a sudden spike in replication lag on a replica that was previously healthy. Walk me through your debugging process.SeniorReveal
    Start by confirming the replica is still connected: query pg_stat_replication on the primary and look for the replica's row. If it's missing, the streaming connection dropped — check network connectivity and pg_hba.conf allows the replication user. If the row exists, check the state column — 'streaming' is healthy, 'catchup' means it's behind and actively recovering, anything else needs investigation. Next, measure lag in bytes with pg_wal_lsn_diff(sent_lsn, replay_lsn) — is the gap growing or stable? If growing, the replica is consuming WAL slower than the primary is generating it. SSH to the replica and run iostat -x 1 — if %util on the data disk is near 100%, the disk is saturated. Check whether heavy read queries are competing with WAL apply for I/O. Check for long-running transactions on the primary that may have generated a WAL burst on commit — look at pg_stat_activity for transactions older than a few minutes. If none of these surface the cause, check whether the replica is running on instance hardware weaker than the primary — that's a provisioning problem and the lag will never resolve under production write load.
  • QHow does statement-based replication differ from row-based replication, and which is safer for non-deterministic functions?Mid-levelReveal
    Statement-Based Replication ships the SQL statement itself to the replica. The replica re-executes it — INSERT INTO orders VALUES(NOW(), ...) becomes a new function call on the replica, and if there's any time difference between the execution on primary and replica, NOW() returns a different value. Non-deterministic functions like NOW(), UUID(), RAND(), and user-defined functions with side effects produce different results on the replica, causing data divergence that's invisible at replication time and only discovered later. Row-Based Replication ships the actual row data that resulted from the statement on the primary — the literal timestamp value, not the function call. The replica applies the exact bytes the primary produced, making non-deterministic functions safe. MySQL has supported both modes historically and defaults to RBR (binlog_format=ROW) for safety. PostgreSQL's physical replication is inherently block-level and does not have this distinction — it replicates the physical data pages, not SQL statements. Logical replication in PostgreSQL also operates at the row level. For any workload involving non-deterministic functions, RBR is the only safe choice.
  • QWhat is a replication slot in PostgreSQL and why is it dangerous if a replica goes offline for an extended period?Mid-levelReveal
    A replication slot is a server-side tracking mechanism that tells the primary: retain all WAL segments that this specific consumer has not yet acknowledged receiving. Without a slot, the primary recycles WAL segments on a schedule regardless of whether any replica has consumed them — a replica that falls behind can find its required WAL already deleted. Slots solve that problem by creating a guarantee. The danger is the guarantee has no expiry and no size limit by default. If the replica holding the slot goes offline — for maintenance, due to a failure, or because someone forgot about it — the slot continues pinning WAL indefinitely. The primary's pg_wal directory grows continuously until the disk fills. A full disk on the primary causes a total write outage: PostgreSQL cannot write WAL, so it cannot commit any transaction. The fix is two-part: set max_slot_wal_keep_size as a hard cap that limits how much WAL any slot can retain, and add monitoring alerts for inactive slots retaining more than a threshold you're comfortable with — I use 1 GB as the alert threshold and 10 GB as the hard cap.
  • QWhat is the difference between physical replication and logical replication in PostgreSQL?Mid-levelReveal
    Physical replication operates at the storage block level. It ships WAL byte-for-byte, and the replica applies those bytes to produce an exact copy of the primary's data files — same schema, same tables, same everything, including internal page layout. You cannot replicate a subset of tables, filter rows, or replicate to a different PostgreSQL major version. Physical replication is used for high-availability standby setups because it's simple, low-overhead, and the replica is ready to promote immediately. Logical replication operates at the row level using a publication and subscription model. You define which tables to replicate (a publication), and the subscriber receives row-level INSERT, UPDATE, and DELETE events for those tables. You can replicate across different PostgreSQL versions, replicate a subset of tables, and even filter rows. This makes logical replication the right tool for selective data distribution, online major version migrations, and CDC pipelines that feed downstream systems like Kafka or data warehouses. The trade-offs: logical replication has higher per-row processing overhead, does not replicate DDL changes automatically (schema changes must be applied manually on the subscriber), and requires wal_level = logical on the primary.
  • QExplain read-after-write consistency and how you would implement it in a Primary-Replica architecture.JuniorReveal
    Read-after-write consistency is the guarantee that after a user performs a write, any subsequent read — by that same user — will reflect their write, even if the system uses eventually-consistent replicas for reads. Without this guarantee, a user saves their profile update, immediately navigates to their profile page, and sees the old data because the read hit a replica that hasn't caught up yet. Frustrating and erodes trust immediately. In a Primary-Replica architecture, there are three implementation approaches in increasing order of precision. First: timestamp-based routing — record the time of each user's last write in a session cookie or cache entry, and route that user's reads to the primary for N seconds (where N is slightly larger than your observed maximum lag). Simple to implement, slightly wastes primary read capacity. Second: LSN-based routing — capture the write's LSN from pg_current_wal_lsn() after the transaction commits, store it per user, and route reads to the primary until a replica confirms via pg_last_xact_replay_timestamp() that it has advanced past that LSN. More precise, but requires application plumbing to pass LSN through the stack. Third: synchronous replica — ensure the write is committed to a synchronous replica before responding to the user, then route reads to that replica. Zero-lag guarantee, but adds synchronous write latency to every write for every user, not just those who need read-after-write consistency. Most applications use approach one for its simplicity, with the session window tuned conservatively.

Frequently Asked Questions

Does replication improve write performance?

No — and synchronous replication actively decreases it. Replication adds overhead: WAL generation, log shipping, and in the synchronous case, blocking on replica acknowledgment before the primary can respond to the client. Write latency in synchronous mode is gated by the network round-trip to the replica. To scale write throughput beyond a single primary's capacity, you need either sharding (partitioning data across multiple primaries, each responsible for a subset) or a multi-primary architecture with conflict resolution. Neither is simple. Exhaust vertical scaling and read offloading to replicas before going there.

What is replication lag?

Replication lag is the delay between a transaction committing on the primary and that same transaction becoming visible on a replica. It's measured two ways: time-based (how many seconds behind the primary is the replica?) and byte-based (how many bytes of WAL has the replica not yet applied?). Lag is a fundamental property of asynchronous replication — it's not a bug or a misconfiguration. The engineering question is not how to eliminate it but whether it is within the bounds your application can tolerate, and whether you have monitoring to know when it isn't.

What happens to the replica if the primary fails?

In a properly configured high-availability setup, one replica is promoted to become the new primary. The promotion sequence involves: ensuring the replica has applied all available WAL, transitioning it from read-only standby mode to read-write primary mode, reconfiguring any remaining replicas to follow the new primary, and redirecting application write traffic to the new primary's endpoint. Automated tools like Patroni handle this sequence in under 30 seconds. Manual promotion requires running pg_ctl promote on the chosen replica and then updating application connection strings — a process that takes minutes at best and is prone to errors under pressure. Invest in automation before you need it.

What is the difference between a hot standby and a warm standby?

A hot standby continuously receives and applies WAL from the primary while simultaneously accepting read-only queries from application clients. This is the standard in modern PostgreSQL streaming replication — replicas offload read traffic from the primary while remaining ready to promote on a moment's notice. A warm standby also receives and applies WAL continuously, but it does not accept query connections until a failover is explicitly triggered. Warm standbys use fewer resources (no read traffic overhead) and are appropriate for dedicated disaster recovery nodes where cost matters more than read offloading. Hot standbys are the default choice for production HA clusters.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousDatabase ShardingNext →CAP Theorem and Databases
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged