Database replication distributes data from a primary to replicas for scalability and failover
Primary-Replica architecture: one write node, many read nodes — zero conflicts by design
Async replication trades durability for speed; sync guarantees zero data loss at the cost of RTT
Replication lag is the delay between primary commit and replica visibility — monitor in seconds and bytes
Replication is not backup — destructive DDL propagates instantly, always maintain independent backups
Automate failover with Patroni or Orchestrator; manual promotion at 3 AM is where data gets lost
✦ Definition~90s read
What is Database Replication?
Database replication is the process of copying data from one database server (the primary) to one or more other servers (replicas) to maintain consistency across multiple nodes. It exists primarily to solve two problems: read scalability (offloading query traffic from the primary) and high availability (surviving node failures without data loss).
★
Imagine a wildly popular recipe book.
But the illusion of free read scaling is dangerous — every replica adds write amplification on the primary, consumes network bandwidth, and, critically, creates replication slots that can bloat WAL (Write-Ahead Log) retention on PostgreSQL and similar systems. If a replica falls behind or disconnects, the primary must keep old WAL segments until that slot is consumed, potentially filling disk and killing the primary.
Replication is not a scaling panacea; it's a trade-off between consistency, latency, and operational complexity.
In the ecosystem, replication architectures fall into two camps: primary-replica (single-writer, many readers) and multi-primary (multiple writers). Primary-replica is simpler and dominates production systems — PostgreSQL streaming replication, MySQL Group Replication, and AWS RDS Multi-AZ all use variants.
Multi-primary, like CockroachDB or Galera Cluster, offers write scalability but introduces conflict resolution nightmares (last-write-wins, CRDTs, or application-level merging). For most OLTP workloads, primary-replica is the right choice; multi-primary is for edge cases where write availability trumps consistency guarantees.
Conflict resolution patterns matter only when you allow writes to multiple nodes. Synchronous replication (primary waits for replica acknowledgment) guarantees zero data loss but kills write latency — every transaction waits for the slowest replica. Asynchronous replication (fire-and-forget) gives low latency but risks data loss on primary failure.
The decision matrix boils down to: can you tolerate losing the last few seconds of writes? If yes, async is fine for read scaling. If no, synchronous replication with a single replica (or quorum commit) is mandatory, but you must monitor replication lag obsessively — lag causes stale reads, breaks application logic, and, in PostgreSQL, bloats replication slots until the primary crashes.
Tools like pg_stat_replication, Patroni, and orchestrator help, but they don't fix the fundamental physics: replication is not free, and slot bloat is the silent killer of primaries.
Plain-English First
Imagine a wildly popular recipe book. The library keeps one original copy, but thousands of people want to read it simultaneously. So the librarian makes identical copies and distributes them to branches across the city. When someone adds a new recipe to the original, those changes gradually sync to every branch. Database replication works exactly like that — your primary database is the original book, replicas are the branch copies, and the WAL is the courier running between them. The system stays responsive under demand because readers never queue behind each other at a single shelf.
At scale, a single database server cannot handle millions of concurrent reads. When it dies, the whole product goes dark. Database replication solves both problems by spreading reads across servers and keeping a hot standby ready to take over.
But replication introduces its own failure modes. Stale data after a write, replication slots silently filling the primary's disk, and split-brain scenarios during failover are real. I've debugged each of these in production, and the pattern is always the same: the monitoring wasn't there.
This article covers how replication works under the hood — WAL streaming, lag measurement, topology design — and the failure patterns that catch most teams off guard. By the end, you'll know how to design a topology that survives production, not just the happy path.
Database Replication — The Illusion of Free Read Scaling
Database replication is the process of copying data from a primary database to one or more replica databases to distribute read load and provide failover capacity. The core mechanic is a continuous stream of write-ahead log (WAL) records shipped from primary to replicas, which replay them to maintain an eventually consistent copy. This is not a cache — replicas hold full table data, not a subset.
In practice, replication introduces a lag window: replicas are always slightly behind the primary. PostgreSQL uses physical replication slots that pin WAL segments until all replicas confirm receipt. If a replica falls behind or disconnects, WAL accumulates on the primary, eventually filling disk and crashing the primary. This is slot bloat — a silent killer that takes down production databases with no query spike.
Use replication when you need read-heavy workloads to scale beyond a single node's capacity, or when you require a hot standby for rapid failover. It is not a substitute for proper indexing or connection pooling — those are cheaper and simpler. Replication buys you read throughput at the cost of operational complexity and a new failure mode: the primary can die because a replica is slow.
Slot Bloat Is Not a Replica Problem — It's a Primary Problem
A disconnected or slow replica causes WAL to pile up on the primary, not the replica. The primary runs out of disk and crashes, taking all replicas with it.
Production Insight
A team added a replica for reporting but never monitored replication lag. The replica fell 6 hours behind due to a slow query. The primary's WAL grew to fill the disk, causing a full outage for all services.
The symptom: primary disk usage climbs steadily while replica lag increases. No spike in queries or connections — just a slow, silent disk fill.
Rule of thumb: set replication slot max_wal_size to 1/10 of available disk, and alert on lag exceeding 5 minutes. Never rely on replicas to keep the primary alive.
Key Takeaway
Replication does not make reads faster — it makes them possible without killing the primary.
A single slow replica can destroy your primary via WAL accumulation — monitor lag, not just replica health.
Replication is a scaling tool for reads, not a substitute for query optimization or connection pooling.
thecodeforge.io
Replication Slot Bloat: Primary Killer
Database Replication
Core Architectures: Primary-Replica vs. Multi-Primary
In a Primary-Replica setup — the architecture that powers the vast majority of production databases you'll encounter — all writes go to a single Primary node. That node records every change in a Write-Ahead Log (WAL) or Binary Log and ships those events to one or more Replica nodes. Replicas are read-only; they apply the log entries to their own copy of the data to stay in sync. This is the industry default for a reason: there is exactly one source of truth, conflict resolution is trivially simple (there are no conflicts), and the operational model is easy to reason about.
Multi-Primary replication allows writes on any node, which sounds like the availability holy grail until you actually implement it. If two clients update the same row on different primaries within the same time window, the system must resolve the conflict using strategies like Last Write Wins (LWW), Vector Clocks, or custom application logic. Last Write Wins sounds simple but silently discards legitimate writes — whichever timestamp wins, the other write vanishes with no error and no log entry. Vector Clocks are more correct but require your application layer to understand and handle conflict signals. Neither is free.
I've seen teams reach for Multi-Primary when their real problem was read scaling, which Primary-Replica solves without any of that complexity. The rule is straightforward: don't adopt multi-primary until you've genuinely exhausted vertical scaling and read-replica offloading, and your team understands distributed consensus deeply enough to operate it at 3 AM.
io/thecodeforge/db/ReplicaStatus.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Run this on the Primary to see connected standbys and their lagSELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;
-- Run this on a Replica to confirm it is in recovery and check replay timestampSELECTpg_is_in_recovery() AS is_replica,
pg_last_xact_replay_timestamp() AS last_replayed_at,
now() - pg_last_xact_replay_timestamp() AS current_lag;
Output
client_addr | state | replication_lag_bytes
------------+-----------+----------------------
10.0.1.5 | streaming | 4096
10.0.1.6 | streaming | 98304
Replication Architecture Mental Model
Primary-Replica: one source of truth, replicas are read-only copies — zero conflict resolution needed
Multi-Primary: any node accepts writes — requires conflict resolution (LWW, Vector Clocks, or application logic)
99% of production systems run Primary-Replica — it is simpler, safer, and sufficient until write throughput genuinely demands multi-primary
Split brain is the #1 operational risk of multi-primary: two nodes accept conflicting writes, data diverges, and the divergence is usually silent
Production Insight
Primary-Replica eliminates conflict resolution by design — use it by default.
Multi-primary adds complexity across every layer and requires distributed consensus.
Only adopt multi-primary when horizontal write scaling is a proven hard requirement.
Most teams that chase multi-primary are really solving a read-scale problem that Primary-Replica handles with zero application changes.
Key Takeaway
Default to Primary-Replica. Only reach for multi-primary when write throughput genuinely demands it and your team can operate distributed consensus under pressure. The simplicity of a single writer saves you from years of conflict-resolution headaches.
Conflict Resolution Patterns
When multiple nodes accept writes — as in multi-primary or active-active topologies — conflicts are inevitable. Two clients update the same row on different primaries before replication propagates either change. The system must decide which version wins and how to reconcile the loser. This section covers the three dominant approaches: Last Writer Wins, Vector Clocks, and CRDTs.
Last Writer Wins (LWW) is the simplest and most common. Each write carries a timestamp. The operation with the most recent timestamp wins; the losing write is silently discarded. LWW is easy to implement and fast — no coordination needed during writes. The cost is silent data loss: if two writes happen within the same clock tick, or if one node's clock is skewed, legitimate updates vanish without notification. LWW is acceptable for low-criticality data like session preferences or user-agent strings. It is dangerous for financial or inventory data where every update matters.
Vector Clocks track causal relationships by maintaining a mapping of node -> logical counter. When a write arrives, the vector clock is compared to the current version. If one clock dominates (all counters >= the other's and at least one >), there is no conflict and the later version wins. If neither dominates, a genuine conflict exists and must be resolved, typically by returning both versions to the application layer. Vector clocks never silently lose data, but they require application-level conflict handlers, add metadata overhead per row, and make the read path more complex.
CRDTs (Conflict-free Replicated Data Types) are mathematical structures designed so that concurrent updates always converge without coordination. For example, a counter that only increments can use a G-Counter where each node stores its own increment count and the total is the sum. CRDTs eliminate conflicts entirely but only support a limited set of data types (counters, sets, registers). They are not a general-purpose solution for relational tables with arbitrary schemas. CRDTs are most common in real-time collaboration tools (shared document editing) and eventually-consistent key-value stores.
For most database replication scenarios, LWW with careful clock synchronization is the pragmatic default, and multi-primary should be avoided unless the write throughput requirement genuinely mandates it. If you must use multi-primary, plan for conflict resolution in your application layer from day one — retrofitting it later is painful and error-prone.
io/thecodeforge/db/conflict_resolver.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# io.thecodeforge: Example conflict resolver using Vector Clocks# This is a simplified simulation — real systems use persistent storage for clock dataclassVectorClock:
def__init__(self, clocks=None):
self.clocks = clocks or {} # node_id -> counterdefincrement(self, node_id):
self.clocks[node_id] = self.clocks.get(node_id, 0) + 1defdominates(self, other):
# Returns True if self > other (self is strictly newer)
has_greater = Falsefor node, counter inself.clocks.items():
other_counter = other.clocks.get(node, 0)
if counter < other_counter:
returnFalseif counter > other_counter:
has_greater = Truefor node, counter in other.clocks.items():
if node notinself.clocks and counter > 0:
returnFalsereturn has_greater
defconcurrent(self, other):
returnnotself.dominates(other) andnot other.dominates(self)
# Example usage
a = VectorClock({'A': 3, 'B': 1})
b = VectorClock({'A': 2, 'B': 2})
print(f"a dominates b: {a.dominates(b)}") # False (A:3>2 but B:1<2)print(f"concurrent: {a.concurrent(b)}") # True -> conflict!print(f"b dominates a: {b.dominates(a)}") # False
Output
a dominates b: False
concurrent: True
b dominates a: False
Clock Skew Breaks Last-Writer-Wins
LWW relies on timestamps being reasonably synchronized across nodes. If a node's clock drifts by even a few seconds, writes from that node can incorrectly win against genuinely newer writes from other nodes. Use NTP with tight configuration (multiple servers, frequent sync) on every node in a multi-primary cluster. In cloud environments with fragmented VM clocks, consider using monotonic logical counters (Lamport timestamps) instead of wall-clock time.
Production Insight
Multi-primary with LWW silently loses data when writes conflict. Multi-primary with Vector Clocks never loses data but requires application-level handlers for concurrent writes.
If the number of conflicts is low, LWW is acceptable for non-critical data. For critical data, either avoid multi-primary or invest in application-level conflict resolution.
CRDTs are elegant but limited in scope — they are not a drop-in replacement for relational table replication.
In practice, you'll rarely need multi-primary. But if you do, test your conflict resolution under realistic write contention before going to production.
Key Takeaway
Conflict resolution is the hardest part of multi-primary replication. LWW is simple but loses data silently; Vector Clocks are correct but require application awareness; CRDTs are safe but only work for specific data types. Default to Primary-Replica to avoid the problem entirely. If you must go multi-primary, invest heavily in testing conflict scenarious.
Synchronous vs. Asynchronous Replication
This is the most consequential trade-off in replication design, and it's worth spending time with rather than defaulting to whatever the tutorial used.
In asynchronous replication, the primary confirms the write to the client as soon as it's committed locally. The WAL is shipped to replicas in the background, after the acknowledgment. This is fast — write latency is bounded only by local disk speed — but it creates a window where committed transactions exist only on the primary. If the primary crashes before shipping that WAL, those transactions are gone. The replica becomes the new primary with a gap in its history that nobody can fill.
In synchronous replication, the primary waits until at least one replica confirms it has received and durably written the WAL before acknowledging the commit to the client. This closes the data loss window entirely — at least one copy of every committed transaction exists before the client gets a response. The cost is that write latency now includes the network round-trip to the replica. If your replica is in a different availability zone, you're paying 2–10ms per write, every write, forever.
Neither is obviously correct. The right choice depends on what your data is worth and what your users will tolerate. A social media feed can absorb async lag without most users noticing. A financial transaction system where a committed payment might vanish on primary failure is a different conversation entirely.
io/thecodeforge/db/docker-compose.ymlDOCKER
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# io.thecodeforge: Quick-start PostgresPrimary-Replica cluster for local development
# Do not use this config verbatim in production — add TLS, auth, and resource limits
services:
db-primary:
image: postgres:16-alpine
environment:
POSTGRES_PASSWORD: forge_secret
POSTGRES_USER: forge_admin
command: |
postgres
-c wal_level=replica
-c max_wal_senders=10
-c max_replication_slots=10
-c wal_keep_size=256MB
db-replica:
image: postgres:16-alpine
depends_on:
- db-primary
environment:
PGPASSWORD: forge_secret
command: |
/bin/sh -c "
until pg_basebackup -h db-primary -D /var/lib/postgresql/data \
-U forge_admin -vP -W --wal-method=stream -R; do
echo 'Primary not ready, retrying in 2s...'; sleep 2
done
echo 'Basebackup complete. Starting replica...'
postgres"
Output
db-primary | LOG: database system is ready to accept connections
db-replica | NOTICE: pg_basebackup: initiating base backup
db-replica | NOTICE: pg_basebackup: base backup completed
db-replica | LOG: started streaming WAL from primary at 0/1000000
Synchronous replication ties your write latency to the network round-trip between primary and replica. If your replica is in a different availability zone, expect 2–10ms added per write. At 1,000 writes per second, that is a measurable throughput ceiling and a latency floor you cannot escape without moving the replica closer. Always benchmark synchronous replication under production-representative write load before enabling it — the latency impact is rarely what people expect, and it only surprises you at the worst time.
Production Insight
Async replication risks losing committed transactions if the primary crashes before shipping WAL — the window is usually small but nonzero.
Sync replication eliminates that window but adds network RTT to every write, and the RTT is not negotiable once your topology is fixed.
Measure both before committing to either. Use async by default for most workloads. Enable sync only for financial or compliance-critical write paths where data loss is a harder constraint than latency.
Also consider selective sync: use synchronous_commit='remote_apply' for specific high-value transactions rather than cluster-wide.
Key Takeaway
Asynchronous replication gives you fast writes, but the primary can crash before shipping WAL — the last few committed transactions may vanish with no way to recover them. Synchronous replication closes that window entirely but gates every write on the replica's network RTT. Use async by default. Enable sync only when a compliance or financial requirement makes zero data loss non-negotiable — and benchmark the latency impact under realistic write load before you ship it.
Choosing Synchronous vs. Asynchronous Replication
IfGeneral web application — can tolerate potential loss of the last few committed writes on catastrophic primary failure
IfFinancial transactions or compliance requirement where zero committed-data loss is mandatory
→
UseUse synchronous — accept the latency penalty for guaranteed durability, benchmark write throughput under load
IfMulti-AZ deployment where replica is 50–200ms round-trip from primary
→
UseUse asynchronous with fast WAL shipping and monitoring — synchronous across AZs can add 5–15ms per write and crater your write throughput
IfNeed durability for specific write paths but cannot afford synchronous latency cluster-wide
→
UseUse synchronous_commit = 'remote_apply' selectively per transaction at the application layer — pay the sync tax only where it is genuinely required
Sync vs Async Decision Matrix
To help you choose between synchronous and asynchronous replication, here is a visual decision matrix summarizing key trade-offs across common dimensions. Use this alongside the other content in this section to evaluate your specific use case.
Decision Matrix: Async vs Sync Replication
Use this matrix to evaluate which mode fits your requirements. The row with the most 'Async' or 'Sync' checks should guide your choice, but always test under production-like load.
Production Insight
This matrix distills the trade-offs discussed earlier. Use it as a quick reference when evaluating a new workload.
Remember that you can mix modes per transaction in PostgreSQL and MySQL — you don't have to pick one for the entire cluster.
For example, use async for bulk imports and sync for individual customer-facing writes.
Key Takeaway
Async for speed and simplicity, sync for durability at the cost of latency. Know your data's value before choosing.
Replication Lag: Causes, Measurement, and Consequences
Replication lag is the time gap between a transaction committing on the primary and that same transaction becoming visible on a replica. It's measured two ways: time-based (how many seconds is the replica behind?) and log-based (how many bytes of WAL has the replica not yet applied?). Both metrics matter and they tell you different things. Time tells you the user-visible impact. Bytes tell you how much backlog the replica is carrying and whether it's catching up or falling further behind.
Lag is not a bug in your replication setup. It is a fundamental, expected property of asynchronous replication, and the engineering question is never 'how do I eliminate lag?' but 'is this lag within acceptable bounds for my use case?' A 200ms lag on a social media activity feed is invisible to users. A 200ms lag on a bank account balance immediately after a wire transfer is a customer complaint and potentially a compliance issue. The same 200ms, completely different problem severity.
The causes of lag that I see most often in production: long-running transactions on the primary that generate a large WAL burst when they commit, replica hardware that is weaker than the primary (slower IOPS means slower WAL apply), network congestion between primary and replica — especially on shared cloud network links during peak hours, and replicas running heavy read queries that compete with the WAL apply process for disk I/O. That last one is subtle and frequently overlooked: your read traffic is physically competing with replication for the same disk, and replication will lose if the disk is saturated.
io/thecodeforge/db/LagMonitoring.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- io.thecodeforge: Replication lag monitoring — run both methods, not just one-- Method 1: Time-based lag (run on the REPLICA)-- Tells you user-visible staleness in wall-clock timeSELECTnow() - pg_last_xact_replay_timestamp() AS time_lag,
CASEWHENnow() - pg_last_xact_replay_timestamp() > INTERVAL'5 seconds'THEN'PAGE: lag exceeds 5s'WHENnow() - pg_last_xact_replay_timestamp() > INTERVAL'1 second'THEN'WARN: lag exceeds 1s'ELSE'OK'ENDAS lag_status;
-- Method 2: Byte-based lag (run on the PRIMARY)-- Tells you WAL backlog depth — a replica at 0s time lag can still carry 500MB of queued WALSELECT
client_addr,
state,
sent_lsn,
replay_lsn,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_pretty,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
CASEWHENpg_wal_lsn_diff(sent_lsn, replay_lsn) > 104857600THEN'PAGE: >100MB lag'WHENpg_wal_lsn_diff(sent_lsn, replay_lsn) > 10485760THEN'WARN: >10MB lag'ELSE'OK'ENDAS lag_status
FROM pg_stat_replication;
Output
-- Replica output
time_lag | lag_status
---------+-----------
00:00:00 | OK
-- Primary output
client_addr | state | lag_pretty | lag_bytes | lag_status
Pro Tip: Monitor Both Time and Bytes — They Catch Different Failure Modes
Time-based lag tells you whether users are seeing stale data right now. Byte-based lag tells you whether the replica is about to fall behind the moment write volume spikes. A replica can show 0 seconds of time lag while carrying 500 MB of queued WAL — it looks fine until your next batch job fires and the backlog doubles in 60 seconds. Alert on both metrics with separate thresholds, and graph them together so you can see the correlation when a write burst hits.
Production Insight
Replication lag is inevitable in async replication — the question is whether it is within your application's tolerance for stale data.
A 500ms lag on a read-your-own-write flow means users see their own edits disappear for half a second immediately after saving, which erodes trust fast.
Measure lag continuously. Set alert thresholds based on your actual consistency requirements, not round numbers.
And alert before users complain, not after.
Also track the rate of change of byte lag — a rising rate indicates a systemic issue, not a transient spike.
Key Takeaway
Lag is a property of async replication, not a malfunction — measure it continuously and set thresholds before users find your threshold for you. Monitor both time-based lag (user impact) and byte-based lag (how much work is queued). If your application requires read-after-write consistency, implement it explicitly: route the user's reads to the primary for a short window after their write rather than pretending the replica is always current.
Responding to Replication Lag
IfLag is stable and under your SLA threshold — not growing
→
UseNo action required — this is normal async replication behavior, document the baseline
IfLag is growing steadily over time during normal traffic
→
UseCheck replica disk I/O with iostat -x 1. Check whether heavy read queries compete with WAL apply. Consider a dedicated WAL-apply-only replica for failover and separate read replicas for query traffic
IfLag spikes during batch writes or large transactions then recovers
→
UseExpected burst behavior — monitor recovery time. If it takes longer to recover each cycle, the replica hardware is under-provisioned for your write growth rate
IfUsers report seeing stale data immediately after their own writes
→
UseImplement read-after-write consistency: route that user's reads to the primary for N seconds after their last write, or track the write LSN and wait for the replica to catch up to it before serving the read
Write-Ahead Logs and Binary Logs: The Replication Plumbing
Every mainstream relational database uses a write-ahead log as the physical foundation of replication. In PostgreSQL it's the WAL. In MySQL it's the Binary Log (binlog). The concept is the same in both: before any change is applied to the actual data files, it's written sequentially to the log first. The log is the authoritative record of every change the primary made, in the exact order it was made. Replicas consume this log to reconstruct the primary's state on their own storage.
WAL serves two distinct purposes that are easy to conflate. First, crash recovery: if the primary dies mid-write, the WAL records enough information to replay or roll back the incomplete transaction on restart. The data files are always recoverable from a consistent checkpoint plus the WAL that follows it. Second, replication: the primary streams WAL segments to connected replicas over a network socket, and replicas apply them in sequence. These two purposes share the same physical log, which is why WAL configuration affects both durability and replication behavior.
The wal_level setting in PostgreSQL determines how much detail the WAL contains. The replica level is the minimum for physical streaming replication — it records enough to reconstruct data changes. The logical level adds row-level detail needed for logical replication and change data capture pipelines like Debezium. Logical WAL generates more volume per write — roughly 2–4x in high-UPDATE workloads — so don't set it unless you actually need it. And changing wal_level requires a server restart, so plan this before you have replicas depending on it.
io/thecodeforge/db/WALConfiguration.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- io.thecodeforge: PostgreSQL WAL configuration audit and replication slot monitoring-- Check current WAL level — requires server restart to changeSHOW wal_level;
SHOW max_wal_senders;
SHOW max_replication_slots;
-- Recommended postgresql.conf settings for a replication primary:-- wal_level = replica -- minimum for physical streaming replication-- max_wal_senders = 10 -- max concurrent replication connections (replicas + tools)-- max_replication_slots = 10 -- max slots; each slot pins WAL until the replica consumes it-- wal_keep_size = '1GB' -- WAL to retain without slots (safety net, not a substitute for slots)-- max_slot_wal_keep_size = '10GB' -- hard cap on WAL any single slot can pin (CRITICAL — set this)-- Monitor active WAL positionSELECTpg_current_wal_lsn() AS current_lsn,
pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file;
-- Audit replication slots — pay attention to inactive slots retaining large WALSELECT
slot_name,
slot_type,
active,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal,
CASEWHEN active = false
ANDpg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824THEN'ALERT: inactive slot retaining >1GB — check immediately'ELSE'OK'ENDAS slot_status
FROM pg_replication_slots
ORDERBYpg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
Output
slot_name | slot_type | active | retained_wal | slot_status
Why WAL Retention Matters More Than Most Engineers Realize
Replication slots tell the primary: do not recycle any WAL segment until this replica has consumed it. That guarantee is critical when a replica needs to catch up after a brief network interruption. The problem is the guarantee has no expiry. An offline replica's slot pins WAL indefinitely — 1 hour, 3 days, however long it takes someone to notice. Without max_slot_wal_keep_size set, a single dead replica's slot will eventually fill the primary's entire pg_wal directory, causing a complete write outage. Set that parameter. Monitor slot retention. Drop slots before extended maintenance windows.
Production Insight
WAL is simultaneously the crash recovery mechanism and the replication stream — it is the single source of truth for every change the database has ever made.
Replication slots are a powerful safety net for replicas, but without monitoring and a size cap, one offline replica's slot becomes a time bomb for the primary's disk.
Always set max_slot_wal_keep_size, monitor inactive slot retention, and treat a slot retaining more than 1 GB as an alert-level event.
Also, document slot ownership — tag each slot with a human-readable description in the slot_name so you know which replica it protects.
Key Takeaway
WAL is the replication stream — every change is logged sequentially before touching the data files, and replicas apply that log to stay in sync. Replication slots are valuable but dangerous without limits: a dead replica's slot pins WAL indefinitely and will fill your primary's disk. Set max_slot_wal_keep_size unconditionally. Alert on inactive slots retaining more than 1 GB. A disk-full primary is a total outage — not a performance degradation, not a warning. A total outage.
IfNeed logical replication, CDC, or Debezium integration
→
UseSet wal_level = logical — generates more WAL volume per write, but enables row-level change streams and selective table replication
IfUsing replication slots for any replica
→
UseSet max_slot_wal_keep_size as a mandatory safety valve — this single setting prevents one offline replica from filling the primary's disk
IfNot using replication slots
→
UseSet wal_keep_size large enough to cover your longest expected replica downtime — WAL beyond this limit is eligible for recycling and a lagging replica may lose its position
Automated Failover: When the Primary Dies
Manual failover — SSH into a replica, run pg_ctl promote, update your connection strings, restart the application, redirect traffic — works in a controlled staging exercise during business hours. In production at 3 AM with pager alerts firing, adrenaline running, and half the team half-asleep, manual failover is how data gets lost and how the wrong replica gets promoted. I've seen both happen.
Automated failover tools handle the entire sequence: detect the primary failure, select the most up-to-date eligible replica, promote it, reconfigure remaining replicas to follow the new primary, and update the application's connection endpoint. All of that in under 30 seconds, without human judgment calls under pressure.
The two dominant tools for PostgreSQL are Patroni (built on etcd or Consul for distributed consensus and leader election) and Repmgr (simpler setup, less operational overhead for smaller clusters). For MySQL, Orchestrator is the standard. All three solve the same core problem: ensuring that exactly one node holds the write role at any moment. The 'exactly one' constraint is the hard part — distributed systems make this genuinely difficult because a network partition can make a live primary look dead to the failover system.
Failover tooling alone is not enough. The application must know where to send writes after failover completes. This requires either a connection pooler (PgBouncer, ProxySQL) or DNS-based service discovery (Route53, Consul) that can be updated programmatically as part of the failover sequence. Hard-coding the primary's IP address in your application configuration file is a guarantee that your automated failover will still require manual application intervention — which defeats most of the point.
io/thecodeforge/db/patroni.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# io.thecodeforge: Patroni configuration for a 3-node PostgreSQLHA cluster
# Patroni uses a DCS (etcd/consul) for distributed leader election and lock management
# Run one instance of this per PostgreSQL node — change 'name' and 'connect_address' per node
scope: forge-cluster
name: node-1 # change to node-2, node-3 on other nodes
namespace: /db/
restapi:
listen: 0.0.0.0:8008
connect_address: node-1.internal:8008
etcd3:
hosts: etcd-1:2379, etcd-2:2379, etcd-3:2379
dcs:
ttl: 30 # leader lock TTL in seconds
loop_wait: 10 # how often Patroni checks its status
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB — replica must be within 1MB of primary to be promotion-eligible
synchronous_mode: true # require at least one synchronous replica before committing
synchronous_mode_strict: false # do not block writes if no sync replica is available (degrades to async)
pg_hba:
- host replication replicator 10.0.0.0/8 md5
- host all all 0.0.0.0/0 md5
postgresql:
listen: 0.0.0.0:5432
connect_address: node-1.internal:5432
data_dir: /var/lib/postgresql/16/main
bin_dir: /usr/lib/postgresql/16/bin
parameters:
max_connections: 200
shared_buffers: 2GB
wal_level: replica
max_wal_senders: 10
max_replication_slots: 10
max_slot_wal_keep_size: 10GB # safety valve — prevents slot bloat from filling disk
authentication:
replication:
username: replicator
password: forge_repl_secret
superuser:
username: postgres
password: forge_admin_secret
Output
2026-03-05 14:22:01 INFO: promoted self to leader by acquiring session lock
2026-03-05 14:22:02 INFO: no action. I am the leader with the cluster lock
2026-03-05 14:22:12 INFO: Lock owner: node-1; I am node-1
2026-03-05 14:22:22 INFO: node-2 is now a streaming standby
2026-03-05 14:22:32 INFO: node-3 is now a streaming standby
Watch Out: Failover Without Fencing Causes Split Brain
If the old primary is unreachable due to a network partition — but not actually dead — it may still be accepting writes from clients whose connection strings haven't updated yet. Promoting a new primary in that scenario gives you two nodes simultaneously accepting writes with diverging histories. Automated failover must include fencing: stopping the old primary process, revoking its VIP, or removing it from the connection pooler before the promotion completes. Without fencing, you are trading one failure mode for a worse one.
Production Insight
Manual failover under pressure is how data gets lost and how the wrong node gets promoted — automate it before you need it.
Failover without fencing causes split brain, which is harder to recover from than the original primary failure.
Use Patroni with etcd for leader election and ensure your fencing mechanism (VIP revocation, STONITH, pg_hba.conf update) is part of the automated sequence.
Never hard-code primary IPs in application config — the failover is only complete when the application is writing to the new primary without human intervention.
Also test failover regularly, at least quarterly, to ensure the automation still works and everyone knows the procedure.
Key Takeaway
Automated failover with Patroni or Orchestrator converts a 3 AM manual recovery sequence into a sub-30-second automated event. Fencing is not optional — without it the old primary can continue accepting writes and you have split brain, which is a worse problem than the original failure. Never hard-code the primary IP in application config. Use a connection pooler or service discovery layer that updates automatically on promotion — otherwise your automated failover still requires a manual step to be useful.
Failover Strategy Decisions
IfSingle-region PostgreSQL deployment, need fast automatic failover with leader election
→
UseUse Patroni with etcd — handles leader election, replica promotion, standby reconfiguration, and integrates with HAProxy for connection routing
IfMySQL ecosystem, need topology visualization and failover orchestration
→
UseUse Orchestrator — visualizes the full replication topology, detects primary failure, and orchestrates graceful failovers with configurable hooks
IfMulti-region deployment with cross-region replicas and strict RPO requirements
→
UseUse Patroni with synchronous_mode enabled and region-aware candidate priority — prefer a same-region replica for promotion to minimize RPO and avoid cross-region WAL gaps
IfApplication currently hard-codes the primary host or IP
→
UseIntroduce PgBouncer or ProxySQL as a connection proxy before attempting automated failover — the app connects to the proxy, failover updates the proxy's backend, no app restart required
Designing Your Replication Topology for Production
A replication topology is not just 'one primary and a couple of replicas.' It's a deliberate design that accounts for how reads are routed, how WAL shipping load is distributed across the primary's network interface, what the failover blast radius looks like, and whether your replicas will actually help when the primary fails — or whether they'll fail alongside it because they share the same infrastructure failure domain.
The most common production topology is a star: one primary with N replicas, all directly connected to the primary via streaming replication. Read traffic is distributed across replicas via a load balancer or connection pooler. HAProxy and PgBouncer in pool mode are the standard choices here. This works well for single-region deployments up to roughly 10 replicas, at which point the primary's WAL shipping starts to become a bottleneck — it's shipping the same WAL stream N times over the same network interface.
Beyond 10 replicas, or whenever you can measure WAL shipping as a CPU or network ceiling on the primary, cascading (hierarchical) replication is worth evaluating. Tier-1 replicas connect directly to the primary. Tier-2 replicas connect to tier-1 replicas and inherit their lag in addition to their own. The primary's WAL shipping burden drops proportionally, but you add a lag tier at each level — tier-2 replicas are always at least as far behind as tier-1. Monitor each tier's lag separately.
For multi-region deployments, place at least one replica in each region. Route reads to the nearest replica for latency. But be clear-eyed about what cross-region replication lag means: a primary in us-east-1 and a replica in eu-west-1 will carry 80–150ms of lag under normal conditions, and more during network events. Writes always go to the single primary — and if that primary is in a different region than your users, those writes pay the cross-region latency too. Geographic distribution of replicas improves read latency; it does not improve write latency unless you go multi-primary, with all the complexity that brings.
io/thecodeforge/db/topology-diagram.txtTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- io.thecodeforge: CommonProductionReplicationTopologies
-- STARTOPOLOGY (recommended default, up to ~10 replicas)
-- All replicas connect directly to the primary
-- Primary ships WAL to every replica — network bandwidth scales linearly with replica count
--
-- [Primary]
-- / | \n-- [Rep1] [Rep2] [Rep3]
-- (reads) (reads) (hot standby)
-- CASCADINGTOPOLOGY (for10+ replicas, or when WAL shipping is a primary bottleneck)
-- Tier-1 replicas connect to primary; tier-2 replicas connect to tier-1
-- Reduces primary WAL shipping load — each tier-1 ships to its tier-2 children
-- Lag at tier-2 = tier-1 lag + tier-2 lag — monitor each tier separately
--
-- [Primary]
-- / \n-- [Rep1-A] [Rep1-B]
-- / \ / \n-- [Rep2-A] [Rep2-B] [Rep2-C] [Rep2-D]
-- MULTI-REGIONTOPOLOGY (for geographic read distribution)
-- One or more replicas per region — route reads to nearest replica
-- Cross-region lag is 80-150ms under normal conditions — set SLAs accordingly
-- Writes still go to the single primary — cross-region writes pay the RTT
--
-- [Primary: us-east-1] ----WAL----> [Replica: eu-west-1]
-- | |
-- [LocalReadReplicas] [LocalReadReplicas]
-- (us-east-1 reads) (eu-west-1 reads)
Output
Topology selected based on replica count, read volume, geographic distribution, and acceptable replication lag per tier.
Pro Tip: Provision Replicas for WAL Apply Throughput, Not Just Read Traffic
The most common and expensive topology mistake I see is giving replicas weaker hardware than the primary to save cost — smaller instance types, slower disks. The reasoning is that replicas are 'just for reads.' The problem is replicas apply WAL continuously in the background while simultaneously serving read queries, and both operations are heavily I/O-bound. If your replica's disk cannot keep pace with the primary's write rate plus its read workload, lag grows and never recovers. Provision replicas with equal or better disk IOPS than the primary. The cost difference is almost always worth it relative to the operational pain of a permanently lagging replica.
Production Insight
Replicas must match or exceed the primary's disk I/O capacity — WAL apply is I/O-bound and competes directly with read query I/O.
Cascading replication solves WAL shipping bottlenecks on the primary but adds lag at each tier.
Multi-region replicas reduce read latency per region but do nothing for write latency and carry inherent cross-region lag.
Provision replicas for write-apply capacity, not just read-serving capacity — a replica that cannot keep up with the primary's write rate provides degraded availability exactly when you need it most.
Also consider colocating replicas in the same failure domain as the primary for fast failover, and separate read replicas in other domains for availability.
Key Takeaway
Star topology is the right default — direct connections, easy to monitor, easy to understand under pressure. Add cascading replication only when you can measure WAL shipping as a bottleneck on the primary, not speculatively. Multi-region replicas give you read locality but introduce cross-region lag that belongs in your SLA documentation, not as a surprise. The single most important hardware decision: provision replicas for WAL apply throughput. A replica that cannot keep up with the primary's write rate lags permanently — and a permanently lagging replica is a hot standby that isn't ready when you need it.
Choosing a Replication Topology
IfSingle region, up to ~10 replicas, primary WAL shipping is not a measured bottleneck
→
UseUse star topology — all replicas connect directly to the primary, simplest to monitor and operate
IfSingle region, 10+ replicas, or primary network bandwidth is a measured WAL shipping bottleneck
→
UseUse cascading topology — tier-1 replicas connect to primary, tier-2 replicas connect to tier-1. Monitor lag per tier separately
IfMulti-region deployment, reads must be low-latency in each region
→
UsePlace at least one replica per region, route reads to the nearest replica, accept and document cross-region lag in your consistency SLA
IfZero-RPO requirement across regions — cannot tolerate any data loss on cross-region primary failure
→
UseUse synchronous replication to a cross-region replica — benchmark the cross-region RTT impact on write latency before committing, it is often larger than expected
Architectural Topology Diagram
The following Mermaid diagram visualizes the three major replication topologies: star, cascading, and multi-region. Use it as a reference when designing your deployment.
Choosing the Right Topology
Star is the default for simplicity and low lag. Use cascading when the primary's WAL shipping becomes a measured bottleneck. Use multi-region only when you need geographic read locality, and accept the cross-region lag trade-off.
Production Insight
Visualizing your topology helps you reason about failure domains. In star, the primary is a single point of failure for writes, but reads are distributed.
In cascading, a tier-1 replica failing can cause all its downstream replicas to re-sync.
In multi-region, a region failure may isolate replicas.
Choose based on your tolerance for lag and your ability to tune replication parameters per tier.
The diagram also helps in runbook documentation — new team members can quickly understand the data flow.
Key Takeaway
Star for simplicity, cascading for scale, multi-region for geographic reads. Always document your topology and test failover scenarios.
Replication Topology Comparison
Health Check Dashboard Metrics
A replication health dashboard should surface the metrics that let you detect problems before they become outages. Below is a recommended set of metrics with alert thresholds.
Build a Single Pane of Glass
Aggregate these metrics into a Grafana dashboard or your monitoring tool of choice. Include a timeline of per-replica lag alongside write throughput on the primary. When a batch job spikes write volume, you'll see the lag climb in real time and can correlate it immediately.
Production Insight
The most telling metric is often the combination of time lag and byte lag. A replica may have low time lag but high byte lag — meaning it's applying recently committed transactions but has a large backlog of older WAL. That replica is one write burst away from falling far behind.
Monitor both together and set alerts that catch the 'ready to fall off a cliff' state.
Also track the rate of change of byte lag — a linearly increasing rate means you are approaching a tipping point.
Key Takeaway
Monitor time lag, byte lag, slot health, and replica state. Alert on inactive slots immediately — they are the most common silent killer of primaries. Add the rate of change of byte lag to catch systemic issues before they become outages.
Replication Security: Encryption and Authentication
Replication streams are sensitive — they carry every data change your database makes. If an attacker intercepts the WAL stream, they have a near-complete copy of your data. Secure replication connections with SSL/TLS encryption between primary and replicas. In PostgreSQL, set ssl = on in postgresql.conf and require sslmode=verify-full in pg_hba.conf for the replication user. For MySQL, enable --ssl-mode=VERIFY_IDENTITY for the replication connection.
Authentication matters just as much. Create a dedicated replication user with minimal privileges — it should only have REPLICATION or REPLICATION SLAVE privileges, not superuser. Use strong passwords or certificate-based authentication. Rotate credentials regularly.
Network segmentation is your last line of defense. Place replication traffic on a private subnet or dedicated VLAN. Use firewall rules to allow replication only from known replica IPs. Never expose replication ports (5432 for PostgreSQL, 3306 for MySQL) to the public internet.
I've seen security audits catch replication streams flowing over unencrypted connections between cloud VPCs. It's an easy fix that many teams overlook because 'it's internal traffic.' Internal traffic still crosses shared infrastructure.
Unencrypted Replication is a Data Leak Waiting to Happen
Without TLS, anyone with access to the network path between primary and replica can sniff the entire WAL stream. That includes every insert, update, and delete — potentially PII, financial data, or credentials. Enable SSL/TLS on replication connections even if both nodes are in the same VPC. The performance overhead is negligible compared to the risk.
Production Insight
Replication traffic carries a complete record of every change — treat it with the same security posture as the database itself.
Use TLS for all replication connections, create a dedicated replication user with minimal privileges, and restrict network access to known replica IPs.
These are low-effort changes that prevent high-impact data exposure. Encryption overhead on replication is typically under 5% — it's not a valid reason to skip it.
Also consider using client certificates for mutual authentication as an additional layer.
Key Takeaway
Secure replication like any other sensitive data path. Enable TLS, use dedicated users with minimal privileges, and never expose replication ports to the internet. The overhead is negligible; the risk of not doing it is severe.
Replication in Cloud Environments (AWS RDS, Aurora, Cloud SQL)
Managed cloud databases simplify replication setup but constrain your control. Understanding those constraints prevents surprises during failover or scaling.
AWS RDS for PostgreSQL uses a 'Multi-AZ' deployment for high availability — it's automatic synchronous replication between AZs. The read replica feature uses asynchronous replication and can be promoted manually. RDS does not expose replication slots natively, so you cannot use logical replication with custom slots. If you need logical replication, use RDS Custom or self-managed on EC2.
Aurora has a completely different architecture — it's not standard replication. Aurora uses a shared-storage volume (6 copies across 3 AZs) rather than WAL shipping. 'Replicas' up to 15 can be added with virtually no lag (typically single-digit milliseconds). Failover is automatic within 30 seconds. Aurora shines for read-heavy workloads but has limitations: it does not support all PostgreSQL extensions and has a higher per-row overhead.
Google Cloud SQL for PostgreSQL offers 'grooming' for high availability via synchronous replication within the same region. Read replicas are asynchronous and can be promoted. Cloud SQL supports pglogical for logical replication if you enable the flag. Understand that failover in Cloud SQL typically takes 1-2 minutes due to DNS propagation and instance restart.
When using any managed service, test failover during low-traffic periods. Know the failover time, understand how connections are redirected, and ensure your application retries on connection loss. The cloud handles the infrastructure, but you own the operational readiness.
Production Insight
Managed cloud databases handle the heavy lifting of replication but abstract away critical details.
You can't query pg_replication_slots in RDS. You can't set max_slot_wal_keep_size. Aurora doesn't use WAL at all.
These aren't bad things — they're just different failure modes. Test failover behavior in a staging environment. Measure failover time.
Know how your application reacts to a connection drop. The cloud does not make the problem of replication monitoring go away; it moves it up the stack.
Always have a fallback plan if the managed service's replication limits become a constraint.
Key Takeaway
Cloud-managed replication trades control for convenience. Understand the architecture of your chosen service — Aurora's shared storage vs. standard WAL shipping, RDS's lack of slot visibility, Cloud SQL's failover duration. Test failover before you need it. Always ensure your application retries connections after a primary change.
Choosing a Cloud Replication Solution
IfNeed near-zero read replica lag and automatic failover with minimal management
→
UseUse Aurora — low lag, 30-second failover, but check extension compatibility and cost
IfNeed full control over replication settings (slots, wal_level, etc.)
→
UseUse self-managed PostgreSQL on EC2 or RDS Custom — you own the configuration and monitoring
IfSimple Multi-AZ availability with minimal configuration
→
UseUse RDS Multi-AZ or Cloud SQL HA — automatic synchronous replication within region, but failover time may be longer
IfNeed logical replication for CDC or selective table replication
→
UseUse RDS PostgreSQL with logical replication support (RDS Custom or higher version) or Cloud SQL with pglogical flag enabled
Disaster Recovery Planning with Replication
Replication is part of a disaster recovery plan, but it's not the whole plan. A DR plan based solely on replication fails when someone accidentally drops a table (the DROP propagates to replicas), when a widespread outage takes down both primary and replicas (if they're in the same region), or when a bug in application code corrupts data (the corruption spreads to all copies).
Use replication for availability and read scaling, not for backup. Always maintain independent point-in-time backups using pg_dump, WAL archiving (pgBackRest, Barman), or managed backup services. Test restores on a schedule — a backup you haven't tested is not a backup.
Define recovery objectives. RPO (Recovery Point Objective) is how much data you can afford to lose. RTO (Recovery Time Objective) is how fast you need to recover. Synchronous replication gives you RPO=0 (within a region). Asynchronous replication has a small RPO (seconds of WAL in transit). WAL archiving provides point-in-time recovery to any transaction. Map these to your DR scenarios.
Plan for region failures. If your database is in a single region, a regional outage takes everything down. Use cross-region replicas or backups in another region. Cross-region replication has higher lag and cost but reduces recovery time. Test a regional failover process at least once per quarter.
Document and practice the runbook. The best plan is worthless if nobody knows how to execute it under pressure. Run a DR drill during off-peak hours. Simulate a primary failure, a replica failure, a region failure, and a DROP TABLE incident. Measure actual RPO and RTO. Adjust based on results.
Replication is Availability, Not Backup
A common misconception: 'We have three replicas, so we don't need backups.' As the production incident in this article shows, a DROP TABLE propagates to all replicas in milliseconds. Replication protects against server failure, not human error or corruption. Always maintain independent backups with point-in-time recovery capability. Test restores regularly.
Production Insight
Replication is one layer of a defense-in-depth DR strategy.
Layer 1 is replication (server failure tolerance).
Layer 2 is backups (human error recovery).
Layer 3 is cross-region failover (regional outage tolerance).
Each layer has different RPO/RTO characteristics. Document them. Test them.
The drill that reveals a gap in your runbook is a success, not a failure — better to find it in a drill than in an incident.
Include a scheduled quarterly 'game day' where you simulate different failure scenarios.
Key Takeaway
Replication does not replace backups. Use it for availability, but maintain independent point-in-time backups for data recovery. Define RPO/RTO per layer. Practice DR drills regularly. The cost of not testing is measured in lost data and extended downtime.
Replication Topologies: Beyond the Binary Choice
Most guides stop at master-slave vs multi-master. That's the difference between owning a hammer and knowing there are screwdrivers. In production, you will build hybrids because read patterns are never uniform. A star topology works when your primary lives in us-east-1 and replicas fan out to other regions. Chain replication (replica receiving from another replica) reduces load on the primary but introduces cascading lag. Tree topologies solve geographic distribution without overwhelming a single node. Pick topology based on latency budgets, not dogma. If your replica in Singapore lags by 200ms because it syncs directly from Virginia, you should have used a regional intermediate. Measure round-trip times between nodes before wiring them together. Grafana dashboards that show per-replica lag tiers are not optional—they are your early warning system.
Chain replication with more than three hops creates a serialization bottleneck. If your lag graph shows spikes every time a mid-tier replica restarts, you have a recovery storm waiting. Limit chain depth to 2. Use intermediate reconciliation jobs for deeper trees.
Key Takeaway
Match replication topology to your geographic latency profile, not the vendor default. Three hops of chain replication is three points of failure.
Read-Your-Writes Consistency: The Bear Trap
Async replication is fantastic for throughput. Then you write a record, redirect the user to a screen that reads from a replica, and the data is gone. The user reloads. Still gone. They file a ticket and your on-call finds zero rows because replication lag swallowed the write. This is the read-your-writes consistency gap and it burns every junior who thinks "eventually consistent" means "fast enough." The fix is session stickiness. Route reads for a session back to the replica that confirmed the write if and only if that replica has caught up past the write's binlog position. Use an in-memory cache keyed on session_id that stores the last write LSN. Every read from a replica must check that LSN against the replica's current executed_gtid_set. If the replica is behind, stall the read or route to the primary. This adds 1ms per request—cheap insurance against a midnight pager.
read_your_writes_middleware.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# io.thecodeforge.replicationimport time
from threading importLockclassReadYourWritesGuard:
def__init__(self, db_cluster):
self._write_lsn = {}
self._lock = Lock()
self._cluster = db_cluster
defrecord_write(self, session_id: str, lsn: int):
withself._lock:
self._write_lsn[session_id] = (lsn, time.monotonic())
defget_safe_replica(self, session_id: str, timeout_ms=100):
write_lsn, timestamp = self._write_lsn.get(session_id, (0, 0))
if time.monotonic() - timestamp > 0.5:
return self._cluster.any_replica() # expired
start = time.monotonic_ns()
while (time.monotonic_ns() - start) < timeout_ms * 1_000_000:
for replica inself._cluster.replicas_by_lag():
if replica.executed_gtid >= write_lsn:
return replica
time.sleep(0.001)
# timeout: fall back to primaryreturnself._cluster.primary
Output
Session 'order_8821' attempted read on us-east-replica-3
-> replica lagging by 120 transactions (GTID: 1-200-450)
-> waited 42ms until replica caught up (GTID: 1-200-570)
-> read succeeded, latency acceptable
Production Trap:
Never route a write confirmation API call back to a replica without checking the GTID set. We saw a payment service confirm charges on a replica still applying the INSERT, leading to duplicate charges and angry customers.
Key Takeaway
Stale reads from replicas after a write are a consistency bug, not a configuration issue. Always guard reads with write position tracking.
● Production incidentPOST-MORTEMseverity: high
Replication slot bloat fills primary disk — full outage during peak traffic
Symptom
Primary database disk usage climbed steadily from 40% to 100% over 72 hours. Write queries started failing with 'could not extend file: No space left on device'. The application returned 500 errors on every write path. Read traffic on the healthy replicas continued working — which actually made the initial triage harder, because everything looked fine from the read side.
Assumption
The team assumed a runaway log file or table bloat on the primary. They ran VACUUM FULL across several large tables and purged several gigabytes of application logs. Disk usage didn't budge. At that point someone opened pg_replication_slots for the first time.
Root cause
A PostgreSQL replication slot had been configured for a replica taken offline for OS patching. The slot pinned WAL segments so the replica could catch up cleanly on return. Over 3 days, 180 GB of WAL accumulated in pg_wal. No alerting existed for replication slot lag, slot retention size, or inactive slots. The slot just sat there, invisible to the team, holding the WAL hostage.
Fix
Dropped the stale slot with pg_drop_replication_slot(). PostgreSQL reclaimed 180 GB immediately — disk dropped from 100% to 22% within a minute. Writes resumed without a restart. The team then added monitoring alerts on pg_replication_slots where active = false and pg_wal_lsn_diff() exceeds 1 GB, added a runbook step to explicitly drop or disable replication slots before taking replicas offline for any maintenance longer than a few hours, and set max_slot_wal_keep_size = '10GB' as a hard safety valve going forward.
Key lesson
Replication slots pin WAL segments — a dead replica's slot will fill the primary's disk silently and completely
Monitor pg_replication_slots.active and pg_wal_lsn_diff() — alert well before the disk threshold, not at 95%
Before taking a replica offline for extended maintenance, drop its slot or set max_slot_wal_keep_size — do not assume it's safe to leave it
Disk-full on a primary is a total write outage, not a degraded state — treat slot monitoring with the same urgency as CPU or memory
Document slot ownership. Tag each slot with the hostname or purpose so you know which replica it belongs to. An orphan slot with no known owner is a ticking bomb.
Production debug guideCommon replication failures and how to diagnose them without guessing7 entries
Symptom · 01
Replica shows stale data after writes on primary
→
Fix
Check replication lag with SELECT now() - pg_last_xact_replay_timestamp() on the replica. If lag is growing rather than stable, check network throughput between primary and replica and whether the replica's disk I/O is saturated applying WAL. A replica serving heavy read traffic can starve the WAL apply process — they compete for the same I/O.
Symptom · 02
Replication stopped — replica is not receiving WAL
→
Fix
Check pg_stat_replication on the primary. If the replica's row is missing entirely, the streaming connection dropped. Check pg_hba.conf allows the replication user from the replica's address, and verify basic network connectivity between the two hosts. If the row is present but state is not 'streaming', the replica may be replaying a backlog — watch whether it progresses.
Symptom · 03
Primary disk filling up with no obvious cause
→
Fix
Run SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes FROM pg_replication_slots. Look for inactive slots retaining large amounts of WAL. Drop dead slots immediately with pg_drop_replication_slot(). This is the single most common surprise I see on Postgres clusters that weren't set up with slot monitoring.
Symptom · 04
Replica crashes with 'requested WAL segment has already been removed'
→
Fix
The replica fell so far behind that the primary recycled the WAL segments it still needs. The only recovery path is a full resync: pg_basebackup from the primary, then start fresh. After resyncing, create a replication slot for this replica so the primary stops recycling WAL it hasn't consumed. Set max_slot_wal_keep_size to prevent the slot from growing unbounded if this happens again.
Symptom · 05
Write latency spiked after enabling synchronous replication
→
Fix
Expected behavior — the primary now blocks on every write until at least one replica acknowledges the WAL write. Measure the network RTT between primary and replica with ping or traceroute. If the replica is in a different AZ, you're paying 2–10ms per write. Consider synchronous_standby_names = 'FIRST 1 (replica1)' to ensure you block on only one fast replica, or use synchronous_commit = 'remote_apply' selectively per transaction rather than cluster-wide.
Symptom · 06
Split brain after failover — two nodes both accepting writes
→
Fix
Stop write traffic to the old primary immediately — this is not a situation where you investigate first. Use your fencing mechanism (STONITH, VIP revocation, removing from pg_hba.conf) to ensure the old primary cannot accept new writes. Then compare pg_current_wal_lsn() on both nodes to determine which has more data. Rebuild the lagging node as a replica of the correct primary. Do not try to merge diverged write streams — it ends badly.
Symptom · 07
Replica unexpectedly becomes primary without failover invoked
→
Fix
Check if any other system or person manually promoted the replica. If not, look at your HA tool logs (Patroni, repmgr). Misconfiguration or a bug in the HA tool can cause false promotion. Verify that your DCS (etcd, consul) is healthy and that only one node holds the leader lock. Audit the sequence of events and tighten your failover thresholds.
★ Replication Emergency Cheat SheetWhen replication breaks in production, run these commands in order. Resist the urge to jump to fixes — diagnose first.
Replica lag is growing and not recovering−
Immediate action
Confirm the replica is still connected and actively applying WAL — don't assume
Commands
SELECT client_addr, state, sent_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag_bytes FROM pg_stat_replication;
SELECT now() - pg_last_xact_replay_timestamp() AS replica_lag;
Fix now
If state is not 'streaming', check network and pg_hba.conf. If lag_bytes is climbing, check replica disk I/O with iostat -x 1 — if %util is near 100%, the replica disk cannot keep up with WAL apply and serving reads simultaneously.
Primary disk is full and writes are failing+
Immediate action
Find inactive replication slots pinning WAL — this is the cause 80% of the time
Commands
SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();
Fix now
SELECT pg_drop_replication_slot('dead_slot_name'); — verify disk recovers with df -h within 60 seconds. If no inactive slots exist, look for a single massive table with runaway autovacuum or an unfinished large import.
Replica won't start — 'requested WAL segment not found'+
Immediate action
The replica needs WAL the primary already recycled — there is no partial fix here
Full resync via pg_basebackup is the only path. Once resynced, create a replication slot for this replica immediately and set max_slot_wal_keep_size as a safety valve — otherwise this will happen again the next time the replica falls behind.
Suspected split brain — two nodes both accepting writes+
Immediate action
Identify which node the application is currently writing to — do not touch anything until you know
SELECT * FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat%';
Fix now
Fence the old primary first — revoke its VIP, stop its PostgreSQL process, or remove it from pg_hba.conf. Then verify the correct primary with SELECT pg_current_wal_lsn() on both nodes — the higher LSN is the node with more committed data. Rebuild the other as a replica using pg_rewind or pg_basebackup.
Replica lag spikes after a batch job completes+
Immediate action
Check if the batch job generated a large WAL burst. The spike is expected, but confirm recovery time is acceptable.
Commands
SELECT now() - pg_last_xact_replay_timestamp() AS replica_lag;
SELECT pg_wal_lsn_diff(sent_lsn, replay_lsn) AS byte_lag FROM pg_stat_replication;
Fix now
If lag exceeds your SLA for more than 5 minutes after the batch finishes, the replica I/O is insufficient for your write burst rate. Consider throttling the batch job or upgrading replica storage IOPS.
Asynchronous vs. Synchronous Replication
Dimension
Asynchronous Replication
Synchronous Replication
Write Latency
Low — bounded by local disk speed only
Higher — bounded by local disk plus network RTT to replica
Data Loss Risk on Primary Crash
Small but nonzero — WAL in transit is lost
Zero — replica must durably acknowledge before client gets confirmation
Read Consistency
Eventual — replica lag introduces stale reads
Strong — if reading from the synchronous replica
Write Availability on Replica Failure
Unaffected — primary continues accepting writes
Blocked if no synchronous replica is available (configurable with synchronous_mode_strict)
Operational Complexity
Standard — straightforward to monitor and operate
Higher — requires quorum configuration, failover handling, and latency benchmarking
Typical Use Case
Web applications, analytics replicas, read scaling
A developer runs DROP TABLE on the primary — accidentally or during a bad migration. The change propagates to all replicas within milliseconds. Every copy of the data is gone simultaneously. The team discovers that 'we had three replicas' does not help at all.
Fix
Replication is for availability and read scaling, not for data protection against destructive operations. Maintain separate point-in-time backups using pg_dump, WAL archiving with pgBackRest or Barman, or a managed backup service. Test restores on a schedule — a backup you haven't tested is not a backup, it's a hope.
×
Ignoring replication lag monitoring until users complain
Symptom
Users report seeing stale balances, missing profile updates, or phantom orders. Application logs show no errors. The replicas are 30–90 seconds behind the primary but no alerting exists. The team finds out from a support ticket, not a monitor.
Fix
Monitor replication lag continuously with alerts set before users hit the threshold. Use pg_stat_replication for byte-based lag and pg_last_xact_replay_timestamp() for time-based lag. Define thresholds based on your actual consistency SLA — alert at 1s, page at 5s is a reasonable starting point for most applications. Adjust from there based on measured user impact.
×
Under-provisioning replica hardware relative to the primary
Symptom
Replication lag grows steadily during peak write hours. The replica's disk I/O is saturated — it's simultaneously applying WAL and serving read queries, and the disk cannot do both at the capacity the primary generates. Lag climbs through the day and never recovers overnight.
Fix
Replicas do work proportional to the primary's write rate. WAL apply is I/O-bound. Provision replicas with equal or better disk IOPS and throughput than the primary. If read traffic is heavy enough to compete meaningfully with WAL apply, consider dedicated replica pools: one set of replicas for failover (WAL apply priority), a separate set for read traffic.
×
Hard-coding database IP addresses in application configuration
Symptom
The primary fails. Patroni promotes a replica within 20 seconds. The application still connects to the old primary's IP. Writes fail. Engineers spend 30 minutes updating config files, redeploying the application, and restarting connection pools while users see errors. The automated failover was only half-automated.
Fix
Use a connection pooler (PgBouncer, ProxySQL) or service discovery (Consul, Route53 with low TTL) as an abstraction layer. The application connects to a stable address that never changes. The failover process updates the pooler's or DNS record's backend target. No application restart required.
×
Running VACUUM FULL or large DDL on the primary without accounting for the WAL impact
Symptom
VACUUM FULL rewrites an entire table, generating a large WAL burst in a short window. Replicas fall 10–60 minutes behind. If synchronous replication is enabled, the primary blocks on every subsequent write until the replica catches up — application latency spikes to seconds per query during the maintenance window. Engineers didn't plan for this because 'it's just maintenance.'
Fix
Schedule heavyweight maintenance (VACUUM FULL, CREATE INDEX CONCURRENTLY, ALTER TABLE with rewrites) during documented low-traffic windows. If synchronous replication is enabled cluster-wide, consider temporarily setting synchronous_commit = 'local' for the maintenance session to prevent primary write blocking, then re-enable after the replica catches up. Monitor lag during and after any large maintenance operation.
×
Assuming all replicas are identical and interchangeable
Symptom
During failover, the promoted replica has a different hardware profile or different configuration (e.g., shared_buffers, max_connections) than the primary. After promotion, the new primary underperforms or fails under the same write load, causing a cascading outage.
Fix
Ensure all replicas that are eligible for promotion have the same or better hardware and configuration as the primary. Use infrastructure-as-code to maintain identical postgresql.conf files across all failover candidates. Include configuration consistency in your failover readiness checks.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
What is the difference between synchronous and asynchronous replication,...
Q02SENIOR
Explain how replication slots work in PostgreSQL and why they can cause ...
Q03SENIOR
What is replication lag, and how do you measure and mitigate it?
Q04SENIOR
Describe a common failure scenario during automated failover and how to ...
Q05SENIOR
How would you design a multi-region replication topology for low read la...
Q01 of 05SENIOR
What is the difference between synchronous and asynchronous replication, and when would you choose one over the other?
ANSWER
Asynchronous replication commits writes on the primary immediately and ships WAL to replicas in the background. This minimizes write latency but creates a small window of potential data loss if the primary crashes before the WAL is shipped. Synchronous replication waits for at least one replica to acknowledge the WAL write before confirming to the client. This guarantees zero data loss but adds network round-trip latency to every write. Choose async for most web applications where occasional data loss on catastrophic failure is acceptable. Choose sync for financial transactions or compliance workloads where you cannot lose committed data. You can also mix modes per transaction using synchronous_commit settings.
Q02 of 05SENIOR
Explain how replication slots work in PostgreSQL and why they can cause disk-full outages.
ANSWER
A replication slot tells the primary to retain WAL segments until the consuming replica has confirmed receipt. This ensures a replica can catch up after a brief disconnection. However, if a replica goes offline for an extended period, its slot pins those WAL segments indefinitely. Without a cap like max_slot_wal_keep_size, the primary's pg_wal directory grows unbounded until the disk fills up. This caused the exact outage described in the article: a replica down for 3 days caused 180 GB of WAL accumulation, filling the primary disk and stopping all writes. Monitoring pg_replication_slots.active and pg_wal_lsn_diff() prevents this.
Q03 of 05SENIOR
What is replication lag, and how do you measure and mitigate it?
ANSWER
Replication lag is the delay between a transaction committing on the primary and becoming visible on a replica. It's measured in two ways: time-based (using pg_last_xact_replay_timestamp() on the replica) and byte-based (using pg_wal_lsn_diff() on the primary). Time lag tells you user-visible staleness; byte lag tells you the backlog depth. Common causes include under-provisioned replica disks (WAL apply competes with read I/O), network congestion, and large transactions generating a WAL burst. Mitigation involves provisioning replicas with adequate I/O, using dedicated WAL-apply replicas separate from read replicas, and implementing read-after-write consistency in the application if needed.
Q04 of 05SENIOR
Describe a common failure scenario during automated failover and how to prevent it.
ANSWER
A common failure is split-brain: when the old primary is unreachable due to a network partition but not actually dead, and a new primary is promoted. Both nodes may accept writes, leading to divergent data. Prevention requires fencing the old primary before promotion — mechanisms include STONITH (shoot the other node in the head), revoking its VIP, or removing it from the connection pooler. Tools like Patroni integrate with fencing via hooks. Also ensure application connection strings point to a proxy (PgBouncer, HAProxy) that updates its backend during failover, not to a hard-coded IP.
Q05 of 05SENIOR
How would you design a multi-region replication topology for low read latency while maintaining data durability?
ANSWER
Place a primary in one region and at least one replica in each other region. Route read traffic to the nearest replica for low latency. For durability, configure synchronous replication to at least one replica in a different availability zone within the primary's region. Cross-region replication should be asynchronous to avoid high write latency — document the expected lag (typically 80–150ms) in your SLA. For zero data loss across regions, consider using sync replication to a cross-region replica, but benchmark the write latency impact first. Always have independent point-in-time backups in each region.
01
What is the difference between synchronous and asynchronous replication, and when would you choose one over the other?
SENIOR
02
Explain how replication slots work in PostgreSQL and why they can cause disk-full outages.
SENIOR
03
What is replication lag, and how do you measure and mitigate it?
SENIOR
04
Describe a common failure scenario during automated failover and how to prevent it.
SENIOR
05
How would you design a multi-region replication topology for low read latency while maintaining data durability?