Database Replication Explained: Architecture, Lag, and Production Pitfalls
- Replication is for availability and read scalability, not for backups — a DROP TABLE on the primary propagates to every replica in milliseconds.
- Understand your consistency requirements concretely before choosing async or sync: can your application tolerate a 500ms window of stale data, or does every read need to reflect the most recent write?
- Write-Ahead Logs are both the crash recovery mechanism and the replication stream — they are the single source of truth for every change the primary has made.
- Database replication copies data from a primary node to one or more replicas for read scalability and high availability
- Primary-Replica is the standard: all writes go to one primary, replicas serve reads and act as hot standbys
- Asynchronous replication is fast but risks data loss if the primary crashes before shipping WAL; synchronous blocks until a replica acknowledges
- Replication lag is the delay between a commit on the primary and its appearance on the replica — measure it in seconds or LSN distance
- Replication is NOT a backup — DROP TABLE propagates to replicas in milliseconds; always maintain separate point-in-time snapshots
- Automate failover with Patroni or Orchestrator — manual promotion during a 3 AM outage is how data gets lost
Replica lag is growing and not recovering
SELECT client_addr, state, sent_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag_bytes FROM pg_stat_replication;SELECT now() - pg_last_xact_replay_timestamp() AS replica_lag;Primary disk is full and writes are failing
SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots;SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();Replica won't start — 'requested WAL segment not found'
pg_controldata /var/lib/postgresql/data | grep 'Latest checkpoint'pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -vP -RSuspected split brain — two nodes both accepting writes
SELECT pg_is_in_recovery(); -- false = primary, true = replica/standbySELECT * FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat%';Production Incident
pg_drop_replication_slot(). PostgreSQL reclaimed 180 GB immediately — disk dropped from 100% to 22% within a minute. Writes resumed without a restart. The team then added monitoring alerts on pg_replication_slots where active = false and pg_wal_lsn_diff() exceeds 1 GB, added a runbook step to explicitly drop or disable replication slots before taking replicas offline for any maintenance longer than a few hours, and set max_slot_wal_keep_size = '10GB' as a hard safety valve going forward.pg_wal_lsn_diff() — alert well before the disk threshold, not at 95%Before taking a replica offline for extended maintenance, drop its slot or set max_slot_wal_keep_size — do not assume it's safe to leave itDisk-full on a primary is a total write outage, not a degraded state — treat slot monitoring with the same urgency as CPU or memoryProduction Debug GuideCommon replication failures and how to diagnose them without guessing
now() - pg_last_xact_replay_timestamp() on the replica. If lag is growing rather than stable, check network throughput between primary and replica and whether the replica's disk I/O is saturated applying WAL. A replica serving heavy read traffic can starve the WAL apply process — they compete for the same I/O.pg_current_wal_lsn(), restart_lsn) AS retained_bytes FROM pg_replication_slots. Look for inactive slots retaining large amounts of WAL. Drop dead slots immediately with pg_drop_replication_slot(). This is the single most common surprise I see on Postgres clusters that weren't set up with slot monitoring.pg_current_wal_lsn() on both nodes to determine which has more data. Rebuild the lagging node as a replica of the correct primary. Do not try to merge diverged write streams — it ends badly.At scale, a single database server cannot absorb millions of concurrent reads without buckling — and if it goes down, your entire product goes with it. Database replication is the engineering answer to both problems simultaneously: it spreads read load across multiple servers and keeps a warm standby ready the moment your primary fails.
But replication is deceptively complex under the hood. The problems it solves — availability, durability, and read scalability — arrive bundled with a new class of trade-offs rooted in the CAP theorem and the physical reality of networks. A user sees a stale balance 800 milliseconds after a deposit. Two nodes in a multi-master cluster silently accept conflicting writes and produce corrupt state. A replica goes offline for a weekend and quietly fills the primary's disk to 100%.
I've dealt with all three of those in production, and the pattern is always the same: the failure mode was known, the monitoring wasn't in place, and the runbook didn't exist. This article is the thing I wish existed when I was setting up my first streaming replication cluster.
By the end you'll understand how replication works at the WAL and binary log level, how to reason about lag and its real-world consequences for your users, and how to design a topology that actually survives the failure modes that catch production systems off guard — not just the happy path.
Core Architectures: Primary-Replica vs. Multi-Primary
In a Primary-Replica setup — the architecture that powers the vast majority of production databases you'll encounter — all writes go to a single Primary node. That node records every change in a Write-Ahead Log (WAL) or Binary Log and ships those events to one or more Replica nodes. Replicas are read-only; they apply the log entries to their own copy of the data to stay in sync. This is the industry default for a reason: there is exactly one source of truth, conflict resolution is trivially simple (there are no conflicts), and the operational model is easy to reason about.
Multi-Primary replication allows writes on any node, which sounds like the availability holy grail until you actually implement it. If two clients update the same row on different primaries within the same time window, the system must resolve the conflict using strategies like Last Write Wins (LWW), Vector Clocks, or custom application logic. Last Write Wins sounds simple but silently discards legitimate writes — whichever timestamp wins, the other write vanishes with no error and no log entry. Vector Clocks are more correct but require your application layer to understand and handle conflict signals. Neither is free.
I've seen teams reach for Multi-Primary when their real problem was read scaling, which Primary-Replica solves without any of that complexity. The rule is straightforward: don't adopt multi-primary until you've genuinely exhausted vertical scaling and read-replica offloading, and your team understands distributed consensus deeply enough to operate it at 3 AM.
-- Run this on the Primary to see connected standbys and their lag SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS replication_lag_bytes FROM pg_stat_replication; -- Run this on a Replica to confirm it is in recovery and check replay timestamp SELECT pg_is_in_recovery() AS is_replica, pg_last_xact_replay_timestamp() AS last_replayed_at, now() - pg_last_xact_replay_timestamp() AS current_lag;
------------+-----------+----------------------
10.0.1.5 | streaming | 4096
10.0.1.6 | streaming | 98304
- Primary-Replica: one source of truth, replicas are read-only copies — zero conflict resolution needed
- Multi-Primary: any node accepts writes — requires conflict resolution (LWW, Vector Clocks, or application logic)
- 99% of production systems run Primary-Replica — it is simpler, safer, and sufficient until write throughput genuinely demands multi-primary
- Split brain is the #1 operational risk of multi-primary: two nodes accept conflicting writes, data diverges, and the divergence is usually silent
Synchronous vs. Asynchronous Replication
This is the most consequential trade-off in replication design, and it's worth spending time with rather than defaulting to whatever the tutorial used.
In asynchronous replication, the primary confirms the write to the client as soon as it's committed locally. The WAL is shipped to replicas in the background, after the acknowledgment. This is fast — write latency is bounded only by local disk speed — but it creates a window where committed transactions exist only on the primary. If the primary crashes before shipping that WAL, those transactions are gone. The replica becomes the new primary with a gap in its history that nobody can fill.
In synchronous replication, the primary waits until at least one replica confirms it has received and durably written the WAL before acknowledging the commit to the client. This closes the data loss window entirely — at least one copy of every committed transaction exists before the client gets a response. The cost is that write latency now includes the network round-trip to the replica. If your replica is in a different availability zone, you're paying 2–10ms per write, every write, forever.
Neither is obviously correct. The right choice depends on what your data is worth and what your users will tolerate. A social media feed can absorb async lag without most users noticing. A financial transaction system where a committed payment might vanish on primary failure is a different conversation entirely.
# io.thecodeforge: Quick-start Postgres Primary-Replica cluster for local development # Do not use this config verbatim in production — add TLS, auth, and resource limits services: db-primary: image: postgres:16-alpine environment: POSTGRES_PASSWORD: forge_secret POSTGRES_USER: forge_admin command: | postgres -c wal_level=replica -c max_wal_senders=10 -c max_replication_slots=10 -c wal_keep_size=256MB db-replica: image: postgres:16-alpine depends_on: - db-primary environment: PGPASSWORD: forge_secret command: | /bin/sh -c " until pg_basebackup -h db-primary -D /var/lib/postgresql/data \ -U forge_admin -vP -W --wal-method=stream -R; do echo 'Primary not ready, retrying in 2s...'; sleep 2 done echo 'Basebackup complete. Starting replica...' postgres"
db-replica | NOTICE: pg_basebackup: initiating base backup
db-replica | NOTICE: pg_basebackup: base backup completed
db-replica | LOG: started streaming WAL from primary at 0/1000000
Replication Lag: Causes, Measurement, and Consequences
Replication lag is the time gap between a transaction committing on the primary and that same transaction becoming visible on a replica. It's measured two ways: time-based (how many seconds is the replica behind?) and log-based (how many bytes of WAL has the replica not yet applied?). Both metrics matter and they tell you different things. Time tells you the user-visible impact. Bytes tell you how much backlog the replica is carrying and whether it's catching up or falling further behind.
Lag is not a bug in your replication setup. It is a fundamental, expected property of asynchronous replication, and the engineering question is never 'how do I eliminate lag?' but 'is this lag within acceptable bounds for my use case?' A 200ms lag on a social media activity feed is invisible to users. A 200ms lag on a bank account balance immediately after a wire transfer is a customer complaint and potentially a compliance issue. The same 200ms, completely different problem severity.
The causes of lag that I see most often in production: long-running transactions on the primary that generate a large WAL burst when they commit, replica hardware that is weaker than the primary (slower IOPS means slower WAL apply), network congestion between primary and replica — especially on shared cloud network links during peak hours, and replicas running heavy read queries that compete with the WAL apply process for disk I/O. That last one is subtle and frequently overlooked: your read traffic is physically competing with replication for the same disk, and replication will lose if the disk is saturated.
-- io.thecodeforge: Replication lag monitoring — run both methods, not just one -- Method 1: Time-based lag (run on the REPLICA) -- Tells you user-visible staleness in wall-clock time SELECT now() - pg_last_xact_replay_timestamp() AS time_lag, CASE WHEN now() - pg_last_xact_replay_timestamp() > INTERVAL '5 seconds' THEN 'PAGE: lag exceeds 5s' WHEN now() - pg_last_xact_replay_timestamp() > INTERVAL '1 second' THEN 'WARN: lag exceeds 1s' ELSE 'OK' END AS lag_status; -- Method 2: Byte-based lag (run on the PRIMARY) -- Tells you WAL backlog depth — a replica at 0s time lag can still carry 500MB of queued WAL SELECT client_addr, state, sent_lsn, replay_lsn, pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS lag_pretty, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, CASE WHEN pg_wal_lsn_diff(sent_lsn, replay_lsn) > 104857600 THEN 'PAGE: >100MB lag' WHEN pg_wal_lsn_diff(sent_lsn, replay_lsn) > 10485760 THEN 'WARN: >10MB lag' ELSE 'OK' END AS lag_status FROM pg_stat_replication;
time_lag | lag_status
---------+-----------
00:00:00 | OK
-- Primary output
client_addr | state | lag_pretty | lag_bytes | lag_status
------------+-----------+------------+-----------+-----------
10.0.1.5 | streaming | 512 kB | 524288 | OK
10.0.1.6 | streaming | 48 MB | 50331648 | WARN: >10MB lag
Write-Ahead Logs and Binary Logs: The Replication Plumbing
Every mainstream relational database uses a write-ahead log as the physical foundation of replication. In PostgreSQL it's the WAL. In MySQL it's the Binary Log (binlog). The concept is the same in both: before any change is applied to the actual data files, it's written sequentially to the log first. The log is the authoritative record of every change the primary made, in the exact order it was made. Replicas consume this log to reconstruct the primary's state on their own storage.
WAL serves two distinct purposes that are easy to conflate. First, crash recovery: if the primary dies mid-write, the WAL records enough information to replay or roll back the incomplete transaction on restart. The data files are always recoverable from a consistent checkpoint plus the WAL that follows it. Second, replication: the primary streams WAL segments to connected replicas over a network socket, and replicas apply them in sequence. These two purposes share the same physical log, which is why WAL configuration affects both durability and replication behavior.
The wal_level setting in PostgreSQL determines how much detail the WAL contains. The replica level is the minimum for physical streaming replication — it records enough to reconstruct data changes. The logical level adds row-level detail needed for logical replication and change data capture pipelines like Debezium. Logical WAL generates more volume per write — roughly 2–4x in high-UPDATE workloads — so don't set it unless you actually need it. And changing wal_level requires a server restart, so plan this before you have replicas depending on it.
-- io.thecodeforge: PostgreSQL WAL configuration audit and replication slot monitoring -- Check current WAL level — requires server restart to change SHOW wal_level; SHOW max_wal_senders; SHOW max_replication_slots; -- Recommended postgresql.conf settings for a replication primary: -- wal_level = replica -- minimum for physical streaming replication -- max_wal_senders = 10 -- max concurrent replication connections (replicas + tools) -- max_replication_slots = 10 -- max slots; each slot pins WAL until the replica consumes it -- wal_keep_size = '1GB' -- WAL to retain without slots (safety net, not a substitute for slots) -- max_slot_wal_keep_size = '10GB' -- hard cap on WAL any single slot can pin (CRITICAL — set this) -- Monitor active WAL position SELECT pg_current_wal_lsn() AS current_lsn, pg_walfile_name(pg_current_wal_lsn()) AS current_wal_file; -- Audit replication slots — pay attention to inactive slots retaining large WAL SELECT slot_name, slot_type, active, pg_size_pretty( pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) ) AS retained_wal, CASE WHEN active = false AND pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824 THEN 'ALERT: inactive slot retaining >1GB — check immediately' ELSE 'OK' END AS slot_status FROM pg_replication_slots ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
------------+-----------+--------+--------------+------------------------------------------------
replica_1 | physical | true | 16 MB | OK
replica_2 | physical | false | 2 GB | ALERT: inactive slot retaining >1GB — check immediately
Automated Failover: When the Primary Dies
Manual failover — SSH into a replica, run pg_ctl promote, update your connection strings, restart the application, redirect traffic — works in a controlled staging exercise during business hours. In production at 3 AM with pager alerts firing, adrenaline running, and half the team half-asleep, manual failover is how data gets lost and how the wrong replica gets promoted. I've seen both happen.
Automated failover tools handle the entire sequence: detect the primary failure, select the most up-to-date eligible replica, promote it, reconfigure remaining replicas to follow the new primary, and update the application's connection endpoint. All of that in under 30 seconds, without human judgment calls under pressure.
The two dominant tools for PostgreSQL are Patroni (built on etcd or Consul for distributed consensus and leader election) and Repmgr (simpler setup, less operational overhead for smaller clusters). For MySQL, Orchestrator is the standard. All three solve the same core problem: ensuring that exactly one node holds the write role at any moment. The 'exactly one' constraint is the hard part — distributed systems make this genuinely difficult because a network partition can make a live primary look dead to the failover system.
Failover tooling alone is not enough. The application must know where to send writes after failover completes. This requires either a connection pooler (PgBouncer, ProxySQL) or DNS-based service discovery (Route53, Consul) that can be updated programmatically as part of the failover sequence. Hard-coding the primary's IP address in your application configuration file is a guarantee that your automated failover will still require manual application intervention — which defeats most of the point.
# io.thecodeforge: Patroni configuration for a 3-node PostgreSQL HA cluster # Patroni uses a DCS (etcd/consul) for distributed leader election and lock management # Run one instance of this per PostgreSQL node — change 'name' and 'connect_address' per node scope: forge-cluster name: node-1 # change to node-2, node-3 on other nodes namespace: /db/ restapi: listen: 0.0.0.0:8008 connect_address: node-1.internal:8008 etcd3: hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379 # etcd cluster must have quorum for Patroni to elect a leader bootstrap: dcs: ttl: 30 # leader lock TTL in seconds loop_wait: 10 # how often Patroni checks its status retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1MB — replica must be within 1MB of primary to be promotion-eligible synchronous_mode: true # require at least one synchronous replica before committing synchronous_mode_strict: false # do not block writes if no sync replica is available (degrades to async) pg_hba: - host replication replicator 10.0.0.0/8 md5 - host all all 0.0.0.0/0 md5 postgresql: listen: 0.0.0.0:5432 connect_address: node-1.internal:5432 data_dir: /var/lib/postgresql/16/main bin_dir: /usr/lib/postgresql/16/bin parameters: max_connections: 200 shared_buffers: 2GB wal_level: replica max_wal_senders: 10 max_replication_slots: 10 max_slot_wal_keep_size: 10GB # safety valve — prevents slot bloat from filling disk authentication: replication: username: replicator password: forge_repl_secret superuser: username: postgres password: forge_admin_secret
2026-03-05 14:22:02 INFO: no action. I am the leader with the cluster lock
2026-03-05 14:22:12 INFO: Lock owner: node-1; I am node-1
2026-03-05 14:22:22 INFO: node-2 is now a streaming standby
Designing Your Replication Topology for Production
A replication topology is not just 'one primary and a couple of replicas.' It's a deliberate design that accounts for how reads are routed, how WAL shipping load is distributed across the primary's network interface, what the failover blast radius looks like, and whether your replicas will actually help when the primary fails — or whether they'll fail alongside it because they share the same infrastructure failure domain.
The most common production topology is a star: one primary with N replicas, all directly connected to the primary via streaming replication. Read traffic is distributed across replicas via a load balancer or connection pooler. HAProxy and PgBouncer in pool mode are the standard choices here. This works well for single-region deployments up to roughly 10 replicas, at which point the primary's WAL shipping starts to become a bottleneck — it's shipping the same WAL stream N times over the same network interface.
Beyond 10 replicas, or whenever you can measure WAL shipping as a CPU or network ceiling on the primary, cascading (hierarchical) replication is worth evaluating. Tier-1 replicas connect directly to the primary. Tier-2 replicas connect to tier-1 replicas and inherit their lag in addition to their own. The primary's WAL shipping burden drops proportionally, but you add a lag tier at each level — tier-2 replicas are always at least as far behind as tier-1. Monitor each tier's lag separately.
For multi-region deployments, place at least one replica in each region. Route reads to the nearest replica for latency. But be clear-eyed about what cross-region replication lag means: a primary in us-east-1 and a replica in eu-west-1 will carry 80–150ms of lag under normal conditions, and more during network events. Writes always go to the single primary — and if that primary is in a different region than your users, those writes pay the cross-region latency too. Geographic distribution of replicas improves read latency; it does not improve write latency unless you go multi-primary, with all the complexity that brings.
-- io.thecodeforge: Common Production Replication Topologies -- STAR TOPOLOGY (recommended default, up to ~10 replicas) -- All replicas connect directly to the primary -- Primary ships WAL to every replica — network bandwidth scales linearly with replica count -- -- [Primary] -- / | \ -- [Rep1] [Rep2] [Rep3] -- (reads) (reads) (hot standby) -- CASCADING TOPOLOGY (for 10+ replicas, or when WAL shipping is a primary bottleneck) -- Tier-1 replicas connect to primary; tier-2 replicas connect to tier-1 -- Reduces primary WAL shipping load — each tier-1 ships to its tier-2 children -- Lag at tier-2 = tier-1 lag + tier-2 lag — monitor each tier separately -- -- [Primary] -- / \ -- [Rep1-A] [Rep1-B] -- / \ / \ -- [Rep2-A] [Rep2-B] [Rep2-C] [Rep2-D] -- MULTI-REGION TOPOLOGY (for geographic read distribution) -- One or more replicas per region — route reads to nearest replica -- Cross-region lag is 80-150ms under normal conditions — set SLAs accordingly -- Writes still go to the single primary — cross-region writes pay the RTT -- -- [Primary: us-east-1] ----WAL----> [Replica: eu-west-1] -- | | -- [Local Read Replicas] [Local Read Replicas] -- (us-east-1 reads) (eu-west-1 reads)
| Dimension | Asynchronous Replication | Synchronous Replication |
|---|---|---|
| Write Latency | Low — bounded by local disk speed only | Higher — bounded by local disk plus network RTT to replica |
| Data Loss Risk on Primary Crash | Small but nonzero — WAL in transit is lost | Zero — replica must durably acknowledge before client gets confirmation |
| Read Consistency | Eventual — replica lag introduces stale reads | Strong — if reading from the synchronous replica |
| Write Availability on Replica Failure | Unaffected — primary continues accepting writes | Blocked if no synchronous replica is available (configurable with synchronous_mode_strict) |
| Operational Complexity | Standard — straightforward to monitor and operate | Higher — requires quorum configuration, failover handling, and latency benchmarking |
| Typical Use Case | Web applications, analytics replicas, read scaling | Financial transactions, compliance workloads, zero-RPO requirements |
🎯 Key Takeaways
- Replication is for availability and read scalability, not for backups — a DROP TABLE on the primary propagates to every replica in milliseconds.
- Understand your consistency requirements concretely before choosing async or sync: can your application tolerate a 500ms window of stale data, or does every read need to reflect the most recent write?
- Write-Ahead Logs are both the crash recovery mechanism and the replication stream — they are the single source of truth for every change the primary has made.
- Monitor LSN distance (byte lag) alongside time-based lag — a replica at zero seconds of lag can be carrying hundreds of megabytes of queued WAL and about to fall behind hard.
- Automate failover with Patroni or Orchestrator, include fencing in the automation, and test the full sequence on a quarterly cadence — knowing it works in theory is not the same as having proven it works under load.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the CAP theorem and how it applies to choosing between synchronous and asynchronous replication.SeniorReveal
- QWhat is split brain in a database cluster, and how do quorum-based systems prevent it?SeniorReveal
- QYou're seeing a sudden spike in replication lag on a replica that was previously healthy. Walk me through your debugging process.SeniorReveal
- QHow does statement-based replication differ from row-based replication, and which is safer for non-deterministic functions?Mid-levelReveal
- QWhat is a replication slot in PostgreSQL and why is it dangerous if a replica goes offline for an extended period?Mid-levelReveal
- QWhat is the difference between physical replication and logical replication in PostgreSQL?Mid-levelReveal
- QExplain read-after-write consistency and how you would implement it in a Primary-Replica architecture.JuniorReveal
Frequently Asked Questions
Does replication improve write performance?
No — and synchronous replication actively decreases it. Replication adds overhead: WAL generation, log shipping, and in the synchronous case, blocking on replica acknowledgment before the primary can respond to the client. Write latency in synchronous mode is gated by the network round-trip to the replica. To scale write throughput beyond a single primary's capacity, you need either sharding (partitioning data across multiple primaries, each responsible for a subset) or a multi-primary architecture with conflict resolution. Neither is simple. Exhaust vertical scaling and read offloading to replicas before going there.
What is replication lag?
Replication lag is the delay between a transaction committing on the primary and that same transaction becoming visible on a replica. It's measured two ways: time-based (how many seconds behind the primary is the replica?) and byte-based (how many bytes of WAL has the replica not yet applied?). Lag is a fundamental property of asynchronous replication — it's not a bug or a misconfiguration. The engineering question is not how to eliminate it but whether it is within the bounds your application can tolerate, and whether you have monitoring to know when it isn't.
What happens to the replica if the primary fails?
In a properly configured high-availability setup, one replica is promoted to become the new primary. The promotion sequence involves: ensuring the replica has applied all available WAL, transitioning it from read-only standby mode to read-write primary mode, reconfiguring any remaining replicas to follow the new primary, and redirecting application write traffic to the new primary's endpoint. Automated tools like Patroni handle this sequence in under 30 seconds. Manual promotion requires running pg_ctl promote on the chosen replica and then updating application connection strings — a process that takes minutes at best and is prone to errors under pressure. Invest in automation before you need it.
What is the difference between a hot standby and a warm standby?
A hot standby continuously receives and applies WAL from the primary while simultaneously accepting read-only queries from application clients. This is the standard in modern PostgreSQL streaming replication — replicas offload read traffic from the primary while remaining ready to promote on a moment's notice. A warm standby also receives and applies WAL continuously, but it does not accept query connections until a failover is explicitly triggered. Warm standbys use fewer resources (no read traffic overhead) and are appropriate for dedicated disaster recovery nodes where cost matters more than read offloading. Hot standbys are the default choice for production HA clusters.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.