MVCC Bloat — Autovacuum Failure at 500 Writes/sec
Dead tuples hit 1M on a 2M-row table at 500 writes/sec, inflating it to 12GB and crashing queries.
20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.
- MVCC gives every transaction a consistent snapshot of data without blocking writes or reads
- PostgreSQL uses xmin/xmax tuples; InnoDB uses undo logs with rollback pointers
- Snapshot isolation prevents dirty reads, but write skew still slips through
- Version storage overhead can add 20-50% disk usage under heavy writes
- Autovacuum bloat is the #1 silent killer of query performance in Postgres
- Tuning idle_in_transaction_session_timeout prevents catastrophic snapshot buildup
Imagine a library with one copy of a popular book. Without MVCC, if someone is reading it, you have to wait until they're done before you can even look at it. With MVCC, the librarian secretly makes a photocopy of the book the moment you sit down — so you read your private snapshot while someone else edits the original. When you're done, you compare notes. Nobody waits. Nobody blocks. That's MVCC — readers never block writers, and writers never block readers, because everyone works from their own timestamped version of the data.
Every high-traffic production database faces the same brutal tension: dozens of queries are reading rows at the exact same moment other queries are updating those same rows. Get this wrong and you're choosing between data inconsistency (dirty reads, phantom rows) or grinding serialisation locks that tank your throughput. This isn't a theoretical concern — it's why Instagram, Stripe, and every SaaS at scale cares deeply about how their database engine handles concurrent access.
Multi-Version Concurrency Control (MVCC) solves this by flipping the fundamental assumption. Instead of locking a row so only one person touches it at a time, the database keeps multiple timestamped versions of every row simultaneously. A reader gets a consistent snapshot of the world as it existed when their transaction began. A writer creates a new version without destroying the old one. The two operations proceed in parallel, completely independently. Lock contention drops dramatically, and read throughput scales linearly with your hardware.
By the end of this article you'll understand exactly how PostgreSQL stores row versions on disk (the xmin/xmax system), how InnoDB's undo log chain differs from that approach, why MVCC doesn't eliminate all anomalies (write skew is still lurking), how to tune autovacuum before table bloat kills your query plans, and what to say when an interviewer asks you to compare snapshot isolation with serialisable isolation. This is the deep, production-grade understanding that separates engineers who just use databases from engineers who run them confidently at scale.
What is Multi-version Concurrency Control?
MVCC is the engine that lets your database serve reads and writes at the same time without locking everything. Here's the core idea: every row carries hidden metadata that tells the database which transaction created it and which (if any) deleted it. A read query gets a snapshot of committed data as of the moment it started. While that query runs, another transaction can update rows and commit — but the first query still sees the old version.
That's the magic: readers don't block writers, and writers don't block readers. But it's not free. Every update leaves a dead version behind — a dead tuple in PostgreSQL, an undo record in InnoDB. If those dead versions pile up, your queries slow down, your storage fills, and eventually your pager goes off at 3 AM.
You don't have a choice about whether your database uses MVCC — every major engine does. Your choice is whether you understand the costs and how to manage them.
- A snapshot is like a git checkout at a specific commit hash.
- Updates create new commits; old commits still exist until garbage collection (vacuum/purge).
- Long-running transactions are branches that never merge — they prevent cleanup of old commits.
- Write skew is like two people cherry-picking different parts of the same file — each thinks they're right, but the result is inconsistent.
How PostgreSQL Stores Row Versions: xmin and xmax
PostgreSQL stores multiple versions of a row directly in the same heap table. Each row header contains two critical transaction IDs: xmin (the transaction that created this version) and xmax (the transaction that deleted or updated it, or 0 if current). When a transaction updates a row, PostgreSQL marks the old version with its XID in xmax and inserts a new version with the same XID in xmin. A snapshot is simply a set of in-progress transaction IDs at a given moment. A row version is visible to a snapshot if its xmin is committed and before the snapshot's horizon, and its xmax is either 0 or in-progress (i.e., the deleting transaction hasn't committed yet). This is the heart of MVCC: visibility depends on the transaction snapshot, not on locks.
PostgreSQL also uses Heap-Only Tuples (HOT) when an update only changes non-indexed columns and the new version fits on the same page. HOT updates avoid the index overhead of inserting a new tuple and are much cheaper because pruning (cleanup) can remove the old version without a full index scan. You can spot HOT updates by comparing pg_stat_all_tables.n_tup_hot_upd to n_tup_upd — a low ratio indicates indexes are hurting your update performance.
How InnoDB Uses the Undo Log — Rollback to Any Point
MySQL InnoDB takes a different approach. Instead of storing multiple row versions in the table space, InnoDB keeps only one version in the clustered index and stores older versions in a separate undo tablespace. Each row in the index has a DB_ROLL_PTR that points to the undo log entry for the previous version. When a transaction updates a row, InnoDB writes the old values to an undo log record and updates the current row. A snapshot is constructed by reading the current row, then following the rollback pointer to reconstruct older versions as needed. This means that table bloat is less of an issue in InnoDB — the table space itself doesn't grow with dead tuples. However, the undo tablespace can grow very large if long-running transactions prevent the purge of old undo records.
InnoDB's purge system runs automatically and is much less configurable than PostgreSQL's vacuum. You can control undo truncation from MySQL 8.0+ with innodb_undo_log_truncate and innodb_max_undo_log_size. If these aren't set correctly, your undo tablespace can silently eat up disk — and you won't see it in SHOW TABLE STATUS. You need to check INFORMATION_SCHEMA.INNODB_METRICS or INNODB_UNDO_TABLESPACES.
SHOW TABLE STATUS — many teams miss the 50GB undo file until it fills the disk.Isolation Levels Under MVCC: Read Committed vs Repeatable Read vs Serializable
MVCC implements snapshot isolation (SI) at the default levels. In PostgreSQL, Read Committed and Repeatable Read use different snapshot strategies. Read Committed takes a new snapshot for each statement within a transaction, meaning you can see changes committed by other transactions between your own statements. Repeatable Read takes a single snapshot at the start of the transaction — all subsequent statements see the same version of the database. Serializable adds predicate locking to SI to prevent all anomalies including write skew, but it comes with overhead and the possibility of serialisation failures (which you must handle by retrying). InnoDB's default is Repeatable Read with a similar snapshot, but it also uses next-key locking to prevent phantoms, which can lead to more lock contention than PostgreSQL's approach.
Write skew is the anomaly that catches most teams off guard. Imagine a medical scheduling system: two doctors both query the on-call roster, see one slot left, and both assign themselves. Under REPEATABLE READ, both transactions commit — now two doctors believe they're on call. The conflict is invisible because each transaction only reads the existing slot count, then updates a separate row. MVCC's snapshot isolation prevents lost updates and dirty reads, but it does not detect this overlapping read-then-write pattern. To prevent write skew, you need either SERIALIZABLE isolation (which uses predicate locks or SSI in Postgres) or explicit locking with SELECT ... FOR UPDATE on the rows you plan to change.
- Read Committed: each statement gets a new
git checkout— changes from others appear between queries. - Repeatable Read: you
git stashat the start and never see other people's commits until yougit pull(commit). - Serializable: no one else can
git pushto the same file while you're working — you get the whole repo locked. - Write skew: two people read the same rows, both decide to change different columns, and neither sees the other's change — until it's too late.
REPEATABLE READ but not SERIALIZABLE.SELECT FOR UPDATE on the rows you're about to change, or switch to SERIALIZABLE and handle retry logic.SERIALIZABLE or explicit locking.Autovacuum Tuning: The Most Common Production Mistake
PostgreSQL's autovacuum is designed to run in the background, but its default settings are conservative. For tables with high write throughput, the default autovacuum_vacuum_scale_factor = 0.2 means autovacuum won't kick in until 20% of the table is dead tuples. On a 10GB table, that's 2GB of dead space before cleanup starts — and by then, the table has likely grown much larger. Tuning involves adjusting autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, and autovacuum_vacuum_cost_limit per table. The cost_delay mechanism paces vacuum work to prevent I/O spikes, but it also means vacuum can fall behind under sustained load. InnoDB doesn't have a vacuum equivalent — it purges undo records automatically — but the undo tablespace still needs trimming.
To tune autovacuum per table, use storage parameters: ``sql ALTER TABLE io.thecodeforge.orders SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_threshold = 1000); ` You should also monitor pg_stat_all_tables.n_dead_tup and set up alerts when the ratio to n_live_tup exceeds 30%. The pg_stat_user_tables` view is your friend. And don't forget that long-running transactions (especially idle-in-transaction) block vacuum from removing dead tuples even if autovacuum runs.
autovacuum_vacuum_scale_factor = 0.01 and autovacuum_vacuum_threshold = 1000 — query times dropped from 8s to 12ms.n_dead_tup / n_live_tup every hour — if it's above 30%, call your DBA.Why MVCC Is Used: The Problem Nobody Told You About
You think MVCC exists to make reads fast. Wrong. MVCC exists so you don't need to lock every row when one transaction sneezes.
Without MVCC, a simple SELECT blocks an UPDATE. An UPDATE blocks a DELETE. You get chain-locking, deadlocks at 3 AM, and a pager that ruins your sleep. MVCC kills that by keeping old versions alive for readers while writers scribble on new copies.
The real reason? Latency hiding. When a transaction holds a row lock, everyone else queues. Queues turn into timeout cascades. MVCC lets readers skip the queue entirely. They grab the last committed version and move on.
If you're running an e-commerce platform and a product update blocks inventory checks, you lose orders. MVCC prevents that by decoupling read consistency from write contention. It's not a feature — it's a survival mechanism.
Types of MVCC in DBMS: Pick Your Poison
Not all MVCC is created equal. There are three major flavors, and picking wrong means either constant retries or a storage bill that rivals your rent.
Timestamp-Based MVCC tags every transaction with a monotonically increasing timestamp. The database compares timestamps to decide visibility. It's simple, deterministic, and breaks under clock skew. If your system clock jumps, transactions see ghosts. Don't use this on distributed databases without hardware-level clock sync.
Snapshot-Based MVCC gives each transaction a point-in-time snapshot of the entire database at start. PostgreSQL and Oracle use this. Every write creates a new row version visible only to transactions that started after the write committed. Reads never block. Writes check for conflicts at commit time. The cost? Bloat. Old row versions pile up until a vacuum or purge cleans them.
History-Based MVCC keeps every row version forever — or at least until a retention policy says otherwise. Great for audit trails and time-travel queries. Terrible for write-heavy workloads because the history chain grows unbounded. InnoDB's undo log is a hybrid: it keeps enough history for rollback and consistent reads, then prunes aggressively.
Hybrid MVCC combines approaches. CockroachDB uses a hybrid clock (HLC) for timestamps but snapshots internally. The trade-off: more complexity, better resilience against clock drift.
Two-Phase Locking: Why Your Reads Block Writes (And How to Fix It)
You think MVCC means no locks? Wrong. Two-phase locking (2PL) is the dirty secret under the hood. MVCC eliminates read-write conflicts for standard queries, but the moment you touch Serializable isolation or DDL, 2PL kicks in and your transactions serialize.
Phase one: acquire all locks. Read locks, write locks, predicate locks — you grab them as you go. Phase two: release them all at commit or rollback. No releasing early. That's the rule. Why? Because early release lets phantom reads and write skew slip through. Serializable mode enforces this with index-range locks (gap locks in MySQL, predicate locks in PostgreSQL).
Here's the production pain: long transactions accumulate locks. A report that scans 10 million rows under Serializable holds read locks on every index page it touches. Meanwhile, a simple UPDATE on a single row blocks — not because of MVCC, but because 2PL won't let the write proceed until the reader releases. Monitor pg_locks or performance_schema.data_locks. Short transactions are not optional.
Shortcut: if you see deadlocks under Serializable, you're likely holding locks in different orders across transactions. Enforce a consistent lock order in your app code.
MVCC + 2PL: The Truth About Snapshot Isolation and Write Conflicts
You've read that MVCC gives you snapshot isolation: readers never block writers, writers never block readers. That's true — until a write conflict happens. Then 2PL muscles in and aborts one transaction. Here's the why.
Snapshot isolation uses row versions to let each transaction see a consistent snapshot from its start time. No locks needed for reads. But when two transactions try to UPDATE the same row concurrently, the database must serialize the writes. It can't just merge two versions — that's a conflict. PostgreSQL's Serializable Snapshot Isolation (SSI) detects these conflicts using predicate locks and aborts one transaction. InnoDB uses a simpler approach: the second updater waits (via 2PL) on a row lock from the first.
Production takeaway: 'no locks' is marketing. Real MVCC implementations use a hybrid. Reads are lock-free. Conflicting writes are not. This is why you see occasional serialization failures under Repeatable Read — the database is honest about conflicts rather than corrupting data.
If your app retries serialization failures, you're working with MVCC + 2PL. Design for that. Exponential backoff. Three retries max. Don't pretend the locks don't exist.
Introduction
Multiversion Concurrency Control (MVCC) is the engine that lets modern databases handle thousands of simultaneous transactions without locking everything to a crawl. The core problem is simple: a database must keep data consistent when multiple users read and write at the same time. Traditional locking would force a writer to block every reader, killing throughput. MVCC solves this by keeping old versions of each row around, so a reader sees a consistent snapshot of the database as it existed when their transaction began, no matter what other transactions are doing. This means reads never wait for writes, and writes never wait for reads. Understanding MVCC's purpose is critical before diving into specific implementations because it explains why every design trade-off — from storage overhead to vacuum management — exists. Without MVCC, high-concurrency workloads would grind to a halt. With it, systems like PostgreSQL and MySQL can handle thousands of transactions per second while maintaining full ACID guarantees.
Two-Phase Locking Protocol
Two-Phase Locking (2PL) is the strict sibling of MVCC. While MVCC avoids locks for reads, 2PL ensures serializability by forcing transactions to acquire all locks before releasing any. It works in two phases: a growing phase where locks are obtained (never released), and a shrinking phase where locks are released (never acquired). The critical property is that 2PL prevents dirty reads, non-repeatable reads, and phantom rows by locking rows, index ranges, or tables. However, the cost is severe: readers block writers and writers block readers. In practice, 2PL is rarely used alone in modern databases. Instead, systems like MySQL's InnoDB combine MVCC with a lighter form of 2PL (called strict 2PL) to handle write conflicts while keeping reads lock-free. The trade-off is clear: 2PL guarantees stronger consistency at the price of throughput. Understanding 2PL matters because it explains the original solution and why MVCC evolved to fix its blocking behavior.
The Autovacuum That Ate the Weekend
- Default autovacuum settings are tuned for OLTP workloads with moderate write rates — not high-write tables.
- Monitor n_dead_tup / n_live_tup ratio per table; alert when it exceeds 30%.
- Always test VACUUM timing under production write load before going live.
- Don't assume 'it's been fine for months' — write patterns change, and bloat accumulates silently.
SELECT schemaname, relname, n_live_tup, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC; -- PgSHOW TABLE STATUS LIKE 'your_table'; -- Inno (Data_free > 100MB indicates bloat)Key takeaways
Common mistakes to avoid
6 patternsMemorising syntax before understanding the concept
Skipping practice and only reading theory
Assuming autovacuum default settings work for all tables
Using REPEATABLE READ (or Read Committed) for financial transactions that need absolute consistency
Ignoring the undo tablespace in MySQL InnoDB
Not setting idle_in_transaction_session_timeout in PostgreSQL
Interview Questions on This Topic
Explain how MVCC works differently in PostgreSQL vs MySQL InnoDB.
Frequently Asked Questions
20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.
That's SQL Advanced. Mark it forged?
11 min read · try the examples if you haven't