MVCC Bloat — Autovacuum Failure at 500 Writes/sec
Dead tuples hit 1M on a 2M-row table at 500 writes/sec, inflating it to 12GB and crashing queries.
- MVCC gives every transaction a consistent snapshot of data without blocking writes or reads
- PostgreSQL uses xmin/xmax tuples; InnoDB uses undo logs with rollback pointers
- Snapshot isolation prevents dirty reads, but write skew still slips through
- Version storage overhead can add 20-50% disk usage under heavy writes
- Autovacuum bloat is the #1 silent killer of query performance in Postgres
- Tuning idle_in_transaction_session_timeout prevents catastrophic snapshot buildup
Imagine a library with one copy of a popular book. Without MVCC, if someone is reading it, you have to wait until they're done before you can even look at it. With MVCC, the librarian secretly makes a photocopy of the book the moment you sit down — so you read your private snapshot while someone else edits the original. When you're done, you compare notes. Nobody waits. Nobody blocks. That's MVCC — readers never block writers, and writers never block readers, because everyone works from their own timestamped version of the data.
Every high-traffic production database faces the same brutal tension: dozens of queries are reading rows at the exact same moment other queries are updating those same rows. Get this wrong and you're choosing between data inconsistency (dirty reads, phantom rows) or grinding serialisation locks that tank your throughput. This isn't a theoretical concern — it's why Instagram, Stripe, and every SaaS at scale cares deeply about how their database engine handles concurrent access.
Multi-Version Concurrency Control (MVCC) solves this by flipping the fundamental assumption. Instead of locking a row so only one person touches it at a time, the database keeps multiple timestamped versions of every row simultaneously. A reader gets a consistent snapshot of the world as it existed when their transaction began. A writer creates a new version without destroying the old one. The two operations proceed in parallel, completely independently. Lock contention drops dramatically, and read throughput scales linearly with your hardware.
By the end of this article you'll understand exactly how PostgreSQL stores row versions on disk (the xmin/xmax system), how InnoDB's undo log chain differs from that approach, why MVCC doesn't eliminate all anomalies (write skew is still lurking), how to tune autovacuum before table bloat kills your query plans, and what to say when an interviewer asks you to compare snapshot isolation with serialisable isolation. This is the deep, production-grade understanding that separates engineers who just use databases from engineers who run them confidently at scale.
What is Multi-version Concurrency Control?
MVCC is the engine that lets your database serve reads and writes at the same time without locking everything. Here's the core idea: every row carries hidden metadata that tells the database which transaction created it and which (if any) deleted it. A read query gets a snapshot of committed data as of the moment it started. While that query runs, another transaction can update rows and commit — but the first query still sees the old version.
That's the magic: readers don't block writers, and writers don't block readers. But it's not free. Every update leaves a dead version behind — a dead tuple in PostgreSQL, an undo record in InnoDB. If those dead versions pile up, your queries slow down, your storage fills, and eventually your pager goes off at 3 AM.
You don't have a choice about whether your database uses MVCC — every major engine does. Your choice is whether you understand the costs and how to manage them.
- A snapshot is like a git checkout at a specific commit hash.
- Updates create new commits; old commits still exist until garbage collection (vacuum/purge).
- Long-running transactions are branches that never merge — they prevent cleanup of old commits.
- Write skew is like two people cherry-picking different parts of the same file — each thinks they're right, but the result is inconsistent.
How PostgreSQL Stores Row Versions: xmin and xmax
PostgreSQL stores multiple versions of a row directly in the same heap table. Each row header contains two critical transaction IDs: xmin (the transaction that created this version) and xmax (the transaction that deleted or updated it, or 0 if current). When a transaction updates a row, PostgreSQL marks the old version with its XID in xmax and inserts a new version with the same XID in xmin. A snapshot is simply a set of in-progress transaction IDs at a given moment. A row version is visible to a snapshot if its xmin is committed and before the snapshot's horizon, and its xmax is either 0 or in-progress (i.e., the deleting transaction hasn't committed yet). This is the heart of MVCC: visibility depends on the transaction snapshot, not on locks.
PostgreSQL also uses Heap-Only Tuples (HOT) when an update only changes non-indexed columns and the new version fits on the same page. HOT updates avoid the index overhead of inserting a new tuple and are much cheaper because pruning (cleanup) can remove the old version without a full index scan. You can spot HOT updates by comparing pg_stat_all_tables.n_tup_hot_upd to n_tup_upd — a low ratio indicates indexes are hurting your update performance.
How InnoDB Uses the Undo Log — Rollback to Any Point
MySQL InnoDB takes a different approach. Instead of storing multiple row versions in the table space, InnoDB keeps only one version in the clustered index and stores older versions in a separate undo tablespace. Each row in the index has a DB_ROLL_PTR that points to the undo log entry for the previous version. When a transaction updates a row, InnoDB writes the old values to an undo log record and updates the current row. A snapshot is constructed by reading the current row, then following the rollback pointer to reconstruct older versions as needed. This means that table bloat is less of an issue in InnoDB — the table space itself doesn't grow with dead tuples. However, the undo tablespace can grow very large if long-running transactions prevent the purge of old undo records.
InnoDB's purge system runs automatically and is much less configurable than PostgreSQL's vacuum. You can control undo truncation from MySQL 8.0+ with innodb_undo_log_truncate and innodb_max_undo_log_size. If these aren't set correctly, your undo tablespace can silently eat up disk — and you won't see it in SHOW TABLE STATUS. You need to check INFORMATION_SCHEMA.INNODB_METRICS or INNODB_UNDO_TABLESPACES.
SHOW TABLE STATUS — many teams miss the 50GB undo file until it fills the disk.Isolation Levels Under MVCC: Read Committed vs Repeatable Read vs Serializable
MVCC implements snapshot isolation (SI) at the default levels. In PostgreSQL, Read Committed and Repeatable Read use different snapshot strategies. Read Committed takes a new snapshot for each statement within a transaction, meaning you can see changes committed by other transactions between your own statements. Repeatable Read takes a single snapshot at the start of the transaction — all subsequent statements see the same version of the database. Serializable adds predicate locking to SI to prevent all anomalies including write skew, but it comes with overhead and the possibility of serialisation failures (which you must handle by retrying). InnoDB's default is Repeatable Read with a similar snapshot, but it also uses next-key locking to prevent phantoms, which can lead to more lock contention than PostgreSQL's approach.
Write skew is the anomaly that catches most teams off guard. Imagine a medical scheduling system: two doctors both query the on-call roster, see one slot left, and both assign themselves. Under REPEATABLE READ, both transactions commit — now two doctors believe they're on call. The conflict is invisible because each transaction only reads the existing slot count, then updates a separate row. MVCC's snapshot isolation prevents lost updates and dirty reads, but it does not detect this overlapping read-then-write pattern. To prevent write skew, you need either SERIALIZABLE isolation (which uses predicate locks or SSI in Postgres) or explicit locking with SELECT ... FOR UPDATE on the rows you plan to change.
- Read Committed: each statement gets a new
git checkout— changes from others appear between queries. - Repeatable Read: you
git stashat the start and never see other people's commits until yougit pull(commit). - Serializable: no one else can
git pushto the same file while you're working — you get the whole repo locked. - Write skew: two people read the same rows, both decide to change different columns, and neither sees the other's change — until it's too late.
REPEATABLE READ but not SERIALIZABLE.SELECT FOR UPDATE on the rows you're about to change, or switch to SERIALIZABLE and handle retry logic.SERIALIZABLE or explicit locking.Autovacuum Tuning: The Most Common Production Mistake
PostgreSQL's autovacuum is designed to run in the background, but its default settings are conservative. For tables with high write throughput, the default autovacuum_vacuum_scale_factor = 0.2 means autovacuum won't kick in until 20% of the table is dead tuples. On a 10GB table, that's 2GB of dead space before cleanup starts — and by then, the table has likely grown much larger. Tuning involves adjusting autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, and autovacuum_vacuum_cost_limit per table. The cost_delay mechanism paces vacuum work to prevent I/O spikes, but it also means vacuum can fall behind under sustained load. InnoDB doesn't have a vacuum equivalent — it purges undo records automatically — but the undo tablespace still needs trimming.
To tune autovacuum per table, use storage parameters: ``sql ALTER TABLE io.thecodeforge.orders SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_threshold = 1000); ` You should also monitor pg_stat_all_tables.n_dead_tup and set up alerts when the ratio to n_live_tup exceeds 30%. The pg_stat_user_tables` view is your friend. And don't forget that long-running transactions (especially idle-in-transaction) block vacuum from removing dead tuples even if autovacuum runs.
autovacuum_vacuum_scale_factor = 0.01 and autovacuum_vacuum_threshold = 1000 — query times dropped from 8s to 12ms.n_dead_tup / n_live_tup every hour — if it's above 30%, call your DBA.The Autovacuum That Ate the Weekend
- Default autovacuum settings are tuned for OLTP workloads with moderate write rates — not high-write tables.
- Monitor n_dead_tup / n_live_tup ratio per table; alert when it exceeds 30%.
- Always test VACUUM timing under production write load before going live.
- Don't assume 'it's been fine for months' — write patterns change, and bloat accumulates silently.
Key takeaways
Common mistakes to avoid
6 patternsMemorising syntax before understanding the concept
Skipping practice and only reading theory
Assuming autovacuum default settings work for all tables
Using REPEATABLE READ (or Read Committed) for financial transactions that need absolute consistency
Ignoring the undo tablespace in MySQL InnoDB
Not setting idle_in_transaction_session_timeout in PostgreSQL
Interview Questions on This Topic
Explain how MVCC works differently in PostgreSQL vs MySQL InnoDB.
Frequently Asked Questions
That's SQL Advanced. Mark it forged?
6 min read · try the examples if you haven't