Junior 6 min · March 06, 2026

MVCC Bloat — Autovacuum Failure at 500 Writes/sec

Dead tuples hit 1M on a 2M-row table at 500 writes/sec, inflating it to 12GB and crashing queries.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • MVCC gives every transaction a consistent snapshot of data without blocking writes or reads
  • PostgreSQL uses xmin/xmax tuples; InnoDB uses undo logs with rollback pointers
  • Snapshot isolation prevents dirty reads, but write skew still slips through
  • Version storage overhead can add 20-50% disk usage under heavy writes
  • Autovacuum bloat is the #1 silent killer of query performance in Postgres
  • Tuning idle_in_transaction_session_timeout prevents catastrophic snapshot buildup
Plain-English First

Imagine a library with one copy of a popular book. Without MVCC, if someone is reading it, you have to wait until they're done before you can even look at it. With MVCC, the librarian secretly makes a photocopy of the book the moment you sit down — so you read your private snapshot while someone else edits the original. When you're done, you compare notes. Nobody waits. Nobody blocks. That's MVCC — readers never block writers, and writers never block readers, because everyone works from their own timestamped version of the data.

Every high-traffic production database faces the same brutal tension: dozens of queries are reading rows at the exact same moment other queries are updating those same rows. Get this wrong and you're choosing between data inconsistency (dirty reads, phantom rows) or grinding serialisation locks that tank your throughput. This isn't a theoretical concern — it's why Instagram, Stripe, and every SaaS at scale cares deeply about how their database engine handles concurrent access.

Multi-Version Concurrency Control (MVCC) solves this by flipping the fundamental assumption. Instead of locking a row so only one person touches it at a time, the database keeps multiple timestamped versions of every row simultaneously. A reader gets a consistent snapshot of the world as it existed when their transaction began. A writer creates a new version without destroying the old one. The two operations proceed in parallel, completely independently. Lock contention drops dramatically, and read throughput scales linearly with your hardware.

By the end of this article you'll understand exactly how PostgreSQL stores row versions on disk (the xmin/xmax system), how InnoDB's undo log chain differs from that approach, why MVCC doesn't eliminate all anomalies (write skew is still lurking), how to tune autovacuum before table bloat kills your query plans, and what to say when an interviewer asks you to compare snapshot isolation with serialisable isolation. This is the deep, production-grade understanding that separates engineers who just use databases from engineers who run them confidently at scale.

What is Multi-version Concurrency Control?

MVCC is the engine that lets your database serve reads and writes at the same time without locking everything. Here's the core idea: every row carries hidden metadata that tells the database which transaction created it and which (if any) deleted it. A read query gets a snapshot of committed data as of the moment it started. While that query runs, another transaction can update rows and commit — but the first query still sees the old version.

That's the magic: readers don't block writers, and writers don't block readers. But it's not free. Every update leaves a dead version behind — a dead tuple in PostgreSQL, an undo record in InnoDB. If those dead versions pile up, your queries slow down, your storage fills, and eventually your pager goes off at 3 AM.

You don't have a choice about whether your database uses MVCC — every major engine does. Your choice is whether you understand the costs and how to manage them.

pg_visibility.sqlSQL
1
2
3
4
-- Query transaction visibility metadata in PostgreSQL
SELECT xmin, xmax, ctid, * FROM io.thecodeforge.orders WHERE id = 42;
-- xmin = creating transaction ID
-- xmax = deleting transaction ID (0 if current)
Output
xmin | xmax | ctid | id | status
1234 | 0 | (0,1) | 42 | active
Think of MVCC like Git branches
  • A snapshot is like a git checkout at a specific commit hash.
  • Updates create new commits; old commits still exist until garbage collection (vacuum/purge).
  • Long-running transactions are branches that never merge — they prevent cleanup of old commits.
  • Write skew is like two people cherry-picking different parts of the same file — each thinks they're right, but the result is inconsistent.
Production Insight
MVCC's version storage is not free — disk usage can increase 20-50% under sustained writes.
Dead tuple accumulation in Postgres is the #1 performance degradation vector in production.
Rule: measure bloat before it measures you. Set up n_dead_tup alerts.
Key Takeaway
MVCC lets readers and writers coexist without blocking.
But the hidden cost is dead versions that must be cleaned up.
You don't get to ignore version storage — it will come for you at 3 AM.
When to Use MVCC vs Lock-Based Concurrency
IfRead-dominant workload (90%+ reads)
UseMVCC wins — readers never block writers, throughput scales with cores.
IfHeavy contention on few rows (e.g., account balance updates)
UseMVCC still helps, but you need optimistic concurrency control or SELECT FOR UPDATE to avoid lost updates.
IfNeed strict serializability for financial transactions
UseMVCC snapshot isolation is not enough; use SERIALIZABLE isolation with predicate locking or application-level contention management.

How PostgreSQL Stores Row Versions: xmin and xmax

PostgreSQL stores multiple versions of a row directly in the same heap table. Each row header contains two critical transaction IDs: xmin (the transaction that created this version) and xmax (the transaction that deleted or updated it, or 0 if current). When a transaction updates a row, PostgreSQL marks the old version with its XID in xmax and inserts a new version with the same XID in xmin. A snapshot is simply a set of in-progress transaction IDs at a given moment. A row version is visible to a snapshot if its xmin is committed and before the snapshot's horizon, and its xmax is either 0 or in-progress (i.e., the deleting transaction hasn't committed yet). This is the heart of MVCC: visibility depends on the transaction snapshot, not on locks.

PostgreSQL also uses Heap-Only Tuples (HOT) when an update only changes non-indexed columns and the new version fits on the same page. HOT updates avoid the index overhead of inserting a new tuple and are much cheaper because pruning (cleanup) can remove the old version without a full index scan. You can spot HOT updates by comparing pg_stat_all_tables.n_tup_hot_upd to n_tup_upd — a low ratio indicates indexes are hurting your update performance.

Production Insight
Long-running read transactions hold back the snapshot horizon, preventing dead tuples from being cleaned.
If a transaction runs for an hour, every row updated during that hour creates a version that cannot be vacuumed.
Rule: set statement_timeout and idle_in_transaction_session_timeout to avoid snapshot bloat.
Key Takeaway
xmin and xmax are the core of PostgreSQL MVCC.
A row is visible to a snapshot if xmin is committed and xmax is invisible (0 or uncommitted).
If you see tuple bloat, check which snapshots are blocking vacuum with pg_stat_activity.backend_xmin.
Choosing Between Heap-Only Tuples (HOT) and Full Updates
IfUpdate changes only indexed columns or uses values that move the row to a different page
UseFull update: new version placed on a new page, old version becomes dead tuple until vacuum.
IfUpdate changes only non-indexed columns and row stays on same page
UseHOT update: new version placed on same page, pruning can remove old version without full vacuum scan. Much cheaper.

How InnoDB Uses the Undo Log — Rollback to Any Point

MySQL InnoDB takes a different approach. Instead of storing multiple row versions in the table space, InnoDB keeps only one version in the clustered index and stores older versions in a separate undo tablespace. Each row in the index has a DB_ROLL_PTR that points to the undo log entry for the previous version. When a transaction updates a row, InnoDB writes the old values to an undo log record and updates the current row. A snapshot is constructed by reading the current row, then following the rollback pointer to reconstruct older versions as needed. This means that table bloat is less of an issue in InnoDB — the table space itself doesn't grow with dead tuples. However, the undo tablespace can grow very large if long-running transactions prevent the purge of old undo records.

InnoDB's purge system runs automatically and is much less configurable than PostgreSQL's vacuum. You can control undo truncation from MySQL 8.0+ with innodb_undo_log_truncate and innodb_max_undo_log_size. If these aren't set correctly, your undo tablespace can silently eat up disk — and you won't see it in SHOW TABLE STATUS. You need to check INFORMATION_SCHEMA.INNODB_METRICS or INNODB_UNDO_TABLESPACES.

Production Insight
InnoDB's undo log is invisible to SHOW TABLE STATUS — many teams miss the 50GB undo file until it fills the disk.
Undo truncation with innodb_undo_log_truncate helps, but it only works with separate undo tablespaces (enabled by default in 8.0+).
Rule: monitor INNODB_METRICS for undo memory usage, not just table size.
Key Takeaway
InnoDB trades table bloat for undo log bloat.
Undo log growth is invisible unless you explicitly monitor it.
MVCC's hidden cost shifts from main table to undo tablespace — but it's still a version management cost you must track.

Isolation Levels Under MVCC: Read Committed vs Repeatable Read vs Serializable

MVCC implements snapshot isolation (SI) at the default levels. In PostgreSQL, Read Committed and Repeatable Read use different snapshot strategies. Read Committed takes a new snapshot for each statement within a transaction, meaning you can see changes committed by other transactions between your own statements. Repeatable Read takes a single snapshot at the start of the transaction — all subsequent statements see the same version of the database. Serializable adds predicate locking to SI to prevent all anomalies including write skew, but it comes with overhead and the possibility of serialisation failures (which you must handle by retrying). InnoDB's default is Repeatable Read with a similar snapshot, but it also uses next-key locking to prevent phantoms, which can lead to more lock contention than PostgreSQL's approach.

Write skew is the anomaly that catches most teams off guard. Imagine a medical scheduling system: two doctors both query the on-call roster, see one slot left, and both assign themselves. Under REPEATABLE READ, both transactions commit — now two doctors believe they're on call. The conflict is invisible because each transaction only reads the existing slot count, then updates a separate row. MVCC's snapshot isolation prevents lost updates and dirty reads, but it does not detect this overlapping read-then-write pattern. To prevent write skew, you need either SERIALIZABLE isolation (which uses predicate locks or SSI in Postgres) or explicit locking with SELECT ... FOR UPDATE on the rows you plan to change.

Think of Snapshots Like Git Branches
  • Read Committed: each statement gets a new git checkout — changes from others appear between queries.
  • Repeatable Read: you git stash at the start and never see other people's commits until you git pull (commit).
  • Serializable: no one else can git push to the same file while you're working — you get the whole repo locked.
  • Write skew: two people read the same rows, both decide to change different columns, and neither sees the other's change — until it's too late.
Production Insight
Write skew is the trickiest MVCC anomaly — it survives REPEATABLE READ but not SERIALIZABLE.
Example: two doctors both query the on-call list (one slot left), both proceed to assign themselves — now two doctors are on call.
Fix: use SELECT FOR UPDATE on the rows you're about to change, or switch to SERIALIZABLE and handle retry logic.
Key Takeaway
Snapshot isolation prevents dirty reads and non-repeatable reads.
But write skew remains unless you use SERIALIZABLE or explicit locking.
Know your anomaly tolerances: CRUD apps often accept write skew, but scheduling/booking systems cannot.

Autovacuum Tuning: The Most Common Production Mistake

PostgreSQL's autovacuum is designed to run in the background, but its default settings are conservative. For tables with high write throughput, the default autovacuum_vacuum_scale_factor = 0.2 means autovacuum won't kick in until 20% of the table is dead tuples. On a 10GB table, that's 2GB of dead space before cleanup starts — and by then, the table has likely grown much larger. Tuning involves adjusting autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, and autovacuum_vacuum_cost_limit per table. The cost_delay mechanism paces vacuum work to prevent I/O spikes, but it also means vacuum can fall behind under sustained load. InnoDB doesn't have a vacuum equivalent — it purges undo records automatically — but the undo tablespace still needs trimming.

To tune autovacuum per table, use storage parameters: ``sql ALTER TABLE io.thecodeforge.orders SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_threshold = 1000); ` You should also monitor pg_stat_all_tables.n_dead_tup and set up alerts when the ratio to n_live_tup exceeds 30%. The pg_stat_user_tables` view is your friend. And don't forget that long-running transactions (especially idle-in-transaction) block vacuum from removing dead tuples even if autovacuum runs.

Production Insight
Deferred vacuum is the #1 cause of query degradation in Postgres at scale.
We tuned a table with 2000 writes/sec by setting autovacuum_vacuum_scale_factor = 0.01 and autovacuum_vacuum_threshold = 1000 — query times dropped from 8s to 12ms.
Rule: measure your write TPS on each table and tune autovacuum per table, not globally.
Key Takeaway
Defaults are for demos, not production.
High-write tables need lower scale_factor and higher cost_limit.
Monitor n_dead_tup / n_live_tup every hour — if it's above 30%, call your DBA.
● Production incidentPOST-MORTEMseverity: high

The Autovacuum That Ate the Weekend

Symptom
Queries on a frequently updated table degrade over hours, then crash the application with 'could not extend file' or deadlock timeouts. pg_stat_user_tables shows n_dead_tup > 50% of n_live_tup.
Assumption
The DBA assumed autovacuum would handle it. The dev team assumed the slow query was an indexing problem.
Root cause
The table had 2M rows, 500 writes/sec, and a default autovacuum_vacuum_threshold of 50 + 0.05 * row count. That threshold is crossed at ~100k dead rows — but the table hit 1M dead rows before autovacuum's cost_delay allowed it to catch up. Dead tuples inflated the table to 12GB, and sequential scans became painful.
Fix
Set autovacuum_vacuum_scale_factor = 0.01 for this table, lowered autovacuum_vacuum_threshold to 1000, and ensured autovacuum_max_workers were not starved by other tables. Also increased maintenance_work_mem to 1GB for the vacuum process.
Key lesson
  • Default autovacuum settings are tuned for OLTP workloads with moderate write rates — not high-write tables.
  • Monitor n_dead_tup / n_live_tup ratio per table; alert when it exceeds 30%.
  • Always test VACUUM timing under production write load before going live.
  • Don't assume 'it's been fine for months' — write patterns change, and bloat accumulates silently.
Production debug guideSymptom → Action4 entries
Symptom · 01
Queries slow down over time; table size much larger than data volume
Fix
Check pg_stat_user_tables.n_dead_tup vs n_live_tup. If > 30%, run VACUUM (but not FULL unless desperate). Set up autovacuum alerting.
Symptom · 02
Transactions hang with 'cannot serialize access' or 'snapshot too old' errors
Fix
Check pg_stat_activity for long-running idle in transaction queries. Kill them or set idle_in_transaction_session_timeout.
Symptom · 03
Write skew anomalies despite using SERIALIZABLE isolation
Fix
SERIALIZABLE doesn't prevent all write skew unless you use predicate locking. If you need true serializability, use SSI (PostgreSQL) or SELECT ... FOR UPDATE on overlapping ranges.
Symptom · 04
InnoDB undo tablespace grows unboundedly
Fix
Check INNODB_METRICS for undo logs. Set innodb_undo_log_truncate = ON and innodb_max_undo_log_size to a reasonable limit (default 1GB is often too large). Monitor with SHOW ENGINE INNODB STATUS.
★ MVCC Quick Debug Cheat SheetUse these commands to diagnose MVCC problems in PostgreSQL (Pg) and MySQL InnoDB (Inno).
Table bloat / dead tuples accumulating
Immediate action
Check dead tuple count
Commands
SELECT schemaname, relname, n_live_tup, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC; -- Pg
SHOW TABLE STATUS LIKE 'your_table'; -- Inno (Data_free > 100MB indicates bloat)
Fix now
Run VACUUM table_name; (not FULL) for PostgreSQL, or OPTIMIZE TABLE for InnoDB (with pt-online-schema-change if big).
Long-running idle-in-transaction queries blocking vacuum+
Immediate action
Identify blocking sessions
Commands
SELECT pid, state, query, age(now(), xact_start) FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY xact_start; -- Pg
SELECT * FROM information_schema.innodb_trx WHERE trx_operation_state = 'idle' AND trx_started < NOW() - INTERVAL 5 MINUTE; -- Inno
Fix now
SET idle_in_transaction_session_timeout = '5min'; in postgresql.conf, then reload. For current, pg_terminate_backend(pid).
Snapshot too old error (PostgreSQL)+
Immediate action
Check old_snapshot_threshold setting
Commands
SHOW old_snapshot_threshold; -- Pg (default: -1, meaning disabled)
SELECT count(*) FROM pg_stat_activity WHERE backend_xmin IS NOT NULL; -- Pg (backends holding old snapshots)
Fix now
Set old_snapshot_threshold to something like '1h' (requires restart). Ensure long-reporting queries finish within that window.
Transaction ID wraparound imminent+
Immediate action
Check current transaction ID and age
Commands
SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database; -- Pg (age should be below 2 billion)
SELECT max(age(datfrozenxid)) FROM pg_database; -- Pg
Fix now
Set autovacuum_freeze_max_age to 200 million (default 200M). Run VACUUM FREEZE on oldest databases. If age reaches 2B, database stops accepting writes.
InnoDB rollback segment too large+
Immediate action
Check undo tablespace size
Commands
SELECT NAME, VALUE/1024/1024/1024 as GB FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME LIKE '%undo%' AND SUBSYSTEM = 'transaction'; -- Inno
SHOW VARIABLES LIKE 'innodb_undo_log_truncate'; -- Inno
Fix now
Enable innodb_undo_log_truncate = ON and set reasonable max undo log size (e.g., innodb_max_undo_log_size = 2G) in my.cnf, then restart.
MVCC Implementation Comparison: PostgreSQL vs MySQL InnoDB
AspectPostgreSQLMySQL InnoDB
Version storageHeap table — dead tuples exist in same table spaceUndo tablespace — only current version in index, old versions in rollback segments
Bloat riskHigh if autovacuum falls behind. Dead tuples bloat heap.Low in table space; undo tablespace can grow large if long transactions hold references.
Vacuum / PurgeManual VACUUM or autovacuum needed to reclaim dead tuple space.Automatic purge of undo records; no vacuum, but undo truncation must be enabled.
Snapshot isolation levelRead Committed (statement-level snapshot) and Repeatable Read (transaction-level snapshot). SERIALIZABLE uses SSI.Repeatable Read default (with next-key locking). SERIALIZABLE uses two-phase locking, not SI.
Write skew protectionOnly SERIALIZABLE (SSI) prevents write skew. Repeatable Read allows it.Only SERIALIZABLE (with locking) prevents write skew; Repeatable Read allows it, but next-key locks reduce some forms.
Performance for read-heavy workloadsExcellent — read queries never block. No locks on reads except at SERIALIZABLE.Excellent, but next-key locking in Repeatable Read can cause lock contention under heavy index access.

Key takeaways

1
You now understand what MVCC is and why it exists
2
You've seen how PostgreSQL and InnoDB implement version storage differently
3
Snapshot isolation prevents dirty reads but not write skew; use SERIALIZABLE for absolute consistency
4
Autovacuum tuning is not optional
monitor dead tuple ratio per table
5
PostgreSQL uses xmin/xmax; InnoDB uses undo logs
both store versions but with different space trade-offs
6
Long-running transactions are the silent killers of MVCC performance

Common mistakes to avoid

6 patterns
×

Memorising syntax before understanding the concept

Symptom
Unable to troubleshoot production bloat or snapshot age issues because the underlying MVCC visibility rules are not understood.
Fix
Instead of memorising SQL syntax, focus on the visibility algorithm: a row is visible if xmin is committed and xmax is 0 or uncommitted. Run test scenarios to internalise the rules.
×

Skipping practice and only reading theory

Symptom
When a real incident occurs (e.g., snapshot too old), you have no hands-on experience to diagnose if it's a long transaction or a vacuum issue.
Fix
Set up a test database with pgbench or sysbench. Simulate high write loads and practice monitoring n_dead_tup, running VACUUM, and watching the effects.
×

Assuming autovacuum default settings work for all tables

Symptom
Queries gradually degrade as table bloat increases. pg_stat_user_tables shows high n_dead_tup.
Fix
Per-table tuning: set autovacuum_vacuum_scale_factor = 0.01 for write-heavy tables, monitor n_dead_tup / n_live_tup ratio.
×

Using REPEATABLE READ (or Read Committed) for financial transactions that need absolute consistency

Symptom
Write skew anomalies: two transactions read the same data, each updates a subset, and the final state violates constraints.
Fix
Use SERIALIZABLE isolation level and implement retry logic for serialisation failures.
×

Ignoring the undo tablespace in MySQL InnoDB

Symptom
Disk space fills up due to undolog growth. Long-running transactions prevent purge.
Fix
Enable innodb_undo_log_truncate, set innodb_max_undo_log_size, and monitor with INNODB_METRICS.
×

Not setting idle_in_transaction_session_timeout in PostgreSQL

Symptom
Vacuum unable to remove dead tuples because a long-running transaction holds a snapshot. Table bloat extreme.
Fix
Set idle_in_transaction_session_timeout to a reasonable value (e.g., 5 minutes) in postgresql.conf.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how MVCC works differently in PostgreSQL vs MySQL InnoDB.
Q02SENIOR
What is write skew and why does MVCC snapshot isolation not prevent it?
Q03SENIOR
How does PostgreSQL determine whether a row version is visible to a snap...
Q04SENIOR
You have a table in PostgreSQL with 500 writes per second and queries ar...
Q05JUNIOR
What is the difference between Read Committed and Repeatable Read in Pos...
Q01 of 05SENIOR

Explain how MVCC works differently in PostgreSQL vs MySQL InnoDB.

ANSWER
PostgreSQL stores dead row versions directly in the heap table with xmin and xmax transaction IDs. Old versions stay visible until VACUUM reclaims them. InnoDB keeps only one version in the clustered index and uses undo logs to reconstruct older versions on demand. This means PostgreSQL's table bloat is more visible but simpler to inspect; InnoDB's version storage is in a separate undo tablespace that grows silently. PostgreSQL requires autovacuum to clean dead tuples; InnoDB uses an automatic purge of undo records. The trade-off: Postgres has better visibility into bloat but more tuning surface; InnoDB is more hands-off but can hide disk growth until it's critical.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is Multi-version Concurrency Control in simple terms?
02
Why doesn't MVCC prevent all concurrency anomalies?
03
How do I know if my PostgreSQL table needs vacuuming?
04
Does MySQL InnoDB have a vacuum equivalent?
05
What's the biggest performance impact of MVCC?
🔥

That's SQL Advanced. Mark it forged?

6 min read · try the examples if you haven't

Previous
Database Locking Mechanisms
16 / 16 · SQL Advanced
Next
Introduction to NoSQL Databases