Junior 11 min · March 06, 2026

Multi-version Concurrency Control

MVCC Bloat — Autovacuum Failure at 500 Writes/sec

Q: What is Multi-version Concurrency Control in simple terms?

MVCC lets multiple transactions see a consistent snapshot of data without blocking each other. Think of it like a library that makes photocopies of a book each time someone starts reading — everyone sees their own copy while others edit the original.

Q: Why doesn't MVCC prevent all concurrency anomalies?

MVCC's snapshot isolation prevents dirty reads, non-repeatable reads, and phantoms, but it does not prevent write skew. Write skew occurs when two transactions read overlapping data, then each updates a different part of that data based on the snapshot, unaware of the other's change. Only serializable isolation or explicit locking can prevent write skew.

Q: How do I know if my PostgreSQL table needs vacuuming?

Check pg_stat_user_tables.n_dead_tup vs n_live_tup. If the ratio exceeds 30%, run VACUUM (not VACUUM FULL unless you need to reclaim disk space). Also monitor for long-running transactions that prevent vacuum from cleaning old rows.

Q: Does MySQL InnoDB have a vacuum equivalent?

No. InnoDB automatically purges undo records when they are no longer needed by any active transaction. However, the undo tablespace can grow large if transactions run for a long time. Use innodb_undo_log_truncate to control undo tablespace size.

Q: What's the biggest performance impact of MVCC?

The version storage overhead. In PostgreSQL, dead tuples bloat the heap and cause index scans to be slower. In InnoDB, undo log growth can fill disk space unnoticed. Both require monitoring and active management (autovacuum, undo truncation).

Dead tuples hit 1M on a 2M-row table at 500 writes/sec, inflating it to 12GB and crashing queries.

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

MVCC gives every transaction a consistent snapshot of data without blocking writes or reads
PostgreSQL uses xmin/xmax tuples; InnoDB uses undo logs with rollback pointers
Snapshot isolation prevents dirty reads, but write skew still slips through
Version storage overhead can add 20-50% disk usage under heavy writes
Autovacuum bloat is the #1 silent killer of query performance in Postgres
Tuning idle_in_transaction_session_timeout prevents catastrophic snapshot buildup

✦ Definition~90s read

What is Multi-version Concurrency Control?

MVCC is the engine that lets your database serve reads and writes at the same time without locking everything. Here's the core idea: every row carries hidden metadata that tells the database which transaction created it and which (if any) deleted it. A read query gets a snapshot of committed data as of the moment it started.

★

Imagine a library with one copy of a popular book.

While that query runs, another transaction can update rows and commit — but the first query still sees the old version.

That's the magic: readers don't block writers, and writers don't block readers. But it's not free. Every update leaves a dead version behind — a dead tuple in PostgreSQL, an undo record in InnoDB. If those dead versions pile up, your queries slow down, your storage fills, and eventually your pager goes off at 3 AM.

You don't have a choice about whether your database uses MVCC — every major engine does. Your choice is whether you understand the costs and how to manage them.

Plain-English First

Imagine a library with one copy of a popular book. Without MVCC, if someone is reading it, you have to wait until they're done before you can even look at it. With MVCC, the librarian secretly makes a photocopy of the book the moment you sit down — so you read your private snapshot while someone else edits the original. When you're done, you compare notes. Nobody waits. Nobody blocks. That's MVCC — readers never block writers, and writers never block readers, because everyone works from their own timestamped version of the data.

Every high-traffic production database faces the same brutal tension: dozens of queries are reading rows at the exact same moment other queries are updating those same rows. Get this wrong and you're choosing between data inconsistency (dirty reads, phantom rows) or grinding serialisation locks that tank your throughput. This isn't a theoretical concern — it's why Instagram, Stripe, and every SaaS at scale cares deeply about how their database engine handles concurrent access.

Multi-Version Concurrency Control (MVCC) solves this by flipping the fundamental assumption. Instead of locking a row so only one person touches it at a time, the database keeps multiple timestamped versions of every row simultaneously. A reader gets a consistent snapshot of the world as it existed when their transaction began. A writer creates a new version without destroying the old one. The two operations proceed in parallel, completely independently. Lock contention drops dramatically, and read throughput scales linearly with your hardware.

By the end of this article you'll understand exactly how PostgreSQL stores row versions on disk (the xmin/xmax system), how InnoDB's undo log chain differs from that approach, why MVCC doesn't eliminate all anomalies (write skew is still lurking), how to tune autovacuum before table bloat kills your query plans, and what to say when an interviewer asks you to compare snapshot isolation with serialisable isolation. This is the deep, production-grade understanding that separates engineers who just use databases from engineers who run them confidently at scale.

What is Multi-version Concurrency Control?

You don't have a choice about whether your database uses MVCC — every major engine does. Your choice is whether you understand the costs and how to manage them.

pg_visibility.sqlSQL

-- Query transaction visibility metadata in PostgreSQL
SELECT xmin, xmax, ctid, * FROM io.thecodeforge.orders WHERE id = 42;
-- xmin = creating transaction ID
-- xmax = deleting transaction ID (0 if current)

Output

xmin | xmax | ctid | id | status

1234 | 0 | (0,1) | 42 | active

Think of MVCC like Git branches

A snapshot is like a git checkout at a specific commit hash.
Updates create new commits; old commits still exist until garbage collection (vacuum/purge).
Long-running transactions are branches that never merge — they prevent cleanup of old commits.
Write skew is like two people cherry-picking different parts of the same file — each thinks they're right, but the result is inconsistent.

Production Insight

MVCC's version storage is not free — disk usage can increase 20-50% under sustained writes.

Dead tuple accumulation in Postgres is the #1 performance degradation vector in production.

Rule: measure bloat before it measures you. Set up n_dead_tup alerts.

Key Takeaway

MVCC lets readers and writers coexist without blocking.

But the hidden cost is dead versions that must be cleaned up.

You don't get to ignore version storage — it will come for you at 3 AM.

When to Use MVCC vs Lock-Based Concurrency

IfRead-dominant workload (90%+ reads)

→

UseMVCC wins — readers never block writers, throughput scales with cores.

IfHeavy contention on few rows (e.g., account balance updates)

→

UseMVCC still helps, but you need optimistic concurrency control or SELECT FOR UPDATE to avoid lost updates.

IfNeed strict serializability for financial transactions

→

UseMVCC snapshot isolation is not enough; use SERIALIZABLE isolation with predicate locking or application-level contention management.

thecodeforge.io

MVCC Bloat and Autovacuum Failure at 500 Writes/sec

Multiversion Concurrency Control

How PostgreSQL Stores Row Versions: xmin and xmax

PostgreSQL stores multiple versions of a row directly in the same heap table. Each row header contains two critical transaction IDs: xmin (the transaction that created this version) and xmax (the transaction that deleted or updated it, or 0 if current). When a transaction updates a row, PostgreSQL marks the old version with its XID in xmax and inserts a new version with the same XID in xmin. A snapshot is simply a set of in-progress transaction IDs at a given moment. A row version is visible to a snapshot if its xmin is committed and before the snapshot's horizon, and its xmax is either 0 or in-progress (i.e., the deleting transaction hasn't committed yet). This is the heart of MVCC: visibility depends on the transaction snapshot, not on locks.

PostgreSQL also uses Heap-Only Tuples (HOT) when an update only changes non-indexed columns and the new version fits on the same page. HOT updates avoid the index overhead of inserting a new tuple and are much cheaper because pruning (cleanup) can remove the old version without a full index scan. You can spot HOT updates by comparing pg_stat_all_tables.n_tup_hot_upd to n_tup_upd — a low ratio indicates indexes are hurting your update performance.

Production Insight

Long-running read transactions hold back the snapshot horizon, preventing dead tuples from being cleaned.

If a transaction runs for an hour, every row updated during that hour creates a version that cannot be vacuumed.

Rule: set statement_timeout and idle_in_transaction_session_timeout to avoid snapshot bloat.

Key Takeaway

xmin and xmax are the core of PostgreSQL MVCC.

A row is visible to a snapshot if xmin is committed and xmax is invisible (0 or uncommitted).

If you see tuple bloat, check which snapshots are blocking vacuum with pg_stat_activity.backend_xmin.

Choosing Between Heap-Only Tuples (HOT) and Full Updates

IfUpdate changes only indexed columns or uses values that move the row to a different page

→

UseFull update: new version placed on a new page, old version becomes dead tuple until vacuum.

IfUpdate changes only non-indexed columns and row stays on same page

→

UseHOT update: new version placed on same page, pruning can remove old version without full vacuum scan. Much cheaper.

How InnoDB Uses the Undo Log — Rollback to Any Point

MySQL InnoDB takes a different approach. Instead of storing multiple row versions in the table space, InnoDB keeps only one version in the clustered index and stores older versions in a separate undo tablespace. Each row in the index has a DB_ROLL_PTR that points to the undo log entry for the previous version. When a transaction updates a row, InnoDB writes the old values to an undo log record and updates the current row. A snapshot is constructed by reading the current row, then following the rollback pointer to reconstruct older versions as needed. This means that table bloat is less of an issue in InnoDB — the table space itself doesn't grow with dead tuples. However, the undo tablespace can grow very large if long-running transactions prevent the purge of old undo records.

InnoDB's purge system runs automatically and is much less configurable than PostgreSQL's vacuum. You can control undo truncation from MySQL 8.0+ with innodb_undo_log_truncate and innodb_max_undo_log_size. If these aren't set correctly, your undo tablespace can silently eat up disk — and you won't see it in SHOW TABLE STATUS. You need to check INFORMATION_SCHEMA.INNODB_METRICS or INNODB_UNDO_TABLESPACES.

Production Insight

InnoDB's undo log is invisible to SHOW TABLE STATUS — many teams miss the 50GB undo file until it fills the disk.

Undo truncation with innodb_undo_log_truncate helps, but it only works with separate undo tablespaces (enabled by default in 8.0+).

Rule: monitor INNODB_METRICS for undo memory usage, not just table size.

Key Takeaway

InnoDB trades table bloat for undo log bloat.

Undo log growth is invisible unless you explicitly monitor it.

MVCC's hidden cost shifts from main table to undo tablespace — but it's still a version management cost you must track.

Isolation Levels Under MVCC: Read Committed vs Repeatable Read vs Serializable

MVCC implements snapshot isolation (SI) at the default levels. In PostgreSQL, Read Committed and Repeatable Read use different snapshot strategies. Read Committed takes a new snapshot for each statement within a transaction, meaning you can see changes committed by other transactions between your own statements. Repeatable Read takes a single snapshot at the start of the transaction — all subsequent statements see the same version of the database. Serializable adds predicate locking to SI to prevent all anomalies including write skew, but it comes with overhead and the possibility of serialisation failures (which you must handle by retrying). InnoDB's default is Repeatable Read with a similar snapshot, but it also uses next-key locking to prevent phantoms, which can lead to more lock contention than PostgreSQL's approach.

Write skew is the anomaly that catches most teams off guard. Imagine a medical scheduling system: two doctors both query the on-call roster, see one slot left, and both assign themselves. Under REPEATABLE READ, both transactions commit — now two doctors believe they're on call. The conflict is invisible because each transaction only reads the existing slot count, then updates a separate row. MVCC's snapshot isolation prevents lost updates and dirty reads, but it does not detect this overlapping read-then-write pattern. To prevent write skew, you need either SERIALIZABLE isolation (which uses predicate locks or SSI in Postgres) or explicit locking with SELECT ... FOR UPDATE on the rows you plan to change.

Think of Snapshots Like Git Branches

Read Committed: each statement gets a new git checkout — changes from others appear between queries.
Repeatable Read: you git stash at the start and never see other people's commits until you git pull (commit).
Serializable: no one else can git push to the same file while you're working — you get the whole repo locked.
Write skew: two people read the same rows, both decide to change different columns, and neither sees the other's change — until it's too late.

Production Insight

Write skew is the trickiest MVCC anomaly — it survives REPEATABLE READ but not SERIALIZABLE.

Example: two doctors both query the on-call list (one slot left), both proceed to assign themselves — now two doctors are on call.

Fix: use SELECT FOR UPDATE on the rows you're about to change, or switch to SERIALIZABLE and handle retry logic.

Key Takeaway

Snapshot isolation prevents dirty reads and non-repeatable reads.

But write skew remains unless you use SERIALIZABLE or explicit locking.

Know your anomaly tolerances: CRUD apps often accept write skew, but scheduling/booking systems cannot.

Autovacuum Tuning: The Most Common Production Mistake

PostgreSQL's autovacuum is designed to run in the background, but its default settings are conservative. For tables with high write throughput, the default autovacuum_vacuum_scale_factor = 0.2 means autovacuum won't kick in until 20% of the table is dead tuples. On a 10GB table, that's 2GB of dead space before cleanup starts — and by then, the table has likely grown much larger. Tuning involves adjusting autovacuum_vacuum_threshold, autovacuum_vacuum_scale_factor, and autovacuum_vacuum_cost_limit per table. The cost_delay mechanism paces vacuum work to prevent I/O spikes, but it also means vacuum can fall behind under sustained load. InnoDB doesn't have a vacuum equivalent — it purges undo records automatically — but the undo tablespace still needs trimming.

To tune autovacuum per table, use storage parameters: ``sql ALTER TABLE io.thecodeforge.orders SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_threshold = 1000); ` You should also monitor pg_stat_all_tables.n_dead_tup and set up alerts when the ratio to n_live_tup exceeds 30%. The pg_stat_user_tables` view is your friend. And don't forget that long-running transactions (especially idle-in-transaction) block vacuum from removing dead tuples even if autovacuum runs.

Production Insight

Deferred vacuum is the #1 cause of query degradation in Postgres at scale.

We tuned a table with 2000 writes/sec by setting autovacuum_vacuum_scale_factor = 0.01 and autovacuum_vacuum_threshold = 1000 — query times dropped from 8s to 12ms.

Rule: measure your write TPS on each table and tune autovacuum per table, not globally.

Key Takeaway

Defaults are for demos, not production.

High-write tables need lower scale_factor and higher cost_limit.

Monitor n_dead_tup / n_live_tup every hour — if it's above 30%, call your DBA.

Why MVCC Is Used: The Problem Nobody Told You About

You think MVCC exists to make reads fast. Wrong. MVCC exists so you don't need to lock every row when one transaction sneezes.

Without MVCC, a simple SELECT blocks an UPDATE. An UPDATE blocks a DELETE. You get chain-locking, deadlocks at 3 AM, and a pager that ruins your sleep. MVCC kills that by keeping old versions alive for readers while writers scribble on new copies.

The real reason? Latency hiding. When a transaction holds a row lock, everyone else queues. Queues turn into timeout cascades. MVCC lets readers skip the queue entirely. They grab the last committed version and move on.

If you're running an e-commerce platform and a product update blocks inventory checks, you lose orders. MVCC prevents that by decoupling read consistency from write contention. It's not a feature — it's a survival mechanism.

ReadContentionDemo.sqlSQL

// io.thecodeforge — database tutorial
// Without MVCC: SELECT blocks UPDATE
-- Session 1
BEGIN;
SELECT stock_count FROM inventory WHERE product_id = 1001;
-- Keep transaction open, block any update

-- Session 2
UPDATE inventory SET stock_count = stock_count - 1 WHERE product_id = 1001;
-- Hangs until Session 1 commits

-- With MVCC: SELECT sees committed snapshot, UPDATE proceeds
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

-- Session 1
BEGIN;
SELECT stock_count FROM inventory WHERE product_id = 1001;
-- Returns stock_count = 50 (snapshot at start)

-- Session 2
UPDATE inventory SET stock_count = 49 WHERE product_id = 1001;
COMMIT;
-- Session 1 still sees 50 until it re-reads

Output

Session 1: stock_count = 50

Session 2: UPDATE successful immediately

Session 1 (re-read): stock_count = 49

Production Trap:

MVCC doesn't eliminate write conflicts. Two concurrent UPDATEs on the same row still cause a serialization failure. Always retry on 40001 error codes.

Key Takeaway

MVCC trades storage for concurrency. If your reads block on writes, you're not using MVCC correctly.

Types of MVCC in DBMS: Pick Your Poison

Not all MVCC is created equal. There are three major flavors, and picking wrong means either constant retries or a storage bill that rivals your rent.

Timestamp-Based MVCC tags every transaction with a monotonically increasing timestamp. The database compares timestamps to decide visibility. It's simple, deterministic, and breaks under clock skew. If your system clock jumps, transactions see ghosts. Don't use this on distributed databases without hardware-level clock sync.

Snapshot-Based MVCC gives each transaction a point-in-time snapshot of the entire database at start. PostgreSQL and Oracle use this. Every write creates a new row version visible only to transactions that started after the write committed. Reads never block. Writes check for conflicts at commit time. The cost? Bloat. Old row versions pile up until a vacuum or purge cleans them.

History-Based MVCC keeps every row version forever — or at least until a retention policy says otherwise. Great for audit trails and time-travel queries. Terrible for write-heavy workloads because the history chain grows unbounded. InnoDB's undo log is a hybrid: it keeps enough history for rollback and consistent reads, then prunes aggressively.

Hybrid MVCC combines approaches. CockroachDB uses a hybrid clock (HLC) for timestamps but snapshots internally. The trade-off: more complexity, better resilience against clock drift.

MVCCImplementationCheck.sqlSQL

// io.thecodeforge — database tutorial
// Check your database's MVCC flavor
-- PostgreSQL: snapshot-based, uses xmin/xmax
SELECT txid_current();
SELECT xmin, xmax, * FROM pg_stat_activity;

-- MySQL/InnoDB: history-based via undo log
SHOW ENGINE INNODB STATUS\G
-- Look for "History list length" under TRANSACTIONS

-- CockroachDB: hybrid, uses HLC
SHOW CLUSTER SETTING cluster.clock.forward_jump_check_enabled;

-- Oracle: snapshot-based with UNDO tablespace
SELECT BEGIN_TIME, END_TIME, UNDOBLKS FROM V$UNDOSTAT;

Output

PostgreSQL: txid_current = 12345

MySQL/InnoDB: History list length = 47

CockroachDB: cluster.clock.forward_jump_check_enabled = on

Senior Shortcut:

If your workload is read-heavy, snapshot-based MVCC wins. If you need point-in-time recovery, history-based wins. Both need aggressive cleanup — monitor your 'dead tuple' count in PostgreSQL or 'history list length' in MySQL.

Key Takeaway

Know your MVCC type before tuning. Snapshot systems need vacuuming. History systems need undo log limits. Timestamp systems need clock hygiene.

Two-Phase Locking: Why Your Reads Block Writes (And How to Fix It)

You think MVCC means no locks? Wrong. Two-phase locking (2PL) is the dirty secret under the hood. MVCC eliminates read-write conflicts for standard queries, but the moment you touch Serializable isolation or DDL, 2PL kicks in and your transactions serialize.

Phase one: acquire all locks. Read locks, write locks, predicate locks — you grab them as you go. Phase two: release them all at commit or rollback. No releasing early. That's the rule. Why? Because early release lets phantom reads and write skew slip through. Serializable mode enforces this with index-range locks (gap locks in MySQL, predicate locks in PostgreSQL).

Here's the production pain: long transactions accumulate locks. A report that scans 10 million rows under Serializable holds read locks on every index page it touches. Meanwhile, a simple UPDATE on a single row blocks — not because of MVCC, but because 2PL won't let the write proceed until the reader releases. Monitor pg_locks or performance_schema.data_locks. Short transactions are not optional.

Shortcut: if you see deadlocks under Serializable, you're likely holding locks in different orders across transactions. Enforce a consistent lock order in your app code.

TwoPhaseLockDemo.sqlSQL

// io.thecodeforge — database tutorial

-- Session A: Acquires read lock on rows 1-100
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM orders WHERE id BETWEEN 1 AND 100;
-- Lock held. Phase 1.

-- Session B: Tries to update row 50 → BLOCKED
UPDATE orders SET status = 'shipped' WHERE id = 50;
-- Sits in lock_wait.

-- Session A: Commits → Phase 2 releases locks
COMMIT;
-- Session B: Now proceeds.

-- Output from pg_locks:
SELECT locktype, relation::regclass, mode, granted, pid
FROM pg_locks WHERE NOT granted;
-- Shows waiting session, lock mode 'RowExclusiveLock'

Output

locktype | relation | mode | granted | pid

----------+----------+-----------------+---------+-----

relation | order | RowExclusiveLock | f | 1234

Production Trap:

Never use Serializable isolation without monitoring lock waits. One slow transaction can bring your entire write path to a crawl. Use 'SET lock_timeout' to fail fast on contention.

Key Takeaway

2PL forces locks to be held until commit — long transactions under Serializable kill concurrency. Keep them short or drop to Repeatable Read.

MVCC + 2PL: The Truth About Snapshot Isolation and Write Conflicts

You've read that MVCC gives you snapshot isolation: readers never block writers, writers never block readers. That's true — until a write conflict happens. Then 2PL muscles in and aborts one transaction. Here's the why.

Snapshot isolation uses row versions to let each transaction see a consistent snapshot from its start time. No locks needed for reads. But when two transactions try to UPDATE the same row concurrently, the database must serialize the writes. It can't just merge two versions — that's a conflict. PostgreSQL's Serializable Snapshot Isolation (SSI) detects these conflicts using predicate locks and aborts one transaction. InnoDB uses a simpler approach: the second updater waits (via 2PL) on a row lock from the first.

Production takeaway: 'no locks' is marketing. Real MVCC implementations use a hybrid. Reads are lock-free. Conflicting writes are not. This is why you see occasional serialization failures under Repeatable Read — the database is honest about conflicts rather than corrupting data.

If your app retries serialization failures, you're working with MVCC + 2PL. Design for that. Exponential backoff. Three retries max. Don't pretend the locks don't exist.

WriteConflictDemo.sqlSQL

// io.thecodeforge — database tutorial

-- Session A: Begin snapshot
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT stock FROM inventory WHERE product_id = 42;
-- Returns 10

-- Session B: Same snapshot, reads 10
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT stock FROM inventory WHERE product_id = 42;
-- Returns 10

-- Session A: Updates first — no lock yet
UPDATE inventory SET stock = 9 WHERE product_id = 42;
-- Succeeds. Row version created.

-- Session B: Tries same update
UPDATE inventory SET stock = 9 WHERE product_id = 42;
-- BLOCKED (InnoDB waits on row lock; PostgreSQL gets serialization failure)
-- InnoDB: waits until Session A commits or rolls back
-- PostgreSQL: ERROR: could not serialize access

-- Session A commits
COMMIT;
-- Session B: Proceeds (InnoDB) or retries (PostgreSQL)

Output

-- PostgreSQL case:

ERROR: could not serialize access due to concurrent update

-- InnoDB case:

Query OK, 1 row affected (1.23 sec)

Senior Shortcut:

Design your write path to handle serialization failures. Use optimistic concurrency control with version columns (e.g., UPDATE ... WHERE version = :old_version) to avoid 2PL locking entirely for high-contention rows.

Key Takeaway

MVCC gives lock-free reads, but conflicting writes fall back to 2PL — either via row locks (InnoDB) or aborts (PostgreSQL SSI). Expect and retry failures.

Introduction

Multiversion Concurrency Control (MVCC) is the engine that lets modern databases handle thousands of simultaneous transactions without locking everything to a crawl. The core problem is simple: a database must keep data consistent when multiple users read and write at the same time. Traditional locking would force a writer to block every reader, killing throughput. MVCC solves this by keeping old versions of each row around, so a reader sees a consistent snapshot of the database as it existed when their transaction began, no matter what other transactions are doing. This means reads never wait for writes, and writes never wait for reads. Understanding MVCC's purpose is critical before diving into specific implementations because it explains why every design trade-off — from storage overhead to vacuum management — exists. Without MVCC, high-concurrency workloads would grind to a halt. With it, systems like PostgreSQL and MySQL can handle thousands of transactions per second while maintaining full ACID guarantees.

mvcc_intro.sqlSQL

// io.thecodeforge — database tutorial
// 25 lines max
-- Show MVCC in action: two concurrent sessions
-- Session A (Transaction 1)
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 100);
-- Session B (Transaction 2)
BEGIN;
-- This read sees empty account before A commits
SELECT balance FROM accounts WHERE id = 1;
-- Returns empty
-- Now A commits
COMMIT;
-- Session B still sees old snapshot
SELECT balance FROM accounts WHERE id = 1;
-- Still empty! No blocking needed.
COMMIT;

Output

Empty (before commit)

Empty (after commit, still old snapshot)

Production Trap:

The snapshot isolation that makes reads free also hides current writes. Long-running transactions will retroactively see stale data, causing phantom reads or write skew if you don't match isolation levels to your workload.

Key Takeaway

MVCC separates read and write paths: readers see snapshots, writers create versions.

Two-Phase Locking Protocol

Two-Phase Locking (2PL) is the strict sibling of MVCC. While MVCC avoids locks for reads, 2PL ensures serializability by forcing transactions to acquire all locks before releasing any. It works in two phases: a growing phase where locks are obtained (never released), and a shrinking phase where locks are released (never acquired). The critical property is that 2PL prevents dirty reads, non-repeatable reads, and phantom rows by locking rows, index ranges, or tables. However, the cost is severe: readers block writers and writers block readers. In practice, 2PL is rarely used alone in modern databases. Instead, systems like MySQL's InnoDB combine MVCC with a lighter form of 2PL (called strict 2PL) to handle write conflicts while keeping reads lock-free. The trade-off is clear: 2PL guarantees stronger consistency at the price of throughput. Understanding 2PL matters because it explains the original solution and why MVCC evolved to fix its blocking behavior.

two_phase_locking.sqlSQL

// io.thecodeforge — database tutorial
// 25 lines max
-- Two-Phase Locking example (simulated)
-- Transaction T1
BEGIN;
-- Phase 1: acquire locks
SELECT * FROM accounts WHERE id = 1 FOR UPDATE;
-- T1 has exclusive lock on row id=1
UPDATE accounts SET balance = 200 WHERE id = 1;
-- T2's SELECT...FOR UPDATE will block here
COMMIT;
-- Phase 2: locks released (in reverse order)
-- T2 now acquires lock and proceeds
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Returns 200 (latest committed)
COMMIT;

Output

T1 updates balance to 200.

T2 waits until T1 commits, then reads 200.

Production Trap:

2PL can cause deadlocks when two transactions hold locks and wait for each other. Always set a lock timeout (e.g., innodb_lock_wait_timeout) to avoid hung connections piling up.

Key Takeaway

2PL ensures serializability but blocks all concurrent access — use MVCC to keep reads moving.

● Production incidentPOST-MORTEMseverity: high

The Autovacuum That Ate the Weekend

Symptom

Queries on a frequently updated table degrade over hours, then crash the application with 'could not extend file' or deadlock timeouts. pg_stat_user_tables shows n_dead_tup > 50% of n_live_tup.

Assumption

The DBA assumed autovacuum would handle it. The dev team assumed the slow query was an indexing problem.

Root cause

The table had 2M rows, 500 writes/sec, and a default autovacuum_vacuum_threshold of 50 + 0.05 * row count. That threshold is crossed at ~100k dead rows — but the table hit 1M dead rows before autovacuum's cost_delay allowed it to catch up. Dead tuples inflated the table to 12GB, and sequential scans became painful.

Fix

Set autovacuum_vacuum_scale_factor = 0.01 for this table, lowered autovacuum_vacuum_threshold to 1000, and ensured autovacuum_max_workers were not starved by other tables. Also increased maintenance_work_mem to 1GB for the vacuum process.

Key lesson

Default autovacuum settings are tuned for OLTP workloads with moderate write rates — not high-write tables.
Monitor n_dead_tup / n_live_tup ratio per table; alert when it exceeds 30%.
Always test VACUUM timing under production write load before going live.
Don't assume 'it's been fine for months' — write patterns change, and bloat accumulates silently.

Production debug guideSymptom → Action4 entries

Symptom · 01

Queries slow down over time; table size much larger than data volume

→

Fix

Check pg_stat_user_tables.n_dead_tup vs n_live_tup. If > 30%, run VACUUM (but not FULL unless desperate). Set up autovacuum alerting.

Symptom · 02

Transactions hang with 'cannot serialize access' or 'snapshot too old' errors

→

Fix

Check pg_stat_activity for long-running idle in transaction queries. Kill them or set idle_in_transaction_session_timeout.

Symptom · 03

Write skew anomalies despite using SERIALIZABLE isolation

→

Fix

SERIALIZABLE doesn't prevent all write skew unless you use predicate locking. If you need true serializability, use SSI (PostgreSQL) or SELECT ... FOR UPDATE on overlapping ranges.

Symptom · 04

InnoDB undo tablespace grows unboundedly

→

Fix

Check INNODB_METRICS for undo logs. Set innodb_undo_log_truncate = ON and innodb_max_undo_log_size to a reasonable limit (default 1GB is often too large). Monitor with SHOW ENGINE INNODB STATUS.

★ MVCC Quick Debug Cheat SheetUse these commands to diagnose MVCC problems in PostgreSQL (Pg) and MySQL InnoDB (Inno).

Table bloat / dead tuples accumulating−

Immediate action

Check dead tuple count

Commands

SELECT schemaname, relname, n_live_tup, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC; -- Pg

SHOW TABLE STATUS LIKE 'your_table'; -- Inno (Data_free > 100MB indicates bloat)

Fix now

Run VACUUM table_name; (not FULL) for PostgreSQL, or OPTIMIZE TABLE for InnoDB (with pt-online-schema-change if big).

Long-running idle-in-transaction queries blocking vacuum+

Snapshot too old error (PostgreSQL)+

Transaction ID wraparound imminent+

InnoDB rollback segment too large+

MVCC Implementation Comparison: PostgreSQL vs MySQL InnoDB

Aspect	PostgreSQL	MySQL InnoDB
Version storage	Heap table — dead tuples exist in same table space	Undo tablespace — only current version in index, old versions in rollback segments
Bloat risk	High if autovacuum falls behind. Dead tuples bloat heap.	Low in table space; undo tablespace can grow large if long transactions hold references.
Vacuum / Purge	Manual VACUUM or autovacuum needed to reclaim dead tuple space.	Automatic purge of undo records; no vacuum, but undo truncation must be enabled.
Snapshot isolation level	Read Committed (statement-level snapshot) and Repeatable Read (transaction-level snapshot). SERIALIZABLE uses SSI.	Repeatable Read default (with next-key locking). SERIALIZABLE uses two-phase locking, not SI.
Write skew protection	Only SERIALIZABLE (SSI) prevents write skew. Repeatable Read allows it.	Only SERIALIZABLE (with locking) prevents write skew; Repeatable Read allows it, but next-key locks reduce some forms.
Performance for read-heavy workloads	Excellent — read queries never block. No locks on reads except at SERIALIZABLE.	Excellent, but next-key locking in Repeatable Read can cause lock contention under heavy index access.

Key takeaways

You now understand what MVCC is and why it exists

You've seen how PostgreSQL and InnoDB implement version storage differently

Snapshot isolation prevents dirty reads but not write skew; use SERIALIZABLE for absolute consistency

Autovacuum tuning is not optional

monitor dead tuple ratio per table

PostgreSQL uses xmin/xmax; InnoDB uses undo logs

both store versions but with different space trade-offs

Long-running transactions are the silent killers of MVCC performance

Common mistakes to avoid

6 patterns

Memorising syntax before understanding the concept

Symptom

Unable to troubleshoot production bloat or snapshot age issues because the underlying MVCC visibility rules are not understood.

Fix

Instead of memorising SQL syntax, focus on the visibility algorithm: a row is visible if xmin is committed and xmax is 0 or uncommitted. Run test scenarios to internalise the rules.

Skipping practice and only reading theory

Symptom

When a real incident occurs (e.g., snapshot too old), you have no hands-on experience to diagnose if it's a long transaction or a vacuum issue.

Fix

Set up a test database with pgbench or sysbench. Simulate high write loads and practice monitoring n_dead_tup, running VACUUM, and watching the effects.

Assuming autovacuum default settings work for all tables

Symptom

Queries gradually degrade as table bloat increases. pg_stat_user_tables shows high n_dead_tup.

Fix

Per-table tuning: set autovacuum_vacuum_scale_factor = 0.01 for write-heavy tables, monitor n_dead_tup / n_live_tup ratio.

Using REPEATABLE READ (or Read Committed) for financial transactions that need absolute consistency

Symptom

Write skew anomalies: two transactions read the same data, each updates a subset, and the final state violates constraints.

Fix

Use SERIALIZABLE isolation level and implement retry logic for serialisation failures.

Ignoring the undo tablespace in MySQL InnoDB

Symptom

Disk space fills up due to undolog growth. Long-running transactions prevent purge.

Fix

Enable innodb_undo_log_truncate, set innodb_max_undo_log_size, and monitor with INNODB_METRICS.

Not setting idle_in_transaction_session_timeout in PostgreSQL

Symptom

Vacuum unable to remove dead tuples because a long-running transaction holds a snapshot. Table bloat extreme.

Fix

Set idle_in_transaction_session_timeout to a reasonable value (e.g., 5 minutes) in postgresql.conf.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain how MVCC works differently in PostgreSQL vs MySQL InnoDB.

Q02SENIOR

What is write skew and why does MVCC snapshot isolation not prevent it?

Q03SENIOR

How does PostgreSQL determine whether a row version is visible to a snap...

Q04SENIOR

You have a table in PostgreSQL with 500 writes per second and queries ar...

Q05JUNIOR

What is the difference between Read Committed and Repeatable Read in Pos...

Q01 of 05SENIOR

Explain how MVCC works differently in PostgreSQL vs MySQL InnoDB.

ANSWER

PostgreSQL stores dead row versions directly in the heap table with xmin and xmax transaction IDs. Old versions stay visible until VACUUM reclaims them. InnoDB keeps only one version in the clustered index and uses undo logs to reconstruct older versions on demand. This means PostgreSQL's table bloat is more visible but simpler to inspect; InnoDB's version storage is in a separate undo tablespace that grows silently. PostgreSQL requires autovacuum to clean dead tuples; InnoDB uses an automatic purge of undo records. The trade-off: Postgres has better visibility into bloat but more tuning surface; InnoDB is more hands-off but can hide disk growth until it's critical.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is Multi-version Concurrency Control in simple terms?

Why doesn't MVCC prevent all concurrency anomalies?

How do I know if my PostgreSQL table needs vacuuming?

Does MySQL InnoDB have a vacuum equivalent?

What's the biggest performance impact of MVCC?

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's SQL Advanced. Mark it forged?

11 min read · try the examples if you haven't