Senior 3 min · June 25, 2026

LSM Trees and SSTables: How They Work and Why They Dominate Write-Heavy Workloads

Q: What is the difference between an LSM Tree and a B-Tree?

LSM Trees optimize for writes by batching them into sequential writes, while B-Trees optimize for reads with in-place updates. LSM Trees have higher write amplification but can absorb more writes per second. B-Trees have better read performance for range scans and point lookups.

Q: How do I reduce write amplification in an LSM Tree?

Use size-tiered compaction instead of leveled, increase the level size multiplier (e.g., from 10 to 20), or reduce the number of levels. Also, increase memtable size to reduce flush frequency. Each compaction pass rewrites data, so fewer compactions = less amplification.

Q: How do I tune bloom filters in RocksDB for better read performance?

Set bloom_filter_bits_per_key in the column family options. 10 bits gives ~1% false positive rate; 20 bits gives ~0.02%. Monitor rocksdb_bloom_filter_useful and rocksdb_bloom_filter_full_positive to see actual effectiveness. Also enable format_version=5 for partitioned filters.

Q: What causes tombstone accumulation and how do I fix it?

Tombstones are markers for deleted keys. They accumulate when compaction doesn't run frequently enough to merge SSTables and drop the tombstones. Fix: increase compaction frequency (reduce min_merge_width), or trigger manual compaction. Also, avoid frequent deletes of individual keys — batch them.

LSM Trees explained with production internals: compaction, bloom filters, write amplification, and when to avoid them.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

LSM Trees absorb high write throughput by converting random writes into sequential ones, then use background compaction to maintain read performance. SSTables are the immutable on-disk files that store sorted key-value pairs, indexed for efficient lookup.

✦ Definition~90s read

What is LSM Trees and SSTables?

An LSM Tree (Log-Structured Merge-Tree) is a data structure that batches writes into in-memory tables, then flushes them as immutable SSTables on disk, merging them in the background. It's the engine behind Cassandra, RocksDB, LevelDB, and Bigtable.

★

Imagine you're a librarian who gets a flood of new book requests every second.

Plain-English First

Imagine you're a librarian who gets a flood of new book requests every second. Instead of shelving each book immediately (which would be chaos), you dump them all into a pile on your desk. When the pile gets big enough, you sort the pile and write it into a new, permanent shelf section. Later, you merge old shelf sections together to keep things tidy. That's an LSM Tree — the pile is the memtable, the shelf sections are SSTables, and the merging is compaction.

You've got a write-heavy workload — time-series metrics, user events, chat messages — and your B-tree starts choking. Writes slow to a crawl, your disk I/O spikes, and the pager goes off at 2 AM. The problem? B-trees do random writes, and spinning disks (even SSDs) hate random writes. LSM Trees solve this by turning every write into a sequential append. That's the entire trick. And it's why every modern write-optimized store — Cassandra, RocksDB, Bigtable — uses them. By the end of this, you'll understand the internals well enough to tune compaction, diagnose write amplification, and know exactly when NOT to use an LSM Tree.

The Write Path: From Memtable to SSTable

Every write hits the memtable — an in-memory sorted structure (usually a skip list or red-black tree). This is fast: O(log n) insert, no disk I/O. When the memtable hits a size threshold (e.g., 64MB in RocksDB), it's frozen and a new memtable takes over. The frozen one is flushed to disk as an SSTable — a sorted, immutable file. This flush is a sequential write, so it's fast even on spinning rust. The key insight: all writes are sequential appends until compaction. No random writes, ever. This is why LSM Trees can absorb 100k+ writes/sec on a single node.

MemtableFlush.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Simplified memtable flush logic (RocksDB-like)
class Memtable {
    SkipList<String, String> data = new SkipList<>();
    int sizeBytes = 0;
    static final int MAX_SIZE = 64 * 1024 * 1024; // 64MB

    void put(String key, String value) {
        data.insert(key, value);
        sizeBytes += key.length() + value.length();
        if (sizeBytes >= MAX_SIZE) {
            triggerFlush(); // freeze and hand off to flush thread
        }
    }

    void triggerFlush() {
        // Flush thread picks this up, writes sorted data to SSTable
        // This is sequential write — no seeks
    }
}

// SSTable on disk: sorted key-value pairs + index + bloom filter
// File format: [Data Block 1][Data Block 2]...[Index][Bloom Filter][Footer]

Output

No direct output — this is internal logic.

Senior Shortcut: Memtable Size Tuning

Larger memtables reduce flush frequency but increase memory pressure. Rule of thumb: set memtable size to 1/4 of your write buffer budget. In RocksDB, write_buffer_size and max_write_buffer_number control this.

thecodeforge.io

LSM Tree Write & Read Path Flow

Lsm Trees Sstables

thecodeforge.io

Write Path: Memtable to SSTable

Lsm Trees Sstables

The Read Path: Why LSM Trees Can Be Slow (and How Bloom Filters Save You)

Reads are the Achilles' heel. To find a key, you must search the memtable, then all SSTables from newest to oldest. Without optimization, that's O(#SSTables * log N) — terrible. Two things fix this: bloom filters and a multi-level index. Bloom filters tell you 'this key is definitely not in this SSTable' with a configurable false positive rate. The index (a sparse map of key → offset) lets you binary-search within an SSTable. In practice, a point lookup touches 1-2 SSTables on average if bloom filters are tuned right. But range scans still suck — they may need to merge across multiple SSTables.

PointLookup.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

class SSTable {
    BloomFilter bloom;
    Index index; // key -> offset in data file
    RandomAccessFile dataFile;

    String get(String key) {
        if (!bloom.mightContain(key)) return null; // definitely not here
        long offset = index.lookup(key);
        if (offset == -1) return null;
        // Read block at offset, binary search within block
        return readBlockAndSearch(offset, key);
    }
}

// Read path: check memtable -> check each SSTable in L0 -> check L1 -> ...
// Bloom filter avoids most disk reads. Tune bits_per_key: 10 bits = ~1% false positive.

Output

No direct output — this is internal logic.

Production Trap: Bloom Filter Memory

Bloom filters live in memory. With 10 bits per key and 1 billion keys, that's 1.25GB per SSTable. If you have 100 SSTables, that's 125GB. Use partitioned bloom filters (RocksDB's format_version=5) to only load relevant filter blocks.

Compaction: The Background Merge That Keeps Everything Sane

Compaction is the process of merging multiple SSTables into one, discarding overwritten and deleted keys. Without compaction, you'd have thousands of tiny SSTables and reads would be catastrophic. There are two main strategies: size-tiered (Cassandra default) merges SSTables of similar size; leveled (RocksDB, LevelDB) organizes SSTables into levels with exponentially increasing size. Leveled compaction is better for reads (fewer SSTables to check) but causes more write amplification. Size-tiered is simpler and has less write amplification but worse read performance. Choose based on your workload: read-heavy → leveled, write-heavy → size-tiered.

CompactionStrategy.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Leveled compaction: each level is 10x the previous
// L0: files from memtable flushes (overlapping ranges)
// L1: non-overlapping, sorted, 10x size of L0
// L2: 100x L0, etc.
// When L1 exceeds target size, pick a file and merge with overlapping files in L2

class LeveledCompaction {
    static final double LEVEL_MULTIPLIER = 10.0;
    long maxBytesForLevel(int level) {
        return (long)(baseLevelSize * Math.pow(LEVEL_MULTIPLIER, level));
    }

    void maybeCompact(Level[] levels) {
        for (int i = 0; i < levels.length - 1; i++) {
            if (levels[i].totalSize > maxBytesForLevel(i)) {
                // Pick a file from level i, merge with overlapping files in level i+1
                doCompaction(levels[i], levels[i+1]);
            }
        }
    }
}

// Write amplification: each byte written may be rewritten multiple times during compaction.
// Leveled: ~10-30x. Size-tiered: ~3-5x.

Output

No direct output — this is internal logic.

Interview Gold: Write Amplification

Interviewers love write amplification. Remember: leveled compaction has higher write amplification because data moves up levels. If your workload is write-heavy and you care about SSD lifespan, size-tiered is better. If reads are critical, leveled wins.

When NOT to Use an LSM Tree

LSM Trees are not a silver bullet. If your workload is read-heavy with large range scans (e.g., 'SELECT * FROM orders WHERE user_id = ? ORDER BY timestamp'), a B-tree will crush an LSM Tree because data is stored contiguously on disk. LSM Trees also suffer from space amplification — old versions of keys linger until compaction catches up. If you need strict ACID transactions with point-in-time snapshots, a B-tree with MVCC (like Postgres) is simpler. Also, if your data fits entirely in memory, an LSM Tree adds unnecessary complexity. Use a hash map or a simple sorted array.

Never Do This: LSM for OLTP

Don't use Cassandra or RocksDB for a high-consistency OLTP workload with many small random reads. You'll fight with tombstone accumulation and read repair. Use Postgres or MySQL with InnoDB.

thecodeforge.io

LSM Tree vs B-Tree for Reads

Lsm Trees Sstables

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Cassandra node OOM-killed every 6 hours. Heap was 4GB, data size 500GB.

Assumption

Thought it was a memory leak in the JVM. Tried increasing heap to 8GB — still died.

Root cause

Compaction was falling behind. Pending compactions grew unbounded, causing memtable flushes to pile up. Each flush consumed memory for new SSTable writers, and the OS page cache ballooned. The real culprit: compaction throughput was throttled by a single-threaded compaction strategy on a 32-core machine.

Fix

Switched from SizeTieredCompactionStrategy to LeveledCompactionStrategy. Set concurrent_compactors to 4. Also increased memtable_heap_space_in_mb to 2048 to reduce flush frequency.

Key lesson

Compaction is not optional background work — it's the heartbeat of an LSM Tree.
If it stalls, the system dies.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Write stalls: 'Write stall' in RocksDB logs, or Cassandra 'FlushWriter' threads blocked

→

Fix

1. Check 'rocksdb.db.write.stall' metric. 2. Reduce max_write_buffer_number to limit memtable count. 3. Increase write_buffer_size to reduce flush frequency. 4. Add more compaction threads (max_background_compactions).

Symptom · 02

Read latency spikes: P99 read latency jumps 10x

→

Fix

1. Check bloom filter false positive rate. 2. Increase bits_per_key. 3. Check number of SSTables per level — if L0 has many files, compaction is falling behind. 4. Force a manual compaction (compact-range in Cassandra, CompactRange in RocksDB).

Symptom · 03

Disk space growing faster than data: 'Space amplification' metric high

→

Fix

1. Check tombstone count (estimate-num-keys vs actual). 2. If many deletes, tune compaction filter to drop tombstones earlier. 3. Increase min_merge_width to merge more files per compaction. 4. Consider switching to leveled compaction if using size-tiered.

★ LSM Trees and SSTables Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Write stalls: `rocksdb.db.write.stall` > 0−

Immediate action

Check pending compaction bytes

Commands

curl localhost:8080/metrics | grep rocksdb_compaction_pending_bytes

curl localhost:8080/metrics | grep rocksdb_num_running_compactions

Fix now

Increase max_background_compactions to number of CPU cores, reduce write_buffer_size by half

High read latency: P99 > 100ms+

Disk space explosion: `df -h` shows 90%++

OOM: Java heap or RSS exceeds limit+

Feature / Aspect	LSM Tree	B-Tree
Write throughput	High (sequential writes)	Moderate (random writes)
Read throughput (point lookup)	Moderate (bloom filter helps)	High (direct index)
Range scan performance	Poor (merge across SSTables)	Excellent (contiguous blocks)
Write amplification	3-30x	1x (in-place update)
Space amplification	Moderate (old versions)	Low (in-place)
Compaction overhead	Background, can spike CPU/IO	None (splits/merges rare)
Typical use case	Write-heavy, time-series, logs	OLTP, read-heavy, transactions

Key takeaways

LSM Trees convert random writes into sequential ones

that's the entire superpower.

Bloom filters are essential for read performance; tune bits_per_key based on your false positive tolerance.

Compaction is not optional

if it falls behind, the system stalls. Monitor pending compaction bytes.

Don't use LSM Trees for read-heavy OLTP with large range scans; B-trees are better there.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does an LSM Tree handle concurrent writes and reads without locking?

Q02SENIOR

When would you choose leveled compaction over size-tiered in production?

Q03SENIOR

What happens when compaction can't keep up with writes? How do you mitig...

Q04JUNIOR

What is an SSTable and how is it structured?

Q05SENIOR

You notice read latency spikes after a large delete operation. What's th...

Q06SENIOR

How would you design an LSM Tree for a system that must guarantee no dat...

Q01 of 06SENIOR

How does an LSM Tree handle concurrent writes and reads without locking?

ANSWER

Writes go to a thread-local memtable or a concurrent skip list. Reads check the memtable first (lock-free or RCU), then SSTables (immutable, no lock needed). Compaction operates on old SSTables, not the active memtable. This minimizes contention.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between an LSM Tree and a B-Tree?

How do I reduce write amplification in an LSM Tree?

How do I tune bloom filters in RocksDB for better read performance?

What causes tombstone accumulation and how do I fix it?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Database Internals. Mark it forged?

3 min read · try the examples if you haven't