Senior 3 min · June 25, 2026

LSM Trees and SSTables: How They Work and Why They Dominate Write-Heavy Workloads

LSM Trees explained with production internals: compaction, bloom filters, write amplification, and when to avoid them.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

LSM Trees absorb high write throughput by converting random writes into sequential ones, then use background compaction to maintain read performance. SSTables are the immutable on-disk files that store sorted key-value pairs, indexed for efficient lookup.

✦ Definition~90s read
What is LSM Trees and SSTables?

An LSM Tree (Log-Structured Merge-Tree) is a data structure that batches writes into in-memory tables, then flushes them as immutable SSTables on disk, merging them in the background. It's the engine behind Cassandra, RocksDB, LevelDB, and Bigtable.

Imagine you're a librarian who gets a flood of new book requests every second.
Plain-English First

Imagine you're a librarian who gets a flood of new book requests every second. Instead of shelving each book immediately (which would be chaos), you dump them all into a pile on your desk. When the pile gets big enough, you sort the pile and write it into a new, permanent shelf section. Later, you merge old shelf sections together to keep things tidy. That's an LSM Tree — the pile is the memtable, the shelf sections are SSTables, and the merging is compaction.

You've got a write-heavy workload — time-series metrics, user events, chat messages — and your B-tree starts choking. Writes slow to a crawl, your disk I/O spikes, and the pager goes off at 2 AM. The problem? B-trees do random writes, and spinning disks (even SSDs) hate random writes. LSM Trees solve this by turning every write into a sequential append. That's the entire trick. And it's why every modern write-optimized store — Cassandra, RocksDB, Bigtable — uses them. By the end of this, you'll understand the internals well enough to tune compaction, diagnose write amplification, and know exactly when NOT to use an LSM Tree.

The Write Path: From Memtable to SSTable

Every write hits the memtable — an in-memory sorted structure (usually a skip list or red-black tree). This is fast: O(log n) insert, no disk I/O. When the memtable hits a size threshold (e.g., 64MB in RocksDB), it's frozen and a new memtable takes over. The frozen one is flushed to disk as an SSTable — a sorted, immutable file. This flush is a sequential write, so it's fast even on spinning rust. The key insight: all writes are sequential appends until compaction. No random writes, ever. This is why LSM Trees can absorb 100k+ writes/sec on a single node.

MemtableFlush.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — System Design tutorial

// Simplified memtable flush logic (RocksDB-like)
class Memtable {
    SkipList<String, String> data = new SkipList<>();
    int sizeBytes = 0;
    static final int MAX_SIZE = 64 * 1024 * 1024; // 64MB

    void put(String key, String value) {
        data.insert(key, value);
        sizeBytes += key.length() + value.length();
        if (sizeBytes >= MAX_SIZE) {
            triggerFlush(); // freeze and hand off to flush thread
        }
    }

    void triggerFlush() {
        // Flush thread picks this up, writes sorted data to SSTable
        // This is sequential write — no seeks
    }
}

// SSTable on disk: sorted key-value pairs + index + bloom filter
// File format: [Data Block 1][Data Block 2]...[Index][Bloom Filter][Footer]
Output
No direct output — this is internal logic.
Senior Shortcut: Memtable Size Tuning
Larger memtables reduce flush frequency but increase memory pressure. Rule of thumb: set memtable size to 1/4 of your write buffer budget. In RocksDB, write_buffer_size and max_write_buffer_number control this.
LSM Tree Write & Read Path Flow THECODEFORGE.IO LSM Tree Write & Read Path Flow From Memtable to SSTable with compaction and read optimization Write to Memtable In-memory sorted tree, fast writes Flush to SSTable Sorted immutable file on disk Compaction Merge Background merge of SSTables Read with Bloom Filter Check filter, then binary search Return Result Data found or not found ⚠ Read amplification from multiple SSTables Use bloom filters and proper compaction strategy THECODEFORGE.IO
thecodeforge.io
LSM Tree Write & Read Path Flow
Lsm Trees Sstables
Write Path: Memtable to SSTableTHECODEFORGE.IOWrite Path: Memtable to SSTableIn-memory insert, then flush to diskWrite RequestInsert into active memtableMemtable FullThreshold hit (e.g., 64MB)Freeze & SwitchOld memtable frozen, new activeFlush to DiskWrite sorted SSTable file⚠ No disk I/O until flush — fast writes, but read cost laterTHECODEFORGE.IO
thecodeforge.io
Write Path: Memtable to SSTable
Lsm Trees Sstables

The Read Path: Why LSM Trees Can Be Slow (and How Bloom Filters Save You)

Reads are the Achilles' heel. To find a key, you must search the memtable, then all SSTables from newest to oldest. Without optimization, that's O(#SSTables * log N) — terrible. Two things fix this: bloom filters and a multi-level index. Bloom filters tell you 'this key is definitely not in this SSTable' with a configurable false positive rate. The index (a sparse map of key → offset) lets you binary-search within an SSTable. In practice, a point lookup touches 1-2 SSTables on average if bloom filters are tuned right. But range scans still suck — they may need to merge across multiple SSTables.

PointLookup.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — System Design tutorial

class SSTable {
    BloomFilter bloom;
    Index index; // key -> offset in data file
    RandomAccessFile dataFile;

    String get(String key) {
        if (!bloom.mightContain(key)) return null; // definitely not here
        long offset = index.lookup(key);
        if (offset == -1) return null;
        // Read block at offset, binary search within block
        return readBlockAndSearch(offset, key);
    }
}

// Read path: check memtable -> check each SSTable in L0 -> check L1 -> ...
// Bloom filter avoids most disk reads. Tune bits_per_key: 10 bits = ~1% false positive.
Output
No direct output — this is internal logic.
Production Trap: Bloom Filter Memory
Bloom filters live in memory. With 10 bits per key and 1 billion keys, that's 1.25GB per SSTable. If you have 100 SSTables, that's 125GB. Use partitioned bloom filters (RocksDB's format_version=5) to only load relevant filter blocks.

Compaction: The Background Merge That Keeps Everything Sane

Compaction is the process of merging multiple SSTables into one, discarding overwritten and deleted keys. Without compaction, you'd have thousands of tiny SSTables and reads would be catastrophic. There are two main strategies: size-tiered (Cassandra default) merges SSTables of similar size; leveled (RocksDB, LevelDB) organizes SSTables into levels with exponentially increasing size. Leveled compaction is better for reads (fewer SSTables to check) but causes more write amplification. Size-tiered is simpler and has less write amplification but worse read performance. Choose based on your workload: read-heavy → leveled, write-heavy → size-tiered.

CompactionStrategy.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — System Design tutorial

// Leveled compaction: each level is 10x the previous
// L0: files from memtable flushes (overlapping ranges)
// L1: non-overlapping, sorted, 10x size of L0
// L2: 100x L0, etc.
// When L1 exceeds target size, pick a file and merge with overlapping files in L2

class LeveledCompaction {
    static final double LEVEL_MULTIPLIER = 10.0;
    long maxBytesForLevel(int level) {
        return (long)(baseLevelSize * Math.pow(LEVEL_MULTIPLIER, level));
    }

    void maybeCompact(Level[] levels) {
        for (int i = 0; i < levels.length - 1; i++) {
            if (levels[i].totalSize > maxBytesForLevel(i)) {
                // Pick a file from level i, merge with overlapping files in level i+1
                doCompaction(levels[i], levels[i+1]);
            }
        }
    }
}

// Write amplification: each byte written may be rewritten multiple times during compaction.
// Leveled: ~10-30x. Size-tiered: ~3-5x.
Output
No direct output — this is internal logic.
Interview Gold: Write Amplification
Interviewers love write amplification. Remember: leveled compaction has higher write amplification because data moves up levels. If your workload is write-heavy and you care about SSD lifespan, size-tiered is better. If reads are critical, leveled wins.

When NOT to Use an LSM Tree

LSM Trees are not a silver bullet. If your workload is read-heavy with large range scans (e.g., 'SELECT * FROM orders WHERE user_id = ? ORDER BY timestamp'), a B-tree will crush an LSM Tree because data is stored contiguously on disk. LSM Trees also suffer from space amplification — old versions of keys linger until compaction catches up. If you need strict ACID transactions with point-in-time snapshots, a B-tree with MVCC (like Postgres) is simpler. Also, if your data fits entirely in memory, an LSM Tree adds unnecessary complexity. Use a hash map or a simple sorted array.

Never Do This: LSM for OLTP
Don't use Cassandra or RocksDB for a high-consistency OLTP workload with many small random reads. You'll fight with tombstone accumulation and read repair. Use Postgres or MySQL with InnoDB.
LSM Tree vs B-Tree for ReadsTHECODEFORGE.IOLSM Tree vs B-Tree for ReadsWhen LSM Trees fall shortLSM Tree WeaknessSlow range scans across SSTablesSpace amplification (stale data)Reads need bloom filter checkB-Tree StrengthContiguous on-disk dataEfficient ordered scansNo write amplificationUse B-trees for read-heavy workloads with large range queriesTHECODEFORGE.IO
thecodeforge.io
LSM Tree vs B-Tree for Reads
Lsm Trees Sstables
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
Cassandra node OOM-killed every 6 hours. Heap was 4GB, data size 500GB.
Assumption
Thought it was a memory leak in the JVM. Tried increasing heap to 8GB — still died.
Root cause
Compaction was falling behind. Pending compactions grew unbounded, causing memtable flushes to pile up. Each flush consumed memory for new SSTable writers, and the OS page cache ballooned. The real culprit: compaction throughput was throttled by a single-threaded compaction strategy on a 32-core machine.
Fix
Switched from SizeTieredCompactionStrategy to LeveledCompactionStrategy. Set concurrent_compactors to 4. Also increased memtable_heap_space_in_mb to 2048 to reduce flush frequency.
Key lesson
  • Compaction is not optional background work — it's the heartbeat of an LSM Tree.
  • If it stalls, the system dies.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Write stalls: 'Write stall' in RocksDB logs, or Cassandra 'FlushWriter' threads blocked
Fix
1. Check 'rocksdb.db.write.stall' metric. 2. Reduce max_write_buffer_number to limit memtable count. 3. Increase write_buffer_size to reduce flush frequency. 4. Add more compaction threads (max_background_compactions).
Symptom · 02
Read latency spikes: P99 read latency jumps 10x
Fix
1. Check bloom filter false positive rate. 2. Increase bits_per_key. 3. Check number of SSTables per level — if L0 has many files, compaction is falling behind. 4. Force a manual compaction (compact-range in Cassandra, CompactRange in RocksDB).
Symptom · 03
Disk space growing faster than data: 'Space amplification' metric high
Fix
1. Check tombstone count (estimate-num-keys vs actual). 2. If many deletes, tune compaction filter to drop tombstones earlier. 3. Increase min_merge_width to merge more files per compaction. 4. Consider switching to leveled compaction if using size-tiered.
★ LSM Trees and SSTables Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Write stalls: `rocksdb.db.write.stall` > 0
Immediate action
Check pending compaction bytes
Commands
curl localhost:8080/metrics | grep rocksdb_compaction_pending_bytes
curl localhost:8080/metrics | grep rocksdb_num_running_compactions
Fix now
Increase max_background_compactions to number of CPU cores, reduce write_buffer_size by half
High read latency: P99 > 100ms+
Immediate action
Check bloom filter false positive rate
Commands
curl localhost:8080/metrics | grep rocksdb_bloom_filter_full_positive
curl localhost:8080/metrics | grep rocksdb_bloom_filter_useful
Fix now
Set bloom_filter_bits_per_key = 20 in column family options
Disk space explosion: `df -h` shows 90%++
Immediate action
Check tombstone ratio
Commands
curl localhost:8080/metrics | grep rocksdb_estimate_num_keys
du -sh /data/rocksdb/*
Fix now
Run manual compaction: curl -X POST localhost:8080/compact?from=0&to=100
OOM: Java heap or RSS exceeds limit+
Immediate action
Check memtable memory
Commands
curl localhost:8080/metrics | grep rocksdb_cur_size_active_memtable
curl localhost:8080/metrics | grep rocksdb_block_cache_usage
Fix now
Reduce write_buffer_size and max_write_buffer_number, set block_cache_size to 1/3 of heap
Feature / AspectLSM TreeB-Tree
Write throughputHigh (sequential writes)Moderate (random writes)
Read throughput (point lookup)Moderate (bloom filter helps)High (direct index)
Range scan performancePoor (merge across SSTables)Excellent (contiguous blocks)
Write amplification3-30x1x (in-place update)
Space amplificationModerate (old versions)Low (in-place)
Compaction overheadBackground, can spike CPU/IONone (splits/merges rare)
Typical use caseWrite-heavy, time-series, logsOLTP, read-heavy, transactions

Key takeaways

1
LSM Trees convert random writes into sequential ones
that's the entire superpower.
2
Bloom filters are essential for read performance; tune bits_per_key based on your false positive tolerance.
3
Compaction is not optional
if it falls behind, the system stalls. Monitor pending compaction bytes.
4
Don't use LSM Trees for read-heavy OLTP with large range scans; B-trees are better there.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does an LSM Tree handle concurrent writes and reads without locking?
Q02SENIOR
When would you choose leveled compaction over size-tiered in production?
Q03SENIOR
What happens when compaction can't keep up with writes? How do you mitig...
Q04JUNIOR
What is an SSTable and how is it structured?
Q05SENIOR
You notice read latency spikes after a large delete operation. What's th...
Q06SENIOR
How would you design an LSM Tree for a system that must guarantee no dat...
Q01 of 06SENIOR

How does an LSM Tree handle concurrent writes and reads without locking?

ANSWER
Writes go to a thread-local memtable or a concurrent skip list. Reads check the memtable first (lock-free or RCU), then SSTables (immutable, no lock needed). Compaction operates on old SSTables, not the active memtable. This minimizes contention.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between an LSM Tree and a B-Tree?
02
How do I reduce write amplification in an LSM Tree?
03
How do I tune bloom filters in RocksDB for better read performance?
04
What causes tombstone accumulation and how do I fix it?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Database Internals. Mark it forged?

3 min read · try the examples if you haven't

Previous
NoSQL Store Types Compared
9 / 9 · Database Internals
Next
Distributed Consensus: Paxos and Raft