LSM Trees absorb high write throughput by converting random writes into sequential ones, then use background compaction to maintain read performance. SSTables are the immutable on-disk files that store sorted key-value pairs, indexed for efficient lookup.
✦ Definition~90s read
What is LSM Trees and SSTables?
An LSM Tree (Log-Structured Merge-Tree) is a data structure that batches writes into in-memory tables, then flushes them as immutable SSTables on disk, merging them in the background. It's the engine behind Cassandra, RocksDB, LevelDB, and Bigtable.
★
Imagine you're a librarian who gets a flood of new book requests every second.
Plain-English First
Imagine you're a librarian who gets a flood of new book requests every second. Instead of shelving each book immediately (which would be chaos), you dump them all into a pile on your desk. When the pile gets big enough, you sort the pile and write it into a new, permanent shelf section. Later, you merge old shelf sections together to keep things tidy. That's an LSM Tree — the pile is the memtable, the shelf sections are SSTables, and the merging is compaction.
You've got a write-heavy workload — time-series metrics, user events, chat messages — and your B-tree starts choking. Writes slow to a crawl, your disk I/O spikes, and the pager goes off at 2 AM. The problem? B-trees do random writes, and spinning disks (even SSDs) hate random writes. LSM Trees solve this by turning every write into a sequential append. That's the entire trick. And it's why every modern write-optimized store — Cassandra, RocksDB, Bigtable — uses them. By the end of this, you'll understand the internals well enough to tune compaction, diagnose write amplification, and know exactly when NOT to use an LSM Tree.
The Write Path: From Memtable to SSTable
Every write hits the memtable — an in-memory sorted structure (usually a skip list or red-black tree). This is fast: O(log n) insert, no disk I/O. When the memtable hits a size threshold (e.g., 64MB in RocksDB), it's frozen and a new memtable takes over. The frozen one is flushed to disk as an SSTable — a sorted, immutable file. This flush is a sequential write, so it's fast even on spinning rust. The key insight: all writes are sequential appends until compaction. No random writes, ever. This is why LSM Trees can absorb 100k+ writes/sec on a single node.
MemtableFlush.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — SystemDesign tutorial
// Simplified memtable flush logic (RocksDB-like)
classMemtable {
SkipList<String, String> data = newSkipList<>();
int sizeBytes = 0;
staticfinalint MAX_SIZE = 64 * 1024 * 1024; // 64MB
voidput(String key, String value) {
data.insert(key, value);
sizeBytes += key.length() + value.length();
if (sizeBytes >= MAX_SIZE) {
triggerFlush(); // freeze and hand off to flush thread
}
}
voidtriggerFlush() {
// Flush thread picks this up, writes sorted data to SSTable
// This is sequential write — no seeks
}
}
// SSTable on disk: sorted key-value pairs + index + bloom filter
// File format: [DataBlock1][DataBlock2]...[Index][BloomFilter][Footer]
Output
No direct output — this is internal logic.
Senior Shortcut: Memtable Size Tuning
Larger memtables reduce flush frequency but increase memory pressure. Rule of thumb: set memtable size to 1/4 of your write buffer budget. In RocksDB, write_buffer_size and max_write_buffer_number control this.
thecodeforge.io
LSM Tree Write & Read Path Flow
Lsm Trees Sstables
thecodeforge.io
Write Path: Memtable to SSTable
Lsm Trees Sstables
The Read Path: Why LSM Trees Can Be Slow (and How Bloom Filters Save You)
Reads are the Achilles' heel. To find a key, you must search the memtable, then all SSTables from newest to oldest. Without optimization, that's O(#SSTables * log N) — terrible. Two things fix this: bloom filters and a multi-level index. Bloom filters tell you 'this key is definitely not in this SSTable' with a configurable false positive rate. The index (a sparse map of key → offset) lets you binary-search within an SSTable. In practice, a point lookup touches 1-2 SSTables on average if bloom filters are tuned right. But range scans still suck — they may need to merge across multiple SSTables.
PointLookup.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — SystemDesign tutorial
classSSTable {
BloomFilter bloom;
Index index; // key -> offset in data file
RandomAccessFile dataFile;
Stringget(String key) {
if (!bloom.mightContain(key)) returnnull; // definitely not here
long offset = index.lookup(key);
if (offset == -1) returnnull;
// Read block at offset, binary search within block
returnreadBlockAndSearch(offset, key);
}
}
// Read path: check memtable -> check each SSTable in L0 -> check L1 -> ...
// Bloom filter avoids most disk reads. Tune bits_per_key: 10 bits = ~1% false positive.
Output
No direct output — this is internal logic.
Production Trap: Bloom Filter Memory
Bloom filters live in memory. With 10 bits per key and 1 billion keys, that's 1.25GB per SSTable. If you have 100 SSTables, that's 125GB. Use partitioned bloom filters (RocksDB's format_version=5) to only load relevant filter blocks.
Compaction: The Background Merge That Keeps Everything Sane
Compaction is the process of merging multiple SSTables into one, discarding overwritten and deleted keys. Without compaction, you'd have thousands of tiny SSTables and reads would be catastrophic. There are two main strategies: size-tiered (Cassandra default) merges SSTables of similar size; leveled (RocksDB, LevelDB) organizes SSTables into levels with exponentially increasing size. Leveled compaction is better for reads (fewer SSTables to check) but causes more write amplification. Size-tiered is simpler and has less write amplification but worse read performance. Choose based on your workload: read-heavy → leveled, write-heavy → size-tiered.
CompactionStrategy.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — SystemDesign tutorial
// Leveled compaction: each level is 10x the previous
// L0: files from memtable flushes (overlapping ranges)
// L1: non-overlapping, sorted, 10x size of L0
// L2: 100x L0, etc.
// WhenL1 exceeds target size, pick a file and merge with overlapping files in L2classLeveledCompaction {
staticfinaldouble LEVEL_MULTIPLIER = 10.0;
longmaxBytesForLevel(int level) {
return (long)(baseLevelSize * Math.pow(LEVEL_MULTIPLIER, level));
}
voidmaybeCompact(Level[] levels) {
for (int i = 0; i < levels.length - 1; i++) {
if (levels[i].totalSize > maxBytesForLevel(i)) {
// Pick a file from level i, merge with overlapping files in level i+1doCompaction(levels[i], levels[i+1]);
}
}
}
}
// Write amplification: each byte written may be rewritten multiple times during compaction.
// Leveled: ~10-30x. Size-tiered: ~3-5x.
Output
No direct output — this is internal logic.
Interview Gold: Write Amplification
Interviewers love write amplification. Remember: leveled compaction has higher write amplification because data moves up levels. If your workload is write-heavy and you care about SSD lifespan, size-tiered is better. If reads are critical, leveled wins.
When NOT to Use an LSM Tree
LSM Trees are not a silver bullet. If your workload is read-heavy with large range scans (e.g., 'SELECT * FROM orders WHERE user_id = ? ORDER BY timestamp'), a B-tree will crush an LSM Tree because data is stored contiguously on disk. LSM Trees also suffer from space amplification — old versions of keys linger until compaction catches up. If you need strict ACID transactions with point-in-time snapshots, a B-tree with MVCC (like Postgres) is simpler. Also, if your data fits entirely in memory, an LSM Tree adds unnecessary complexity. Use a hash map or a simple sorted array.
Never Do This: LSM for OLTP
Don't use Cassandra or RocksDB for a high-consistency OLTP workload with many small random reads. You'll fight with tombstone accumulation and read repair. Use Postgres or MySQL with InnoDB.
thecodeforge.io
LSM Tree vs B-Tree for Reads
Lsm Trees Sstables
● Production incidentPOST-MORTEMseverity: high
The 4GB Container That Kept Dying
Symptom
Cassandra node OOM-killed every 6 hours. Heap was 4GB, data size 500GB.
Assumption
Thought it was a memory leak in the JVM. Tried increasing heap to 8GB — still died.
Root cause
Compaction was falling behind. Pending compactions grew unbounded, causing memtable flushes to pile up. Each flush consumed memory for new SSTable writers, and the OS page cache ballooned. The real culprit: compaction throughput was throttled by a single-threaded compaction strategy on a 32-core machine.
Fix
Switched from SizeTieredCompactionStrategy to LeveledCompactionStrategy. Set concurrent_compactors to 4. Also increased memtable_heap_space_in_mb to 2048 to reduce flush frequency.
Key lesson
Compaction is not optional background work — it's the heartbeat of an LSM Tree.
If it stalls, the system dies.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Write stalls: 'Write stall' in RocksDB logs, or Cassandra 'FlushWriter' threads blocked
→
Fix
1. Check 'rocksdb.db.write.stall' metric. 2. Reduce max_write_buffer_number to limit memtable count. 3. Increase write_buffer_size to reduce flush frequency. 4. Add more compaction threads (max_background_compactions).
Symptom · 02
Read latency spikes: P99 read latency jumps 10x
→
Fix
1. Check bloom filter false positive rate. 2. Increase bits_per_key. 3. Check number of SSTables per level — if L0 has many files, compaction is falling behind. 4. Force a manual compaction (compact-range in Cassandra, CompactRange in RocksDB).
Symptom · 03
Disk space growing faster than data: 'Space amplification' metric high
→
Fix
1. Check tombstone count (estimate-num-keys vs actual). 2. If many deletes, tune compaction filter to drop tombstones earlier. 3. Increase min_merge_width to merge more files per compaction. 4. Consider switching to leveled compaction if using size-tiered.
★ LSM Trees and SSTables Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Reduce write_buffer_size and max_write_buffer_number, set block_cache_size to 1/3 of heap
Feature / Aspect
LSM Tree
B-Tree
Write throughput
High (sequential writes)
Moderate (random writes)
Read throughput (point lookup)
Moderate (bloom filter helps)
High (direct index)
Range scan performance
Poor (merge across SSTables)
Excellent (contiguous blocks)
Write amplification
3-30x
1x (in-place update)
Space amplification
Moderate (old versions)
Low (in-place)
Compaction overhead
Background, can spike CPU/IO
None (splits/merges rare)
Typical use case
Write-heavy, time-series, logs
OLTP, read-heavy, transactions
Key takeaways
1
LSM Trees convert random writes into sequential ones
that's the entire superpower.
2
Bloom filters are essential for read performance; tune bits_per_key based on your false positive tolerance.
3
Compaction is not optional
if it falls behind, the system stalls. Monitor pending compaction bytes.
4
Don't use LSM Trees for read-heavy OLTP with large range scans; B-trees are better there.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
How does an LSM Tree handle concurrent writes and reads without locking?
Q02SENIOR
When would you choose leveled compaction over size-tiered in production?
Q03SENIOR
What happens when compaction can't keep up with writes? How do you mitig...
Q04JUNIOR
What is an SSTable and how is it structured?
Q05SENIOR
You notice read latency spikes after a large delete operation. What's th...
Q06SENIOR
How would you design an LSM Tree for a system that must guarantee no dat...
Q01 of 06SENIOR
How does an LSM Tree handle concurrent writes and reads without locking?
ANSWER
Writes go to a thread-local memtable or a concurrent skip list. Reads check the memtable first (lock-free or RCU), then SSTables (immutable, no lock needed). Compaction operates on old SSTables, not the active memtable. This minimizes contention.
Q02 of 06SENIOR
When would you choose leveled compaction over size-tiered in production?
ANSWER
Leveled compaction for read-heavy workloads (fewer SSTables to check) at the cost of higher write amplification (10-30x). Size-tiered for write-heavy workloads where you can tolerate slower reads. Example: time-series metrics → size-tiered; user profile lookups → leveled.
Q03 of 06SENIOR
What happens when compaction can't keep up with writes? How do you mitigate?
ANSWER
Memtable flushes pile up, L0 file count grows, read latency spikes, and eventually write stalls occur. Mitigations: increase compaction threads, throttle write rate via write stall triggers, or pre-partition data to spread load. In Cassandra, reduce memtable_flush_writers and increase concurrent_compactors.
Q04 of 06JUNIOR
What is an SSTable and how is it structured?
ANSWER
An SSTable is an immutable, sorted file of key-value pairs. It consists of data blocks (each containing sorted keys), an index block (sparse key→offset map), a bloom filter block, and a footer. The immutability allows safe concurrent reads without locks.
Q05 of 06SENIOR
You notice read latency spikes after a large delete operation. What's the likely cause and fix?
ANSWER
Tombstones from deletes accumulate and are only removed during compaction. They cause extra work during reads (skipping tombstones). Fix: trigger a manual compaction to merge SSTables and drop tombstones. In Cassandra, run nodetool compact on the keyspace.
Q06 of 06SENIOR
How would you design an LSM Tree for a system that must guarantee no data loss on crash?
ANSWER
Use a write-ahead log (WAL) that is fsynced before acknowledging the write. The memtable is recoverable from the WAL on restart. SSTables are immutable so they survive crashes. Ensure WAL is on a separate device or has its own fsync policy.
01
How does an LSM Tree handle concurrent writes and reads without locking?
SENIOR
02
When would you choose leveled compaction over size-tiered in production?
SENIOR
03
What happens when compaction can't keep up with writes? How do you mitigate?
SENIOR
04
What is an SSTable and how is it structured?
JUNIOR
05
You notice read latency spikes after a large delete operation. What's the likely cause and fix?
SENIOR
06
How would you design an LSM Tree for a system that must guarantee no data loss on crash?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
What is the difference between an LSM Tree and a B-Tree?
LSM Trees optimize for writes by batching them into sequential writes, while B-Trees optimize for reads with in-place updates. LSM Trees have higher write amplification but can absorb more writes per second. B-Trees have better read performance for range scans and point lookups.
Was this helpful?
02
How do I reduce write amplification in an LSM Tree?
Use size-tiered compaction instead of leveled, increase the level size multiplier (e.g., from 10 to 20), or reduce the number of levels. Also, increase memtable size to reduce flush frequency. Each compaction pass rewrites data, so fewer compactions = less amplification.
Was this helpful?
03
How do I tune bloom filters in RocksDB for better read performance?
Set bloom_filter_bits_per_key in the column family options. 10 bits gives ~1% false positive rate; 20 bits gives ~0.02%. Monitor rocksdb_bloom_filter_useful and rocksdb_bloom_filter_full_positive to see actual effectiveness. Also enable format_version=5 for partitioned filters.
Was this helpful?
04
What causes tombstone accumulation and how do I fix it?
Tombstones are markers for deleted keys. They accumulate when compaction doesn't run frequently enough to merge SSTables and drop the tombstones. Fix: increase compaction frequency (reduce min_merge_width), or trigger manual compaction. Also, avoid frequent deletes of individual keys — batch them.