Daily Software Engineering: The LSM Tree: Why Write-Heavy Databases Stopped Using B-Trees

The LSM Tree: Why Write-Heavy Databases Stopped Using B-Trees

2026-05-30

B-trees have ruled databases for 50 years, but they have a dirty secret: every write is a random I/O. You locate the leaf page, read it, modify it, write it back. On spinning disks that's a seek penalty; on SSDs it's write amplification that kills your drive's lifespan. For write-heavy workloads (time-series, logs, metrics, event streams), B-trees become the bottleneck. The Log-Structured Merge Tree (LSM) flips the tradeoff: turn random writes into sequential writes, pay a cost on reads.

How it works:

Memtable: Writes go into an in-memory sorted structure (usually a skip list or red-black tree). Fast — no disk I/O.
WAL: Before the memtable, append the write to a write-ahead log on disk. Sequential, durable, cheap.
SSTables: When the memtable fills, flush it to disk as an immutable Sorted String Table. Sequential write, no in-place updates.
Compaction: A background process merges smaller SSTables into larger ones, discarding deleted/overwritten keys. Sequential reads and writes.
Reads: Check memtable first, then SSTables from newest to oldest. Bloom filters skip files that can't contain the key.

Real-world example: Cassandra, RocksDB, LevelDB, ScyllaDB, and HBase are all LSM-based. When Discord migrated their trillions of messages from Cassandra to ScyllaDB (both LSM), they handled millions of writes per second on commodity hardware — something a B-tree-backed Postgres cluster would need aggressive sharding to match. Meanwhile, MySQL/InnoDB (B-tree) still wins for OLTP workloads with heavy read-modify-write patterns.

The fundamental tradeoff:

Write amplification: B-tree ≈ 1-3x (each write rewrites a page). LSM with leveled compaction ≈ 10-30x (each key gets rewritten during compactions). Sounds bad — but it's sequential, which SSDs handle 100x faster than random.
Read amplification: B-tree ≈ log(N) page reads. LSM ≈ check N levels (mitigated by bloom filters that have ~1% false positive rate per 10 bits/key).
Space amplification: LSM has dead data sitting around until compaction. Plan for 1.5-2x your live dataset size on disk.

Rule of thumb: If your write:read ratio exceeds 1:10 (lots of writes), or you're ingesting time-ordered data, reach for LSM. If your workload is read-heavy with small updates scattered everywhere (banking, inventory), stick with B-trees. The "10:1" cutoff isn't magic — it's where LSM's compaction overhead stops dominating and its sequential-write advantage wins.

Tune compaction strategy (leveled vs. size-tiered) based on whether you need predictable read latency (leveled) or lower write amplification (size-tiered).

See it in action: Check out The Secret Sauce Behind NoSQL: LSM Tree by ByteByteGo to see this theory applied.

Key Takeaway: LSM trees trade read complexity and disk space for sequential-write throughput — pick them when writes dominate and SSDs make random I/O the enemy.

All newsletters