2026-05-30
B-trees have ruled databases for 50 years, but they have a dirty secret: every write is a random I/O. You locate the leaf page, read it, modify it, write it back. On spinning disks that's a seek penalty; on SSDs it's write amplification that kills your drive's lifespan. For write-heavy workloads (time-series, logs, metrics, event streams), B-trees become the bottleneck. The Log-Structured Merge Tree (LSM) flips the tradeoff: turn random writes into sequential writes, pay a cost on reads.
How it works:
Real-world example: Cassandra, RocksDB, LevelDB, ScyllaDB, and HBase are all LSM-based. When Discord migrated their trillions of messages from Cassandra to ScyllaDB (both LSM), they handled millions of writes per second on commodity hardware — something a B-tree-backed Postgres cluster would need aggressive sharding to match. Meanwhile, MySQL/InnoDB (B-tree) still wins for OLTP workloads with heavy read-modify-write patterns.
The fundamental tradeoff:
Rule of thumb: If your write:read ratio exceeds 1:10 (lots of writes), or you're ingesting time-ordered data, reach for LSM. If your workload is read-heavy with small updates scattered everywhere (banking, inventory), stick with B-trees. The "10:1" cutoff isn't magic — it's where LSM's compaction overhead stops dominating and its sequential-write advantage wins.
Tune compaction strategy (leveled vs. size-tiered) based on whether you need predictable read latency (leveled) or lower write amplification (size-tiered).
