Why Raft over Paxos?

When to use which storage engine, and what write amplification really costs

Modern storage engines are largely built on two data structure families: B-Tree and LSM Tree. Both organize data on disk to support efficient reads and writes, but they optimize for very different workloads. The real trade-offs are not just about speed, but about write amplification, read behavior, space usage, and operational costs.

The fundamental difference: in-place updates vs sequential writes

B-Trees are designed for in-place updates. When a record changes, the database finds the correct page and modifies it directly on disk (with buffering and caching to reduce I/O). This keeps the structure balanced and allows fast point lookups and range scans.

LSM Trees take the opposite approach. Writes are first appended sequentially to a memory structure (memtable) and a write-ahead log. Once the memory fills, data is flushed to disk as immutable files, and background processes continuously merge and compact these files to keep the structure organized.

This design makes LSM Trees extremely efficient for write-heavy workloads because sequential I/O is much faster than random disk updates, especially on spinning disks and still beneficial on SSDs.

Write amplification: the real cost of LSM Trees

Write amplification means the system writes more data to disk than the user actually inserted or updated. In LSM Trees, this happens because data is repeatedly rewritten during compaction as files are merged across levels.