Putting It All Together

LSM Tree — Full Architecture#

This is how all the pieces connect inside a single node. The write path goes through the left side (WAL + Memtable), the read path goes through the right side (Memtable → Bloom Filters → SSTables).

graph TD
    subgraph Memory
        MT["Memtable - sorted in-memory tree"]
        BF1["Bloom Filter - SSTable 3"]
        BF2["Bloom Filter - SSTable 2"]
        BF3["Bloom Filter - SSTable 1"]
    end

    subgraph Disk
        WAL["WAL - append-only log"]
        SS3["SSTable 3 - newest - sorted"]
        SS2["SSTable 2 - sorted"]
        SS1["SSTable 1 - oldest - sorted"]
    end

    W["put request"] --> WAL
    WAL --> MT
    MT -->|memtable full| SS3

    R["get request"] --> MT
    MT -->|not found| BF1
    BF1 -->|probably here| SS3
    BF1 -->|not here - SKIP| BF2
    BF2 -->|probably here| SS2
    BF2 -->|not here - SKIP| BF3
    BF3 -->|probably here| SS1

Write Path — End to End#

sequenceDiagram
    participant Client
    participant Node
    participant WAL as WAL on disk
    participant MT as Memtable in memory
    participant SS as SSTable on disk

    Client->>Node: put key:123 val:Alice
    Node->>WAL: append entry - sequential write
    Node->>MT: insert into sorted tree
    Node->>Client: ack - write complete

    Note over MT: memtable reaches 32-64 MB threshold

    MT->>SS: flush sorted data as new SSTable
    Note over WAL: old WAL deleted
    Note over MT: fresh memtable + new WAL created

Append to WAL — sequential write to disk. This is for crash recovery only.
Insert into memtable — O(log n) insert into an in-memory sorted tree (red-black tree or skip list).
Ack to client — done. The write is durable (WAL on disk) and queryable (memtable in memory).
Background flush — when the memtable reaches its size threshold, it's flushed to disk as a sorted SSTable. The old WAL is deleted and a fresh memtable + new WAL are created.

Every disk write in this path is sequential — appending to the WAL, flushing a sorted SSTable. No random I/O, no seeking, no page splits.

Read Path — End to End#

sequenceDiagram
    participant Client
    participant Node
    participant MT as Memtable
    participant BF3 as Bloom Filter 3
    participant SS3 as SSTable 3
    participant BF2 as Bloom Filter 2
    participant SS2 as SSTable 2
    participant BF1 as Bloom Filter 1
    participant SS1 as SSTable 1

    Client->>Node: get key:123
    Node->>MT: check memtable
    MT-->>Node: not found

    Node->>BF3: check Bloom filter for SSTable 3
    BF3-->>Node: definitely not here - SKIP

    Node->>BF2: check Bloom filter for SSTable 2
    BF2-->>Node: probably here
    Node->>SS2: binary search SSTable 2
    SS2-->>Node: FOUND key:123 = Alice

    Node->>Client: return Alice

Check memtable — in-memory lookup, O(log n). If found, return immediately — this is the newest data.
Check Bloom filter for newest SSTable — a few bit checks in memory. If "definitely not here," skip the SSTable entirely. No disk read.
If Bloom filter says "probably here" — binary search the SSTable on disk. If found, return. If not found (false positive), move to the next SSTable.
Repeat for each SSTable, newest to oldest, until the key is found or all SSTables are exhausted.

The Bloom filters are the critical optimization. Without them, every SSTable check is a disk read. With them, most SSTables are skipped with just a few bit checks in memory. Only the SSTable that actually contains the key (or the rare false positive) triggers a disk read.

What Happens to Deleted Keys?#

A delete doesn't remove data — it writes a tombstone (a special marker that says "this key was deleted at timestamp X"). The tombstone flows through the same path as any write: WAL → memtable → eventually flushed to an SSTable.

During reads, if the node finds a tombstone, it returns "not found" to the client. During compaction, when the tombstone meets the old value for that key, the old value is discarded. After the tombstone's grace period expires (typically 10 days — needed for anti-entropy to propagate the delete to all replicas), compaction removes the tombstone itself.

Summary — Why Each Piece Exists#

Component        Purpose                              Lives in
─────────        ───────                              ────────
WAL              Crash recovery for memtable           Disk (append-only)
Memtable         Fast writes + sorted batching         Memory
SSTable          Persistent sorted data                Disk (immutable)
Bloom Filter     Skip SSTables that don't have the key Memory
Compaction       Reduce SSTable count + reclaim space   Background process

Each component exists because of a specific problem: - WAL → memtable is in memory, crashes lose data → append to disk first - Memtable → random writes to sorted files are slow → batch and sort in memory - SSTable → memory is limited → flush sorted data to disk - Bloom filter → too many SSTables to check per read → skip irrelevant ones cheaply - Compaction → SSTables accumulate endlessly → merge them to reduce count and remove stale data

Interview framing

"The LSM Tree write path is: append to WAL for durability, insert into an in-memory sorted memtable, ack to client. When the memtable fills up, flush it to disk as a sorted SSTable. Every disk write is sequential — no random I/O. The read path checks the memtable first, then each SSTable newest to oldest. Bloom filters sit in memory and let us skip SSTables that definitely don't contain the key — reducing read amplification from checking every SSTable to checking only the relevant ones. Compaction runs in the background, merging SSTables to reduce their count and discard stale values."