Final Architecture

Final Architecture — Everything Connected#

This is the complete architecture reflecting every deep dive decision. The base design was: client → LB → coordinator → hash → replicas → quorum → respond. Now every component has depth behind it.

flowchart TD
    Client([Client])
    LB[Load Balancer]

    subgraph Cluster - 1200 Nodes
        subgraph Any Node = Coordinator
            Coord[Coordinator Logic]
            Ring[Consistent Hash Ring - 256 vnodes per node]
            Membership[Membership Table - via gossip]
        end

        subgraph Replica Set N=3
            subgraph Node B
                BF_B[Bloom Filters - in memory]
                MT_B[Memtable - sorted in memory]
                WAL_B[WAL - append only]
                SST_B[SSTables - sorted immutable files]
            end
            subgraph Node C
                BF_C[Bloom Filters]
                MT_C[Memtable]
                WAL_C[WAL]
                SST_C[SSTables]
            end
            subgraph Node D
                BF_D[Bloom Filters]
                MT_D[Memtable]
                WAL_D[WAL]
                SST_D[SSTables]
            end
        end
    end

    Client --> LB
    LB -->|round-robin| Coord
    Coord -->|hash key + lookup ring| Ring
    Ring -->|find replica set| Membership
    Coord --> MT_B
    Coord --> MT_C
    Coord --> MT_D

Complete Write Path — End to End#

Every decision we made is in this flow. Nothing is hand-waved.

1. Client sends:
   POST /api/v1/item
   Body: { "key": "user:123", "value": "base64(raw bytes)", "ttl": 86400 }

2. Load balancer picks any healthy node (round-robin) → Node A

3. Node A becomes the coordinator:
   a. Decodes base64 value back to raw bytes
   b. Hashes "user:123" → ring position 4500
   c. Looks up local membership table (populated by gossip)
   d. Walks the ring → finds Node B, C, D own this range (N=3, distinct physical nodes via vnodes)
   e. Checks membership table → all three are alive

4. Coordinator forwards write to Node B, Node C, Node D
   Each receiving node:
   a. Append to WAL on disk (sequential write — crash recovery)
   b. Insert into memtable (in-memory sorted tree — red-black tree or skip list)
   c. Store metadata: key + value + timestamp + expiry (current time + TTL)
   d. Ack back to coordinator

5. Coordinator waits for W=2 acks (quorum)
   → 2 of 3 nodes confirmed → write is durable

6. If Node D is down:
   → Coordinator writes to Node E (next on ring) with hint "belongs to Node D"
   → Hinted handoff — Node E forwards to Node D when it recovers
   → Hint expires after ~3 hours if Node D stays down

7. Coordinator responds to client:
   201 Created { "key": "user:123", "timestamp": 1713400000000 }

8. Background (on each replica node, independently):
   → Memtable fills up (32-64 MB) → flush to SSTable (sorted, immutable, sequential write)
   → Old WAL deleted, fresh memtable + new WAL created
   → Bloom filter built for the new SSTable
   → Compaction merges SSTables in background (size-tiered or leveled)

sequenceDiagram
    participant C as Client
    participant LB as Load Balancer
    participant A as Coordinator
    participant B as Node B
    participant NC as Node C
    participant D as Node D

    C->>LB: PUT user:123 + base64 value + TTL
    LB->>A: round-robin pick
    Note over A: hash key → ring → find B C D via membership table

    A->>B: write key + raw bytes + timestamp + expiry
    A->>NC: write key + raw bytes + timestamp + expiry
    A->>D: write key + raw bytes + timestamp + expiry

    Note over B: WAL append → memtable insert
    Note over NC: WAL append → memtable insert
    Note over D: WAL append → memtable insert

    B->>A: ack
    NC->>A: ack
    Note over A: W=2 satisfied
    A->>C: 201 Created + timestamp

    Note over B: Later: memtable flush → SSTable + Bloom filter

Complete Read Path — Eventual Consistency (R=1)#

Fastest path. One network hop to one replica. No comparison, no repair.

1. Client sends:
   GET /api/v1/item?key=user:123&consistency=eventual

2. Load balancer → Node A becomes coordinator

3. Coordinator hashes key → finds Node B, C, D on ring

4. Coordinator picks the NEAREST or least-loaded replica (say Node B)
   → Sends read to Node B only (R=1)

5. Node B read path:
   a. Check memtable → if found and not expired → return
   b. Check Bloom filter for SSTable-4 (newest)
      → "definitely not here" → SKIP
   c. Check Bloom filter for SSTable-3
      → "probably here" → binary search SSTable-3
      → Verify checksum → data intact
      → Check expiry: now < expiry → NOT expired
      → FOUND: return value + timestamp
   d. If expired: return "not found"

6. Coordinator responds:
   200 OK { "key": "user:123", "value": "base64(raw bytes)", "timestamp": ... }

sequenceDiagram
    participant C as Client
    participant A as Coordinator
    participant B as Node B

    C->>A: GET user:123 consistency=eventual
    Note over A: hash key → ring → pick nearest replica
    A->>B: read request

    Note over B: memtable → miss
    Note over B: Bloom filter SSTable-4 → not here → skip
    Note over B: Bloom filter SSTable-3 → probably here
    Note over B: binary search SSTable-3 → found
    Note over B: checksum valid → not expired

    B->>A: value + timestamp
    A->>C: 200 OK + base64 value

Making eventual consistency reads even faster#

At R=1, the read already only hits one node. But we can make it faster:

1. Nearest replica routing — instead of picking a random replica, the coordinator picks the one with the lowest network latency. If the coordinator itself is one of the replicas, it reads from its own local storage — zero network hops.

Replica set: Node B, Node C, Node D
Coordinator is Node B

R=1 read → Node B reads from its OWN local storage
  → No network hop at all
  → Just memtable + Bloom filter + SSTable lookup
  → Sub-millisecond if data is in memtable or OS page cache

This happens naturally in our system — the load balancer can route the request to any node, and with 256 vnodes per node, every node owns a portion of the key space. There's a good chance the coordinator itself is one of the replicas.

2. OS page cache — frequently read SSTables stay in the operating system's page cache (RAM). The binary search in the SSTable doesn't actually hit the physical disk — it reads from cached pages in memory. At 500 reads/sec per node, hot data stays warm in the page cache.

Cold read (first time):  memtable miss → Bloom filter → SSTable from disk → 1-5ms
Warm read (cached):      memtable miss → Bloom filter → SSTable from page cache → <1ms
Hot read (in memtable):  memtable hit → <0.1ms

3. Row cache (optional) — a dedicated in-memory cache for the most frequently read keys. Before even checking the memtable, check the row cache. If the key is there, return immediately. Cassandra supports this as an opt-in feature per table.

With row cache:
  get("user:123")
    → Row cache: HIT → return immediately (microseconds)
    → Skip memtable, Bloom filters, SSTables entirely

Without row cache:
  get("user:123")
    → Memtable → Bloom filters → SSTable → disk read (milliseconds)

Row cache is only useful for read-heavy keys that are accessed repeatedly. For keys that are read once and never again, the cache entry just wastes memory.

Eventual consistency read performance stack:

  Row cache (if enabled)     → microseconds (pure memory lookup)
  Memtable hit               → <0.1ms (in-memory sorted tree)
  SSTable in page cache      → <1ms (OS-level caching)
  SSTable from disk          → 1-5ms (Bloom filter skips most SSTables)
  Coordinator is a replica   → zero network hop (reads local storage)

Complete Read Path — Strong Consistency (R=2)#

Slower but guarantees the latest value. Coordinator contacts multiple replicas and compares.

1. Client sends:
   GET /api/v1/item?key=user:123&consistency=strong

2. Coordinator hashes key → finds Node B, C, D on ring

3. Coordinator sends read to ALL THREE: Node B, C, D
   → Waits for R=2 responses (quorum)

4. Two responses come back:
   Node B: "Alice", timestamp=1005, checksum valid
   Node C: "Alice", timestamp=1005, checksum valid

   Both agree → return "Alice"

   OR:
   Node B: "Alice", timestamp=1005
   Node D: "Bob",   timestamp=1002

   Disagree → pick higher timestamp → return "Alice"
   → Trigger read repair: send "Alice" (ts=1005) to Node D in background

5. Coordinator responds:
   200 OK { "key": "user:123", "value": "base64...", "timestamp": 1005 }

sequenceDiagram
    participant C as Client
    participant A as Coordinator
    participant B as Node B
    participant NC as Node C
    participant D as Node D

    C->>A: GET user:123 consistency=strong
    A->>B: read request
    A->>NC: read request
    A->>D: read request

    B->>A: Alice ts=1005
    D->>A: Bob ts=1002
    Note over A: R=2 met. Compare: 1005 > 1002
    A->>C: 200 OK Alice

    Note over A: Read repair in background
    A->>D: write Alice ts=1005

Why strong consistency is slower#

Eventual (R=1):  1 network hop → 1 response → done
Strong (R=2):    3 network hops → wait for 2 responses → compare → done

The difference:
  → Extra network round trips (must wait for slowest of 2 responses)
  → Comparison logic at coordinator
  → Potential read repair write in background

Typical latency:
  Eventual: 1-5ms
  Strong:   5-20ms

The client chooses per request. Same KV store, same data, same nodes — just a different consistency parameter in the request.

Complete Delete Path#

Delete is just a write — a write of a tombstone. Same replication, same quorum, same flow.

1. Client sends: DELETE /api/v1/item?key=user:123

2. Coordinator hashes key → finds Node B, C, D

3. Coordinator sends TOMBSTONE write to all three:
   Key: "user:123"
   Value: TOMBSTONE marker
   Timestamp: current time

4. Each node: WAL append → memtable insert (tombstone entry)

5. W=2 acks → respond 200 OK { "deleted": true }

6. Future reads find the tombstone → return 404

7. Background:
   → Anti-entropy propagates tombstone to any replica that missed it
   → Tombstone grace period: 10 days
   → After grace period: compaction removes tombstone from disk

Background Processes — Always Running#

These processes run continuously on every node, independent of client requests:

Process              Frequency        Purpose
───────              ─────────        ───────
Gossip               Every 1 second   Exchange membership tables with random neighbor
                                      Heartbeat counter increment
                                      Failure detection (frozen heartbeats)

Compaction           Continuous       Merge SSTables (size-tiered or leveled)
                                      Drop expired entries (TTL cleanup)
                                      Remove tombstones past grace period
                                      Reclaim disk space

Anti-entropy         Every few hours  Merkle tree comparison with replica partners
                                      Fix ALL diverged keys (not just read ones)
                                      Propagate tombstones

Memtable flush       On threshold     When memtable hits 32-64 MB
                                      Flush to SSTable + build Bloom filter
                                      Delete old WAL, create fresh ones

Hinted handoff       On node recovery Forward stored hints to recovered node

Fault Tolerance Summary — What Happens When Things Break#

Failure                  What happens                         Recovery
───────                  ────────────                         ────────
Node dies                Quorum still met (W=2 of N=3)        Hinted handoff → read repair → anti-entropy
                         Hinted handoff on another node

Node down > 10 days      Tombstone grace period exceeded       Full data rebuild (treat as new node)

Disk corruption          Checksum detects bad data             Read repair from healthy replica
                         Node refuses to serve corrupt data

Disk full                Writes fail, reads still work         Monitoring + reserved space for compaction
                         Compaction death spiral risk

Network partition        Group A and B can't communicate       Tunable consistency per request
                         Strong reads may fail (R=2)           Self-heals when partition resolves
                         Eventual reads succeed (R=1)          Gossip + read repair + anti-entropy

Coordinator dies         Client sees timeout                   Retry on different coordinator
                         Data may already be on replicas       Safe because put/delete are idempotent

Traffic spike            All nodes overloaded                  Rate limiting + circuit breakers
                                                              Exponential backoff + jitter on retries

Concurrent writes        Same key written by two clients       LWW (highest timestamp wins)
                                                              Optional: siblings for merge use cases

Decisions Made and Why#

Decision                      Why
────────                      ───
Leaderless architecture       No SPOF, any node coordinates, tunable consistency on both reads AND writes

Consistent hashing + vnodes   Even load distribution, graceful scaling, node failure spreads load thinly

Quorum (W=2, R=2, N=3)       Strong consistency when W+R>N, eventual when R=1

LSM Tree storage engine       Sequential writes, handles any read/write ratio, general-purpose

Size-tiered compaction        Default for write-heavy. Leveled available for read-heavy tables

Bloom filters                 Skip irrelevant SSTables, critical for non-existent key lookups

Gossip protocol               Decentralised membership, no central registry, O(log n) propagation

Heartbeat counters            Failure detection without synchronized clocks

LWW conflict resolution       Simple, deterministic, covers 95% of use cases

Tombstones for deletes        Prevent resurrection during anti-entropy

TTL as entry metadata         Self-describing expiry, no tombstone needed, lazy cleanup in compaction

Checksums at every level      Detect silent disk corruption before serving bad data

Rack-aware replica placement  Prevent correlated failures from killing all replicas

Interview framing

"The final architecture is a leaderless KV store with 1,200 nodes. Any node can coordinate any request. Consistent hashing with 256 vnodes per node distributes data evenly. Writes go to N=3 replicas with W=2 quorum — each node appends to a WAL and inserts into an in-memory memtable, which periodically flushes to sorted SSTables on disk. Reads are tunable: R=1 for eventual consistency (fast, single hop, can read from local storage if coordinator is a replica), R=2 for strong consistency (quorum, triggers read repair on stale nodes). Bloom filters skip irrelevant SSTables. Gossip protocol handles membership and failure detection. Three repair layers — hinted handoff, read repair, anti-entropy — guarantee eventual convergence. LWW resolves conflicts. Tombstones prevent resurrection. Rate limiting, circuit breakers, and exponential backoff protect against cascading failures."