Skip to content

The Problem

The core idea

When two replicas silently diverge — one missing a write the other has — you need a way to find which specific keys are different without comparing every single row. Comparing 1 billion rows over the network is impractical. Merkle Trees let you find the diverged sections efficiently.


The problem — silent divergence#

You have a primary and three replicas. Primary sends a write to all three. Due to a network blip, Replica 2 misses one write silently. No error, no alarm — Replica 2 just never received it.

Primary  → W1, W2, W3, W4, W5
Replica1 → W1, W2, W3, W4, W5  ✓
Replica2 → W1, W2,     W4, W5  ← missed W3 silently
Replica3 → W1, W2, W3, W4, W5  ✓

This works in a leaderless system like Cassandra where each write is an independent key-value operation — W3 writes to user:123, W4 writes to user:456. They have no ordering dependency. Replica 2 can happily accept W4 with no knowledge that W3 ever existed.

Replica 2 thinks it is healthy. The primary thinks replication succeeded. But the data is wrong. This is silent divergence — and it is very real in Cassandra's architecture.

Cassandra calls the process of detecting and fixing this anti-entropy repair — a periodic background job that compares replicas and patches any gaps.


Why WAL offset doesn't help here#

A natural instinct is to compare WAL offsets. If Replica 2's WAL is behind, you can replay from where it diverged.

WAL offset works well for streaming new changes forward — it tells you what happened next in sequence. But it breaks down for silent divergence because:

  • Cassandra's writes are not sequential. Each write is an independent operation to a different key. There is no single ordered log where "missing W3" shows up as a gap — Replica 2's log just skips it with no indication anything is wrong.
  • Even if offsets matched, a write could have been applied but corrupted — offset looks fine, data is wrong.
  • When comparing two independent replicas across data centers that were not always in sync, there is no shared WAL to compare against.

WAL is the right tool for replication lag. Merkle Trees are the right tool for finding what is already out of sync.


The naive fix — compare every row#

If you want to find which keys differ between Replica 1 and Replica 2:

Replica1 has 1,000,000,000 rows
Replica2 has 1,000,000,000 rows
Send all rows over the network and compare one by one

That is transferring billions of records purely to find the few that differ. Completely impractical. You need a way to narrow down the search — find which sections of the data are different without looking at everything.

That is exactly what Merkle Trees solve.