Node Failure
Node Failure Is Routine at Scale#
We have 1,200 nodes. At that scale, node failures aren't exceptional events that trigger alarms — they're routine, expected, and happening constantly. If each node has 99.9% uptime, on average 1-2 nodes are down at any given time.
1,200 nodes × 0.1% failure rate = 1.2 nodes down on average
This isn't "what if a node dies" — it's "which node is down right now?"
The system must handle node failures without any human intervention, without data loss, and without clients even noticing. Every mechanism we've built so far — quorum writes, hinted handoff, read repair, anti-entropy — exists precisely for this.
The Three Recovery Layers — Timeline#
Node D goes down. A write comes in for a key whose replica set is Node B, Node C, and Node D (N=3). Here's what happens over time:
First 3 hours — Hinted Handoff#
The coordinator tries to send the write to Node B, Node C, and Node D. Node D doesn't respond. The coordinator picks the next available node on the ring (say Node E) and writes the data there with a hint: "this belongs to Node D."
sequenceDiagram
participant Coord as Coordinator
participant B as Node B
participant C as Node C
participant D as Node D
participant E as Node E
Coord->>B: write user:123
Coord->>C: write user:123
Coord->>D: write user:123
Note over D: DOWN - no response
Coord->>E: write user:123 with hint for Node D
B->>Coord: ack
C->>Coord: ack
E->>Coord: ack
Note over Coord: W=2 satisfied by B and C. Write succeeds. Node E holds the data temporarily. When Node D comes back online, Node E detects it (via gossip) and forwards the hinted data to Node D. The write is recovered.
3 hours to 10 days — Read Repair + Anti-Entropy#
Hints expire after ~3 hours. If Node D is still down when the hint expires, Node E deletes it. The data is now only on Node B and Node C — replication factor degraded from 3 to 2.
But two mechanisms catch this:
Read repair — whenever a client reads a key that Node D was supposed to hold, the coordinator contacts Node B and Node C (R=2 for quorum). When Node D eventually comes back and gets included in a read, the coordinator discovers it's stale and sends it the correct value in the background.
Anti-entropy — the periodic background process compares Merkle trees between Node D and its replica partners. It finds every key that Node D missed during its downtime and syncs them — not just the ones that happen to be read.
Down < 3 hours:
→ Hinted handoff covers all writes during the outage
→ Node D recovers fully when hints are forwarded
Down 3 hours - 10 days:
→ Hints expired, some writes may be missing from Node D
→ Read repair fixes keys as clients read them
→ Anti-entropy catches everything read repair missed
→ Node D eventually converges with Node B and Node C
Beyond 10 days — Full Rebuild Required#
If Node D has been down for longer than the tombstone grace period (typically 10 days), it can't be safely synced with anti-entropy. Here's why:
During those 10+ days, some keys may have been explicitly deleted. The delete wrote a tombstone on Node B and Node C. After 10 days, compaction cleaned up the tombstone — Node B and Node C no longer have any record that the key was deleted.
When Node D comes back, anti-entropy compares it with Node B:
Node B: "user:789" → (nothing — tombstone was cleaned up)
Node D: "user:789" → "Alice" (still has the old data, never saw the delete)
Anti-entropy: Node D has data, Node B doesn't
→ Copies "Alice" back to Node B
→ The deleted data is RESURRECTED
To prevent this, a node that's been down longer than the grace period must be treated as a brand new node. Its data is wiped completely, and a full copy is done from the healthy replicas. No incremental sync, no anti-entropy — start fresh.
Down < 3 hours: Hinted handoff handles it
Down 3 hours - 10 days: Read repair + anti-entropy handles it
Down > 10 days: DANGER — treat as new node, full data rebuild
Can Data Be Permanently Lost?#
Node D being down — for hours, days, or even months — doesn't risk data loss. The same data exists on Node B and Node C (N=3 replication). As long as at least one replica has the data, it can be recovered.
Node D down for 30 days → data still on B and C → safe
Node D AND Node C both down → data still on B → safe
Node B, C, AND D all down → data lost → catastrophic
The probability of all three replicas failing simultaneously (with independent failures):
Single node failure probability: 0.1%
All 3 replicas failing: 0.001 × 0.001 × 0.001 = 0.000000001
= one in a billion chance
The Real Danger — Correlated Failures#
Independent failures are astronomically unlikely to take out all three replicas. But correlated failures can. If all three replicas happen to sit on the same physical rack:
Rack power failure → all 3 nodes on that rack lose power simultaneously
→ All 3 replicas for some keys are gone
→ Data loss
This is why real systems use rack-aware or availability-zone-aware replica placement. The three replicas for any key are spread across different racks (or different data centers), so a single physical failure can never take out all copies:
Node B: Rack 1
Node C: Rack 2
Node D: Rack 3
Rack 1 power failure → only Node B goes down
→ Node C (Rack 2) and Node D (Rack 3) still have the data
→ No data loss
Interview framing
"At 1,200 nodes, node failure is routine — 1-2 nodes are down at any given time. We handle it with three layers: hinted handoff covers writes during short outages (under 3 hours), read repair fixes stale data as clients read it, and anti-entropy periodically syncs everything both missed. If a node is down longer than the tombstone grace period (10 days), it must do a full data rebuild to avoid resurrecting deleted data. Data is only permanently lost if all N=3 replicas fail simultaneously — we prevent this with rack-aware replica placement so a single physical failure can't take out all copies."