Handling Node Failure#

A cache node goes down. What happens to the keys it owned? What happens to the traffic?

Without replicas — graceful degradation#

With consistent hashing, the keys that belonged to the failed node get routed to the next node clockwise on the ring:

Before failure:
  Node A → keys 1–25M
  Node B → keys 25M–50M  ← goes down
  Node C → keys 50M–75M

After failure:
  Node A → keys 1–25M
  Node C → keys 25M–75M  ← absorbs Node B's slice

Immediate effect:
  → keys in the 25M–50M range → cache miss (Node B is gone, keys lost)
  → those requests hit DB
  → DB gradually repopulates Node C with those keys
  → after a few minutes, hit rate recovers

Blast radius:
  → only Node B's keys are affected (~1/N of keyspace)
  → Node A's keys: completely unaffected ✓
  → Node C's keys: completely unaffected ✓

Consistent hashing limits the blast radius of a single node failure to just that node's keyspace. Without consistent hashing (naive modulo), all keys would need to remap.

With replicas — seamless failover#

If each node has a replica, the replica promotes to primary when the primary fails:

Node B dies
→ Redis Sentinel detects failure (~10 seconds)
→ Node B's replica promoted to primary
→ keys in 25M–50M range: still available from the replica ✓
→ no cache miss at all ✓
→ brief (~10-30s) window of reduced write availability while failover completes

With replication, a single node failure is invisible to users — reads continue serving from the replica.

The thundering herd on recovery#

When a failed node comes back online, it starts empty. The keys that were routed to other nodes during the outage now remap back:

Node B recovers → consistent hashing routes its keys back to it
→ Node B is empty → cache miss for every key in 25M–50M range
→ all those requests hit DB simultaneously
→ mini stampede

Fix: bring the recovered node back gradually, using a warm-up period or shadow mode before taking live traffic.

Summary#

No replicas + consistent hashing:
  Node failure → ~1/N of keys become misses → DB absorbs that traffic briefly
  Keys repopulate naturally as requests come in

With replicas:
  Node failure → replica promotes → seamless, zero cache misses
  Better but costs double the memory

Recovery:
  Empty node comes back → keys remap back → another wave of misses
  Fix: warm up before returning to live traffic