Cache Avalanche#
Thousands of cache keys expire at exactly the same time. Mass cache miss across the entire keyspace. The DB gets hammered simultaneously by every miss.
How it happens#
Cache Avalanche is Cache Stampede at scale — not one key expiring, but thousands or millions.
E-commerce site, Black Friday prep:
Midnight: bulk-load 50,000 product pages into cache, all TTL = 5 minutes
12:05am: all 50,000 keys expire simultaneously
→ every product page request → cache miss
→ 50,000 DB queries at once
→ DB collapses ✗
The cause is always the same: keys created in a batch with identical TTL values — they were born together, so they die together.
Other common triggers:
Cache restart → all keys lost → manually re-warm with identical TTLs → all expire together
Scheduled refresh job → refreshes all keys at the same time → same TTL → expire together
Deployment → new cache instance → warm all keys simultaneously → all expire together
How it differs from Stampede#
Cache Stampede → ONE key expires → burst of misses for that key → recovers quickly
Cache Avalanche → MANY keys expire simultaneously → sustained mass miss → DB collapse
Avalanche is more severe and more sustained. A stampede resolves once one request repopulates the key. An avalanche requires the DB to handle thousands of simultaneous queries across the entire key set.
Fix — TTL Jitter#
Add randomness to the TTL on bulk loads so expirations are spread out over a time window:
Without jitter:
All 50,000 keys → TTL = 300s → expire at exactly the same second
With jitter:
Each key → TTL = 300s + random(0, 60s)
Key A → TTL = 312s
Key B → TTL = 347s
Key C → TTL = 301s
Key D → TTL = 358s
...
Result: expirations spread across a 60-second window
→ ~833 misses/second instead of 50,000 at once ✓
→ DB sees a gentle, manageable trickle of cache misses
One line of code. Completely solves the problem.
# Without jitter — dangerous
cache.set(key, value, ttl=300)
# With jitter — safe
import random
cache.set(key, value, ttl=300 + random.randint(0, 60))
Other fixes#
Refresh-Ahead on the entire batch — instead of letting keys expire, proactively refresh them before expiry. More complex, but prevents misses entirely.
Staggered bulk-loading — when loading a large batch, insert keys in groups with different TTL offsets:
Keys 1-10,000: TTL = 300s
Keys 10,001-20,000: TTL = 310s
Keys 20,001-30,000: TTL = 320s
→ expirations naturally staggered across 30-second windows
Interview framing
"I'd add jitter to TTL values on bulk loads — instead of all keys expiring at the same time, each gets a random offset of ±30 seconds. This spreads expiry across a time window so the DB sees a trickle of misses rather than a simultaneous flood."