Cache Penetration#

Requests for keys that don't exist in the DB. Every request is a cache miss. Every request hits the DB. DB returns null. Nothing gets cached. The cycle repeats forever.

Why it's different from a normal cache miss#

A normal cache miss eventually resolves itself:

Normal miss:
  → cache miss → DB returns data → store in cache
  → next request → cache hit ✓
  → self-healing

Cache penetration never resolves:

Penetration:
  GET /user/99999999  (user doesn't exist)
  → cache miss
  → DB returns null
  → nothing to cache  ← this is the key difference
  → next request → cache miss again
  → DB again → null again → nothing cached
  → infinite loop

At 1,000 requests/second for non-existent keys, that's 1,000 DB queries/second all returning null. The DB does real work (table scan, index lookup) for zero useful result.

Common causes#

Malicious attack    → attacker deliberately queries non-existent IDs
                      to exhaust DB connections
Bugs in code        → generating invalid IDs, wrong join keys
Deleted records     → data existed once but was deleted;
                      cache wasn't invalidated and now DB has nothing

Fix 1 — Cache the Null#

DB returns null for user:99999999
→ cache.set("user:99999999", NULL, TTL=60s)   ← cache the absence

Next 1,000 requests for user:99999999:
→ cache hit → return null immediately ✓
→ DB sees zero queries ✓

Keep the TTL short on null entries. If the record gets created later, the null expires and real data gets cached on the next request.

The risk: memory consumption from null entries. If an attacker queries millions of different non-existent IDs, you fill your cache with null entries. Set a very short TTL (30-60 seconds) and monitor cache memory usage.

Fix 2 — Bloom Filter (better for attacks)#

A Bloom filter answers: "has this key ever been inserted into the DB?"

Request arrives for user:99999999
→ check Bloom filter: "has user:99999999 ever been inserted?"
→ NO (definitely not) → return 404 immediately ✓
   → cache never touched, DB never touched

→ YES (or maybe) → proceed normally to cache → DB

Why Bloom filters work here: - No false negatives — if the filter says NO, the key definitely doesn't exist. Safe to reject immediately. - Possible false positives — the filter might say YES for a key that doesn't exist, but the rate is very low and controllable (< 1%). - Space-efficient — a Bloom filter for 100 million keys fits in ~120MB. Storing those same keys in a hash set would take gigabytes.

Put the Bloom filter in front of the cache layer. Non-existent keys never reach cache or DB at all.

Bloom filters in production

Cassandra, HBase, and PostgreSQL all use Bloom filters internally to avoid disk lookups for non-existent keys. The pattern is well-established.

Which fix to use#

Cache null values   → simple, works for small datasets, short-lived fix
                      vulnerable to memory exhaustion under attack
Bloom filter        → better under attack, scales to billions of keys
                      more infrastructure, requires periodic rebuilding
                      as new records are inserted

For a system under active attack or with adversarial input, Bloom filter is the correct answer.