Hot Partition — Edge Cases

The edge cases that break the happy path

The salting mechanism works cleanly when everything is running. These are the scenarios where something goes wrong — a user was offline when the conversation was salted, Redis restarts, or a new app server comes up cold.

Edge case 1 — User was offline during salting#

Scenario: - conv_abc123 was normal (N=1) when Bob went offline - While Bob was offline, the conversation got hot — N bumped to 4 - Messages were written to conv_abc123#0 through conv_abc123#3 - Bob reconnects

What happens without the fix: Bob's client requests chat history. If the app server handling Bob's request doesn't know about the salting, it queries only conv_abc123 (N=1) and finds nothing — all messages were written to the salted partitions. Bob sees an empty chat or missing messages.

The fix: The registry lookup happens at the app server on every read — not at the client. Bob's client just asks for conv_abc123 history. The app server:

1. GET registry[conv_abc123] → max_N = 4
2. Scatter-gather across conv_abc123#0 through #3
3. Return complete history to Bob

Bob being offline during the salting is irrelevant — the registry is the source of truth, and it's always consulted on every read. Bob's client is completely unaware of salting.

Edge case 2 — Redis restarts#

Scenario: Redis holding the hot partition registry crashes and restarts.

What happens: - Redis with AOF replays the log and recovers all registry entries - Recovery time depends on AOF file size — for a registry with millions of entries, replay takes seconds to minutes - During recovery, app servers get null from registry lookups → treat all conversations as N=1 → queries miss salted partitions

The fix — warm-up from DynamoDB: On Redis restart, before serving traffic, the hot partition service rebuilds the registry from a backup:

Option A: DynamoDB backup table
  → On every registry update, also write to a DynamoDB table: conversation_id → max_N
  → On Redis restart: scan DynamoDB backup → repopulate Redis
  → Slower recovery but zero data loss

Option B: AOF replay (no backup needed)
  → Redis replays AOF on restart automatically
  → If AOF is intact, full recovery with no manual intervention
  → AOF corruption is rare but possible

For production, use both — AOF as the fast path, DynamoDB backup as the fallback if AOF is corrupted.

During the recovery window: App servers should detect Redis unavailability and fall back to reading from the DynamoDB backup table directly. Slower (~5-10ms instead of ~1ms) but correct.

Edge case 3 — New app server comes up cold#

Scenario: A new app server is added to the fleet (auto-scaling during peak traffic). It has no local state — no counters, no cached registry entries.

What happens: - Registry lookups: the new server queries Redis fresh on every request → correct, no issue - Local WPS counters: start at 0 → the server will under-detect hot conversations until counters build up

The fix: The local WPS counter is a detection mechanism, not a routing mechanism. Under-detection on a new server means it takes slightly longer to detect that a conversation is hot — but the registry already has the correct max_N from when the conversation was first detected as hot by other servers. Writes and reads are routed correctly regardless. The new server's detection will catch up within a few seconds as traffic flows through it.

Edge case 4 — Hot partition service crashes#

Scenario: The service that consumes from the Redis Stream and updates the registry goes down.

What happens: - Existing registry entries remain intact — max_N doesn't decrease, existing salting still works - New hot conversations are not detected — their N stays at 1 even as they exceed 1,000 WPS - DynamoDB throttling begins for newly hot conversations

The fix: - Run multiple instances of the hot partition service — if one crashes, others continue consuming from the Redis Stream - Redis Stream retains unprocessed events — when the service recovers, it replays missed events and catches up - Add alerting on DynamoDB throttle metrics — a sudden spike in ProvisionedThroughputExceeded errors signals the hot partition service may be down

Summary#

Edge Case	Risk	Fix
User offline during salting	Missing messages on reconnect	Registry always consulted on read — client unaware of salting
Redis restart	Registry unavailable during recovery	AOF replay (fast) + DynamoDB backup (fallback)
New app server cold start	Slow hot detection	Doesn't affect routing — registry in Redis is authoritative
Hot partition service crash	New hot conversations not detected	Multiple instances + Redis Stream replay on recovery