Scheduler Down — Fault Isolation#

How It Propagates#

The scheduler service polls the Scheduler DB every second for due notifications and publishes them to Kafka. If the scheduler crashes, polling stops. Scheduled notifications continue accumulating in the Scheduler DB — writing is unaffected since the app server still writes to it. But nothing is reading from it and dispatching to Kafka.

Users with scheduled notifications miss their delivery time. A birthday notification scheduled for 9:00am sits in the Scheduler DB undelivered while the scheduler is down.

Impact on immediate notifications: zero. Immediate notifications bypass the Scheduler DB entirely — they go straight from the app server to Kafka. Only scheduled notifications are affected.

Detection#

Scheduler service heartbeat missing (scheduler emits a heartbeat to a monitoring endpoint every 5 seconds)
Scheduler DB row count growing with no corresponding Kafka publish rate
Scheduled notification delivery lag metric climbing
Alert fires to on-call

Containment — Redis Distributed Lock and Standby Instances#

The scheduler runs as multiple instances, but only one dispatches at a time — controlled by a Redis distributed lock:

Redis SET scheduler_lock <instance_id> NX EX 5
→ only the instance that wins the lock runs the poll-and-dispatch loop
→ lock TTL: 5 seconds
→ if leader crashes, lock expires in 5 seconds
→ standby instance wins the lock and takes over

Maximum downtime: 5 seconds — the TTL of the Redis lock. After that, a standby instance takes over automatically with no manual intervention.

Recovery — Catch-Up Dispatch#

When the new scheduler instance takes over, it queries the Scheduler DB for all notifications where scheduled_at <= now that were never dispatched — these are the notifications that missed their window while the leader was down.

The new scheduler publishes all of them to Kafka immediately. Workers process them through the normal pipeline — preference check, deduplication, send to external provider.

Scheduler was down for 10 minutes:
  → 10 minutes of scheduled notifications sitting in Scheduler DB
  → New scheduler queries: SELECT * WHERE scheduled_at <= now
  → Publishes all missed notifications to Kafka
  → Workers drain them within seconds

Scheduled Notifications Are Late, Not Lost#

If the scheduler is down for 10 minutes, notifications scheduled during that window are delivered ~10 minutes late. They are never lost — the Scheduler DB is the durable store. As long as the Scheduler DB is up, notifications survive a scheduler crash.

The lateness is bounded by the outage duration + the 5-second lock TTL. For most scheduled notification types (birthdays, reminders, marketing) a 10-minute delay is imperceptible. For time-sensitive scheduled notifications (OTPs with a specific send time), the jitter rules already applied at intake mean exact timing was never guaranteed anyway.

The Scheduler DB is the durability guarantee

The scheduler service is stateless — it reads from the DB and publishes to Kafka. Crashing and restarting loses no data because all state lives in the Scheduler DB. A stateless service with durable backing store can always recover cleanly.

What If the Scheduler DB Is Also Down?#

If both the scheduler service and the Scheduler DB are down simultaneously, scheduled notifications cannot be dispatched or even written. New scheduled notification requests from the app server fail at write time.

This is treated as a catastrophic dual failure — the app server returns 503 for scheduled notification requests, and calling services are responsible for retrying. Immediate notifications are unaffected.

Summary#

Failure	Impact	Recovery time
Scheduler service crashes	Scheduled notifications delayed	5 seconds (Redis lock TTL)
Scheduler service + Redis down	Scheduled notifications delayed until Redis recovers	Minutes
Scheduler DB down	Scheduled notifications lost if not yet written	Catastrophic — treat as P0

Immediate notifications are never affected by scheduler failures

The scheduler only handles notifications with a future scheduled_at. Immediate notifications go directly from app server to Kafka — the scheduler is not in their path at all.