Final Design — Notification System#
Architecture Diagram#
flowchart TD
CS[Calling Services - Instagram / Uber / Bank] --> AS[App Server]
AS -->|scheduled_at is future + jitter| SDB[(Scheduler DB - Cassandra)]
SDB -->|poll every 1s| SS[Scheduler Service]
SS -->|due notifications| K
AS -->|immediate notifications| RB[(Redis Buffer - Kafka downtime only)]
RB --> K
AS -->|normal path| K[Kafka Topics]
K --> KP[notifications-push]
K --> KS[notifications-sms]
K --> KE[notifications-email]
KP -->|18 partitions| PW[Push Workers x18]
KS --> SW[SMS Workers]
KE -->|84 partitions| EW[Email Workers x84]
PW --> RC1[(Redis - Bloom Filter + Preferences Cache + Token Buckets)]
SW --> RC1
EW --> RC1
RC1 -->|cache miss| PG[(PostgreSQL - Preferences)]
PW -->|HTTP2 - 10 conn x 1000 streams| APNs[APNs / FCM]
SW -->|account pool - 500 accounts| TW[Twilio]
EW -->|IP pool - 84 IPs - batch 1K| SG[SendGrid]
APNs --> UD[User Device]
TW --> UD
SG --> UD
PW -->|DELIVERED / FAILED| CDB[(Cassandra - notifications + notification_content)]
SW --> CDB
EW --> CDB
PW -->|on failure| PDLQ[notifications-push-dlq]
SW -->|on failure| SDLQ[notifications-sms-dlq]
EW -->|on failure| EDLQ[notifications-email-dlq]
PDLQ --> RW[Retry Workers]
SDLQ --> RW
EDLQ --> RW
RW -->|exponential backoff| APNs
RW -->|exponential backoff| TW
RW -->|exponential backoff| SG
RW -->|all retries exhausted| FT[notifications-failed topic]
FT --> CS Every Decision Explained#
Intake — App Server#
The app server is the single entry point for all calling services. It validates incoming requests, checks scheduled_at, and routes accordingly:
scheduled_at = null→ publish directly to Kafka (immediate path)scheduled_at = future→ write to Scheduler DB with jitter applied (low-priority) or exact time (OTPs, alerts)- Kafka temporarily down → buffer critical notifications in Redis, drain when Kafka recovers
The app server returns 201 Accepted immediately — it never waits for delivery. The notification_id in the response is the idempotency key for deduplication.
Scheduling — Scheduler DB + Scheduler Service#
Scheduled notifications go to Cassandra with minute_bucket as partition key and scheduled_at ASC as clustering key. The Scheduler Service polls every second, queries the current minute bucket for due notifications, and publishes them to Kafka.
Only one Scheduler Service instance dispatches at a time — controlled by a Redis distributed lock with 5-second TTL. Standby instances take over within 5 seconds of leader failure. Scheduled notifications are late on failure, never lost.
Kafka — Three Topics, Per-Channel Isolation#
One topic per channel prevents the slowest channel from poisoning the fastest:
| Topic | Partitions | Partition Key | Consumer Group |
|---|---|---|---|
| notifications-push | 18 | receiver user_id | push-workers |
| notifications-sms | varies | receiver user_id | sms-workers |
| notifications-email | 84 | receiver user_id | email-workers |
user_id (receiver) as partition key guarantees per-user ordering — comment notification never arrives before the post notification for the same user. Partitioning by sender_id would create hot partitions (all 400M Kylie Jenner follower notifications on one partition).
Redis — Three Jobs#
Redis serves three distinct purposes:
-
Preferences cache — user opt-out settings, populated lazily from PostgreSQL. Cache miss falls back to PostgreSQL. On Redis failure, fallback default (assume opted-in).
-
Bloom filter — deduplication on
notification_id. False positives (skip a valid notification) are acceptable. False negatives (miss a duplicate) never happen. On Redis failure, skip deduplication and accept duplicates — delivery beats dedup. -
Token buckets — per-Twilio-account and per-SendGrid-IP rate limiting. On Redis failure, circuit breaker handles 429s.
Push Workers — 18 Instances#
Derived from APNs math:
APNs latency: 50ms
HTTP/2: 1000 concurrent streams per connection
10 connections per worker: 200K push/sec per worker
3.5M push/sec ÷ 200K = 18 workers
Each worker: consume batch of 10K → bloom filter check → preference check → async HTTP/2 to APNs/FCM → write DELIVERED to Cassandra → commit Kafka offset.
SMS Workers — Account Pool#
Twilio caps at 1K/sec per account. 500K SMS/sec would need 500 accounts — but intake filtering means only high-priority SMS (OTPs, fraud alerts) enters the SMS topic. Real volume is a small fraction of 500K/sec.
Workers round-robin across the Twilio account pool. On 429 → read Retry-After header, mark account blocked in Redis, route to next account. Load shedding: never overload healthy accounts to compensate for blocked ones.
Email Workers — 84 Instances + IP Pool#
Email's 2-minute SLO enables batching:
1M emails/sec × 20% = 1M/sec intake
1M/sec ÷ 120s = 8,333 emails/sec sustained
8,333 ÷ 100 emails/sec per IP = 84 IPs
Workers batch 1,000 emails per SendGrid API call — 9 API calls/sec total. Marketing and digest emails are filtered at intake — only transactional email enters the real-time pipeline.
Cassandra — Two Tables#
| Table | Partition Key | Clustering Key | Purpose |
|---|---|---|---|
| notifications | user_id | created_at DESC | Status tracking, hot queries |
| notification_content | notification_id | — | Full content, cold queries |
300 nodes (100 shards × 3 replicas) for 10M writes/sec (5M inserts + 5M status updates).
PostgreSQL — Preferences#
Preferences need strong consistency — opt-out must be immediately visible. Low write volume (users rarely change settings). ACID guarantees. Cached in Redis; PostgreSQL only sees cache-miss reads.
DLQ + Retry Workers#
Workers publish failures explicitly to channel-specific DLQ topics — Kafka has no built-in DLQ unlike SQS/RabbitMQ. Retry workers consume from DLQs with exponential backoff (1s, 2s, 4s — 3 retries). After all retries exhausted: mark FAILED in Cassandra, publish to notifications-failed topic. Calling services subscribe to notifications-failed and react — fallback channel, alert, or log.
Scale Summary#
| Component | Count | Throughput |
|---|---|---|
| App Servers | Horizontally scaled | 5M intake/sec |
| Kafka Brokers | ~10 | 5M messages/sec |
| Push Workers | 18 | 3.5M push/sec |
| SMS Workers | Twilio-bound | 500K SMS/sec (intake filtered) |
| Email Workers | 84 | 8,333 emails/sec sustained |
| Cassandra Nodes | 300 | 10M writes/sec |
| Redis | Clustered | Preferences + bloom filter + token buckets |
| PostgreSQL | Replicated | Preferences (low write) |
| Twilio Accounts | 500 | 500K SMS/sec aggregate |
| SendGrid IPs | 84 | 8,333 emails/sec |