Final Design — Notification System#

Architecture Diagram#

flowchart TD
    CS[Calling Services - Instagram / Uber / Bank] --> AS[App Server]
    AS -->|scheduled_at is future + jitter| SDB[(Scheduler DB - Cassandra)]
    SDB -->|poll every 1s| SS[Scheduler Service]
    SS -->|due notifications| K

    AS -->|immediate notifications| RB[(Redis Buffer - Kafka downtime only)]
    RB --> K

    AS -->|normal path| K[Kafka Topics]

    K --> KP[notifications-push]
    K --> KS[notifications-sms]
    K --> KE[notifications-email]

    KP -->|18 partitions| PW[Push Workers x18]
    KS --> SW[SMS Workers]
    KE -->|84 partitions| EW[Email Workers x84]

    PW --> RC1[(Redis - Bloom Filter + Preferences Cache + Token Buckets)]
    SW --> RC1
    EW --> RC1

    RC1 -->|cache miss| PG[(PostgreSQL - Preferences)]

    PW -->|HTTP2 - 10 conn x 1000 streams| APNs[APNs / FCM]
    SW -->|account pool - 500 accounts| TW[Twilio]
    EW -->|IP pool - 84 IPs - batch 1K| SG[SendGrid]

    APNs --> UD[User Device]
    TW --> UD
    SG --> UD

    PW -->|DELIVERED / FAILED| CDB[(Cassandra - notifications + notification_content)]
    SW --> CDB
    EW --> CDB

    PW -->|on failure| PDLQ[notifications-push-dlq]
    SW -->|on failure| SDLQ[notifications-sms-dlq]
    EW -->|on failure| EDLQ[notifications-email-dlq]

    PDLQ --> RW[Retry Workers]
    SDLQ --> RW
    EDLQ --> RW

    RW -->|exponential backoff| APNs
    RW -->|exponential backoff| TW
    RW -->|exponential backoff| SG
    RW -->|all retries exhausted| FT[notifications-failed topic]
    FT --> CS

Every Decision Explained#

Intake — App Server#

The app server is the single entry point for all calling services. It validates incoming requests, checks scheduled_at, and routes accordingly:

scheduled_at = null → publish directly to Kafka (immediate path)
scheduled_at = future → write to Scheduler DB with jitter applied (low-priority) or exact time (OTPs, alerts)
Kafka temporarily down → buffer critical notifications in Redis, drain when Kafka recovers

The app server returns 201 Accepted immediately — it never waits for delivery. The notification_id in the response is the idempotency key for deduplication.

Scheduling — Scheduler DB + Scheduler Service#

Scheduled notifications go to Cassandra with minute_bucket as partition key and scheduled_at ASC as clustering key. The Scheduler Service polls every second, queries the current minute bucket for due notifications, and publishes them to Kafka.

Only one Scheduler Service instance dispatches at a time — controlled by a Redis distributed lock with 5-second TTL. Standby instances take over within 5 seconds of leader failure. Scheduled notifications are late on failure, never lost.

Kafka — Three Topics, Per-Channel Isolation#

One topic per channel prevents the slowest channel from poisoning the fastest:

Topic	Partitions	Partition Key	Consumer Group
notifications-push	18	receiver user_id	push-workers
notifications-sms	varies	receiver user_id	sms-workers
notifications-email	84	receiver user_id	email-workers

user_id (receiver) as partition key guarantees per-user ordering — comment notification never arrives before the post notification for the same user. Partitioning by sender_id would create hot partitions (all 400M Kylie Jenner follower notifications on one partition).

Redis — Three Jobs#

Redis serves three distinct purposes:

Preferences cache — user opt-out settings, populated lazily from PostgreSQL. Cache miss falls back to PostgreSQL. On Redis failure, fallback default (assume opted-in).
Bloom filter — deduplication on notification_id. False positives (skip a valid notification) are acceptable. False negatives (miss a duplicate) never happen. On Redis failure, skip deduplication and accept duplicates — delivery beats dedup.
Token buckets — per-Twilio-account and per-SendGrid-IP rate limiting. On Redis failure, circuit breaker handles 429s.

Push Workers — 18 Instances#

Derived from APNs math:

APNs latency: 50ms
HTTP/2: 1000 concurrent streams per connection
10 connections per worker: 200K push/sec per worker
3.5M push/sec ÷ 200K = 18 workers

Each worker: consume batch of 10K → bloom filter check → preference check → async HTTP/2 to APNs/FCM → write DELIVERED to Cassandra → commit Kafka offset.

SMS Workers — Account Pool#

Twilio caps at 1K/sec per account. 500K SMS/sec would need 500 accounts — but intake filtering means only high-priority SMS (OTPs, fraud alerts) enters the SMS topic. Real volume is a small fraction of 500K/sec.

Workers round-robin across the Twilio account pool. On 429 → read Retry-After header, mark account blocked in Redis, route to next account. Load shedding: never overload healthy accounts to compensate for blocked ones.

Email Workers — 84 Instances + IP Pool#

Email's 2-minute SLO enables batching:

1M emails/sec × 20% = 1M/sec intake
1M/sec ÷ 120s = 8,333 emails/sec sustained
8,333 ÷ 100 emails/sec per IP = 84 IPs

Workers batch 1,000 emails per SendGrid API call — 9 API calls/sec total. Marketing and digest emails are filtered at intake — only transactional email enters the real-time pipeline.

Cassandra — Two Tables#

Table	Partition Key	Clustering Key	Purpose
notifications	user_id	created_at DESC	Status tracking, hot queries
notification_content	notification_id	—	Full content, cold queries

300 nodes (100 shards × 3 replicas) for 10M writes/sec (5M inserts + 5M status updates).

PostgreSQL — Preferences#

Preferences need strong consistency — opt-out must be immediately visible. Low write volume (users rarely change settings). ACID guarantees. Cached in Redis; PostgreSQL only sees cache-miss reads.

DLQ + Retry Workers#

Workers publish failures explicitly to channel-specific DLQ topics — Kafka has no built-in DLQ unlike SQS/RabbitMQ. Retry workers consume from DLQs with exponential backoff (1s, 2s, 4s — 3 retries). After all retries exhausted: mark FAILED in Cassandra, publish to notifications-failed topic. Calling services subscribe to notifications-failed and react — fallback channel, alert, or log.

Scale Summary#

Component	Count	Throughput
App Servers	Horizontally scaled	5M intake/sec
Kafka Brokers	~10	5M messages/sec
Push Workers	18	3.5M push/sec
SMS Workers	Twilio-bound	500K SMS/sec (intake filtered)
Email Workers	84	8,333 emails/sec sustained
Cassandra Nodes	300	10M writes/sec
Redis	Clustered	Preferences + bloom filter + token buckets
PostgreSQL	Replicated	Preferences (low write)
Twilio Accounts	500	500K SMS/sec aggregate
SendGrid IPs	84	8,333 emails/sec