Estimation
Users and Activity#
Start with the real world: roughly 10 billion people on Earth, about 1 billion of them are on a platform like Instagram or WhatsApp. Of those 1 billion registered users, assume 50% are Daily Active Users (DAU) — 500 million people using the product on any given day.
Each active user performs about 50 actions per day — likes, comments, DMs, story views. Each action typically generates 1 notification for the receiver (not the sender). So:
QPS#
There are approximately 86,400 seconds in a day — round to 100,000 for easy math:
Traffic is never uniform. Peak hours (evening, major events) can spike to 10× sustained:
Multi-Channel Multiplier#
Here's what candidates often miss: one notification event can trigger sends on multiple channels. A user might have push + email both enabled. The system supports 3 channels — push, SMS, email. On average, assume 2 channel sends per event:
This is the harder number. The fan-out infrastructure (Kafka, channel workers) must handle 5M dispatches per second at peak, not 2.5M.
Storage#
Each notification record contains: - ID: 8 bytes - User ID, sender ID: 16 bytes
- Title + body: ~200 bytes - Deep link + metadata: ~100 bytes - Status, timestamps: ~50 bytes
Total: ~300–400 bytes per notification record. Call it 400 bytes.
Wait — that seems large. The key insight is notifications are not permanent. Nobody queries "did I get that Instagram like in January?" After 30–90 days, notifications are useless. Cap retention at 90 days:
That's still large. But most of this is cold storage — only recent notifications (last 7 days) are hot. And you can compress heavily since notification payloads are repetitive text. With 10× compression, cold storage drops to ~770TB for 90 days. Manageable.
What to store vs what to drop
You don't need to store the full notification payload forever. Store: ID, user ID, channel, delivery status, timestamp. Drop: full body/title after 30 days. This brings storage down dramatically.
Bandwidth#
At peak, 5M sends/sec, each notification payload ~300 bytes:
A standard NIC handles 10Gbps. At peak you need slightly more than one NIC's worth of capacity — so you need either: - Multiple dispatch nodes (each handling a shard of the traffic), or - 25Gbps NICs on dispatcher machines
This is not a problem in practice since you'll have many dispatcher nodes anyway, but it's worth flagging that a single-machine dispatcher falls over at peak.
Interview framing for bandwidth
"At 5M sends/sec with ~300 bytes per payload, we're looking at ~12Gbps outbound at peak. A single machine can't handle this, which is one reason we need a horizontally scalable dispatcher pool behind Kafka — each worker node handles a partition and the total throughput is distributed."
Summary Table#
| Metric | Value |
|---|---|
| Registered users | 1B |
| DAU | 500M |
| Notifications/day | 25B |
| Sustained QPS (events) | 250K/sec |
| Peak QPS (events) | 2.5M/sec |
| Peak QPS (actual sends) | 5M/sec (×2 channel multiplier) |
| Payload size | ~300 bytes |
| Peak bandwidth | ~12Gbps |
| Storage per day | ~86TB (raw), compress to ~8TB |
| Retention | 90 days |
| Working storage | ~720TB compressed |