Estimation Framework
Estimation without a framework produces random numbers. Estimation with a framework produces justified architecture decisions.
The goal is never "get the exact number." The goal is: use the number to justify your next design choice.
The 6-step framework — use this order every time#
Step 0 — State assumptions out loud#
Before any calculation, say what you're assuming. Interviewers want to see your reasoning, not a precise number. If your assumption is wrong, they'll correct you. If it's reasonable, they'll nod and you continue.
"I'll assume 100M MAU, with about 10% DAU — so 10M daily active users."
"I'll assume each user creates 0.1 URLs per day on average."
"I'll assume a 10:1 read-to-write ratio, so 10 reads per URL created."
Step 1 — Users (DAU/MAU)#
Start with who uses the system.
MAU → MAU × 10% = DAU (typical engagement rate for consumer apps)
What does each user DO per day? (reads? writes? both?)
Common assumptions:
Social app: 10–50 actions/day per DAU
URL shortener: 0.1 creates + 10 redirects per DAU
Chat: 50–100 messages sent/day per DAU
Video streaming: 3–5 videos watched/day per DAU
Step 2 — QPS (always separate reads from writes)#
Avg write QPS = DAU × writes_per_day / 86,400
Avg read QPS = DAU × reads_per_day / 86,400
Peak QPS = avg × 3–5 (or × 10 for viral systems)
Always sanity-check your read:write ratio:
URL shortener: 1000:1 (redirects massively outnumber creates)
Social media feed: 100:1 (many more reads than posts)
Chat: 1:1 (every message sent is also received)
Ride-sharing: 1:1 (driver sends location, user reads it)
Step 3 — Storage#
Always do this in two parts: metadata and media.
Metadata storage = records_per_day × bytes_per_record × days_retained
× replication_factor (3)
× index_overhead (1.3–1.5)
Media storage = records_per_day × % with media × media_size
× transcoding_factor (for video: ×10)
× replication_factor (3)
Storage at 10 years for common systems:
URL shortener: 50B URLs × 500 bytes = 25 TB raw → ~250 TB with overhead
Twitter (text): ~500 GB/year text (negligible vs media)
Twitter (media): ~16 PB/year photos
YouTube: ~180 PB/year video uploads (transcoded)
Step 4 — Bandwidth#
Incoming = write QPS × avg write payload size
Outgoing = read QPS × avg response size
Check against server NIC: 10 Gbps = 1.25 GB/s
If outgoing > 10 Gbps per server → CDN or more servers
Step 5 — Cache sizing#
Cache size = 20% × (daily active data)
= 20% × (daily read QPS × record size × 86,400)
OR simpler:
Active working set = URLs created in last 3 days × 20% viral factor
Cache size = active working set × record size
Step 6 — Server count#
App servers = peak QPS / QPS per server (use 1k–5k for CRUD)
DB nodes = (peak read QPS × cache miss rate) / reads_per_node (10k–50k for Postgres)
+ write QPS / writes_per_node (5k–10k)
Redis nodes = cache size / memory per node (64–256 GB)
(throughput almost never the constraint)
Always add 20–30% headroom (N+1 minimum).
Full worked example — URL Shortener#
Step 0 — Assumptions
100M MAU, 10% DAU = 10M DAU
Each DAU: 0.1 creates/day, 10 reads/day
Step 1 — Users
10M DAU
Step 2 — QPS
Writes: 10M × 0.1 / 86,400 = 1M/day / 86,400 ≈ 12/sec avg → ~1k/sec peak
Reads: 10M × 10 / 86,400 = 100M/day / 86,400 ≈ 1,157/sec avg → ~1M/sec peak
Read:write ratio = 1000:1
Step 3 — Storage (10 years)
URLs: 1M/day × 365 × 10 = 3.65B → ~50B with safety margin
Per URL: 500 bytes
Raw: 50B × 500 bytes = 25 TB
With replication (3×) + indexes (1.5×): 25 × 3 × 1.5 = ~112 TB → say 250 TB
Step 4 — Bandwidth
Outgoing: 1M reads/sec × 200 bytes (301 header) = 200 MB/s → fits 10Gbps ✓
Incoming: 1k writes/sec × 500 bytes = 500 KB/s → negligible
Step 5 — Cache
Active window: 3 days (80% of traffic from recent URLs)
Active URLs: 3M → 20% viral = 600k hot URLs
Cache size: 600k × 500 bytes = 300 MB + buffer → ~27 GB Redis cluster
Step 6 — Servers
Redirect app servers: 1M / 50k per server = 20 servers
Creation app servers: 1k / 50k per server = 2 servers
DB: 8 shards (250 TB / ~30 TB per machine)
Redis: 27 GB → 1 node (64 GB) fine, use cluster for HA
Architecture decision cheat sheet#
| Metric | Threshold | Architecture implication |
|---|---|---|
| Read QPS | > 1k | Multiple app servers + load balancer |
| Read QPS | > 10k | Redis caching |
| Read QPS | > 100k | DB read replicas |
| Read QPS | > 1M | Local in-process cache on app servers |
| Write QPS | > 1k | Async queue (Kafka/SQS) |
| Write QPS | > 10k | Shard DB primaries |
| Write QPS | > 100k | LSM DB (Cassandra) |
| Storage | > 1 TB | Plan archival strategy |
| Storage | > 10 TB | Sharding required |
| Storage | > 100 TB | Tiered storage (SSD + S3) |
| Has media | any | CDN mandatory |
| Global users | any | Regional replicas, CDN |
| Latency SLO | < 50ms | Redis cache mandatory |
| Latency SLO | < 10ms | Local in-process cache |
| Latency SLO | < 1ms | In-process only, no network |
Interview framing
"I'll start with users — 100M MAU, ~10M DAU. 10 reads per DAU = 100M reads/day ÷ 86,400 = ~1,200/sec avg, peak 10× = ~12k/sec. Storage: 1KB per record × 100M users = 100 GB/year. Cache: 80/20 rule, cache 20% = 20 GB. Read QPS > 10k so I'll add Redis. Storage growing past 10 TB by year 3 so I'll plan for sharding early."