Measuring Latency — Notification System#
You cannot store billions of latency measurements. You don't need to.
At 5M notifications/sec, storing raw delivery times would require petabytes per day. The trick is to throw away the raw numbers and keep only the shape of the distribution.
What Latency Means for a Notification System#
For a URL shortener, latency is simple — time from request to response on a single HTTP call. For a notification system, latency is the time from when the calling service submits the notification to when the external provider acknowledges delivery. This spans multiple hops:
Calling service submits → App Server → Kafka → Worker → APNs/Twilio/SendGrid → ack
Total latency = Kafka publish time + consumer lag + worker processing time + provider round trip
Each channel has its own latency profile. Push is measured in seconds, SMS in tens of seconds, email in minutes. You cannot aggregate them — a single "notification system p95" number is meaningless.
Histograms — Keep the Shape, Throw Away the Raw Values#
Each worker maintains a histogram — bucket counters for ranges of delivery latency. Every successful send increments exactly one bucket.
Push worker histogram buckets:
0-1s: most notifications (fast APNs response)
1-3s: slightly slow (APNs momentarily backed up)
3-5s: approaching SLO threshold
5-10s: SLO breach territory
10s+: severe degradation
SMS worker histogram buckets:
0-5s: fast Twilio response
5-15s: normal range
15-30s: approaching SLO threshold
30-60s: SLO breach territory
60s+: severe degradation
Email worker histogram buckets:
0-30s: fast SendGrid response
30-60s: normal range
60-120s: approaching SLO threshold
120-300s: SLO breach territory
300s+: severe degradation
Each worker instance maintains its own histogram counters in memory — incrementing a counter is a single atomic operation, essentially free at 5M/sec.
Computing p95 from a Histogram#
p95 means: the latency value below which 95% of notifications were delivered. With 3.5M push notifications/sec across 18 workers, each worker handles ~195K sends/sec.
For one worker in one second:
Bucket Counter Running Total
0-1s: 180,000 → 180,000
1-3s: 12,000 → 192,000
3-5s: 2,500 → 194,500 ← 95% of 195,000 = 185,250 → falls in 1-3s bucket
5-10s: 400 → 194,900
10s+: 100 → 195,000
p95 lands in the 1-3s bucket — well under the 5s SLO. ✓
The tradeoff: you lose precision within the bucket. You know p95 is somewhere between 1s and 3s, not exactly where. For SLO tracking this is fine — you care whether you're above or below the threshold, not the exact millisecond.
Merging Histograms Across the Fleet#
18 push workers each maintain their own histogram. To get fleet-wide p95, add the bucket counts together:
Worker 1: 0-1s: 180,000 1-3s: 12,000 3-5s: 2,500 ...
Worker 2: 0-1s: 179,000 1-3s: 12,500 3-5s: 2,400 ...
...
Worker 18: 0-1s: 178,000 1-3s: 11,800 3-5s: 2,600 ...
---------------------------------------------------------
Fleet: 0-1s: 3.2M 1-3s: 216K 3-5s: 45K ...
Histograms are mergeable by design — just add the counters. This is why histograms are the standard tool for distributed latency measurement.
Kafka Consumer Lag — The Hidden Latency#
Delivery latency is not just the external provider round trip. If a notification sits in the Kafka topic for 30 seconds before a worker picks it up, that's 30 seconds of latency the histogram doesn't capture.
Kafka consumer lag is a separate metric — how many messages behind the latest offset is each consumer group?
notifications-push consumer group lag: 0 messages → workers keeping up ✓
notifications-push consumer group lag: 5M messages → workers falling behind ✗
~30 seconds of backlog at 3.5M/sec
Consumer lag is the earliest warning signal for capacity problems. Latency SLI breaches come after — consumer lag starts rising before users notice slow delivery.
Who Does the Scraping#
Each worker exposes its histogram counters on /metrics. A Prometheus instance (or Datadog agent) scrapes all workers every 15 seconds, merges the histograms per channel, and computes fleet-wide p95 per channel.
Every 15 seconds:
Prometheus → scrapes all 18 push workers → merges → fleet push p95
Prometheus → scrapes all SMS workers → merges → fleet SMS p95
Prometheus → scrapes all 84 email workers → merges → fleet email p95
Prometheus → scrapes Kafka consumer lag → per-topic lag metric
Four latency time-series, updated every 15 seconds, each independently alertable.
Interview framing
"Latency is measured per channel — push, SMS, email have different SLOs and different delivery paths. Each worker maintains a histogram in memory, Prometheus scrapes and merges every 15 seconds. But delivery latency alone isn't enough — Kafka consumer lag is a separate metric that catches capacity problems before they show up in the latency histogram. Rising lag means the problem is coming. Rising p95 means it's already here."