Observability — Alerting#
Knowing your SLI is breached is useless if nobody finds out for an hour. Alerting closes the loop — when SLI diverges from SLO, a human gets paged immediately.
The Alert Rules#
IF API latency p99 > 200ms
FOR more than 2 minutes
→ page on-call engineer
IF TTFF p99 > 2000ms
FOR more than 2 minutes
→ page on-call engineer
IF stream start availability < 99.99%
FOR more than 2 minutes
→ page on-call engineer
IF buffering ratio > 0.1%
FOR more than 2 minutes
→ page on-call engineer
Why You Need a Sustained Breach Window#
Without the duration condition, a single spiky second triggers a page. At 500,000 API requests/second, occasional spikes happen — a GC pause on one BFF instance, a brief Redis connection reset, a CDN node rebalancing. These self-resolve in seconds.
If you alert on every spike, you wake engineers at 3am for events that resolved before anyone could respond. This is alert fatigue — engineers start ignoring pages because most are false alarms. When the real incident happens, the page gets ignored too.
The 2-minute window filters out transient spikes while still catching real degradation fast enough to act.
CDN brief hiccup (10 seconds): condition met but not sustained → no page, self-resolved
Redis connection reset (20 seconds): same → no page
Actual CDN node failure (30 minutes): condition met for > 2 minutes → page fires
Leading Indicator Alerts — Act Before SLO Breaches#
SLO-based alerts tell you when you've already failed users. Leading indicator alerts tell you something is degrading before the SLO is breached.
IF Redis cache hit ratio < 90%
FOR more than 3 minutes
→ warning alert
Reason: cache miss spike means more DB reads → API latency climbing
IF CDN cache miss rate > 5%
FOR more than 3 minutes
→ warning alert
Reason: CDN misses mean more S3 fetches → TTFF climbing on stream starts
IF BFF fan-out timeout rate > 1%
FOR more than 2 minutes
→ warning alert
Reason: genre services timing out → home feed rows being dropped
IF CDN bandwidth utilisation > 80% on any node
→ warning alert
Reason: node approaching saturation → buffering ratio about to spike
IF circuit breaker state = OPEN on any genre service
→ page immediately (no sustained window needed)
Reason: service confirmed down — graceful degradation already active, needs attention
IF transcoding queue depth > 10,000 jobs
FOR more than 5 minutes
→ warning alert
Reason: backlog growing → new content not ready → TTFF spikes on launch night
Circuit breaker alerts are exceptions to the sustained window rule. A circuit breaker opening means a service has already failed — not a transient spike. Every second it stays open, some users see a missing row. Immediate page is warranted.
The Full Alerting Loop#
flowchart LR
SLO["SLO: TTFF p99 < 2s"] --> SLI["SLI: measured every 15s\nvia client telemetry"]
SLI -->|below 2s| OK["✓ no action"]
SLI -->|above 2s for 2 min| Alert["PagerDuty page\nto on-call engineer"]
Alert --> Action["Engineer investigates:\ncheck CDN miss rate\ncheck transcoding queue\ncheck BFF timeouts"] Every 15 seconds Prometheus computes fleet-wide p99 TTFF from the telemetry histograms. If it stays below 2 seconds, nothing happens. If it crosses 2 seconds and stays there for 2 minutes, PagerDuty pages the on-call engineer with the exact metric, current value, and a graph showing when it started.
Tooling#
Prometheus + Grafana + Alertmanager — Prometheus scrapes metrics from BFF instances and ingests telemetry histograms, Grafana renders dashboards per component (BFF, CDN, genre services, transcoding), Alertmanager routes pages to PagerDuty.
Client telemetry pipeline — a separate ingestion service receives TTFF and buffering ratio events from player SDKs globally. These feed into the same Prometheus histograms as server-side metrics so all four SLIs land in one alerting system.
Interview framing
"Alert rule: if TTFF p99 exceeds 2 seconds for more than 2 minutes, page on-call. 2-minute window prevents alert fatigue from transient CDN hiccups. Leading indicators: Redis cache hit ratio and CDN miss rate warn before SLO breaches. Circuit breaker state is an immediate page — no sustained window needed, that's a confirmed service failure."