Alerting
Knowing your SLI is breached is useless if nobody finds out for an hour
With 1,200 nodes, things break constantly. Alerting closes the loop — when SLI diverges from SLO, a human gets paged immediately. But alert on everything and you drown in noise. Alert on nothing and you miss real incidents.
The alert rules for our KV store#
Each SLO maps to an alert rule. The rule fires when the SLI breaches the SLO for a sustained period:
IF p99_ec_read_latency > 10ms
FOR more than 2 minutes
→ page on-call engineer (CRITICAL)
IF p99_sc_read_latency > 50ms
FOR more than 2 minutes
→ page on-call engineer (CRITICAL)
IF p99_write_latency > 20ms
FOR more than 2 minutes
→ page on-call engineer (CRITICAL)
IF availability < 99.99%
FOR more than 2 minutes
→ page on-call engineer (CRITICAL)
Why you need a sustained breach window#
Without the "for more than 2 minutes" condition, every transient spike triggers a page. At 1,200 nodes with constant background processes, transient spikes are inevitable:
Causes of transient latency spikes (self-resolving):
→ Compaction kicks in on a node → disk I/O spikes for 30 seconds → finishes
→ Memtable flush on a node → brief write pause → resumes
→ GC pause on a JVM-based node → 200ms stall → recovers
→ One replica slow to respond → quorum met by faster replicas → no client impact
→ Gossip briefly marks a node as suspected → indirect probes clear it
All of these cause a momentary SLI dip that resolves within seconds. Paging an engineer at 3am for a compaction-induced blip that lasted 15 seconds is alert fatigue — engineers start ignoring pages, and when the real incident happens, the page gets dismissed.
The 2-minute window filters out transient self-resolving events while catching real sustained degradation fast enough to act on.
Compaction spike (30 seconds): alert condition met but not sustained → no page
Partition causing quorum failures: condition sustained for 5 minutes → page fires at 2 minutes
Warning alerts vs critical alerts#
Not everything needs to wake someone up. Some metrics are leading indicators — they signal that something is drifting toward an SLO breach, but hasn't crossed the line yet.
CRITICAL (page immediately):
→ SLO breach sustained for 2+ minutes
→ Any of the 4 SLO alert rules above
WARNING (send to Slack, don't page):
→ p99 EC read > 7ms for 5 minutes (approaching 10ms SLO)
→ Compaction backlog > 50 SSTables (reads will slow down soon)
→ Hinted handoff queue > 10,000 hints (nodes are staying down too long)
→ Disk usage > 80% on any node (approaching full, compaction at risk)
→ Read repair rate > 5% of reads (replicas are diverging unusually)
→ Bloom filter false positive rate > 2% (read amplification increasing)
→ Tombstone ratio > 30% of reads (too many uncompacted deletes)
INFORMATIONAL (dashboard only, no notification):
→ Anti-entropy differences found per cycle
→ SSTable count per node
→ Memtable flush frequency
→ Gossip message rate
The warning alerts give the team time to investigate and fix before it becomes a critical incident. If compaction backlog is growing, an engineer can investigate during business hours instead of being woken up at 3am when p99 finally breaches SLO.
The full loop — SLO → SLI → Alert → Action#
SLO: p99 EC read latency < 10ms ← the promise
SLI: actual p99 measured every 15 seconds ← the reality
Alert: fires when SLI > SLO for > 2 minutes ← the notification
Action: on-call engineer investigates ← the response
Every 15 seconds, Prometheus computes cluster-wide p99 for each operation type. If EC read p99 stays below 10ms, nothing happens. If it crosses 10ms and stays there for 2 minutes, the alert fires — PagerDuty pages the on-call engineer.
The engineer gets: - Which SLO is breaching (EC read latency) - The current value (p99 = 15ms) - A link to the latency dashboard showing the spike - Per-node breakdown to identify the outlier
flowchart LR
SLO["SLO: p99 < 10ms"] --> SLI["SLI: measured every 15s"]
SLI -->|breach sustained 2+ min| Alert["Alert fires"]
Alert --> Page["PagerDuty pages on-call"]
Page --> Debug["Engineer checks per-node histograms"]
Debug --> Fix["Fix: restart compaction / add capacity / fix partition"]
Fix --> SLI Prometheus vs managed services for 1,200 nodes#
Prometheus (self-hosted): - Scraping 1,200 nodes every 15 seconds = 80 scrapes/sec — well within Prometheus capacity - Pairs with Grafana for dashboards and Alertmanager for routing pages - More control over retention, alert rules, and custom metrics - You manage the Prometheus infrastructure (storage, HA, federation)
Datadog / Grafana Cloud (managed): - Send metrics from 1,200 nodes to their service - No ops burden — no Prometheus servers to manage - Built-in dashboards, alerting, anomaly detection - Cost: at 1,200 nodes with dozens of metrics each, the bill gets significant
For infrastructure as critical as a KV store, many teams run both — Prometheus for real-time operational metrics and alerting, plus a managed service for long-term storage and cross-team dashboards.
Interview framing
"We have four critical alert rules — one per SLO. Each fires when the SLI breaches the threshold for more than 2 minutes, filtering out transient spikes from compaction or GC pauses. Below critical alerts, we have warning alerts on leading indicators — compaction backlog, hinted handoff queue depth, disk usage, Bloom filter false positive rate. These give us time to fix problems before they become SLO breaches. The full loop: Prometheus scrapes all 1,200 nodes every 15 seconds, computes cluster-wide SLIs, alerts fire when SLI diverges from SLO, PagerDuty pages on-call with the exact metric and per-node breakdown."