Skip to content

Alerting#

Alert tiers#

Critical — page immediately#

These mean the SLO is breached or data is corrupted right now.

Alert Condition Why
Latency SLO breach p99 > 5ms sustained for 2 minutes Callers are being slowed down platform-wide
Availability breach Success rate < 99.99% for 5 minutes Write path is failing for callers
Duplicate ID detected Duplicate ID count > 0 Data corruption — P0 incident, no tolerance
All nodes unhealthy LB has no healthy nodes Complete outage — no IDs can be generated

Warning — investigate soon#

These are leading indicators — the system is healthy now but trending towards a problem.

Alert Condition Why
High clock skew frequency >5 clock skew events/minute on any node NTP or hardware clock is misbehaving
Clock skew wait > 10ms Any single wait exceeds 10ms Larger than expected NTP correction
Node latency divergence One node's p99 > 3x other nodes Node-specific hardware or resource problem
Single node down One node fails health check Reduced capacity, increased load on remaining nodes
Sequence counter saturation Sequence hitting 4095 (max) per ms Node receiving more than 4096 requests/ms — unexpected

Informational — log and monitor#

Alert Condition
Node restart Any ID generator node restarts
NTP sync event NTP correction applied to any node
Deployment New version rolled out

Sustained breach window#

Don't alert on a single bad second — brief spikes happen. Alert only when the condition is sustained:

  • Critical latency: 2 consecutive minutes above threshold
  • Availability: 5 consecutive minutes below threshold
  • Duplicate ID: immediate — zero tolerance, no window

A 2-minute window prevents false alarms from brief GC pauses or network blips while still catching real degradation quickly.