Alerting#

Alert tiers#

These mean the SLO is breached or data is corrupted right now.

Alert	Condition	Why
Latency SLO breach	p99 > 5ms sustained for 2 minutes	Callers are being slowed down platform-wide
Availability breach	Success rate < 99.99% for 5 minutes	Write path is failing for callers
Duplicate ID detected	Duplicate ID count > 0	Data corruption — P0 incident, no tolerance
All nodes unhealthy	LB has no healthy nodes	Complete outage — no IDs can be generated

These are leading indicators — the system is healthy now but trending towards a problem.

Alert	Condition	Why
High clock skew frequency	>5 clock skew events/minute on any node	NTP or hardware clock is misbehaving
Clock skew wait > 10ms	Any single wait exceeds 10ms	Larger than expected NTP correction
Node latency divergence	One node's p99 > 3x other nodes	Node-specific hardware or resource problem
Single node down	One node fails health check	Reduced capacity, increased load on remaining nodes
Sequence counter saturation	Sequence hitting 4095 (max) per ms	Node receiving more than 4096 requests/ms — unexpected

Alert	Condition
Node restart	Any ID generator node restarts
NTP sync event	NTP correction applied to any node
Deployment	New version rolled out

Don't alert on a single bad second — brief spikes happen. Alert only when the condition is sustained:

A 2-minute window prevents false alarms from brief GC pauses or network blips while still catching real degradation quickly.