Measuring Availability#

What counts as success#

A request is successful if: - The node returns a valid int64 ID with HTTP 200 - The ID is unique (no duplicate)

A request is a failure if: - The node returns a 5xx error - The request times out - The load balancer cannot route to any healthy node

A request that waits a few milliseconds due to clock skew and then succeeds is not a failure — it's a delayed success.

What counts as failure#

Scenario	Counts as failure?
5xx from node	✅ yes
Request timeout	✅ yes
All nodes down	✅ yes
Clock skew wait (1–10ms) then success	❌ no — delayed success
LB returns 429 rate limit	✅ yes — caller couldn't get an ID

Availability calculation#

Availability = successful requests / total requests

At 99.99% SLO:
Allowed failures = 0.01% of requests
At 1M req/sec → 100 failures/second allowed before breaching SLO

Track availability as a rolling 5-minute window and a rolling 1-hour window. Short windows catch sudden outages. Longer windows track slow degradation.

Per-node availability#

Track availability per node separately. If Node 3 has 98% availability while Nodes 1 and 2 are at 99.99%, the aggregate might still look healthy — but Node 3 is sick and needs investigation before it fully fails.