Error Budget#
What the budget is#
At 99.99% availability SLO:
Allowed downtime per year = 0.01% × 365 × 24 × 60 = 52.6 minutes/year
Allowed downtime per month = 0.01% × 30 × 24 × 60 = 4.3 minutes/month
Every minute the service is unavailable or breaching its latency SLO consumes from this budget.
What consumes the budget#
| Event | Budget consumed | Notes |
|---|---|---|
| Node crash + failover | ~30 seconds | LB detects failure, stops routing, callers retry |
| Full cluster restart (rolling deploy) | ~1–2 minutes | Nodes restart one at a time, brief reduced capacity |
| NTP correction causing wait spikes | Seconds | Brief latency SLO breach, not full unavailability |
| Hardware failure on one node | Until replacement | Reduced capacity, remaining nodes absorb load |
Duplicate IDs consume infinite budget#
A duplicate ID is not an error budget problem — it is a correctness failure. Error budgets measure availability and latency. Data corruption is outside the budget model entirely.
If a duplicate ID is ever detected, the incident response is not "how much budget did we consume?" — it is "which records are corrupted, how do we fix the data, and what code change caused this?"
Duplicate IDs are not an SLO issue — they are a correctness incident
SLOs measure degradation. A duplicate ID is a bug that corrupted production data. Treat it as a P0 incident with a full post-mortem, not as budget consumption.
Budget policy#
Healthy budget (>50% remaining): Deploy freely. Experiment with node configurations. Normal operations.
Degraded budget (10–50% remaining): Freeze non-critical changes. Investigate what consumed the budget. No risky deployments.
Budget exhausted (<10% remaining): Feature freeze. Only critical fixes. Post-mortem required before any changes. Focus entirely on reliability improvements.
Rolling deploy strategy#
When deploying a new version, restart nodes one at a time:
Node 1 restarts → Nodes 2 and 3 absorb all traffic
Node 1 healthy → Node 2 restarts → Nodes 1 and 3 serve traffic
Node 2 healthy → Node 3 restarts → Nodes 1 and 2 serve traffic
Each restart causes a brief capacity reduction — not an outage. Budget consumed per deploy is typically under 30 seconds — well within the 4.3-minute monthly allowance.