Reliability vs Availability
Both measure how "healthy" a system is. So why are they different?
Because a system can be perfectly available and completely broken at the same time.
The core distinction#
Availability asks: can users reach the system? Reliability asks: when they reach it, do they get correct answers?
These are independent. You can have any combination:
| Available | Not Available | |
|---|---|---|
| Reliable | System is up, responses are correct ✅ | System is down, but when it was up it was correct |
| Not Reliable | System is up, responses are wrong ❌ | System is down and was returning wrong data anyway |
The dangerous quadrant is available but not reliable — users are reaching the system, getting responses, and trusting those responses. But the responses are wrong.
Same failure, different diagnosis#
| What happened | Which problem |
|---|---|
| Server crashes, users get connection refused | Availability — system unreachable |
| Server is up, pricing bug returns $0 for all products | Reliability — wrong response |
| DB goes down, users get 503 | Availability — dependency failure |
| DB replication lag, users see stale data | Reliability — incorrect data |
| Too many requests, server rejects with 503 | Availability — overwhelmed |
| Too many concurrent writes, data gets corrupted | Reliability — correctness failure under load |
The 5xx reminder#
HTTP status codes make the split concrete:
- 503 — server never processed the request → Availability problem
- 500 — server processed it but failed → Reliability problem
A 500 means the system was available enough to receive and process your request — it just couldn't complete it correctly.
Different solutions#
Solving availability does not solve reliability. They require completely separate engineering work.
| Problem | Solution |
|---|---|
| System crashes | Add redundancy, automated failover |
| System returns stale data | Fix cache invalidation |
| System is overloaded | Add capacity, load balancing |
| System corrupts writes under concurrency | Fix locking, transactions |
| System is unreachable | Fix network, eliminate SPOF |
| System returns wrong results | Fix the bug, fix replication logic |
Adding more servers does not fix a reliability problem
A hundred servers all returning wrong answers is not reliable — it's just very available at being wrong.
Measured separately as SLIs#
Because they're independent problems, they need independent measurements:
Availability SLI = successful requests / total requests target: 99.9%
Reliability SLI = correct responses / total responses target: < 0.1% error rate
A system can hit its availability SLO (99.9% uptime) and completely miss its reliability SLO (2% error rate) at the same time.
In an interview — always address both explicitly
"For availability I'd eliminate SPOFs with redundancy and automatic failover. For reliability I'd ensure strong consistency on writes, proper cache invalidation, and track error rate as a separate SLI from uptime."