Skip to content

Reliability vs Availability

Both measure how "healthy" a system is. So why are they different?

Because a system can be perfectly available and completely broken at the same time.


The core distinction#

Availability asks: can users reach the system? Reliability asks: when they reach it, do they get correct answers?

These are independent. You can have any combination:

Available Not Available
Reliable System is up, responses are correct ✅ System is down, but when it was up it was correct
Not Reliable System is up, responses are wrong ❌ System is down and was returning wrong data anyway

The dangerous quadrant is available but not reliable — users are reaching the system, getting responses, and trusting those responses. But the responses are wrong.


Same failure, different diagnosis#

What happened Which problem
Server crashes, users get connection refused Availability — system unreachable
Server is up, pricing bug returns $0 for all products Reliability — wrong response
DB goes down, users get 503 Availability — dependency failure
DB replication lag, users see stale data Reliability — incorrect data
Too many requests, server rejects with 503 Availability — overwhelmed
Too many concurrent writes, data gets corrupted Reliability — correctness failure under load

The 5xx reminder#

HTTP status codes make the split concrete:

  • 503 — server never processed the request → Availability problem
  • 500 — server processed it but failed → Reliability problem

A 500 means the system was available enough to receive and process your request — it just couldn't complete it correctly.


Different solutions#

Solving availability does not solve reliability. They require completely separate engineering work.

Problem Solution
System crashes Add redundancy, automated failover
System returns stale data Fix cache invalidation
System is overloaded Add capacity, load balancing
System corrupts writes under concurrency Fix locking, transactions
System is unreachable Fix network, eliminate SPOF
System returns wrong results Fix the bug, fix replication logic

Adding more servers does not fix a reliability problem

A hundred servers all returning wrong answers is not reliable — it's just very available at being wrong.


Measured separately as SLIs#

Because they're independent problems, they need independent measurements:

Availability SLI  =  successful requests / total requests        target: 99.9%
Reliability SLI   =  correct responses / total responses         target: < 0.1% error rate

A system can hit its availability SLO (99.9% uptime) and completely miss its reliability SLO (2% error rate) at the same time.

In an interview — always address both explicitly

"For availability I'd eliminate SPOFs with redundancy and automatic failover. For reliability I'd ensure strong consistency on writes, proper cache invalidation, and track error rate as a separate SLI from uptime."