Availability#
Your system is running. But can users actually use it?
That's availability. Not whether your servers are on — whether the entire path from user to response is working.
What it is#
You open WhatsApp to send a message. It shows "Connecting..." and never connects.
The servers are running. The code is deployed. But you can't use it. That experience — the system being unreachable or non-functional when a user tries to use it — is what availability measures.
Availability = what percentage of the time is your system actually usable?
Example — system was down for 1 hour in a 30 day month:
What causes unavailability#
| Cause | Example |
|---|---|
| Node failure | A server crashes or loses power |
| Traffic overload | Too many requests, threads exhausted, requests rejected — 503 |
| Deployments | Engineer pushes new code, server restarts, brief window of no service |
| Network partition | Servers are fine but the network between them or to users goes down |
| Dependency failure | Your code is fine but a service you depend on goes down (Stripe, AWS S3) |
Availability is not just "is my server on or off"
It's whether the entire path from user to response is working. Any single point in that path failing = unavailability.
Why it matters in system design#
Every system design interview will ask about availability either directly or indirectly.
- "What happens if this server goes down?" — availability question
- "How do you handle failures?" — availability question
- "What's your SLO?" — availability question
It connects directly to what you've already learned: - SLO — your availability target (99.9%, 99.99%) - Error Budget — how much downtime you're allowed - Latency — a system that's too slow is effectively unavailable to the user