Interview Cheatsheet — Availability#
When does availability come up in an interview and what do you actually say?
Three moments — requirements, component proposals, and scaling discussion.
Moment 1 — Requirements Phase#
Before drawing anything, ask:
"What availability SLO are we targeting — 99.9% or 99.99%?"
Then use the answer to justify your architecture:
| SLO | What it implies |
|---|---|
| 99.9% | Single region, automated failover, 43 min downtime/month acceptable |
| 99.99% | Multi-AZ mandatory, zero-downtime deployments, 4 min downtime/month |
| 99.999% | Multi-region active-active, sub-second failover, 26 sec downtime/month |
This one question signals seniority immediately
Most candidates jump straight to designing. You're quantifying the availability requirement first. That's what senior engineers do.
Moment 2 — Every Component You Draw#
Every time you add a box to the diagram, the interviewer can ask "what happens if this goes down?"
Always follow this pattern:
"This is a potential SPOF. To eliminate it I'd run [component] in [active-active / active-passive]. Health checks detect failure within seconds and traffic automatically reroutes."
For stateless components (app servers, API servers): "I'd run three app servers in active-active behind the load balancer. Any server can handle any request — if one dies the other two absorb the traffic instantly with no failover delay."
For stateful components (databases): "I'd run the database in active-passive — primary handles all writes, replica is on hot standby. Automated failover promotes the replica within 30 seconds if the primary dies."
For the load balancer itself: "The load balancer is also a SPOF. I'd run two load balancers — if one dies, DNS or a floating IP switches traffic to the second one."
Moment 3 — Scaling Discussion#
When you add multiple servers for throughput, availability comes for free — but say it out loud:
"I'm adding three app servers behind a load balancer. This handles the throughput requirement AND eliminates the app server as a SPOF — three servers in parallel at 99.9% each gives us 99.9999% availability at that tier."
Two birds, one stone. Interviewers love when you connect decisions to multiple requirements.
The One Trap to Avoid#
Never say "we'll add redundancy" without specifying what type and how failover works.
❌ "We'll make the database redundant"
✅ "We'll run the database in Active-Passive — primary handles all writes, replica is on standby. Automated failover promotes the replica within 30 seconds if the primary dies. Until a new replica is provisioned, we're temporarily back to a SPOF — so we'd alert on this and provision a new replica automatically."
The Full Availability Checklist for Every Design#
- [ ] Asked for the availability SLO before designing
- [ ] Every component in the diagram has redundancy — no SPOFs
- [ ] Specified active-active for stateless, active-passive for stateful
- [ ] Mentioned health checks and automatic failover
- [ ] Connected parallel redundancy to the availability math when relevant
- [ ] Measured availability from user perspective (request success rate) not server uptime