Interview Cheatsheet — Reliability#
When does reliability come up in an interview and what do you actually say?
Three moments — requirements, component design, and failure discussion.
Moment 1 — Requirements Phase#
Before designing, ask two questions:
"What's the RTO — how long can the system be down after a failure?" "What's the RPO — how much data loss is acceptable?"
Then use the answers to justify your architecture:
| RTO | What to build |
|---|---|
| Hours | Restore from backup on failure |
| Minutes | Warm standby — secondary system ready but idle |
| Seconds | Hot standby — Active-Passive with automated failover |
| Zero | Active-Active multi-region |
| RPO | What to build |
|---|---|
| 24 hours | Daily backups |
| 1 hour | Hourly snapshots |
| Minutes | Async replication with small lag |
| Zero | Synchronous replication — warn about write latency cost |
These two questions immediately signal seniority
Most candidates start drawing boxes. You're quantifying failure tolerance first.
Moment 2 — Distinguishing Availability from Reliability#
When an interviewer asks "how do you handle failures?" — most candidates only talk about uptime. Go further:
"I'd separate availability and reliability as two distinct SLIs. For availability I'd track request success rate and eliminate SPOFs with redundancy. For reliability I'd track error rate — 500s, wrong responses, stale data — separately. A system can be fully available and completely unreliable at the same time."
Then give a concrete example relevant to the system you're designing: - E-commerce → pricing service bug returning $0 for all products - Chat app → replication lag causing messages to appear out of order - News feed → stale cache showing 3-hour-old posts as new
Moment 3 — Failure Discussion#
When asked "what happens when this component fails?" — cover both MTBF and MTTR:
"To keep MTBF high I'd use canary deployments and chaos testing to catch weaknesses before they hit production. But failures are inevitable at scale — so I'd focus equally on MTTR: automated alerting so we know within seconds, runbooks so engineers aren't improvising during incidents, and automated rollback for bad deploys."
Then tie it back to your RTO:
"Our RTO is 15 minutes, so our entire recovery process — detection, diagnosis, rollback — needs to consistently complete in under 15 minutes. That drives the investment in observability and automation."
The Reliability Checklist for Every Design#
- [ ] Asked for RTO and RPO before designing
- [ ] Separated availability SLI (uptime) from reliability SLI (error rate)
- [ ] Identified at least one way the system can be available but return wrong data
- [ ] Specified replication strategy based on RPO (async vs sync)
- [ ] Specified recovery strategy based on RTO (backup restore / warm standby / hot standby / active-active)
- [ ] Addressed both MTBF (prevention) and MTTR (fast recovery)
Quick Reference#
Reliability = correct answers consistently over time
Availability = can users reach the system?
(a system can be both available AND unreliable)
MTBF = Total uptime / Number of failures (higher = better)
MTTR = Total downtime / Number of failures (lower = better)
Availability = MTBF / (MTBF + MTTR)
RTO = max acceptable downtime after failure (drives recovery architecture)
RPO = max acceptable data loss after failure (drives replication strategy)
503 → availability problem (server never received the request)
500 → reliability problem (server received it, processed it, failed)