Interview Cheatsheet

When does fault tolerance come up in an interview?

It comes up the moment you add any dependency — a database, a downstream service, a third-party API. The interviewer will ask "what happens when X fails?" — this cheatsheet is your answer.

The One-Line Answer#

"I design for failure as the default, not the exception — timeouts on every call, retries with backoff, circuit breakers for sustained failures, bulkheads to contain blast radius, and graceful degradation so users always get something useful."

Moment 1 — "What happens when Service X fails?"#

Walk through the full failure handling chain:

"If Payment service fails — first, timeouts ensure we don't wait forever, freeing threads quickly. Retries with exponential backoff handle transient failures. If failures are sustained, the circuit breaker opens — requests fail immediately instead of waiting, no resource waste. The bulkhead ensures Payment's failure doesn't starve Recommendations or Notifications. And graceful degradation means the user sees a clear error on payment while the rest of the app works normally."

Moment 2 — "What are the different ways a service can fail?"#

The three failure modes:

Type	Description	Detection	Fix
Crash	Service dies, unreachable	Health check, heartbeat	Redundancy + failover
Slow	Alive but too slow	Timeout	Timeout → retry → circuit breaker
Byzantine	Running but wrong answers	Data validation, anomaly detection	Checksums, monitoring, alerts

Moment 3 — "How do you prevent cascading failures?"#

Two tools — always mention both:

Bulkhead: "Each downstream service gets its own thread pool — Payment's slowness exhausts its own 20 threads, not the shared pool. Recommendations and Notifications are completely unaffected."

Circuit Breaker: "After N consecutive failures, the circuit opens — no more requests to the broken service. Fail immediately, free threads instantly, check recovery every 30 seconds."

Graceful Degradation Decision Framework#

Not everything should degrade gracefully

Can it degrade?	Examples
✅ Yes	Recommendations, search suggestions, social features, non-critical UI
❌ No	Payments, bank balance, medical data, anything where wrong data = harm

The rule: If wrong data causes financial loss, legal liability, or safety issues — fail hard.

The Full Fault Tolerance Checklist#

[ ] Named all three failure modes — crash, slow, byzantine
[ ] Timeouts on every downstream call — connect + read + write
[ ] Retries with exponential backoff + jitter for transient failures
[ ] Identified which operations are safe to retry (idempotent) vs not (payments, creates)
[ ] Circuit breaker for sustained failures — N failures → open → test every 30s
[ ] Bulkhead — isolated thread/connection pools per downstream service
[ ] Graceful degradation — decided which paths can degrade and which must fail hard
[ ] Mentioned redundancy for crash failures