Skip to content

Interview Cheatsheet

When does fault tolerance come up in an interview?

It comes up the moment you add any dependency — a database, a downstream service, a third-party API. The interviewer will ask "what happens when X fails?" — this cheatsheet is your answer.


The One-Line Answer#

"I design for failure as the default, not the exception — timeouts on every call, retries with backoff, circuit breakers for sustained failures, bulkheads to contain blast radius, and graceful degradation so users always get something useful."


Moment 1 — "What happens when Service X fails?"#

Walk through the full failure handling chain:

"If Payment service fails — first, timeouts ensure we don't wait forever, freeing threads quickly. Retries with exponential backoff handle transient failures. If failures are sustained, the circuit breaker opens — requests fail immediately instead of waiting, no resource waste. The bulkhead ensures Payment's failure doesn't starve Recommendations or Notifications. And graceful degradation means the user sees a clear error on payment while the rest of the app works normally."


Moment 2 — "What are the different ways a service can fail?"#

The three failure modes:

Type Description Detection Fix
Crash Service dies, unreachable Health check, heartbeat Redundancy + failover
Slow Alive but too slow Timeout Timeout → retry → circuit breaker
Byzantine Running but wrong answers Data validation, anomaly detection Checksums, monitoring, alerts

Moment 3 — "How do you prevent cascading failures?"#

Two tools — always mention both:

Bulkhead: "Each downstream service gets its own thread pool — Payment's slowness exhausts its own 20 threads, not the shared pool. Recommendations and Notifications are completely unaffected."

Circuit Breaker: "After N consecutive failures, the circuit opens — no more requests to the broken service. Fail immediately, free threads instantly, check recovery every 30 seconds."


Graceful Degradation Decision Framework#

Not everything should degrade gracefully

Can it degrade? Examples
✅ Yes Recommendations, search suggestions, social features, non-critical UI
❌ No Payments, bank balance, medical data, anything where wrong data = harm

The rule: If wrong data causes financial loss, legal liability, or safety issues — fail hard.


The Full Fault Tolerance Checklist#

  • [ ] Named all three failure modes — crash, slow, byzantine
  • [ ] Timeouts on every downstream call — connect + read + write
  • [ ] Retries with exponential backoff + jitter for transient failures
  • [ ] Identified which operations are safe to retry (idempotent) vs not (payments, creates)
  • [ ] Circuit breaker for sustained failures — N failures → open → test every 30s
  • [ ] Bulkhead — isolated thread/connection pools per downstream service
  • [ ] Graceful degradation — decided which paths can degrade and which must fail hard
  • [ ] Mentioned redundancy for crash failures