Skip to content

Fault Tolerance — Overview#

A fault tolerant system doesn't pretend failures won't happen. It's designed knowing they will.

Every system fails — servers crash, networks drop, services go slow. Fault tolerance is the art of keeping the system useful despite those failures. This folder covers how failures happen, how to contain them, and the patterns that prevent one broken component from taking down everything else.


Files in this folder#

File Topic
01-Fault-Tolerance.md What it is, the three failure modes — crash, slow, byzantine
02-Graceful-Degradation.md Return something useful rather than total failure
03-Bulkhead.md Isolate failures so one component can't take down others
04-Timeout-Retry-Backoff.md Don't wait forever, retry smartly, back off exponentially
05-Circuit-Breaker.md Stop trying when you know something is broken
06-Interview-Cheatsheet.md What to say in an interview, full checklist