MTBF & MTTR
Your system is reliable right now. But how does it behave across weeks and months?
MTBF and MTTR are the two numbers that answer that question.
MTBF — Mean Time Between Failures#
How long does your system run before something breaks?
If your server crashes once every 30 days on average — your MTBF is 30 days.
MTBF = Total operational time / Number of failures
Example:
System ran for 300 hours, failed 3 times
MTBF = 300 / 3 = 100 hours between failures
Higher MTBF = more reliable. The system breaks less often.
MTTR — Mean Time To Recovery#
When something breaks, how long until it's working again?
This includes: detecting the failure + alerting the team + diagnosing + fixing + deploying the fix + verifying it's healthy.
MTTR = Total downtime / Number of failures
Example:
3 failures, each took 30 minutes to fix
MTTR = 90 minutes / 3 = 30 minutes per failure
Lower MTTR = more resilient. You recover faster.
Why both matter together#
MTBF tells you how often you fall down. MTTR tells you how fast you get back up.
A system with low MTBF but very low MTTR can still be highly available — it breaks often but recovers in seconds. Netflix's chaos engineering is built on exactly this: break things constantly in production, force MTTR to near-zero, don't rely on MTBF being high.
A system with high MTBF but terrible MTTR is dangerous — it rarely breaks, but when it does, it's down for hours.
The availability connection#
These two numbers directly calculate your availability:
Availability = MTBF / (MTBF + MTTR)
Example:
MTBF = 99 hours (fails once every 99 hours)
MTTR = 1 hour (takes 1 hour to recover)
Availability = 99 / (99 + 1) = 99/100 = 99%
Want 99.9%? Either make MTBF much larger (fail less often) or make MTTR much smaller (recover faster). Two completely different engineering strategies.
How you improve each#
Improving MTBF — prevent failures from happening: - Better hardware - Code reviews, testing, canary deployments - Chaos engineering — find weaknesses before they hit production
Improving MTTR — recover faster when they do happen: - Automated alerting — know instantly when something breaks - Runbooks — engineers don't improvise during an incident, they follow a playbook - Automated rollback — bad deploy? One command reverts it - Good observability — logs, metrics, traces so you find the root cause fast
The key insight#
Most engineers only think about MTBF — "how do I prevent failures?" But at scale, failures are inevitable. The teams that maintain high availability focus just as hard on MTTR — "when it breaks, how fast can we recover?"
Designing for low MTTR is often more cost-effective than designing for high MTBF
Prevention gets exponentially expensive. Fast recovery is an engineering discipline you can build cheaply — better alerting, runbooks, automated rollback cost far less than eliminating every possible failure.