MTTR vs RTO
You have a replica running. But what is it actually doing while it waits?
And what does RPO = 0 really cost you?
Warm standby vs hot standby — what's the actual difference#
Both have a replica running on a separate machine, continuously syncing with the primary. The difference is what that replica is doing while it waits.
Hot standby — the replica is live and actively serving read traffic. Failover is automated: a health check detects the primary is dead and promotes the replica within seconds. No human involved.
Primary ──writes──▶ Replica (serving reads, fully synced)
Primary dies at 3:00:00
Health check fires at 3:00:05
Replica promoted at 3:00:08
Traffic redirected at 3:00:10
RTO = ~10 seconds
Warm standby — the replica is running and staying in sync, but serving no traffic. It's just sitting there replicating. When the primary dies, someone (or a script) has to manually promote it, update the load balancer config, run health checks, and verify it's ready. That process takes minutes.
Primary ──writes──▶ Replica (syncing, idle — serving nothing)
Primary dies at 3:00:00
Engineer gets paged at 3:02:00
Promotes replica,
updates config at 3:08:00
RTO = ~8 minutes
| Hot Standby | Warm Standby | |
|---|---|---|
| Serving traffic? | Yes — reads | No — just replicating |
| Failover | Automated, seconds | Manual or scripted, minutes |
| RTO | Seconds | Minutes |
| Cost | Higher (doing real work) | Slightly lower (idle server) |
Warm standby is a spare tyre in the boot — you still have to pull over and change it. Hot standby is a second engine that kicks in automatically.
The hidden cost of RPO = 0#
RPO = 0 means zero data loss — every write must be confirmed on the replica before the user gets a response. That's synchronous replication.
Synchronous replication adds latency to every write. The write has to travel to the replica and get confirmed before the user gets a response. If the replica is in another datacenter, that's 50–100ms added to every single write operation.
Async replication (RPO = seconds):
User writes → Primary confirms → User gets response ✓
↓ (background)
Replica replicates
Sync replication (RPO = 0):
User writes → Primary writes → Replica writes → Both confirm → User gets response ✓
←————————— 50–100ms round trip if cross-datacenter ————————————→
This is why most systems use asynchronous replication — slightly higher RPO (seconds of potential data loss), but no latency penalty on writes. The business decides which trade-off is acceptable.
RPO = 0 is not free. Every write pays a latency tax equal to the round-trip time to your replica. At Google scale (100k writes/sec), adding 50ms to every write is a significant cost. Reserve synchronous replication for data where loss is truly unacceptable — financial transactions, billing records.
How they drive architecture decisions#
| RTO | Strategy | What it requires |
|---|---|---|
| Hours | Snapshot restore | Backup file in remote storage (S3), no standby server needed |
| Minutes | Warm standby replica | A replica is running but not serving traffic, manual failover |
| Seconds | Hot standby replica | A replica is live, automated failover promotes it instantly |
| Zero | Active-Active | Multiple live primaries, no failover — writes go to all |
| RPO | Strategy | What it requires |
|---|---|---|
| 24 hours | Daily snapshots to S3 | Simplest and cheapest — acceptable for non-critical data |
| 1 hour | Hourly snapshots | Still snapshot-based, just more frequent |
| Minutes | Async replication | Live replica with small replication lag |
| Zero | Sync replication | Write confirmed only after both primary and replica have written it |
MTTR vs RTO — they sound similar but are different things#
- MTTR — what actually happens on average when things break. A measurement of past incidents.
- RTO — what the business requires as the maximum acceptable downtime. A target you design to.
You design your system so that MTTR stays below RTO. If your RTO is 30 minutes, your recovery process — detection, failover, verification — must consistently complete in under 30 minutes.
RTO = 30 min (the ceiling the business set)
MTTR = 8 min (what actually happens based on incident history)
→ You have headroom. You're meeting the SLA comfortably.
If MTTR creeps up to 35 min → you're breaching RTO → time to improve your recovery process.
In an interview — ask for both before designing
"What's the RTO and RPO for this system?"
The answers tell you exactly what backup strategy and replication model to use. A fintech system with RPO = 0 needs synchronous replication to another datacenter. An internal analytics dashboard with RPO = 24 hours just needs daily snapshots to S3. Same question, completely different architectures.