Skip to content

MTTR vs RTO

You have a replica running. But what is it actually doing while it waits?

And what does RPO = 0 really cost you?


Warm standby vs hot standby — what's the actual difference#

Both have a replica running on a separate machine, continuously syncing with the primary. The difference is what that replica is doing while it waits.

Hot standby — the replica is live and actively serving read traffic. Failover is automated: a health check detects the primary is dead and promotes the replica within seconds. No human involved.

Primary  ──writes──▶  Replica (serving reads, fully synced)

Primary dies at 3:00:00
Health check fires  at 3:00:05
Replica promoted    at 3:00:08
Traffic redirected  at 3:00:10
RTO = ~10 seconds

Warm standby — the replica is running and staying in sync, but serving no traffic. It's just sitting there replicating. When the primary dies, someone (or a script) has to manually promote it, update the load balancer config, run health checks, and verify it's ready. That process takes minutes.

Primary  ──writes──▶  Replica (syncing, idle — serving nothing)

Primary dies at 3:00:00
Engineer gets paged  at 3:02:00
Promotes replica,
updates config       at 3:08:00
RTO = ~8 minutes
Hot Standby Warm Standby
Serving traffic? Yes — reads No — just replicating
Failover Automated, seconds Manual or scripted, minutes
RTO Seconds Minutes
Cost Higher (doing real work) Slightly lower (idle server)

Warm standby is a spare tyre in the boot — you still have to pull over and change it. Hot standby is a second engine that kicks in automatically.


The hidden cost of RPO = 0#

RPO = 0 means zero data loss — every write must be confirmed on the replica before the user gets a response. That's synchronous replication.

Synchronous replication adds latency to every write. The write has to travel to the replica and get confirmed before the user gets a response. If the replica is in another datacenter, that's 50–100ms added to every single write operation.

Async replication (RPO = seconds):
User writes → Primary confirms → User gets response ✓
                ↓ (background)
            Replica replicates

Sync replication (RPO = 0):
User writes → Primary writes → Replica writes → Both confirm → User gets response ✓
              ←————————— 50–100ms round trip if cross-datacenter ————————————→

This is why most systems use asynchronous replication — slightly higher RPO (seconds of potential data loss), but no latency penalty on writes. The business decides which trade-off is acceptable.

RPO = 0 is not free. Every write pays a latency tax equal to the round-trip time to your replica. At Google scale (100k writes/sec), adding 50ms to every write is a significant cost. Reserve synchronous replication for data where loss is truly unacceptable — financial transactions, billing records.


How they drive architecture decisions#

RTO Strategy What it requires
Hours Snapshot restore Backup file in remote storage (S3), no standby server needed
Minutes Warm standby replica A replica is running but not serving traffic, manual failover
Seconds Hot standby replica A replica is live, automated failover promotes it instantly
Zero Active-Active Multiple live primaries, no failover — writes go to all
RPO Strategy What it requires
24 hours Daily snapshots to S3 Simplest and cheapest — acceptable for non-critical data
1 hour Hourly snapshots Still snapshot-based, just more frequent
Minutes Async replication Live replica with small replication lag
Zero Sync replication Write confirmed only after both primary and replica have written it

MTTR vs RTO — they sound similar but are different things#

  • MTTR — what actually happens on average when things break. A measurement of past incidents.
  • RTO — what the business requires as the maximum acceptable downtime. A target you design to.

You design your system so that MTTR stays below RTO. If your RTO is 30 minutes, your recovery process — detection, failover, verification — must consistently complete in under 30 minutes.

RTO = 30 min  (the ceiling the business set)
MTTR = 8 min  (what actually happens based on incident history)

→ You have headroom. You're meeting the SLA comfortably.

If MTTR creeps up to 35 min → you're breaching RTO → time to improve your recovery process.

In an interview — ask for both before designing

"What's the RTO and RPO for this system?"

The answers tell you exactly what backup strategy and replication model to use. A fintech system with RPO = 0 needs synchronous replication to another datacenter. An internal analytics dashboard with RPO = 24 hours just needs daily snapshots to S3. Same question, completely different architectures.