Backups#
Why Replication Is Not Enough#
Replication copies data across nodes and regions in real time. That's the problem.
Bug runs → deletes all 50M user records
Deletion replicates to all regions in milliseconds
All regions now have zero user records
Replication faithfully copied the corruption everywhere
You need a point-in-time snapshot taken before the corruption happened. That's a backup.
Full Backup#
A complete copy of all data at a specific point in time.
Restore is simple:
Problem: Copying 10TB every day is expensive and slow. Doing it hourly is impractical.
Incremental Backup#
Only copy data that changed since the last backup.
Sunday 2am → Full backup → 10TB copied
Monday 2am → Incremental → only 50GB changed → copy 50GB
Tuesday 2am → Incremental → only 30GB changed → copy 30GB
Wednesday 2am → Incremental → only 80GB changed → copy 80GB
Much cheaper. But restore is more complex:
Corruption on Wednesday 3pm:
Step 1: Restore Sunday full backup → database at Sunday 2am state
Step 2: Apply Monday incremental → database at Monday 2am state
Step 3: Apply Tuesday incremental → database at Tuesday 2am state
Lost: Tuesday 2am → Wednesday 3pm
Think of it like Git:
Full backup = initial commit (entire codebase)
Incremental = each subsequent commit (only the diff)
Restore = checkout initial + apply all commits in order
Problem: 6 months of daily incrementals = 180 files to apply on restore. Very slow recovery.
The Middle Ground Strategy#
Take a fresh full backup periodically, incrementals in between:
Week 1 Sunday → Full backup (reset the chain)
Mon → Sat → Daily incrementals (max 6 files to apply)
Week 2 Sunday → Full backup (reset the chain)
Mon → Sat → Daily incrementals
Worst case restore: Full backup + 6 incrementals. Manageable.
Backup Frequency and RPO#
Backup frequency directly determines your RPO
Full backup every week, incrementals every day:
Corruption Wednesday 3pm
Last backup: Tuesday 2am
Data lost: ~37 hours → RPO = 37 hours
Incrementals every hour:
Corruption Wednesday 3pm
Last backup: Wednesday 2pm
Data lost: 1 hour → RPO = 1 hour
Continuous backup (WAL archiving):
Stream WAL to backup storage in real time
RPO = seconds
More frequent backups = smaller RPO = less data loss = higher cost.
Backups vs Replication — Not Alternatives#
These solve different problems
| Protects Against | Does NOT protect against | |
|---|---|---|
| Replication | Hardware failure, node/DC outage | Logical corruption (replicates instantly) |
| Backups | Logical corruption, accidental deletion | Hardware failure during backup window |
Use both:
Replication → keeps system available when hardware fails
Backups → lets you recover from bugs, accidental deletes, ransomware
Interview framing
"I'd use replication for hardware failure and availability. Backups — full weekly, incremental daily — for logical corruption and accidental deletion. Backup frequency determines RPO — for critical data I'd do hourly incrementals or continuous WAL archiving to get RPO under an hour."