SPOF and Redundancy#
What's the single thing that can take your entire system down?
That's your SPOF. And redundancy is the only way to eliminate it.
SPOF — Single Point of Failure#
A SPOF is any component in your system that, if it fails, takes the entire system down with it.
Examples: - One server serving all traffic → server dies → system is down - One database with no replica → DB crashes → all data is unreachable - One load balancer → load balancer fails → no requests get through - One datacenter → datacenter loses power → everything is gone
Every SPOF is a ticking clock
It's not a question of if it will fail — it's when. Hardware fails. Networks go down. Power cuts happen. Design assuming failure is inevitable.
The Answer to Every SPOF — Redundancy#
The answer to every availability problem is the same single word: redundancy.
Remove every single point of failure by having a backup.
| SPOF | Redundancy solution |
|---|---|
| One server | Run two or more servers |
| One database | Replicate it — primary + replica |
| One datacenter | Deploy in multiple datacenters |
| One region | Deploy in multiple regions |
| One load balancer | Run two load balancers |
The pattern is always the same — if one fails, the other keeps serving.
Redundancy alone isn't enough — you need Failover#
Redundancy gives you a backup. Failover is the mechanism that automatically switches to that backup when the primary fails.
Manual failover is not good enough
If your primary server dies at 3am and an engineer has to manually switch to the backup — you're down for however long it takes that engineer to wake up, log in, and fix it. That could be 30 minutes.
At a 99.99% SLO you only have 4.32 minutes of downtime allowed per month.
Failover must be automatic.
How automatic failover works#
- A health check continuously pings each server — "are you alive?"
- If a server stops responding, it's marked as unhealthy
- Traffic is automatically rerouted to healthy servers
- An alert fires so engineers know to investigate
This happens in seconds — not minutes. No human needed.
The cascading failure problem#
Redundancy protects against one component failing. But what about this scenario:
Your primary DB fails → replica gets promoted → now you have only one DB again → that one fails too → system is down.
This is a cascading failure — one failure triggers another. Redundancy must be maintained continuously, not just set up once and forgotten.
Redundancy is not a one-time setup — it's an ongoing operational requirement
The moment your backup becomes your primary, you need a new backup.