N+1 Redundancy#

You know you need redundancy. But how many backups is enough?

N+1 gives you a formula instead of a gut feeling.

What it is#

N = the number of components you need to handle your current load. +1 = one extra, always.

If you need 3 servers to handle your traffic — you run 4. If one dies, you still have exactly 3. Load doesn't spike. Users don't notice anything.

Why "just run two" isn't a rule#

"Run two servers" is vague. What if you need 10 servers to handle peak traffic? Running 2 is nowhere near enough — if one dies you're at 1, and the system collapses under load.

N+1 gives you a formula:

Need 1 server? Run 2. (1+1)
Need 5 servers? Run 6. (5+1)
Need 10 servers? Run 11. (10+1)

The "+1" is always exactly one spare — not two, not three. Just enough to absorb one failure without degrading service.

It applies everywhere, not just servers#

Component	N (what you need)	N+1 (what you run)
App servers	3 to handle load	4
Database replicas	2	3
Power supplies in a server	1	2
Datacenters	1	2
Network links between DCs	1	2

Same logic at every layer. One failure → still fully operational.

The moment N+1 breaks down#

N+1 protects you against one failure at a time.

The danger is the gap between failure and replacement:

You're running N+1 — 4 servers, need 3
One server dies → you're now at N — 3 servers, need 3 — still fine
Before you provision a replacement, a second server dies → you're at N-1 — 2 servers, need 3 → service degrades or drops

This is why N+1 requires fast replacement, not just fast detection. The "+1" is a buffer, not a permanent safety net.

The moment your +1 becomes your N, you're exposed

Provision a replacement immediately — don't wait until the next maintenance window.

N+2 — when one spare isn't enough#

Setup for all three scenarios below: you need 3 servers, so you run N+1 = 4.

Planned maintenance#

You take a server offline to apply a security patch.

Before maintenance:  4 servers running  (N+1 — buffer intact)
During maintenance:  3 servers running  (exactly N — zero buffer)

If one more server dies during that window, you're at 2 but need 3. Service degrades.

With N+2 = 5 servers:

Before maintenance:  5 servers running
During maintenance:  4 servers running  (still N+1 — buffer intact)

Now a failure during maintenance still leaves you at exactly N. Still safe.

Slow provisioning#

A server dies. You're now at exactly N — no buffer. You start spinning up a replacement but it takes 20 minutes to boot, configure, and join the cluster.

For those 20 minutes you have zero buffer. A second failure in that window puts you under capacity.

With N+2 = 5 servers:

Server 1 dies  →  4 servers left  (still N+1 — buffer intact)
Provisioning takes 20 minutes...
Server 2 dies in that window  →  3 servers left  (exactly N — still safe)

The second spare covers the provisioning gap.

Mission-critical systems#

Payment processors, hospital systems. The question isn't "can two servers die at once?" — it's "what's the cost if they do?"

Startup — two simultaneous failures → some users annoyed → N+1 is fine
Payment processor — two simultaneous failures during peak → transactions failing → millions in losses + regulatory consequences → N+2 is worth the cost

The mental model#

Configuration	What it survives
N+1	One failure
N+2	One failure while something else risky is already happening
N+K	K simultaneous problems

The cost scales linearly — one more instance per +1. The decision is: what's the cost of a second failure in my system?

In an interview

"I'd run N+1 for the app servers — one spare to absorb any single failure. For the database tier, given slow replica provisioning, I'd consider N+2 to cover the replacement window."