Network Partitions#
What a Partition Is#
A network partition = two or more nodes in a distributed system are alive and running, but cannot communicate with each other.
The classic scenario:
Node A — Mumbai data center
Node B — Singapore data center
Undersea cable cut by a ship's anchor
Node A: running perfectly, serving Mumbai users
Node B: running perfectly, serving Singapore users
But: Node A and Node B cannot talk to each other
Both nodes are healthy. The network between them is not.
Partition vs Crash — Critical Distinction#
These are completely different problems requiring different handling
Server crash:
Node goes down → stops responding
Health check fails → detected quickly
Fix: failover to another node
Network partition:
Both nodes alive and running
Neither can reach the other
Each thinks the other might be dead
Fix: much harder — you don't know if the other node is dead or just unreachable
Why partition is harder:
During a crash — you know the node is gone. During a partition — you don't know if: - The other node is dead - The network between you is down - You are the one isolated
You can't tell from inside the partition.
Why Partitions Are Inevitable#
Every distributed system will experience partitions. This is not a maybe — it is a certainty at scale.
Common causes:
Physical:
Undersea cable cut (happens multiple times per year globally)
Power outage in one data center
Router failure between data centers
Software:
Network misconfiguration
Firewall rule change
DNS failure
Operational:
Data center maintenance
Cloud provider outage (AWS us-east-1 taking down half the internet)
BGP routing issues
At large scale — Netflix, Google, Amazon — partitions happen multiple times per week. The system must be designed to handle them, not avoid them.
What a Partition Looks Like#
Normal operation:
Node A ←──────────────→ Node B
(constant replication, heartbeats, coordination)
During partition:
Node A ✗──────────────✗ Node B
(all communication lost)
Node A: "Is Node B dead? Or is the network down?"
Node B: "Is Node A dead? Or is the network down?"
Neither knows.
After partition heals:
Node A ←──────────────→ Node B
(communication restored, must reconcile diverged state)
The Fundamental Problem#
During a partition, each isolated node must make a decision for every incoming request:
User request arrives at Node B (Singapore)
Node B cannot reach Node A (Mumbai)
Node B doesn't know if its data is fresh
Option 1: Serve the request → might return stale data
Option 2: Refuse the request → user gets an error
This decision — serve or refuse — is the heart of the CAP theorem. Covered in the next file.