Why Distributed Systems Are Hard#
The core idea
A single machine is easy to reason about. The moment you add a second machine, one fundamental problem appears that never existed before — you can never be 100% sure a message was received.
The single machine world#
One server. You write to memory, you read from memory. It's consistent. You crash, everything stops together. Simple.
There's no ambiguity. Either the operation happened or it didn't. Either the machine is up or it's down.
The moment you add a second machine#
The moment you scale horizontally — two app servers, a primary and replica DB, any two machines that talk to each other — one uncomfortable question appears:
Did the other side receive my message?
You send a message over the network. The network is unreliable. The message might have been lost. The other side might have received it but the reply got lost. The other side might be slow. You have no way to know.
This is the Two Generals Problem — the formal name for this impossibility.
Two Generals Problem#
Two armies are camped on opposite sides of a city. They need to attack at the same time — if only one attacks, they lose. General A sends a messenger to General B: "Attack at dawn."
The messenger has to cross enemy territory. He might get captured. General B gets the message and sends a confirmation back — but that confirmation might get captured too.
Even if General B sends 100 confirmations, General A can never be 100% sure one got through. And General B knows that — so he's not sure if A knows he confirmed.
Neither side can ever be fully certain. No matter how many messages they exchange.
This isn't a war story. This is your payment service calling a bank API.
What this looks like in production#
Your payment service calls a bank API: "Charge this user ₹500." The network drops. You get no response.
What do you do? You retry. But here's the problem — what if the first request did go through, and only the response got lost? Now the user gets charged twice.
Client → "Charge ₹500" → Bank
Bank charges ₹500
Bank → "OK" → [network drops]
Client sees no response
Client → "Charge ₹500" → Bank (retry)
Bank charges ₹500 again
User charged twice
This is the Two Generals Problem in production. You retried because you had to — retries are the only tool you have against an unreliable network. But retries create duplicates.
The three things every distributed protocol needs#
Because you can never have certainty, every distributed system is designed around managing uncertainty. Three mechanisms make this work:
Retries — because messages can be lost, you must retry. Not retrying means accepting data loss.
Acknowledgments — the receiver explicitly confirms receipt. This shrinks the uncertainty window, but doesn't eliminate it (the ack itself can be lost).
Idempotency — the operation can be safely executed multiple times and produce the same result. This is what makes retries safe. The bank checks: "have I already processed this request?" If yes, it returns the previous result instead of charging again.
This trio is the foundation of every distributed protocol
Consistent hashing, quorums, leader election, Raft consensus — they all exist because of this one uncomfortable truth. Understanding why distributed systems are hard is what makes every other distributed concept make sense.
What distributed systems do instead of solving the problem#
The Two Generals Problem is mathematically unsolvable — you cannot guarantee delivery over an unreliable network. So distributed systems don't try to solve it. They accept the uncertainty and design around it.
- Accept that messages can be lost → build retries in
- Accept that retries cause duplicates → build idempotency in
- Accept that nodes can fail → build redundancy in
- Accept that you can't have a global clock → build logical clocks (covered in 5.8)
- Accept that you can't have perfect consistency → make explicit trade-offs (CAP theorem)
This is the mindset shift. You're not writing code that prevents failures. You're writing code that expects failures and stays correct anyway.