Failures

2PC looks clean in the happy path. The problems surface when things go wrong — and there are more failure modes than just the coordinator crashing. Every node in the system can fail at different points, and each produces a different kind of mess.

Failure 1 — Participant crashes during Phase 1 (before voting)#

A participant crashes before it can send its YES/NO vote back to the coordinator.

sequenceDiagram
    participant C as Coordinator
    participant P as Payment Service
    participant I as Inventory Service
    participant O as Order Service

    C->>P: PREPARE
    C->>I: PREPARE
    C->>O: PREPARE
    P-->>C: YES
    I-->>C: YES
    Note over O: 💀 Crashes before replying

What happens: The coordinator waits for O's vote. After a timeout it treats the missing vote as NO and sends ABORT to everyone.

Outcome: Clean rollback. Payment and Inventory never committed anything — they just locked resources and released them on ABORT. No inconsistency.

This is the safe failure case — crashing before voting means you never promised anything.

Failure 2 — Coordinator crashes after Phase 1, before Phase 2#

All participants voted YES and are now holding their locks, waiting for the coordinator's COMMIT or ABORT.

sequenceDiagram
    participant C as Coordinator
    participant P as Payment Service
    participant I as Inventory Service
    participant O as Order Service

    C->>P: PREPARE
    C->>I: PREPARE
    C->>O: PREPARE
    P-->>C: YES
    I-->>C: YES
    O-->>C: YES
    Note over C: 💀 Coordinator crashes here
    Note over P: Locked. Waiting...
    Note over I: Locked. Waiting...
    Note over O: Locked. Waiting...

What happens: Participants are stuck. They cannot proceed because they don't know what the coordinator decided. They cannot rollback on their own either — the coordinator might have sent COMMIT to one participant before crashing.

Outcome: All participants hold their locks indefinitely. The system is blocked until the coordinator recovers. This is called an in-doubt transaction.

The blocking problem

Participants voted YES — they promised they're ready. They cannot unilaterally rollback because another participant may have already committed. They must wait. Locks are held. Other transactions queue behind them.

Failure 3 — Coordinator crashes mid-Phase 2 (partial commit)#

The coordinator sends COMMIT to some participants, then crashes before reaching the rest.

sequenceDiagram
    participant C as Coordinator
    participant P as Payment Service
    participant I as Inventory Service
    participant O as Order Service

    C->>P: COMMIT
    P-->>C: ACK ✓
    Note over C: 💀 Crashes here
    Note over I: Still waiting...
    Note over O: Still waiting...

What happens: Payment Service committed — money is deducted. Inventory and Order are still in-doubt, holding locks.

Outcome: The system is in a permanently inconsistent state until the coordinator recovers. If Inventory and Order rollback on their own — the user got charged but no order exists. They cannot safely rollback. They must wait.

This is the most dangerous failure case in 2PC.

Failure 4 — Participant crashes during Phase 2 (after receiving COMMIT)#

The coordinator sends COMMIT to all participants. One participant receives the COMMIT, starts applying it, then crashes mid-commit.

sequenceDiagram
    participant C as Coordinator
    participant P as Payment Service
    participant I as Inventory Service
    participant O as Order Service

    C->>P: COMMIT
    C->>I: COMMIT
    C->>O: COMMIT
    P-->>C: ACK ✓
    I-->>C: ACK ✓
    Note over O: Receives COMMIT, starts applying...
    Note over O: 💀 Crashes mid-commit

What happens: Order Service crashes after receiving COMMIT but before finishing. When it recovers, it checks its WAL. Because it wrote the COMMIT decision to WAL before applying it (this is why WAL is written first), it knows it should commit — it replays the WAL and finishes the commit.

Outcome: Self-healing on recovery. The WAL ensures the participant always knows what it was supposed to do. No inconsistency.

WAL saves you here

In Phase 1, every participant writes its YES vote to its WAL before sending it. In Phase 2, every participant writes the COMMIT/ABORT decision to its WAL before applying it. On crash recovery, the participant reads its WAL and knows exactly what to do — no ambiguity.

Failure 5 — Participant crashes before receiving Phase 2 message#

The coordinator sends COMMIT to all participants. Before the message reaches Order Service, Order Service crashes.

sequenceDiagram
    participant C as Coordinator
    participant P as Payment Service
    participant I as Inventory Service
    participant O as Order Service

    C->>P: COMMIT
    C->>I: COMMIT
    C->>O: COMMIT
    P-->>C: ACK ✓
    I-->>C: ACK ✓
    Note over O: 💀 Crashes before receiving COMMIT

What happens: Order Service recovers with no COMMIT in its WAL. It doesn't know the outcome. It contacts the coordinator asking "what was the decision for transaction XYZ?"

The coordinator checks its own WAL — it wrote COMMIT before sending out the messages — and tells Order Service to commit.

Outcome: Order Service commits on recovery. Consistent. This requires the coordinator to be up and reachable when Order Service recovers.

Failure 6 — Network partition (message lost in transit)#

The coordinator sends COMMIT but the network drops the message before it reaches a participant. The participant never receives it.

What happens: Same as Failure 5 from the participant's perspective — it never got the COMMIT. On timeout it contacts the coordinator asking for the decision.

Outcome: Coordinator retries the COMMIT. Eventually consistent — but during the partition window, the participant is blocked holding its lock.

All failure cases summarised#

Failure	When	Outcome
Participant crashes before voting (Phase 1)	Pre-vote	Safe — coordinator times out, sends ABORT, clean rollback
Coordinator crashes after all votes, before Phase 2	Post-vote	Blocking — participants hold locks indefinitely, in-doubt
Coordinator crashes mid-Phase 2 (partial COMMIT)	Mid-commit	Dangerous — partial commit, permanent inconsistency until coordinator recovers
Participant crashes mid-commit (after receiving COMMIT)	Mid-apply	Self-healing — WAL replay on recovery, no inconsistency
Participant crashes before receiving Phase 2 message	Pre-receive	Recoverable — asks coordinator for decision on recovery
Network drops Phase 2 message	In-transit	Recoverable — coordinator retries, participant blocked during partition

The fundamental problem

2PC cannot handle a coordinator crash mid-Phase 2 without blocking. Participants are stuck in-doubt until the coordinator comes back. There is no safe way to resolve this without external intervention. This is an inherent limitation of the protocol — not a bug that can be fixed.

The solutions to these failures are covered in the next file.