Orchestration
Saga Orchestration
knows the entire flow and tells each service what to do next — step by step. Services don't talk to each other directly. They only talk to the orchestrator. The full story of every saga lives in one place.
The happy path — Swiggy order#
sequenceDiagram
participant Orch as Orchestrator
participant P as Payment Service
participant I as Inventory Service
participant O as Order Service
Orch->>P: charge payment
P-->>Orch: payment_success ✓
Orch->>I: deduct inventory
I-->>Orch: inventory_updated ✓
Orch->>O: create order
O-->>Orch: order_created ✓
Note over Orch: saga complete ✓ One brain. The orchestrator drives every step. The full flow of the saga lives in the orchestrator's state — not scattered across Kafka topics and service logs.
The failure path — Order Service fails#
sequenceDiagram
participant Orch as Orchestrator
participant P as Payment Service
participant I as Inventory Service
participant O as Order Service
Orch->>P: charge payment
P-->>Orch: payment_success ✓
Orch->>I: deduct inventory
I-->>Orch: inventory_updated ✓
Orch->>O: create order
O-->>Orch: order_failed ✗
Orch->>I: reverse inventory
I-->>Orch: inventory_reversed ✓
Orch->>P: refund payment
P-->>Orch: payment_refunded ✓
Note over Orch: saga rolled back ✓ When Order Service fails, the orchestrator knows exactly which steps succeeded and triggers compensating transactions in reverse — without anyone else needing to know about the failure.
Debugging is trivial#
Six months later, a bug is reported for order 123. You open the orchestrator's state for that order:
order_id: 123
step 1 → charge_payment → SUCCESS (10:01:32)
step 2 → deduct_inventory → SUCCESS (10:01:33)
step 3 → create_order → FAILED (10:01:34) ← here
step 4 → reverse_inventory → SUCCESS (10:01:35)
step 5 → refund_payment → FAILED (10:01:36) ← and here
The full picture in one place. You instantly see that the refund failed — you know exactly where to look.
This is orchestration's biggest operational advantage — centralised observability.
Failure cases and solutions#
Failure 1 — Orchestrator crashes before recording ACK (double execution)#
The orchestrator sends "refund payment" to Payment Service. Payment Service executes the refund and replies. The orchestrator crashes before recording the ACK in its DB.
sequenceDiagram
participant Orch as Orchestrator
participant P as Payment Service
Orch->>P: refund payment
P->>P: refund executes ✓
P-->>Orch: payment_refunded
Note over Orch: 💀 crashes before writing ACK to DB
Note over Orch: restarts
Orch->>Orch: reads DB → refund_payment = in_progress
Orch->>P: refund payment (retry)
P->>P: refund executes again 😬 What happens: The orchestrator retries because its DB still shows the step as in-progress. The refund runs twice — user double refunded.
Solution — orchestrator persists state before sending, services are idempotent:
Orchestrator side:
1. Write to DB: step=refund_payment, status=in_progress
2. Send "refund payment" to Payment Service
3. Receive ACK
4. Write to DB: step=refund_payment, status=completed
On restart, the orchestrator reads its DB, sees the step is in-progress, and retries. But Payment Service handles the retry safely:
# Service side idempotency
if payment.status != "refunded":
process_refund()
payment.status = "refunded"
db.save(payment)
# retry → already refunded → skip
Both sides must be fault-tolerant independently. The orchestrator retries. The service ignores duplicates.
Failure 2 — Service crashes without responding#
The orchestrator sends "deduct inventory" to Inventory Service. Inventory Service crashes mid-execution and never responds.
sequenceDiagram
participant Orch as Orchestrator
participant I as Inventory Service
Orch->>I: deduct inventory
Note over I: 💀 crashes mid-execution
Note over Orch: waiting...
Note over Orch: timeout fires
Orch->>I: deduct inventory (retry)
I->>I: deduct stock ✓
I-->>Orch: inventory_updated ✓ What happens: The orchestrator waits for a response, hits its timeout, and retries. Inventory Service recovers and processes the command on the retry.
Solution — orchestrator timeout + retry with exponential backoff + service idempotency:
Service must be idempotent — if the first execution partially ran before the crash, the retry must not double-deduct stock.
Failure 3 — Service responds with failure#
Inventory Service receives "deduct inventory" but its DB is down. It responds with an explicit failure.
sequenceDiagram
participant Orch as Orchestrator
participant P as Payment Service
participant I as Inventory Service
Orch->>P: charge payment
P-->>Orch: payment_success ✓
Orch->>I: deduct inventory
I-->>Orch: inventory_failed ✗
Orch->>P: refund payment
P-->>Orch: payment_refunded ✓
Note over Orch: saga rolled back ✓ What happens: The orchestrator receives the failure, knows exactly which steps succeeded, and triggers compensating transactions in reverse. Only Payment Service needs to be compensated — Order Service never started.
This is orchestration's advantage over choreography — the orchestrator knows the exact saga state at all times. No guessing, no tracing across Kafka topics.
Failure 4 — Compensation itself fails#
The orchestrator tries to refund the payment as a compensating transaction — but Payment Service's DB is down and the refund fails.
sequenceDiagram
participant Orch as Orchestrator
participant P as Payment Service
Orch->>P: refund payment
P-->>Orch: refund_failed ✗
Orch->>P: refund payment (retry 1)
P-->>Orch: refund_failed ✗
Orch->>P: refund payment (retry 2)
P-->>Orch: refund_failed ✗
Note over Orch: max retries reached → DLQ + alert What happens: The orchestrator retries with exponential backoff. After N failures, it marks the saga as stuck and sends the failed step to a Dead Letter Queue (DLQ). An alert fires.
Solution — retry + DLQ + manual intervention:
Orchestrator DB:
order_id: 123
step: refund_payment
status: STUCK
last_error: "Payment DB timeout"
retry_count: 5
A human or automated recovery process investigates — retries the refund once Payment DB recovers, or escalates to finance for manual resolution. This is why payment systems combine Saga with end-of-day reconciliation — to catch anything the automated retries couldn't resolve.
Compensation failure requires human intervention
There is no automatic resolution when the compensating transaction itself keeps failing. The orchestrator surfaces it clearly — stuck step, error log, DLQ alert. In choreography the same failure is far harder to detect because no single service owns the full saga state.
Failure 5 — Orchestrator crashes mid-compensation#
The orchestrator has already started compensating — reversed inventory — then crashes before refunding the payment.
sequenceDiagram
participant Orch as Orchestrator
participant P as Payment Service
participant I as Inventory Service
Orch->>I: reverse inventory
I-->>Orch: inventory_reversed ✓
Orch->>Orch: writes DB: inventory_reversed = completed
Note over Orch: 💀 crashes before sending refund
Note over Orch: restarts
Orch->>Orch: reads DB → next step = refund_payment
Orch->>P: refund payment
P-->>Orch: payment_refunded ✓ What happens: The orchestrator persisted every completed step to its DB before crashing. On restart it reads its DB, sees inventory was reversed but payment was not refunded yet, and resumes from exactly that point.
Solution — durable orchestrator state in DB:
The orchestrator's DB is the source of truth for the saga. Every step — forward and compensating — is written to the DB before being executed. On recovery, the orchestrator replays from the last known state.
This is why a production orchestrator (like Temporal or AWS Step Functions) persists every event to a durable store — it can always recover to exactly where it left off, no matter how many times it crashes.
Failure 6 — Saga timeout — service never responds and retries exhausted#
Inventory Service never responds to "deduct inventory" — not a failure message, just silence. The orchestrator retries but hits its max retry limit.
What happens: The saga is stuck in-progress. The user is charged but the order is neither placed nor cancelled.
Solution — saga-level timeout with forced compensation:
Order placed at 10:00 AM
Saga must complete by 10:05 AM
10:05 AM → saga timeout fires
Orchestrator checks current state:
payment → SUCCESS
inventory → IN_PROGRESS (stuck)
Orchestrator forces compensation:
→ cancel inventory step
→ refund payment
The orchestrator owns the clock for the entire saga. If it hasn't completed within the deadline, it compensates everything that succeeded so far and marks the order as failed. The user gets a refund. No silent money loss.
All failure cases summarised#
| Failure | What happens | Solution |
|---|---|---|
| Orchestrator crashes before recording ACK | Retries on recovery, double execution risk | Persist state before sending + service idempotency |
| Service crashes without responding | Orchestrator hits timeout, retries | Timeout + exponential backoff retry + idempotency |
| Service responds with failure | Orchestrator knows exact state, triggers compensation | Orchestrator drives reverse compensation in order |
| Compensation itself fails | Saga stuck in inconsistent state | Retry + DLQ + alert + manual intervention |
| Orchestrator crashes mid-compensation | Compensation incomplete on restart | Durable orchestrator DB — resumes from last known state |
| Saga timeout — retries exhausted | Order stuck, user charged but no action | Saga-level timeout + forced compensation |
Orchestration — trade-offs#
| Strength | Weakness |
|---|---|
| Full saga state in one place — easy to debug | Orchestrator is a single point of failure (mitigated by making it fault-tolerant) |
| Clear flow — easy to reason about | Orchestrator becomes a bottleneck at very high throughput |
| Easier to implement idempotency — one source of retries | More upfront complexity — need to build and maintain the orchestrator |
| Centralised monitoring and alerting | Services are coupled to the orchestrator |