Choreography
Choreography
There is no central brain. Each service listens to Kafka, reacts to events by doing its local work, and publishes its own events for the next service to pick up. Services coordinate by reacting to each other — like dancers following the music, not a conductor.
The happy path — Swiggy order#
A user places an order. Three services need to act in sequence.
sequenceDiagram
participant U as User
participant P as Payment Service
participant K as Kafka
participant I as Inventory Service
participant O as Order Service
U->>P: place order
P->>P: charge card ✓
P->>K: publish "payment_success"
K->>I: deliver "payment_success"
I->>I: deduct stock ✓
I->>K: publish "inventory_updated"
K->>O: deliver "inventory_updated"
O->>O: create order ✓
O->>K: publish "order_created" No coordinator. Each service: 1. Does its local work 2. Publishes a success event 3. The next service picks it up and continues
The failure path — Order Service crashes#
Order Service fails. Now the saga needs to unwind.
sequenceDiagram
participant P as Payment Service
participant K as Kafka
participant I as Inventory Service
participant O as Order Service
K->>O: deliver "inventory_updated"
O->>O: create order ✗ DB down
O->>K: publish "order_failed"
K->>I: deliver "order_failed"
I->>I: add stock back ✓
I->>K: publish "inventory_reversed"
K->>P: deliver "inventory_reversed"
P->>P: refund user ✓
P->>K: publish "payment_refunded" Each service listens to both success and failure events. On failure, it runs its compensating transaction and publishes a reversal event for the previous service to pick up. The saga unwinds itself automatically.
Failure cases and solutions#
Failure 1 — Service crashes before publishing its event#
Payment Service charges the card successfully but crashes before it can publish "payment_success" to Kafka.
sequenceDiagram
participant P as Payment Service
participant K as Kafka
P->>P: charge card ✓
Note over P: 💀 crashes before publishing
Note over K: "payment_success" never arrives
Note over K: Inventory Service never starts What happens: The saga never progresses. Inventory Service is waiting for an event that never comes. The user got charged but the order was never placed.
Solution — Outbox pattern:
Never publish directly to Kafka from the service. Instead, write the event to an outbox table in the same local DB transaction as the business operation:
Transaction:
1. charge card (payments table)
2. write "payment_success" to outbox table
→ single ACID commit — both succeed or both fail
Separate outbox poller:
→ reads outbox table
→ publishes to Kafka
→ marks row as published
If the service crashes after the DB commit, the outbox row is already written. The poller picks it up on recovery and publishes to Kafka. The event is never lost.
Failure 2 — Service crashes after receiving event but before processing#
Inventory Service receives "payment_success" from Kafka but crashes before deducting stock.
sequenceDiagram
participant K as Kafka
participant I as Inventory Service
K->>I: deliver "payment_success"
Note over I: 💀 crashes before processing
Note over K: no ACK received
K->>I: redelivers "payment_success"
I->>I: deduct stock ✓ What happens: Kafka never received an ACK, so it redelivers the message when Inventory Service recovers. Inventory Service processes it on the second delivery.
Solution — Kafka at-least-once delivery + idempotency:
Kafka guarantees at-least-once delivery — if no ACK, it redelivers. This means your service might process the same event twice. Make every operation idempotent:
# check before acting
if inventory.status != "deducted_for_order_123":
deduct_stock()
inventory.status = "deducted_for_order_123"
db.save(inventory)
# second delivery → already deducted → skip
Failure 3 — Service crashes after processing but before ACK (double processing risk)#
Payment Service receives "inventory_reversed", runs the refund, but crashes before sending the ACK to Kafka.
sequenceDiagram
participant K as Kafka
participant P as Payment Service
K->>P: deliver "inventory_reversed"
P->>P: refund executes ✓
Note over P: 💀 crashes before ACK
K->>P: redelivers "inventory_reversed"
P->>P: refund executes again 😬 What happens: Kafka redelivers the message. The refund runs twice — user gets double refunded.
Solution — idempotency check before acting:
if payment.status != "refunded":
process_refund()
payment.status = "refunded"
db.save(payment)
# second delivery → status already "refunded" → skip
Same pattern as Failure 2. Every step in a choreography saga must be idempotent — assume every message will be delivered more than once.
Failure 4 — Compensation itself fails#
Order Service publishes "order_failed". Inventory Service tries to add stock back — but its DB is down. The compensation fails.
sequenceDiagram
participant K as Kafka
participant I as Inventory Service
K->>I: deliver "order_failed"
I->>I: add stock back ✗ DB down
Note over I: no ACK sent
K->>I: redelivers "order_failed"
I->>I: add stock back ✗ DB still down What happens: Kafka keeps redelivering. Inventory Service keeps failing. The compensation is stuck — stock is not restored. The saga is in a permanently inconsistent state until the DB recovers.
Solution — retry with exponential backoff + dead letter queue:
Retry 1 → fails → wait 1s
Retry 2 → fails → wait 2s
Retry 3 → fails → wait 4s
...
After N retries → send to Dead Letter Queue (DLQ)
The DLQ holds the failed message. An alert fires. A human or automated process investigates — manually restores stock, or triggers a different compensation path. The system never silently loses the failure.
Compensation failure is the hardest problem in Saga
There is no automatic resolution. If the compensating transaction keeps failing, you need human intervention or a fallback path. This is why payment systems combine Saga with end-of-day reconciliation — to catch anything that fell through.
Failure 5 — Kafka goes down mid-saga#
Payment succeeds. Kafka goes down before delivering "payment_success" to Inventory Service.
What happens: Inventory Service never starts. The saga is frozen mid-way. User is charged, order is not placed.
Solution — Kafka replication + durability:
Kafka is itself replicated across brokers. "payment_success" is written to Kafka's replicated log before Payment Service gets an ACK. So if one Kafka broker goes down, another has the message. The message is not lost.
Payment Service → Kafka leader (writes to replicated log)
→ Broker 1 ✓
→ Broker 2 ✓
→ Broker 3 ✓
→ ACK sent to Payment Service only after replication
If the entire Kafka cluster goes down and comes back up — messages are replayed from the log. Inventory Service will eventually receive "payment_success".
This is why using Kafka (durable, replicated log) matters over a simple message queue like RabbitMQ which can lose messages if not configured carefully.
Failure 6 — Saga gets stuck with no further events (silent failure)#
A service processes an event, its DB call fails, it doesn't publish anything — not a success event, not a failure event. The saga simply stops. No compensation runs. No alert fires.
Inventory Service receives "payment_success"
→ DB call fails silently
→ publishes nothing
→ Order Service never starts
→ Payment Service never gets a reversal signal
→ user is charged, nothing happens
Solution — saga timeout + monitoring:
Every saga should have a timeout. If an order hasn't completed within N minutes, an external monitor fires an alert or triggers a forced compensation:
Order placed at 10:00 AM
Order not completed by 10:05 AM → timeout triggers
→ compensate: refund payment, restore stock
In choreography this is harder to implement because no single service owns the full saga state. This is one of the core reasons teams move to orchestration — the orchestrator tracks the full state and can enforce timeouts centrally.
All failure cases summarised#
| Failure | What happens | Solution |
|---|---|---|
| Service crashes before publishing event | Saga stuck, event never sent | Outbox pattern — write event to DB in same transaction |
| Service crashes before processing event | Kafka redelivers, processed on recovery | Kafka at-least-once + idempotency |
| Service crashes after processing, before ACK | Double processing on redelivery | Idempotency check before every operation |
| Compensation itself fails | Saga stuck in inconsistent state | Retry + exponential backoff + Dead Letter Queue |
| Kafka goes down mid-saga | Messages lost, saga frozen | Kafka replication — messages survive broker failure |
| Silent failure — no event published | Saga never progresses, no alert | Saga timeout + monitoring + forced compensation |
The debugging problem#
Six months later, a bug is reported — an order got charged but never refunded. Where do you start?
In choreography, the full flow is spread across multiple services and multiple Kafka topics:
payment_success → Payment Service logs
inventory_updated → Inventory Service logs
order_failed → Order Service logs
inventory_reversed → Inventory Service logs
payment_refunded → Payment Service logs (missing?)
You have to trace through Kafka logs across 3 different services to reconstruct what happened to order 123. There is no single place that shows you the full picture.
This is choreography's biggest operational weakness — distributed observability. You need distributed tracing (e.g. Jaeger, Zipkin) and correlation IDs on every event just to follow one order through the system.
Choreography — trade-offs#
| Strength | Weakness |
|---|---|
| No single point of failure | Hard to debug — flow is spread across services |
| Fully decentralised | Each service must implement idempotency independently |
| Services are loosely coupled | Easy to lose track of the overall saga state |
| Simple to add new steps — just listen to an event | No single place to see "what happened to order 123" |