Challenge: Your checkout succeeded... but your inventory didn't
You run an e-commerce platform. A customer clicks "Place Order".
Now you've charged a customer for something you can't ship.
If this were a monolith with a single database, you'd wrap everything in a transaction and roll back. In a distributed system, you don't get that luxury.
So you need a way to coordinate multiple independent services and still achieve a business-level "all-or-nothing" outcome.
Welcome to the Saga pattern.
Your system is decomposed into services. Each service owns its data. Cross-service operations are distributed transactions in disguise.
Traditional 2PC (Two-Phase Commit) tries to make distributed operations behave like a single ACID transaction. But 2PC has sharp edges:
A Saga replaces a single distributed transaction with:
In production, assume:
Pause and think: If each local transaction commits independently, what prevents the system from ending up in a "half-done" state?
(Write down your guess: retries? timeouts? compensation? idempotency? reconciliation?)
Think of a restaurant group with separate counters:
There's no single manager who can "uncharge your card" automatically if ingredients aren't available - unless you design a refund process (compensation).
A Saga is that refund-and-reversal playbook.
Airline booking:
If ticket issuing fails, you might cancel the seat and refund payment. That's a Saga.
Key insight: A Saga does not give you atomicity across services. It gives you a business-consistent outcome over time using coordination + compensations + reconciliation.
You want to reason about correctness. With ACID transactions, you reason about isolation and atomicity. With Sagas, you reason about states and transitions.
Pause and think: In a Saga, what are the two "directions" the state machine can move?
Imagine a delivery route:
Sagas move:
Key insight: Model Sagas explicitly as a durable workflow/state machine with replay-safe transitions - not as a "transaction with retries".
[IMAGE: State machine diagram. Forward steps T1 -> T2 -> T3. Compensation steps C2 <- C1. Terminal states: SUCCEEDED, FAILED, COMPENSATED, and optionally PARTIAL_ACCEPTED (domain-specific).]
A Saga should have explicit terminal states and policies:
SUCCEEDED: all required steps completedCOMPENSATED: forward steps undone to an acceptable business stateFAILED_NEEDS_MANUAL: cannot compensate automaticallyTIMED_OUT: exceeded SLA; may still complete later unless fencedYou need to implement a Saga. You must decide: do services coordinate implicitly via events, or explicitly via a controller?
Two common styles:
Pause and think: Which one sounds more scalable? Which one sounds easier to debug at 3am?
Key insight: Choreography optimizes for local autonomy; orchestration optimizes for global visibility and control.
OrderCreatedOrderCreated, authorizes/charges, emits PaymentAuthorized or PaymentFailedPaymentAuthorized, reserves, emits InventoryReserved or InventoryFailedInventoryReserved, creates shipment, emits ShipmentCreated or ShipmentFailedCOMPLETED or CANCELLEDIMAGE: Event flow diagram with services as boxes and events as arrows. Include: [blocked]
OrderCreated -> PaymentAuthorized -> InventoryReserved -> ShipmentCreated -> OrderCompletedInventoryFailed -> PaymentVoided/Refunded -> OrderCancelled]Pause and think: Where does the system decide to compensate? Is there a single place that "knows" the whole workflow?
Key insight: Choreography spreads workflow logic across services. The system's behavior emerges from event interactions.
PaymentAuthorized may arrive twice.
Production pattern: store a processed-message key (e.g., eventId) in an inbox/dedup table with TTL aligned to replay window.
The real issue is:
PaymentFailed arrives after PaymentAuthorized due to retries)Mitigation: consumers must validate current aggregate state (or saga step version) before applying.
Inventory emits InventoryFailed, but Order Service never receives it.
A new service subscribes to PaymentAuthorized and changes behavior. Suddenly the Saga has new hidden dependencies.
During a partition, some services may continue processing while others lag.
Teams often say: "We're event-driven, so we're loosely coupled." Then they add 12 subscribers to a single event.
Key insight: Choreography reduces control-plane coupling but can increase semantic coupling unless events are versioned and governed.
Pick the true statement(s). Pause first.
A. In choreography, there is no need for a state machine.
B. In orchestration, services do not need idempotency.
C. In choreography, adding a new subscriber can change system behavior without changing producers.
D. In orchestration, the orchestrator is a single point of failure.
Reveal:
Key insight: Both styles require idempotency, durable state, and careful failure handling. The difference is where the workflow logic lives.
Implement the same checkout with an orchestrator.
The orchestrator drives the flow:
CreateOrderAuthorizePaymentReserveInventoryCreateShipmentOn failure, it triggers compensations:
VoidOrRefundPayment + CancelOrder[IMAGE: Sequence diagram showing orchestrator sending commands to services and receiving replies/events. Include compensation path and retries.]
Pause and think: What new failure mode did we introduce by adding an orchestrator?
Key insight: Orchestration centralizes workflow decisions, improving observability and control, but it adds a critical component that must be engineered for HA and correctness.
Orchestrator sends ReserveInventory. Inventory reserves, but response is lost.
sagaId + stepName)Due to retry, orchestrator might send CancelPayment twice.
Centralized logic means a single bug can affect all flows.
Production mitigation: feature flags, canarying, workflow versioning, and strong unit tests on the state machine.
Key insight: Orchestration centralizes control flow, not data ownership. The monolith risk comes from putting domain logic into the orchestrator instead of services.
| Dimension | Choreography Saga | Orchestration Saga |
|---|---|---|
| Workflow logic location | Distributed across services | Centralized in orchestrator |
| Operational visibility | Harder (need tracing across events) | Easier (single workflow view) |
| Coupling | Lower control coupling, higher semantic coupling risk | Higher control coupling to orchestrator API |
| Change management | Adding steps can be tricky (many subscribers) | Easier to change flow in one place |
| Failure handling | Distributed compensation logic | Centralized compensation logic |
| Testing | Complex integration tests across services | Orchestrator can be tested as a state machine |
| Runtime dependencies | Broker/event bus is critical | Orchestrator + persistence is critical |
| Scaling | Naturally scales with services | Orchestrator must scale; often horizontally |
| Best when | Simple flows, high autonomy, event-native domains | Complex workflows, strict auditability, need clear ownership |
Key insight: Neither is universally better. Choose based on workflow complexity, observability needs, and organizational ownership.
[IMAGE: Decision matrix: x-axis workflow complexity, y-axis need for auditability/visibility; show choreography favored bottom-left, orchestration top-right, hybrid in the middle.]
AuthorizePayment commandPaymentAuthorized eventA. Idempotency keys + dedup store (inbox)
B. Outbox pattern + reliable event publishing
C. Saga step fencing tokens / version checks
D. Exactly-once delivery
E. Durable workflow state + at-least-once commands
Reveal (one reasonable mapping):
Why not D? Because "exactly-once delivery" is rarely achievable end-to-end; we typically build effectively-once processing from at-least-once + idempotency.
A common fencing technique is a monotonic saga step/version stored with the aggregate. A late message with an older step/version is ignored.
[IMAGE: Fencing diagram showing COMPENSATED state with a higher version; late-arriving ReserveInventorySucceeded with lower version is rejected.]
Compensations are not perfect inverses.
If you can't undo a step (email sent, external bank transfer settled), you can only:
ReserveInventory -> ReleaseInventoryChargeCard -> RefundCard (or VoidAuthorization if not captured)Key insight: Compensations must be modeled as first-class business operations with their own failure modes, SLAs, and audit trails.
[IMAGE: Compensation ladder: void auth (best) -> refund (ok) -> credit note (ok) -> manual case (last resort).]
Two customers try to buy the last item.
Without global isolation, anomalies include:
[IMAGE: Concurrency diagram showing two sagas competing for one inventory token; only one gets it.]
Key insight: Sagas shift isolation problems from the database to the application. You must choose domain-appropriate concurrency control.
A Saga design is only as reliable as its messaging and persistence patterns.
You typically need:
If a service updates its DB and publishes an event, you can get:
Now other services never learn about the change.
Outbox solves this by writing the event to an outbox table in the same local transaction, then publishing asynchronously.
The original code used FOR UPDATE SKIP LOCKED without an explicit transaction boundary and updated published_at before ensuring the broker accepted the message. Below is a safer pattern:
Important production note: this still provides at-least-once publishing. Consumers must deduplicate by event_id.
For effectively-once processing, consumers should store processed event_id (or (producer, sequence)):
[IMAGE: Inbox/outbox diagram showing producer outbox -> broker -> consumer inbox/dedup table -> handler.]
Choreography works best with facts.
PaymentAuthorized.v1orderId, sagaId, causationId, correlationIdIf the message is addressed to a specific service and implies obligation ("do X"), it's a command. Commands are usually better as direct RPC or a command queue with clear ownership.
[IMAGE: Facts vs commands diagram: broadcast facts on event bus; targeted commands via orchestrator/command channel.]
Building your own orchestrator means implementing:
Workflow engines (Temporal/Cadence/Zeebe/Step Functions) provide these primitives.
The original snippet used fetch without importing it (Node < 18) and persisted state to a local file (not HA). Below is a didactic example that:
fetch) or imports node-fetchProduction note: a real orchestrator also needs timers for long-running steps, workflow versioning, and a way to query status.
[IMAGE: Orchestrator durability diagram: orchestrator instances (HA) + durable DB + command dispatch + replies/events.]
Order placed -> send confirmation email -> send SMS
Debit wallet -> credit merchant -> initiate bank transfer -> confirm settlement
Create ride request -> match driver -> driver accepts -> start trip -> end trip
Reveal (one reasonable answer):
Minimum identifiers to propagate:
sagaId (workflow instance)orderId (business key)correlationId (ties a request chain together)causationId (points to the message that caused this message)stepName + attemptAlso:
[IMAGE: Trace timeline showing sagaId across services with retries and compensation steps.]
Orchestrator calls Payment Service. It times out.
Did the payment fail? Or did it succeed but the response got lost?
Only safe assumption after a timeout: you do not know.
Mitigations:
GetPaymentStatus)[IMAGE: Unknown outcome diagram: request sent -> timeout -> two possible realities (succeeded/failed) -> reconcile via status API.]
While a Saga is in progress, users refresh the UI.
Expose in-progress states explicitly:
PENDING_PAYMENTPENDING_INVENTORYPENDING_SHIPMENTCOMPLETINGAvoid lying with a premature "confirmed".
Patterns:
[IMAGE: UI state diagram mapping saga steps to user-visible statuses.]
Many real systems mix both styles.
Example:
Key insight: Use orchestration for cross-domain workflows; use choreography for intra-domain reactions.
[IMAGE: Hybrid architecture diagram: orchestrator at domain boundary; internal domain event mesh inside each bounded context.]
Choreography often broadcasts events broadly, increasing risk of:
Mitigations:
[IMAGE: PII minimization diagram: event contains customerId token; PII fetched only by authorized service.]
Common domains:
Common infrastructure choices:
Design a checkout Saga for:
Constraints:
PENDING_SHIPMENT, orOrderCompleted / OrderCancelled facts.[IMAGE: Final architecture diagram showing orchestrator, services, broker, outbox/inbox tables, and external dependencies; include correlation IDs flow.]
FAILED_NEEDS_MANUAL and page/queue for ops.If you can answer: "After a timeout, how do I know what happened?" and "How do I undo this step safely?" you're thinking like a distributed systems engineer.