Saga Pattern Explained: Managing Distributed Transactions Without Two-Phase Commit

Learn the saga pattern for distributed transactions — choreography vs orchestration, compensating actions, and real examples from e-commerce systems.

saga-patterndistributed-transactionsmicroservicesevent-drivendistributed-systems

Saga Pattern

The saga pattern manages distributed transactions by breaking them into a sequence of local transactions, each followed by a compensating action that undoes the work if a later step fails — eliminating the need for a distributed lock or two-phase commit.

What It Really Means

In a monolith, a database transaction can atomically update orders, inventory, and payments in a single BEGIN...COMMIT. In a microservices architecture, the Order Service, Inventory Service, and Payment Service each own their own database. There is no single transaction that spans all three.

Two-phase commit (2PC) could coordinate them, but 2PC has serious problems: it locks resources across services during the protocol, a coordinator failure blocks everyone, and it does not work well across network boundaries or with heterogeneous systems.

The saga pattern, introduced by Hector Garcia-Molina and Kenneth Salem in 1987, takes a different approach. Instead of one big atomic transaction, it runs a sequence of smaller, local transactions. If step 3 of 5 fails, the saga runs compensating transactions for steps 2 and 1 to undo their effects. The end result is either all steps succeed, or all effects are rolled back — achieving consistency without distributed locks.

Sagas are used extensively in e-commerce (order processing), travel booking (flights + hotels + cars), financial services (money transfers), and any microservices system that needs cross-service data consistency.

How It Works in Practice

The Two Coordination Styles

Choreography (event-driven): Each service listens for events and performs its step, then emits the next event. No central coordinator.

Orchestration (central coordinator): A Saga Orchestrator tells each service what to do and handles failures.

Real-World: E-Commerce Order Processing

An order saga might have these steps:

StepServiceActionCompensating Action
1OrderCreate order (PENDING)Cancel order
2InventoryReserve itemsRelease items
3PaymentAuthorize paymentVoid authorization
4ShippingSchedule shipmentCancel shipment
5OrderConfirm order (CONFIRMED)— (no compensation needed)

If step 3 (payment) fails, the orchestrator runs the compensating actions for steps 2 and 1 in reverse order. The customer sees the order move to CANCELLED status.

Real-World: Travel Booking (Choreography)

A travel booking platform uses choreography:

  1. Flight Service books flight → emits FlightBooked
  2. Hotel Service hears FlightBooked, books hotel → emits HotelBooked
  3. Car Rental Service hears HotelBooked, books car → emits CarBooked
  4. Booking Service hears CarBooked, confirms trip → emits TripConfirmed

If car rental fails, Hotel Service hears CarBookingFailed and cancels the hotel. Flight Service hears HotelCancelled and cancels the flight. Each service only knows about the events it subscribes to.

Implementation

python

Trade-offs

Advantages

  • No distributed locks: Each service uses only local transactions — no cross-service locking
  • High availability: Services are not blocked waiting for each other (unlike 2PC)
  • Service autonomy: Each service owns its data and transactions independently
  • Works across heterogeneous systems: Services can use different databases, languages, and protocols

Disadvantages

  • Complexity: Designing compensating actions for every step is non-trivial (how do you "undo" sending an email?)
  • No isolation: Intermediate states are visible to other transactions. A concurrent query might see the order as PENDING with inventory reserved but payment not yet charged.
  • Compensation can fail: If a compensating action fails, you need manual intervention or a dead letter queue
  • Debugging difficulty: Tracing failures across multiple services and compensating actions requires good distributed tracing

Common Misconceptions

  • "Sagas provide ACID transactions" — Sagas provide eventual consistency, not atomicity or isolation. During execution, the system is in an intermediate state where some steps are committed and others are not. This is ACD (Atomicity, Consistency, Durability) without Isolation.

  • "Choreography is always better than orchestration" — Choreography is simpler for 2-3 services but becomes unmanageable for complex workflows. With 10 services and conditional branching, orchestration with a central coordinator is far easier to understand, test, and debug.

  • "Compensating actions perfectly undo the original action" — Some actions are not fully reversible. You cannot un-send a notification, un-ship a package, or un-charge a credit card (you can only issue a refund, which is different). Saga design must account for semantic reversibility.

  • "Sagas replace 2PC entirely" — For cases requiring strict atomicity (e.g., transferring money between two accounts that must always balance), sagas introduce temporary inconsistency. Some domains genuinely need 2PC or similar strong consistency guarantees.

How This Appears in Interviews

The saga pattern is a high-frequency interview topic for microservices design:

  • "Design an order processing system with multiple services" — this is the canonical saga question. Walk through the steps, compensating actions, and whether you choose choreography or orchestration. See our system design interview guide.
  • "How do you handle distributed transactions in microservices?" — compare sagas, 2PC, and eventual consistency. Explain why sagas are preferred for most microservice architectures.
  • "What happens if a compensating action fails?" — discuss retry with idempotency, dead letter queues, and manual intervention dashboards.
  • "Choreography vs orchestration — when do you pick each?" — choreography for simple linear workflows between 2-3 services; orchestration for complex workflows with branching, parallel steps, or many services.

Related Concepts

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.