Saga pattern for distributed transactions
A sequence of local transactions, each with a compensating action that undoes it if a later step fails.
A saga is how you do a multi-service transaction without distributed locks. You break the transaction into local steps. Each step has a "do" and an "undo." If any step fails, run the undos in reverse.
The pattern
Booking a trip needs hotel + flight + car. No two-phase commit across three providers. So:
- Book hotel. If fails, abort.
- Book flight. If fails, cancel hotel.
- Book car. If fails, cancel flight, cancel hotel.
Two flavors
Orchestration. A central coordinator drives the saga. Each step is an HTTP call or message. Coordinator knows the full sequence and compensations. Easier to reason about. Use Temporal, AWS Step Functions, or write your own.
Choreography. Each service emits an event when done. The next service subscribes. No central coordinator. Looser coupling but harder to track the flow.
Default to orchestration. Choreography sounds elegant but debugging "where did the saga get stuck" is brutal without a central record.
Compensations are not rollbacks
A DB rollback erases the write. A compensation runs a new transaction that undoes the effect. The history shows both. Examples:
- Payment charge -> refund. The bank statement shows charge then refund.
- Send email -> send apology email. The first email was still sent.
- Reserve inventory -> release inventory. Inventory count returns to original.
Some actions are not compensable. Sending a missile. Posting to a public timeline. Saga is wrong for these.
Failures during compensation
What if the compensation itself fails? The saga is stuck. Options:
- Retry forever with backoff until it succeeds (transient failures only).
- Manual intervention. Alert humans. The DLQ of sagas.
- Forward-recovery. Instead of undoing, try to fix forward.
Production sagas need a "stuck saga" inbox that humans review.
Workflow engines
Don't write a saga orchestrator from scratch. Use:
- Temporal. The gold standard. Code your workflow as normal functions; the engine handles persistence, retries, compensation.
- AWS Step Functions. Managed. JSON state machine.
- Cadence. Temporal's predecessor, still used.
These engines persist workflow state in their own DB. Workflow code resumes from where it crashed.
My rule
If a business operation spans 3+ services with rollback semantics, use a workflow engine and define compensations. If 2 services, often a synchronous call with try/catch is enough. Don't reach for saga until you actually have a multi-step distributed transaction.
Learn more
- ArticleMicroservices.io: Saga patternChris Richardson
- Paper
- DocsTemporal: workflow as saga orchestratorTemporal docs