Saga pattern for distributed transactions
Why 2PC failed, how sagas replace it, orchestration vs choreography, compensations that don't lie, and why you should use Temporal.
A saga is what you reach for when you need a transaction across multiple services and you cannot use a database transaction because the services don't share a database. The pattern originally comes from a 1987 paper by Garcia-Molina and Salem about long-running database transactions. It was rediscovered for microservices in the 2010s, and it has become the canonical answer to "how do we do distributed transactions."
The honest answer is: you don't, not in the ACID sense. You build something that looks transactional from the user's perspective using compensations and careful orchestration. That something is the saga.
Why 2PC doesn't work in microservices
Two-phase commit theoretically gives you ACID across multiple services. In phase 1, the coordinator asks all participants "can you commit?" Each participant prepares and replies yes/no. In phase 2, if all said yes, coordinator says "commit"; otherwise "abort."
Problems:
- Blocking. Participants must hold locks during the prepare phase. If the coordinator dies mid-protocol, they hold locks forever.
- Liveness. A slow participant blocks everyone.
- Not supported. Most modern services (Kafka, REST APIs, SaaS APIs like Stripe) don't speak 2PC. Even most databases discourage XA transactions.
- Performance. Round trips for prepare and commit double the latency.
2PC works in tightly coupled internal systems (a single bank with XA-aware DBs). It does not work for "we use Stripe for payments, Twilio for SMS, our internal API for orders."
What a saga actually is
A saga is a sequence of local transactions T1, T2, ..., Tn. Each Ti has a compensating transaction Ci that semantically undoes Ti.
If T1...Tk succeed but Tk+1 fails, run Ck, Ck-1, ..., C1 in reverse.
The whole thing is NOT atomic. Intermediate states are observable. You trade strong consistency for the ability to span services.
Compensations vs rollbacks
A rollback erases history. A compensation creates new history that semantically undoes.
Examples:
| Action | Compensation | History |
|---|---|---|
| Charge $50 | Refund $50 | Statement: -$50, +$50 |
| Send email | Send apology | Inbox: both emails |
| Reserve inventory | Release inventory | Count returns to original |
| Create user | Soft-delete user | DB: user with deleted_at |
Some actions cannot be compensated:
- Sending a physical letter.
- Posting to a public Twitter timeline.
- Firing a missile.
For these, sagas don't work. You need different approaches: explicit user confirmation steps, delayed execution windows (Stripe holds for 7 days), or just accepting that this step cannot be undone and structuring the saga to do it last.
Orchestration
A central orchestrator drives the saga. It knows the sequence of steps, when to invoke compensations, and stores the saga state.
class BookingOrchestrator:
def execute(self, trip):
try:
hotel_id = hotel_service.book(trip.dates)
self.state.record("hotel", hotel_id)
try:
flight_id = flight_service.book(trip.dates)
self.state.record("flight", flight_id)
try:
car_id = car_service.book(trip.dates)
self.state.record("car", car_id)
return success
except:
flight_service.cancel(flight_id)
raise
except:
hotel_service.cancel(hotel_id)
raise
except:
return failureThis naive code is terrible. It can't survive a process crash. The orchestrator state lives in memory; restart loses it.
Real orchestration persists state at every step. Hence workflow engines.
Choreography
No central orchestrator. Each service emits events. Other services react.
- Hotel service: receives booking request, books, emits
hotel.booked. - Flight service: subscribes to
hotel.booked, books, emitsflight.bookedorflight.failed. - Hotel service: subscribes to
flight.failed, cancels.
Order created -> Hotel books -> Flight books -> Car books
| | |
v v v
On failure of any: cascading cancel events
Pros:
- Loose coupling. No coordinator.
- Each service is autonomous.
Cons:
- Hard to reason about the flow. The saga isn't anywhere; it's distributed across N services.
- Hard to track. "Where is saga X stuck?" requires checking every service.
- Failure handling is duplicated across services.
- Cyclic dependencies easy to create accidentally.
Choreography sounds elegant. In practice it becomes a tangled mess at >3 services. Use orchestration.
Workflow engines
Don't write a saga orchestrator from scratch. Use a workflow engine.
Temporal
Code your workflow as normal functions. The engine handles persistence, retries, and compensation.
@workflow.defn
class BookTripWorkflow:
@workflow.run
async def run(self, trip):
hotel_id = await workflow.execute_activity(book_hotel, trip)
try:
flight_id = await workflow.execute_activity(book_flight, trip)
except:
await workflow.execute_activity(cancel_hotel, hotel_id)
raise
try:
car_id = await workflow.execute_activity(book_car, trip)
except:
await workflow.execute_activity(cancel_flight, flight_id)
await workflow.execute_activity(cancel_hotel, hotel_id)
raise
return Trip(hotel_id, flight_id, car_id)The engine persists workflow state at every step. If the worker crashes mid-saga, another worker resumes from the last checkpoint. The saga survives infrastructure failures.
AWS Step Functions
Define the saga as a state machine in JSON. AWS handles execution. Good for AWS-native shops.
Cadence, Conductor, Camunda
Other workflow engines. Temporal is the most popular for new projects.
I default to Temporal. The programming model is the same as regular code, plus survivability.
Idempotency in sagas
Every step in a saga must be idempotent. The engine WILL retry steps. The same step might execute twice during failover.
If book_hotel is not idempotent, a saga retry creates two reservations.
Implement idempotency:
- Pass an idempotency key derived from the saga ID + step name.
- The hotel service uses the key to dedup.
Temporal makes this easy: workflow ID + activity ID is automatically stable across retries. Pass it as the idempotency key downstream.
Stuck sagas
What if a compensation itself fails permanently? The saga is stuck in an inconsistent state. Options:
- Retry with exponential backoff. For transient failures (network, downstream blip).
- Alert humans. For permanent failures. The saga DLQ. A dashboard of stuck sagas requires human triage.
- Forward-recovery. Instead of compensating, fix forward. "Couldn't refund the card. Issue a credit instead."
- Accept the inconsistency. Sometimes the wrong answer is the only answer. Log it, move on.
Production sagas always have a stuck-saga workflow with human escalation.
Semantic locks
Sometimes you need to prevent other operations from acting on resources that are mid-saga.
Pattern: when T1 runs, set a status field. When the saga completes or compensates, clear it.
UPDATE accounts SET status = 'transfer_in_progress' WHERE id = $1;Other operations check status before acting. Acts as an optimistic lock without actually holding a DB lock.
Sagas vs other patterns
Saga vs outbox. Outbox solves dual-write (DB + event bus). Saga solves multi-service transactions. Often used together: each saga step uses outbox to atomically update DB + emit "step done" event.
Saga vs event sourcing. Orthogonal. You can use saga without event sourcing. Event sourcing makes it easier to replay saga state.
Saga vs 2PC. 2PC = strong consistency, blocking, requires participant cooperation. Saga = eventual consistency, non-blocking, works across any participant.
Real example: order processing
User places an order:
- Reserve inventory.
- Charge payment.
- Create shipment.
- Send confirmation email.
If charge fails: release inventory. If shipment creation fails: refund payment, release inventory. If email fails: not critical, log and continue. Don't compensate.
Note the asymmetry: not every failure compensates everything. Email failure shouldn't refund a successful charge. The compensation graph is part of the business logic, not just mechanical.
Code in Temporal:
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order):
reservation_id = await ex(reserve_inventory, order)
try:
charge_id = await ex(charge_payment, order)
try:
shipment_id = await ex(create_shipment, order)
# email is best-effort
try:
await ex(send_email, order)
except:
log.warn("email failed")
return Order(reservation_id, charge_id, shipment_id)
except:
await ex(refund_payment, charge_id)
raise
except:
await ex(release_inventory, reservation_id)
raiseThis is the kind of code Temporal makes manageable. Without an engine, the persistence and retry logic would dwarf the business logic.
When NOT to use a saga
- Single service. Just use a local DB transaction.
- Two services with no failure consequence. A
try/catchis fine. - Operations that are not compensable. Sending a missile is not a saga step.
- You haven't thought through what each compensation actually means. Compensations are business decisions; saga adds infrastructure to enforce them.
A saga is overkill for many situations where engineers reach for it. "Distributed transaction" is a buzzword. The actual question is "what consistency does this business need." Often it is much weaker than ACID, and a simple async event with idempotency is enough.
Observability
For each saga:
- Current state. Which step is running.
- Latency per step. Where time is spent.
- Success/failure rate. Overall and per step.
- Compensation triggers. When and why sagas roll back.
- Stuck sagas count. Should be zero or near-zero. Alert otherwise.
Temporal has a UI showing the live state of every workflow. Use it.
What I would tell a junior engineer
Sagas are not optional in microservices; they are the cost of choosing microservices. If you have a multi-service business operation that must be atomic from the user's view, you need a saga. Use a workflow engine - the homegrown ones always degenerate into half-implemented Temporals. And remember the saga itself is part of your domain model: the compensations are real business decisions, not just rollbacks.
Learn more
- PaperSagas: original paper (1987)Cornell
- ArticleMicroservices.io: SagaChris Richardson
- DocsTemporal documentationTemporal
- Docs
- ArticleDesigning Data-Intensive Applications, ch 7Martin Kleppmann