Part 1: SAGA Patterns - Fundamentals and Theory
Series Navigation: Index | Part 1 |
Part 2 - Choreography |
Part 3 - Orchestration |
Part 4 - Implementation |
Part 5 - Advanced |
Part 6 - Pitfalls |
Part 7 - Interview
Table of Contents
- The Problem This Pattern Solves
- A Quick Story to Set the Stage
- Understanding ACID Transactions
- The Monolith Era: When ACID Was Easy
- The Microservices Era: When ACID Became Hard
- Why You Cannot Just Use a Distributed Database
- CAP Theorem: The Fundamental Constraint
- BASE Properties: The Practical Alternative
- Two-Phase Commit (2PC): The Failed Solution
- Enter the SAGA Pattern
- How SAGA Works: The Core Mechanics
- Compensating Transactions: The Heart of SAGA
- Types of SAGA: Choreography vs Orchestration
- SAGA vs 2PC: A Direct Comparison
- When to Use SAGA and When Not To
- Trade-Offs You Must Accept
- Key Terminology Reference
- Summary
1. The Problem This Pattern Solves
Imagine you are building an e-commerce system. A customer places an order. Your system must:
- Create the order record
- Charge the customer's credit card
- Reserve items from inventory
- Schedule a delivery shipment
In a monolithic application, these four operations happen inside a single database transaction.
Either ALL succeed or ALL fail. No half-completed states. The database guarantees this.
Now imagine each of these steps runs in a SEPARATE microservice, each with its OWN database.
Order Service Payment Service Inventory Service Shipping Service
[orders DB] [payments DB] [inventory DB] [shipping DB]
(MySQL RDS #1) (MySQL RDS #2) (MySQL RDS #3) (MySQL RDS #4)
The Question: How do you guarantee that either ALL four steps succeed, or ALL four steps
are rolled back - when there is NO single database transaction spanning all four databases?
This is the distributed transaction problem. SAGA is the most practical solution.
2. A Quick Story to Set the Stage
Think of booking an international trip:
- You book a flight with Airline A
- You book a hotel in the destination city
- You book a rental car
- You buy travel insurance
If the rental car company suddenly has no cars available, what happens?
Option A (Strict atomicity): Cancel everything and refund everything. You wait hours for
refunds to appear on your credit card. You start again from scratch.
Option B (Compensating actions): The travel agent calls the airline and hotel, says
"the customer needs to cancel because the rental car failed." The airline marks the seat
available again. The hotel releases the room. This is compensation.
Option B is what a SAGA does. Not perfect rollback in a mathematical sense, but a series of
compensating actions that bring the system back to a consistent state.
3. Understanding ACID Transactions
Before understanding why SAGA exists, you must deeply understand ACID:
Atomicity
"All or nothing." Either every operation in the transaction succeeds, or none of them do.
If you crash in the middle, the database rolls back all partial changes.
BEGIN TRANSACTION;
INSERT INTO orders (id, status, amount) VALUES ('ORD-001', 'PENDING', 100.00);
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 'PROD-101';
INSERT INTO payments (order_id, amount, status) VALUES ('ORD-001', 100.00, 'CHARGED');
COMMIT; -- Either all three succeed, or none of them are persistedConsistency
The database moves from one valid state to another valid state. Constraints (foreign keys,
unique indexes, check constraints) are always honored.
Isolation
Concurrent transactions do not see each other's intermediate (uncommitted) data. One
transaction does not read dirty data from another in-progress transaction.
Durability
Once a transaction is committed, it persists even through crashes, power failures, or other
system failures. Data written to disk is permanent.
Why ACID Matters
ACID is a promise from your database that you can write code without worrying about:
- Partial writes
- Data corruption from concurrent access
- Lost updates
- Reads of temporary data
The problem is this promise ONLY works within a SINGLE database connection.
4. The Monolith Era: When ACID Was Easy
MONOLITH APPLICATION
+--------------------------------------------------+
| |
| Order Module -> Payment Module -> Inventory |
| |
| All share ONE database |
| |
+--------------------------------------------------+
|
v
+--------------------------------------------------+
| SINGLE MySQL DATABASE |
| orders | payments | inventory | shipping |
| All tables, one connection |
+--------------------------------------------------+
In a monolith with a single database, this code works perfectly:
@Transactional // Spring manages the transaction boundary
public void placeOrder(OrderRequest request) {
// Step 1: Create Order
Order order = orderRepository.save(new Order(request));
// Step 2: Process Payment (same DB, same transaction)
Payment payment = paymentRepository.save(new Payment(order));
// Step 3: Reserve Inventory (same DB, same transaction)
inventoryRepository.decrementStock(request.getProductId(), request.getQuantity());
// Step 4: Create Shipment (same DB, same transaction)
shipmentRepository.save(new Shipment(order));
// If ANY exception is thrown, Spring rolls back ALL of the above
// The database guarantees this with its transaction log
}This is beautiful. Simple. Safe. But it requires ALL operations to hit the SAME database.
5. The Microservices Era: When ACID Became Hard
When you split your monolith into microservices:
Order Service Payment Service Inventory Service
+--------------+ +---------------+ +----------------+
| Spring Boot | | Spring Boot | | Spring Boot |
| orders DB | | payments DB | | inventory DB |
| (MySQL RDS1) | | (MySQL RDS2) | | (MySQL RDS3) |
+--------------+ +---------------+ +----------------+
| | |
| REST or Kafka | |
+--------> call ------->+--------> call ------->+
The @Transactional annotation CANNOT span across these three separate databases.
They run on different machines, different JVMs, different database connections.
What actually happens when payment fails after order creation?
Timeline of a failed order:
T1: Order Service creates order (committed to orders DB - permanent!)
T2: Payment Service rejects payment (card declined)
T3: ??? The order record exists in the DB but payment failed
Result: DATABASE INCONSISTENCY
- orders DB shows order in PENDING state
- payments DB shows no payment
- inventory DB not touched
- Customer sees "order placed" but nothing will ship
This is the core problem. Data is permanently written to one database before we know
whether all other steps will succeed.
The Naive Solutions (And Why They All Fail)
Approach 1: Accept inconsistency and fix manually
- Result: Data corruption, angry customers, manual reconciliation jobs at 3am
Approach 2: Use one giant database shared by all services
- Violates microservices principle of data ownership
- Creates a shared database - the tightest form of coupling
- Kills independent deployability and scalability
Approach 3: Use distributed transactions (2PC)
- This sounds correct but fails badly in practice (see Section 9)
Approach 4: SAGA Pattern
- This is the correct answer for most cases
6. Why You Cannot Just Use a Distributed Database
"Why not use a NewSQL database like Google Spanner, CockroachDB, or YugabyteDB that
supports distributed ACID transactions?"
This is a valid question. Here is why it is usually not the answer:
Technical Reasons
| Problem | Explanation |
|---|---|
| Network latency | Cross-region transactions take 50-200ms per operation |
| Lock contention | Global locks across nodes dramatically reduce throughput |
| Vendor lock-in | You are tied to one database vendor for all services |
| Cost | Global distributed databases are expensive at scale |
| Operational complexity | Requires specialized expertise to operate |
Architectural Reasons
In microservices, services OWN their data. The entire point is that:
- Order Service owns the orders database
- Payment Service owns the payments database
- These are deployment and operational units that evolve independently
If both services share even the same distributed database schema, you have re-created
a distributed monolith - the worst of both worlds.
When Distributed Databases ARE the Answer
- When your entire system naturally fits one data model
- When you need strong consistency at global scale (financial systems, inventory)
- When you have a single team owning all services
- When operational complexity is manageable
For most microservices systems though, SAGA is the right answer.
7. CAP Theorem: The Fundamental Constraint
Eric Brewer's CAP theorem states that a distributed data store can provide at most
TWO of the following three guarantees simultaneously:
CONSISTENCY
/
/ <-- can only pick two
/ \
AVAILABILITY --------- PARTITION TOLERANCE
Consistency (C)
Every read receives the most recent write or an error.
All nodes see the same data at the same time.
Availability (A)
Every request receives a non-error response.
The system always responds (though the data may not be the latest).
Partition Tolerance (P)
The system continues to operate even when network messages between nodes are lost
or delayed.
The Practical Implication
Network partitions ARE going to happen in real distributed systems. Cables fail,
switches fail, network congestion causes timeouts. Therefore P is NOT optional.
You are always choosing between C and A:
-
CP systems (choose consistency over availability): Return an error when uncertain.
Examples: MySQL with synchronous replication, HBase, ZooKeeper. -
AP systems (choose availability over consistency): Return potentially stale data
but always respond. Examples: DynamoDB, Cassandra, CouchDB.
How This Relates to SAGA
SAGA is fundamentally an AP approach:
- Services remain available and process requests even during network issues
- Consistency is achieved EVENTUALLY through compensation
- During a saga execution, the system may be in an intermediate inconsistent state
- Final consistency is guaranteed only after all saga steps complete
8. BASE Properties: The Practical Alternative
BASE is the alternative to ACID for distributed systems:
Basically Available
The system guarantees availability. Responses are always given, though the data
might not be the most up-to-date.
Soft State
The state of the system may change over time, even without input.
During a saga, data is in a soft, transitional state.
Eventually Consistent
The system will BECOME consistent over time, given that it receives no new input.
After a saga completes (successfully or via compensation), all services will reflect
a consistent state.
ACID vs BASE Comparison
| Property | ACID | BASE |
|---|---|---|
| Consistency | Strong (immediate) | Eventual (over time) |
| Isolation | Full (reads see committed data only) | Partial (reads may see uncommitted saga state) |
| Rollback | Database handles it | Application handles via compensation |
| Atomicity | Database guarantees | Application guarantees through compensating transactions |
| Suitable for | Single-database operations | Cross-service, multi-database operations |
| Complexity | Low (DB handles it) | High (application must manage) |
| Performance | Lower (locks) | Higher (no cross-service locks) |
The Key Mental Shift
Moving from monolith to microservices requires accepting:
"I cannot have strong consistency across service boundaries.
I must design for eventual consistency and compensable failures."
This is not a weakness. It is a design choice that enables scale, availability,
and independent evolution of services.
9. Two-Phase Commit (2PC): The Failed Solution
Before SAGAs became popular, engineers tried to solve this with 2PC (Two-Phase Commit).
Understanding why 2PC fails is crucial to appreciating SAGA.
How 2PC Works
2PC uses a Transaction Coordinator (TC) that manages two phases:
PHASE 1: PREPARE (Voting Phase)
==============<mark class="obsidian-highlight">
Transaction Coordinator (TC)
|
|-- "Can you commit?" --> Order Service --> "Yes, vote COMMIT"
|-- "Can you commit?" --> Payment Service --> "Yes, vote COMMIT"
|-- "Can you commit?" --> Inventory Service --> "Yes, vote COMMIT"
|
| (TC waits for all votes)
PHASE 2: COMMIT (Decision Phase)
</mark>==============
| (if ALL voted COMMIT)
|
|-- "COMMIT now!" --> Order Service --> Commits, releases locks
|-- "COMMIT now!" --> Payment Service --> Commits, releases locks
|-- "COMMIT now!" --> Inventory Service --> Commits, releases locks
|
| (if ANY voted ABORT)
|
|-- "ROLLBACK!" --> All services --> Roll back their changes
Why 2PC Fails in Microservices
Problem 1: Blocking Protocol
During Phase 1, all participants hold database locks waiting for the Phase 2 decision.
In a distributed system, this means locks can be held for seconds or minutes waiting
for a response from another service across the network.
Order Service holds locks on order records
Payment Service holds locks on payment records
Inventory Service holds locks on inventory records
All waiting for the TC to tell them to commit or rollback
= MASSIVE lock contention, terrible throughput
Problem 2: Single Point of Failure - The Coordinator
What happens if the Transaction Coordinator crashes AFTER Phase 1 but BEFORE Phase 2?
TC asks all participants: "Can you commit?"
All say: "Yes"
TC prepares to send commit decision...
TC CRASHES
+---- Order Service: locked, waiting for commit/rollback decision
All +---- Payment Service: locked, waiting
participants+---- Inventory Service: locked, waiting
They are stuck. They cannot commit on their own. They cannot rollback on their own.
They CANNOT release their locks.
= SYSTEM DEADLOCK until coordinator recovers
Problem 3: Network Partition After Prepare
TC sends COMMIT to Order Service -- OK
TC sends COMMIT to Payment Service -- NETWORK PARTITION
TC sends COMMIT to Inventory Service -- COMMIT received
Result:
- Order Service: COMMITTED
- Payment Service: UNKNOWN STATE (did it commit? did it not?)
- Inventory Service: COMMITTED
= SPLIT BRAIN INCONSISTENCY
Problem 4: Performance at Scale
2PC requires at minimum 2 round trips across the network for EVERY transaction.
In a system processing 10,000 transactions per second, this is catastrophic.
Problem 5: Heterogeneous Systems
2PC requires ALL participants to support the XA protocol. Most modern services use
message queues, HTTP APIs, and NoSQL stores that have no notion of XA.
The Verdict on 2PC
| Issue | Severity |
|---|---|
| Blocking under coordinator failure | Critical |
| Performance overhead | High |
| Scalability limitation | High |
| Limited ecosystem support | Medium |
| Operational complexity | High |
| Cloud-native incompatibility | High |
2PC works well within a single database system (MySQL uses it internally for
replication). It does NOT work well across independent microservices.
10. Enter the SAGA Pattern
The SAGA pattern was first described by Hector Garcia-Molina and Kenneth Salem in
their 1987 paper "SAGAS" (Princeton University, Department of Computer Science).
The original definition was for Long-Lived Transactions (LLTs) - transactions that
span minutes or hours and cannot hold database locks for that duration.
The Core Insight from the 1987 Paper:
A saga is a sequence of transactions T1, T2, ..., Tn that can be interleaved with
other transactions. Each Ti has a corresponding compensating transaction Ci that
semantically undoes the effect of Ti.
Modern Definition
A SAGA is a sequence of local transactions where:
- Each local transaction updates data within a SINGLE service/database
- Each local transaction publishes an event or sends a command to trigger the next step
- If any step fails, compensating transactions are executed in reverse order to undo
the effects of the preceding steps
SAGA HAPPY PATH:
T1 (Order Created) -> T2 (Payment Processed) -> T3 (Inventory Reserved) -> T4 (Shipment Scheduled)
| | | |
Commit Commit Commit Commit
SAGA COMPLETE
SAGA FAILURE PATH (failure at T3):
T1 (Order Created) -> T2 (Payment Processed) -> T3 (Inventory Reserve FAILS)
| | |
Commit Commit FAIL
|
v
C2 (Refund Payment) <- C1 (Cancel Order)
Commit Commit
SAGA COMPENSATED
What Makes SAGA Different from 2PC?
| Aspect | 2PC | SAGA |
|---|---|---|
| Consistency | Strong (atomic) | Eventual |
| Locking | Holds locks across network | No cross-service locks |
| Failure handling | Coordinator decides | Application compensates |
| Performance | Blocking | Non-blocking |
| Availability | Low (coordinator SPOF) | High |
| Complexity | Protocol complexity | Application logic complexity |
| Recovery | Automatic (coordinator) | Must design compensations |
SAGA trades strong atomicity for availability and performance. It accepts that
the system may be in an inconsistent state temporarily, and uses application-level
logic to restore consistency.
11. How SAGA Works: The Core Mechanics
The Sequence of Events
Let us trace through the e-commerce order saga:
STEP 1: Customer places order
---------------------------------
OrderService.createOrder()
- INSERT INTO orders (id, status) VALUES ('ORD-001', 'PENDING')
- COMMIT (local transaction completes, this is permanent)
- Publish OrderCreatedEvent to Kafka/SQS
STEP 2: Payment processing
---------------------------------
PaymentService receives OrderCreatedEvent
- Begin local transaction
- Call payment gateway API (external)
- INSERT INTO payments (order_id, amount, status) VALUES ('ORD-001', 99.99, 'CHARGED')
- COMMIT (local transaction completes)
- Publish PaymentProcessedEvent
STEP 3: Inventory reservation
---------------------------------
InventoryService receives PaymentProcessedEvent
- Begin local transaction
- SELECT FOR UPDATE on inventory row (check quantity)
- UPDATE inventory SET reserved = reserved + 1 WHERE product_id = 'PROD-101'
- INSERT INTO reservations (order_id, product_id, quantity) VALUES (...)
- COMMIT
- Publish InventoryReservedEvent
STEP 4: Shipping schedule
---------------------------------
ShippingService receives InventoryReservedEvent
- Begin local transaction
- INSERT INTO shipments (order_id, status) VALUES ('ORD-001', 'SCHEDULED')
- COMMIT
- Publish ShipmentScheduledEvent
SAGA COMPLETE: All 4 steps succeeded
---------------------------------
OrderService receives ShipmentScheduledEvent
- UPDATE orders SET status = 'CONFIRMED' WHERE id = 'ORD-001'
Failure Scenario
Failure at Step 3 (Inventory Out of Stock):
---------------------------------
InventoryService receives PaymentProcessedEvent
- Begin local transaction
- SELECT FOR UPDATE on inventory row
- Quantity available: 0 (out of stock!)
- ROLLBACK local transaction
- Publish InventoryReservationFailedEvent
COMPENSATION BEGINS (reverse order):
---------------------------------
Step C2: Payment Compensation
PaymentService receives InventoryReservationFailedEvent
- Begin local transaction
- UPDATE payments SET status = 'REFUNDED' WHERE order_id = 'ORD-001'
- Call payment gateway refund API
- COMMIT
- Publish PaymentRefundedEvent
Step C1: Order Compensation
OrderService receives PaymentRefundedEvent
- Begin local transaction
- UPDATE orders SET status = 'CANCELLED' WHERE id = 'ORD-001'
- COMMIT
- Publish OrderCancelledEvent
SAGA COMPENSATED: System is back to consistent state
All records updated to reflect cancellation
The Key Points
- Local transactions commit immediately - No holding for global outcome
- Compensation is NOT rollback - It is a new forward transaction that undoes effects
- The order DB has a CANCELLED order - Not an absent order record
- The payment DB has a REFUNDED record - Not a deleted record
- The audit trail is preserved - Every step is recorded permanently
12. Compensating Transactions: The Heart of SAGA
A compensating transaction is NOT the same as a database rollback. This distinction
is fundamental and frequently misunderstood.
Semantic vs Syntactic Undo
Database Rollback (Syntactic Undo):
- Removes all trace of the original operation
- Happens automatically before commit
- The original transaction never "happened" from the database's perspective
- No audit trail
Compensating Transaction (Semantic Undo):
- Creates a NEW forward transaction that reverses the EFFECTS of the original
- Executes AFTER the original transaction already committed
- Both the original action AND the compensation are permanently recorded
- Full audit trail preserved
- Must be designed explicitly by the developer
DATABASE ROLLBACK:
T1: INSERT INTO orders VALUES (...) <-- this insert
ROLLBACK <-- this erases the insert
Result: orders table unchanged, no record of attempted insert
SAGA COMPENSATION:
T1: INSERT INTO orders (id, status) VALUES ('ORD-001', 'PENDING') <-- commits
COMMIT
... time passes, other things happen ...
C1: UPDATE orders SET status = 'CANCELLED' WHERE id = 'ORD-001' <-- commits
COMMIT
Result: orders table has CANCELLED order, full audit trail preserved
Properties of Good Compensating Transactions
1. Idempotent
If a compensating transaction is executed multiple times (due to retries), the result
must be the same as if it was executed once. This is CRITICAL.
// BAD: Not idempotent - double compensation causes a double refund
public void refundPayment(String orderId) {
paymentGateway.refund(orderId); // if called twice, refunds twice!
}
// GOOD: Idempotent - safe to call multiple times
public void refundPayment(String orderId) {
Payment payment = paymentRepository.findByOrderId(orderId);
if (payment.getStatus() == PaymentStatus.REFUNDED) {
log.info("Payment {} already refunded, skipping", orderId);
return; // safe to return, already done
}
payment.setStatus(PaymentStatus.REFUNDED);
paymentGateway.refund(payment.getExternalRef());
paymentRepository.save(payment);
}2. Semantically Reversible
The compensation must undo the BUSINESS EFFECT, not just the database change.
Original: Charged customer's credit card $99.99
Compensation: Refunded $99.99 to customer's credit card
NOT: Deleted the payment record (that would be auditing fraud)
3. Retryable
Because network failures can happen during compensation, the compensation must
be safe to retry. (This is related to idempotency.)
4. Non-Failing (By Design)
Compensating transactions should not fail. If a compensation fails, you have
a bigger problem. Design compensations to always succeed (possibly with manual
intervention as a last resort).
Types of Transactions in a SAGA
Pivotal Transaction: The transaction after which you no longer need to
compensate. Once the pivotal transaction commits, the saga can only go forward.
In the order saga:
T1: Create Order (compensable: cancel order)
T2: Process Payment (compensable: refund payment)
T3: Reserve Inventory <-- THIS IS THE PIVOT TRANSACTION in many designs
T4: Schedule Shipment (retriable: shipment can be rescheduled)
T5: Confirm Order (retriable: status update can be retried)
Once inventory is reserved and payment processed, you WANT to fulfill the order.
The pivot is where the saga transitions from "can be undone" to "must complete forward"
Compensable Transactions: Transactions that have a corresponding compensation.
These are the transactions before the pivotal transaction.
Retriable Transactions: Transactions after the pivotal transaction that should
be retried until they succeed (not compensated).
Complete SAGA Model:
[Compensable Txns] -> [Pivot Txn] -> [Retriable Txns]
T1, T2, T3 -> T4 -> T5, T6
These can fail If T4 fails, These should
and trigger compensate always succeed
C3, C2, C1 T1, T2, T3 (with retries)
13. Types of SAGA: Choreography vs Orchestration
There are exactly TWO ways to coordinate a SAGA:
Choreography (Event-Driven)
Services communicate by publishing and listening to events. There is NO central
coordinator. Each service knows what events to react to and what events to emit.
CHOREOGRAPHY FLOW:
Order Service ---OrderCreatedEvent-----> Payment Service
|
PaymentProcessedEvent
|
Inventory Service
|
InventoryReservedEvent
|
Shipping Service
|
ShipmentCreatedEvent
|
Order Service (update to CONFIRMED)
COMPENSATION (payment fails):
Payment Service ---PaymentFailedEvent-----> Order Service (cancel order)
No one service "knows" the whole saga. Each service only knows:
- What events to subscribe to
- What local action to perform
- What event to emit on success or failure
Orchestration (Command-Driven)
A central Orchestrator service drives the saga. It sends commands to services
and receives success/failure responses. The orchestrator maintains state.
ORCHESTRATION FLOW:
SAGA ORCHESTRATOR
(knows all steps)
|
+-------------+-------------+
| | |
ProcessPaymentCmd ReserveInventoryCmd CreateShipmentCmd
| | |
Payment Svc Inventory Svc Shipping Svc
| | |
PaymentResult InventoryResult ShipmentResult
| | |
+-------------+-------------+
|
(orchestrator decides next step
or initiates compensation)
The orchestrator explicitly knows the entire saga flow and coordinates every step.
Quick Comparison
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coordination | Decentralized | Centralized |
| Service coupling | Loose (via events) | Tighter (commands + responses) |
| Visibility | Hard to trace flow | Easy to see saga state |
| Business logic | Distributed across services | Concentrated in orchestrator |
| Testing | Complex | Simpler |
| Debugging | Difficult | Straightforward |
| Scalability | Excellent | Good (orchestrator can scale) |
| Event proliferation | Can explode | Controlled |
| When to use | Simple, few services | Complex, many services |
Detailed implementations of both patterns are covered in Parts 2 and 3.
14. SAGA vs 2PC: A Direct Comparison
| Factor | SAGA | 2PC |
|---|---|---|
| Consistency model | Eventual | Strong |
| Performance | High (no distributed locks) | Low (blocking locks) |
| Availability | High (no coordinator SPOF) | Lower (coordinator dependency) |
| Scalability | Excellent | Poor |
| Network failures | Handled via compensation | Can cause deadlock |
| Heterogeneous systems | Works with any service type | Requires XA protocol support |
| Audit trail | Full trail of all operations | Only committed state |
| Complexity | Application logic | Protocol complexity |
| Cloud native | Yes | No |
| Suitable for microservices | Yes | No |
When Would You Still Use 2PC?
2PC makes sense WITHIN a single system (not across microservices):
- MySQL uses 2PC internally for its InnoDB storage engine
- A single service that writes to two data stores (rare, use Outbox instead)
- When absolute strong consistency is non-negotiable and performance is secondary
15. When to Use SAGA and When Not To
Use SAGA When:
-
Multiple services must update their databases as part of one business transaction
Example: Order, Payment, Inventory, Shipping must all update together. -
Long-running business processes span multiple services
Example: Insurance claim processing that involves 10 services over several days. -
Services use different databases and technologies
Example: Order service uses MySQL, Payment uses DynamoDB, Inventory uses Postgres. -
High throughput is required
Example: Processing 50,000 orders per second. 2PC would collapse under this load. -
System must remain available during partial failures
Example: Payment service goes down, but other services should keep running.
Do NOT Use SAGA When:
-
Strong consistency is an absolute business requirement
Example: Financial double-entry bookkeeping where debit and credit MUST be atomic.
Use a single service with a single transactional database instead. -
All data fits in one service/database
If order, payment, inventory all belong to one service - just use @Transactional. -
You are building a simple CRUD application
Over-engineering. Use standard transactional patterns. -
Operations are not compensable
Example: "Send an SMS notification." You cannot un-send an SMS. In this case,
place the SMS step LAST (after the pivot transaction) as a retriable transaction. -
Team lacks experience with distributed systems
SAGAs are complex. Build team competency before adopting in production.
The Decision Matrix
Does the operation span multiple service databases?
|
YES
|
v
Do all operations need to eventually be consistent?
|
YES
|
v
Can you design compensating transactions for each step?
|
YES
|
v
Is the team ready for eventual consistency in the application?
|
YES
|
v
USE SAGA PATTERN
If any answer is NO:
- Reconsider service boundaries (maybe they should be one service)
- Use synchronous request-response with retry
- Accept the strong consistency constraint and design around it
16. Trade-Offs You Must Accept
Trade-Off 1: Lack of Isolation
In a traditional database transaction, no other transaction sees your intermediate data.
In a SAGA, each local transaction commits and becomes visible immediately.
SAGA IN PROGRESS:
T1: Order PENDING (committed, visible to reads)
T2: Payment CHARGED (committed, visible to reads)
T3: (not started yet)
Another user queries order status: sees "PENDING" or partially complete data
The Impact: Users or other systems may see intermediate saga states.
The Mitigation:
- Semantic locking: Add a
saga_in_progressflag - Design UI/API to show "processing" state gracefully
- Use read models (CQRS) that only show final states
Trade-Off 2: Compensation is Business Logic
Compensating transactions are not free. Someone must design, implement, test, and maintain them.
Each new saga step requires a corresponding compensation. This doubles the code surface area.
The Mitigation: Design compensations upfront during system design, not as an afterthought.
Trade-Off 3: Eventual Consistency is Visible
Between a saga starting and completing, the system is in a state that may not match
any logically correct business state. Downstream systems reading data during this window
may get inconsistent results.
The Mitigation:
- Design read models carefully
- Use sagas events to update read models only after saga completion
- Communicate to business stakeholders that "processing" is a valid state
Trade-Off 4: Testing Complexity
Testing a saga requires testing the happy path AND every possible failure scenario:
- Failure at each step
- Failure during compensation
- Network timeouts
- Duplicate events (at-least-once delivery)
- Out-of-order events
The Mitigation: Testcontainers + thorough integration tests covering all paths.
Trade-Off 5: Observability Complexity
A single saga spans multiple services. Debugging requires correlating logs and events
across all services using a correlation ID.
The Mitigation: Distributed tracing (AWS X-Ray, OpenTelemetry), structured logging,
saga-specific dashboards.
17. Key Terminology Reference
| Term | Definition |
|---|---|
| SAGA | A sequence of local transactions coordinated to achieve a larger business goal |
| Local Transaction | A transaction within a single service/database |
| Compensating Transaction | A transaction that semantically undoes the effect of a previous transaction |
| Choreography | SAGA coordination via event publishing and subscription (no central coordinator) |
| Orchestration | SAGA coordination via a central orchestrator that drives all steps |
| Pivot Transaction | The last compensable transaction in a saga; after this point, only forward progress |
| Compensable Transaction | A transaction that has a corresponding compensation |
| Retriable Transaction | A transaction after the pivot that should be retried until it succeeds |
| Idempotent | An operation that produces the same result regardless of how many times it is executed |
| Eventual Consistency | A consistency model where the system will become consistent over time |
| SEC | Saga Execution Coordinator - the entity (service or mechanism) that tracks saga state |
| Dual-Write Problem | The inability to atomically write to a database AND publish an event |
| Outbox Pattern | A pattern that solves the dual-write problem using the database as an event buffer |
| Semantic Lock | A flag on a record indicating it is participating in a saga |
| Compensating Action | Business-level undo (not database rollback) |
| Saga State | The current status of a saga execution: STARTED, IN_PROGRESS, COMPLETED, COMPENSATING |
18. Summary
| Concept | Key Takeaway |
|---|---|
| The Problem | ACID transactions cannot span multiple service databases |
| Why 2PC Fails | Blocking, single point of failure, performance killer |
| CAP Theorem | Distributed systems must choose between Consistency and Availability when partitioned |
| BASE vs ACID | Accept eventual consistency for distributed systems, strong consistency for single services |
| SAGA Basics | Sequence of local transactions + compensating transactions for rollback |
| Compensation | A new forward transaction, not database rollback. Must be idempotent. |
| SAGA Types | Choreography (event-driven) or Orchestration (centralized) |
| Key Trade-Off | No isolation during saga execution; eventual consistency only |
| Use SAGA When | Multi-service transactions with high availability and throughput requirements |
| Do NOT Use When | Single-service data, strong consistency mandatory, operations not compensable |
What Comes Next
You now understand WHY SAGA exists and the theoretical foundations.
Next: Part 2 - Choreography Pattern
Learn how to implement the event-driven choreography approach with complete, production-ready
Spring Boot code, Kafka configuration, and AWS integration.
Series Navigation: Index | Part 1 |
Part 2 |
Part 3 |
Part 4 |
Part 5 |
Part 6 |
Part 7