← Back to Articles
6/6/2026Admin Post

saga demystified part1 fundamentals

Part 1: SAGA Patterns - Fundamentals and Theory

Series Navigation: Index | Part 1 |
Part 2 - Choreography |
Part 3 - Orchestration |
Part 4 - Implementation |
Part 5 - Advanced |
Part 6 - Pitfalls |
Part 7 - Interview


Table of Contents

  1. The Problem This Pattern Solves
  2. A Quick Story to Set the Stage
  3. Understanding ACID Transactions
  4. The Monolith Era: When ACID Was Easy
  5. The Microservices Era: When ACID Became Hard
  6. Why You Cannot Just Use a Distributed Database
  7. CAP Theorem: The Fundamental Constraint
  8. BASE Properties: The Practical Alternative
  9. Two-Phase Commit (2PC): The Failed Solution
  10. Enter the SAGA Pattern
  11. How SAGA Works: The Core Mechanics
  12. Compensating Transactions: The Heart of SAGA
  13. Types of SAGA: Choreography vs Orchestration
  14. SAGA vs 2PC: A Direct Comparison
  15. When to Use SAGA and When Not To
  16. Trade-Offs You Must Accept
  17. Key Terminology Reference
  18. Summary

1. The Problem This Pattern Solves

Imagine you are building an e-commerce system. A customer places an order. Your system must:

  1. Create the order record
  2. Charge the customer's credit card
  3. Reserve items from inventory
  4. Schedule a delivery shipment

In a monolithic application, these four operations happen inside a single database transaction.
Either ALL succeed or ALL fail. No half-completed states. The database guarantees this.

Now imagine each of these steps runs in a SEPARATE microservice, each with its OWN database.

Order Service    Payment Service    Inventory Service    Shipping Service
[orders DB]      [payments DB]      [inventory DB]       [shipping DB]
(MySQL RDS #1)   (MySQL RDS #2)     (MySQL RDS #3)       (MySQL RDS #4)

The Question: How do you guarantee that either ALL four steps succeed, or ALL four steps
are rolled back - when there is NO single database transaction spanning all four databases?

This is the distributed transaction problem. SAGA is the most practical solution.


2. A Quick Story to Set the Stage

Think of booking an international trip:

  • You book a flight with Airline A
  • You book a hotel in the destination city
  • You book a rental car
  • You buy travel insurance

If the rental car company suddenly has no cars available, what happens?

Option A (Strict atomicity): Cancel everything and refund everything. You wait hours for
refunds to appear on your credit card. You start again from scratch.

Option B (Compensating actions): The travel agent calls the airline and hotel, says
"the customer needs to cancel because the rental car failed." The airline marks the seat
available again. The hotel releases the room. This is compensation.

Option B is what a SAGA does. Not perfect rollback in a mathematical sense, but a series of
compensating actions that bring the system back to a consistent state.


3. Understanding ACID Transactions

Before understanding why SAGA exists, you must deeply understand ACID:

Atomicity

"All or nothing." Either every operation in the transaction succeeds, or none of them do.
If you crash in the middle, the database rolls back all partial changes.

BEGIN TRANSACTION;
  INSERT INTO orders (id, status, amount) VALUES ('ORD-001', 'PENDING', 100.00);
  UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 'PROD-101';
  INSERT INTO payments (order_id, amount, status) VALUES ('ORD-001', 100.00, 'CHARGED');
COMMIT;  -- Either all three succeed, or none of them are persisted

Consistency

The database moves from one valid state to another valid state. Constraints (foreign keys,
unique indexes, check constraints) are always honored.

Isolation

Concurrent transactions do not see each other's intermediate (uncommitted) data. One
transaction does not read dirty data from another in-progress transaction.

Durability

Once a transaction is committed, it persists even through crashes, power failures, or other
system failures. Data written to disk is permanent.

Why ACID Matters

ACID is a promise from your database that you can write code without worrying about:

  • Partial writes
  • Data corruption from concurrent access
  • Lost updates
  • Reads of temporary data

The problem is this promise ONLY works within a SINGLE database connection.


4. The Monolith Era: When ACID Was Easy

                MONOLITH APPLICATION
+--------------------------------------------------+
|                                                  |
|  Order Module -> Payment Module -> Inventory     |
|                                                  |
|         All share ONE database                   |
|                                                  |
+--------------------------------------------------+
                       |
                       v
+--------------------------------------------------+
|         SINGLE MySQL DATABASE                    |
|   orders | payments | inventory | shipping       |
|         All tables, one connection               |
+--------------------------------------------------+

In a monolith with a single database, this code works perfectly:

@Transactional  // Spring manages the transaction boundary
public void placeOrder(OrderRequest request) {
    // Step 1: Create Order
    Order order = orderRepository.save(new Order(request));
 
    // Step 2: Process Payment (same DB, same transaction)
    Payment payment = paymentRepository.save(new Payment(order));
 
    // Step 3: Reserve Inventory (same DB, same transaction)
    inventoryRepository.decrementStock(request.getProductId(), request.getQuantity());
 
    // Step 4: Create Shipment (same DB, same transaction)
    shipmentRepository.save(new Shipment(order));
 
    // If ANY exception is thrown, Spring rolls back ALL of the above
    // The database guarantees this with its transaction log
}

This is beautiful. Simple. Safe. But it requires ALL operations to hit the SAME database.


5. The Microservices Era: When ACID Became Hard

When you split your monolith into microservices:

Order Service          Payment Service        Inventory Service
+--------------+       +---------------+      +----------------+
| Spring Boot  |       | Spring Boot   |      | Spring Boot    |
| orders DB    |       | payments DB   |      | inventory DB   |
| (MySQL RDS1) |       | (MySQL RDS2)  |      | (MySQL RDS3)   |
+--------------+       +---------------+      +----------------+
      |                      |                       |
      |    REST or Kafka      |                       |
      +--------> call ------->+--------> call ------->+

The @Transactional annotation CANNOT span across these three separate databases.
They run on different machines, different JVMs, different database connections.

What actually happens when payment fails after order creation?

Timeline of a failed order:

T1: Order Service creates order (committed to orders DB - permanent!)
T2: Payment Service rejects payment (card declined)
T3: ??? The order record exists in the DB but payment failed

Result: DATABASE INCONSISTENCY
- orders DB shows order in PENDING state
- payments DB shows no payment
- inventory DB not touched
- Customer sees "order placed" but nothing will ship

This is the core problem. Data is permanently written to one database before we know
whether all other steps will succeed.

The Naive Solutions (And Why They All Fail)

Approach 1: Accept inconsistency and fix manually

  • Result: Data corruption, angry customers, manual reconciliation jobs at 3am

Approach 2: Use one giant database shared by all services

  • Violates microservices principle of data ownership
  • Creates a shared database - the tightest form of coupling
  • Kills independent deployability and scalability

Approach 3: Use distributed transactions (2PC)

  • This sounds correct but fails badly in practice (see Section 9)

Approach 4: SAGA Pattern

  • This is the correct answer for most cases

6. Why You Cannot Just Use a Distributed Database

"Why not use a NewSQL database like Google Spanner, CockroachDB, or YugabyteDB that
supports distributed ACID transactions?"

This is a valid question. Here is why it is usually not the answer:

Technical Reasons

ProblemExplanation
Network latencyCross-region transactions take 50-200ms per operation
Lock contentionGlobal locks across nodes dramatically reduce throughput
Vendor lock-inYou are tied to one database vendor for all services
CostGlobal distributed databases are expensive at scale
Operational complexityRequires specialized expertise to operate

Architectural Reasons

In microservices, services OWN their data. The entire point is that:

  • Order Service owns the orders database
  • Payment Service owns the payments database
  • These are deployment and operational units that evolve independently

If both services share even the same distributed database schema, you have re-created
a distributed monolith - the worst of both worlds.

When Distributed Databases ARE the Answer

  • When your entire system naturally fits one data model
  • When you need strong consistency at global scale (financial systems, inventory)
  • When you have a single team owning all services
  • When operational complexity is manageable

For most microservices systems though, SAGA is the right answer.


7. CAP Theorem: The Fundamental Constraint

Eric Brewer's CAP theorem states that a distributed data store can provide at most
TWO of the following three guarantees simultaneously:

           CONSISTENCY
          /
         /   <-- can only pick two
        /           \
AVAILABILITY --------- PARTITION TOLERANCE

Consistency (C)

Every read receives the most recent write or an error.
All nodes see the same data at the same time.

Availability (A)

Every request receives a non-error response.
The system always responds (though the data may not be the latest).

Partition Tolerance (P)

The system continues to operate even when network messages between nodes are lost
or delayed.

The Practical Implication

Network partitions ARE going to happen in real distributed systems. Cables fail,
switches fail, network congestion causes timeouts. Therefore P is NOT optional.

You are always choosing between C and A:

  • CP systems (choose consistency over availability): Return an error when uncertain.
    Examples: MySQL with synchronous replication, HBase, ZooKeeper.

  • AP systems (choose availability over consistency): Return potentially stale data
    but always respond. Examples: DynamoDB, Cassandra, CouchDB.

How This Relates to SAGA

SAGA is fundamentally an AP approach:

  • Services remain available and process requests even during network issues
  • Consistency is achieved EVENTUALLY through compensation
  • During a saga execution, the system may be in an intermediate inconsistent state
  • Final consistency is guaranteed only after all saga steps complete

8. BASE Properties: The Practical Alternative

BASE is the alternative to ACID for distributed systems:

Basically Available

The system guarantees availability. Responses are always given, though the data
might not be the most up-to-date.

Soft State

The state of the system may change over time, even without input.
During a saga, data is in a soft, transitional state.

Eventually Consistent

The system will BECOME consistent over time, given that it receives no new input.
After a saga completes (successfully or via compensation), all services will reflect
a consistent state.

ACID vs BASE Comparison

PropertyACIDBASE
ConsistencyStrong (immediate)Eventual (over time)
IsolationFull (reads see committed data only)Partial (reads may see uncommitted saga state)
RollbackDatabase handles itApplication handles via compensation
AtomicityDatabase guaranteesApplication guarantees through compensating transactions
Suitable forSingle-database operationsCross-service, multi-database operations
ComplexityLow (DB handles it)High (application must manage)
PerformanceLower (locks)Higher (no cross-service locks)

The Key Mental Shift

Moving from monolith to microservices requires accepting:

"I cannot have strong consistency across service boundaries.
I must design for eventual consistency and compensable failures."

This is not a weakness. It is a design choice that enables scale, availability,
and independent evolution of services.


9. Two-Phase Commit (2PC): The Failed Solution

Before SAGAs became popular, engineers tried to solve this with 2PC (Two-Phase Commit).
Understanding why 2PC fails is crucial to appreciating SAGA.

How 2PC Works

2PC uses a Transaction Coordinator (TC) that manages two phases:

PHASE 1: PREPARE (Voting Phase)
==============<mark class="obsidian-highlight">
Transaction Coordinator (TC)
         |
         |-- "Can you commit?" --> Order Service     --> "Yes, vote COMMIT"
         |-- "Can you commit?" --> Payment Service   --> "Yes, vote COMMIT"
         |-- "Can you commit?" --> Inventory Service --> "Yes, vote COMMIT"
         |
         | (TC waits for all votes)

PHASE 2: COMMIT (Decision Phase)
</mark>==============
         | (if ALL voted COMMIT)
         |
         |-- "COMMIT now!" --> Order Service     --> Commits, releases locks
         |-- "COMMIT now!" --> Payment Service   --> Commits, releases locks
         |-- "COMMIT now!" --> Inventory Service --> Commits, releases locks
         |
         | (if ANY voted ABORT)
         |
         |-- "ROLLBACK!" --> All services --> Roll back their changes

Why 2PC Fails in Microservices

Problem 1: Blocking Protocol
During Phase 1, all participants hold database locks waiting for the Phase 2 decision.
In a distributed system, this means locks can be held for seconds or minutes waiting
for a response from another service across the network.

Order Service holds locks on order records
Payment Service holds locks on payment records
Inventory Service holds locks on inventory records
All waiting for the TC to tell them to commit or rollback
= MASSIVE lock contention, terrible throughput

Problem 2: Single Point of Failure - The Coordinator

What happens if the Transaction Coordinator crashes AFTER Phase 1 but BEFORE Phase 2?

TC asks all participants: "Can you commit?"
All say: "Yes"
TC prepares to send commit decision...
TC CRASHES
            +---- Order Service: locked, waiting for commit/rollback decision
All         +---- Payment Service: locked, waiting
participants+---- Inventory Service: locked, waiting

They are stuck. They cannot commit on their own. They cannot rollback on their own.
They CANNOT release their locks.
= SYSTEM DEADLOCK until coordinator recovers

Problem 3: Network Partition After Prepare

TC sends COMMIT to Order Service -- OK
TC sends COMMIT to Payment Service -- NETWORK PARTITION
TC sends COMMIT to Inventory Service -- COMMIT received

Result:
- Order Service: COMMITTED
- Payment Service: UNKNOWN STATE (did it commit? did it not?)
- Inventory Service: COMMITTED
= SPLIT BRAIN INCONSISTENCY

Problem 4: Performance at Scale

2PC requires at minimum 2 round trips across the network for EVERY transaction.
In a system processing 10,000 transactions per second, this is catastrophic.

Problem 5: Heterogeneous Systems

2PC requires ALL participants to support the XA protocol. Most modern services use
message queues, HTTP APIs, and NoSQL stores that have no notion of XA.

The Verdict on 2PC

IssueSeverity
Blocking under coordinator failureCritical
Performance overheadHigh
Scalability limitationHigh
Limited ecosystem supportMedium
Operational complexityHigh
Cloud-native incompatibilityHigh

2PC works well within a single database system (MySQL uses it internally for
replication). It does NOT work well across independent microservices.


10. Enter the SAGA Pattern

The SAGA pattern was first described by Hector Garcia-Molina and Kenneth Salem in
their 1987 paper "SAGAS" (Princeton University, Department of Computer Science).

The original definition was for Long-Lived Transactions (LLTs) - transactions that
span minutes or hours and cannot hold database locks for that duration.

The Core Insight from the 1987 Paper:

A saga is a sequence of transactions T1, T2, ..., Tn that can be interleaved with
other transactions. Each Ti has a corresponding compensating transaction Ci that
semantically undoes the effect of Ti.

Modern Definition

A SAGA is a sequence of local transactions where:

  1. Each local transaction updates data within a SINGLE service/database
  2. Each local transaction publishes an event or sends a command to trigger the next step
  3. If any step fails, compensating transactions are executed in reverse order to undo
    the effects of the preceding steps
SAGA HAPPY PATH:
T1 (Order Created) -> T2 (Payment Processed) -> T3 (Inventory Reserved) -> T4 (Shipment Scheduled)
       |                      |                         |                          |
    Commit                  Commit                   Commit                     Commit
                                                                            SAGA COMPLETE

SAGA FAILURE PATH (failure at T3):
T1 (Order Created) -> T2 (Payment Processed) -> T3 (Inventory Reserve FAILS)
       |                      |                         |
    Commit                  Commit                    FAIL
                                                        |
                                                        v
                                         C2 (Refund Payment) <- C1 (Cancel Order)
                                              Commit                   Commit
                                                             SAGA COMPENSATED

What Makes SAGA Different from 2PC?

Aspect2PCSAGA
ConsistencyStrong (atomic)Eventual
LockingHolds locks across networkNo cross-service locks
Failure handlingCoordinator decidesApplication compensates
PerformanceBlockingNon-blocking
AvailabilityLow (coordinator SPOF)High
ComplexityProtocol complexityApplication logic complexity
RecoveryAutomatic (coordinator)Must design compensations

SAGA trades strong atomicity for availability and performance. It accepts that
the system may be in an inconsistent state temporarily, and uses application-level
logic to restore consistency.


11. How SAGA Works: The Core Mechanics

The Sequence of Events

Let us trace through the e-commerce order saga:

STEP 1: Customer places order
---------------------------------
OrderService.createOrder()
  - INSERT INTO orders (id, status) VALUES ('ORD-001', 'PENDING')
  - COMMIT (local transaction completes, this is permanent)
  - Publish OrderCreatedEvent to Kafka/SQS

STEP 2: Payment processing
---------------------------------
PaymentService receives OrderCreatedEvent
  - Begin local transaction
  - Call payment gateway API (external)
  - INSERT INTO payments (order_id, amount, status) VALUES ('ORD-001', 99.99, 'CHARGED')
  - COMMIT (local transaction completes)
  - Publish PaymentProcessedEvent

STEP 3: Inventory reservation
---------------------------------
InventoryService receives PaymentProcessedEvent
  - Begin local transaction
  - SELECT FOR UPDATE on inventory row (check quantity)
  - UPDATE inventory SET reserved = reserved + 1 WHERE product_id = 'PROD-101'
  - INSERT INTO reservations (order_id, product_id, quantity) VALUES (...)
  - COMMIT
  - Publish InventoryReservedEvent

STEP 4: Shipping schedule
---------------------------------
ShippingService receives InventoryReservedEvent
  - Begin local transaction
  - INSERT INTO shipments (order_id, status) VALUES ('ORD-001', 'SCHEDULED')
  - COMMIT
  - Publish ShipmentScheduledEvent

SAGA COMPLETE: All 4 steps succeeded
---------------------------------
OrderService receives ShipmentScheduledEvent
  - UPDATE orders SET status = 'CONFIRMED' WHERE id = 'ORD-001'

Failure Scenario

Failure at Step 3 (Inventory Out of Stock):
---------------------------------
InventoryService receives PaymentProcessedEvent
  - Begin local transaction
  - SELECT FOR UPDATE on inventory row
  - Quantity available: 0 (out of stock!)
  - ROLLBACK local transaction
  - Publish InventoryReservationFailedEvent

COMPENSATION BEGINS (reverse order):
---------------------------------
Step C2: Payment Compensation
  PaymentService receives InventoryReservationFailedEvent
  - Begin local transaction
  - UPDATE payments SET status = 'REFUNDED' WHERE order_id = 'ORD-001'
  - Call payment gateway refund API
  - COMMIT
  - Publish PaymentRefundedEvent

Step C1: Order Compensation
  OrderService receives PaymentRefundedEvent
  - Begin local transaction
  - UPDATE orders SET status = 'CANCELLED' WHERE id = 'ORD-001'
  - COMMIT
  - Publish OrderCancelledEvent

SAGA COMPENSATED: System is back to consistent state
All records updated to reflect cancellation

The Key Points

  1. Local transactions commit immediately - No holding for global outcome
  2. Compensation is NOT rollback - It is a new forward transaction that undoes effects
  3. The order DB has a CANCELLED order - Not an absent order record
  4. The payment DB has a REFUNDED record - Not a deleted record
  5. The audit trail is preserved - Every step is recorded permanently

12. Compensating Transactions: The Heart of SAGA

A compensating transaction is NOT the same as a database rollback. This distinction
is fundamental and frequently misunderstood.

Semantic vs Syntactic Undo

Database Rollback (Syntactic Undo):

  • Removes all trace of the original operation
  • Happens automatically before commit
  • The original transaction never "happened" from the database's perspective
  • No audit trail

Compensating Transaction (Semantic Undo):

  • Creates a NEW forward transaction that reverses the EFFECTS of the original
  • Executes AFTER the original transaction already committed
  • Both the original action AND the compensation are permanently recorded
  • Full audit trail preserved
  • Must be designed explicitly by the developer
DATABASE ROLLBACK:
  T1: INSERT INTO orders VALUES (...)  <-- this insert
  ROLLBACK                             <-- this erases the insert
  Result: orders table unchanged, no record of attempted insert

SAGA COMPENSATION:
  T1: INSERT INTO orders (id, status) VALUES ('ORD-001', 'PENDING')  <-- commits
  COMMIT
  ... time passes, other things happen ...
  C1: UPDATE orders SET status = 'CANCELLED' WHERE id = 'ORD-001'   <-- commits
  COMMIT
  Result: orders table has CANCELLED order, full audit trail preserved

Properties of Good Compensating Transactions

1. Idempotent
If a compensating transaction is executed multiple times (due to retries), the result
must be the same as if it was executed once. This is CRITICAL.

// BAD: Not idempotent - double compensation causes a double refund
public void refundPayment(String orderId) {
    paymentGateway.refund(orderId);  // if called twice, refunds twice!
}
 
// GOOD: Idempotent - safe to call multiple times
public void refundPayment(String orderId) {
    Payment payment = paymentRepository.findByOrderId(orderId);
    if (payment.getStatus() == PaymentStatus.REFUNDED) {
        log.info("Payment {} already refunded, skipping", orderId);
        return;  // safe to return, already done
    }
    payment.setStatus(PaymentStatus.REFUNDED);
    paymentGateway.refund(payment.getExternalRef());
    paymentRepository.save(payment);
}

2. Semantically Reversible
The compensation must undo the BUSINESS EFFECT, not just the database change.

Original: Charged customer's credit card $99.99
Compensation: Refunded $99.99 to customer's credit card
NOT: Deleted the payment record (that would be auditing fraud)

3. Retryable
Because network failures can happen during compensation, the compensation must
be safe to retry. (This is related to idempotency.)

4. Non-Failing (By Design)
Compensating transactions should not fail. If a compensation fails, you have
a bigger problem. Design compensations to always succeed (possibly with manual
intervention as a last resort).

Types of Transactions in a SAGA

Pivotal Transaction: The transaction after which you no longer need to
compensate. Once the pivotal transaction commits, the saga can only go forward.

In the order saga:
T1: Create Order  (compensable: cancel order)
T2: Process Payment (compensable: refund payment)
T3: Reserve Inventory  <-- THIS IS THE PIVOT TRANSACTION in many designs
T4: Schedule Shipment  (retriable: shipment can be rescheduled)
T5: Confirm Order  (retriable: status update can be retried)

Once inventory is reserved and payment processed, you WANT to fulfill the order.
The pivot is where the saga transitions from "can be undone" to "must complete forward"

Compensable Transactions: Transactions that have a corresponding compensation.
These are the transactions before the pivotal transaction.

Retriable Transactions: Transactions after the pivotal transaction that should
be retried until they succeed (not compensated).

Complete SAGA Model:
[Compensable Txns] -> [Pivot Txn] -> [Retriable Txns]
T1, T2, T3         ->   T4        ->    T5, T6
These can fail      If T4 fails,   These should
and trigger         compensate     always succeed
C3, C2, C1         T1, T2, T3     (with retries)

13. Types of SAGA: Choreography vs Orchestration

There are exactly TWO ways to coordinate a SAGA:

Choreography (Event-Driven)

Services communicate by publishing and listening to events. There is NO central
coordinator. Each service knows what events to react to and what events to emit.

CHOREOGRAPHY FLOW:

Order Service  ---OrderCreatedEvent----->  Payment Service
                                                |
                                   PaymentProcessedEvent
                                                |
                                   Inventory Service
                                                |
                                   InventoryReservedEvent
                                                |
                                   Shipping Service
                                                |
                                   ShipmentCreatedEvent
                                                |
                                   Order Service (update to CONFIRMED)

COMPENSATION (payment fails):

Payment Service  ---PaymentFailedEvent----->  Order Service (cancel order)

No one service "knows" the whole saga. Each service only knows:

  • What events to subscribe to
  • What local action to perform
  • What event to emit on success or failure

Orchestration (Command-Driven)

A central Orchestrator service drives the saga. It sends commands to services
and receives success/failure responses. The orchestrator maintains state.

ORCHESTRATION FLOW:

                   SAGA ORCHESTRATOR
                   (knows all steps)
                         |
           +-------------+-------------+
           |             |             |
  ProcessPaymentCmd  ReserveInventoryCmd  CreateShipmentCmd
           |             |             |
     Payment Svc    Inventory Svc   Shipping Svc
           |             |             |
  PaymentResult    InventoryResult  ShipmentResult
           |             |             |
           +-------------+-------------+
                         |
                   (orchestrator decides next step
                    or initiates compensation)

The orchestrator explicitly knows the entire saga flow and coordinates every step.

Quick Comparison

AspectChoreographyOrchestration
CoordinationDecentralizedCentralized
Service couplingLoose (via events)Tighter (commands + responses)
VisibilityHard to trace flowEasy to see saga state
Business logicDistributed across servicesConcentrated in orchestrator
TestingComplexSimpler
DebuggingDifficultStraightforward
ScalabilityExcellentGood (orchestrator can scale)
Event proliferationCan explodeControlled
When to useSimple, few servicesComplex, many services

Detailed implementations of both patterns are covered in Parts 2 and 3.


14. SAGA vs 2PC: A Direct Comparison

FactorSAGA2PC
Consistency modelEventualStrong
PerformanceHigh (no distributed locks)Low (blocking locks)
AvailabilityHigh (no coordinator SPOF)Lower (coordinator dependency)
ScalabilityExcellentPoor
Network failuresHandled via compensationCan cause deadlock
Heterogeneous systemsWorks with any service typeRequires XA protocol support
Audit trailFull trail of all operationsOnly committed state
ComplexityApplication logicProtocol complexity
Cloud nativeYesNo
Suitable for microservicesYesNo

When Would You Still Use 2PC?

2PC makes sense WITHIN a single system (not across microservices):

  • MySQL uses 2PC internally for its InnoDB storage engine
  • A single service that writes to two data stores (rare, use Outbox instead)
  • When absolute strong consistency is non-negotiable and performance is secondary

15. When to Use SAGA and When Not To

Use SAGA When:

  1. Multiple services must update their databases as part of one business transaction
    Example: Order, Payment, Inventory, Shipping must all update together.

  2. Long-running business processes span multiple services
    Example: Insurance claim processing that involves 10 services over several days.

  3. Services use different databases and technologies
    Example: Order service uses MySQL, Payment uses DynamoDB, Inventory uses Postgres.

  4. High throughput is required
    Example: Processing 50,000 orders per second. 2PC would collapse under this load.

  5. System must remain available during partial failures
    Example: Payment service goes down, but other services should keep running.

Do NOT Use SAGA When:

  1. Strong consistency is an absolute business requirement
    Example: Financial double-entry bookkeeping where debit and credit MUST be atomic.
    Use a single service with a single transactional database instead.

  2. All data fits in one service/database
    If order, payment, inventory all belong to one service - just use @Transactional.

  3. You are building a simple CRUD application
    Over-engineering. Use standard transactional patterns.

  4. Operations are not compensable
    Example: "Send an SMS notification." You cannot un-send an SMS. In this case,
    place the SMS step LAST (after the pivot transaction) as a retriable transaction.

  5. Team lacks experience with distributed systems
    SAGAs are complex. Build team competency before adopting in production.

The Decision Matrix

Does the operation span multiple service databases?
  |
  YES
  |
  v
Do all operations need to eventually be consistent?
  |
  YES
  |
  v
Can you design compensating transactions for each step?
  |
  YES
  |
  v
Is the team ready for eventual consistency in the application?
  |
  YES
  |
  v
USE SAGA PATTERN

If any answer is NO:
- Reconsider service boundaries (maybe they should be one service)
- Use synchronous request-response with retry
- Accept the strong consistency constraint and design around it

16. Trade-Offs You Must Accept

Trade-Off 1: Lack of Isolation

In a traditional database transaction, no other transaction sees your intermediate data.
In a SAGA, each local transaction commits and becomes visible immediately.

SAGA IN PROGRESS:
T1: Order PENDING (committed, visible to reads)
T2: Payment CHARGED (committed, visible to reads)
T3: (not started yet)

Another user queries order status: sees "PENDING" or partially complete data

The Impact: Users or other systems may see intermediate saga states.

The Mitigation:

  • Semantic locking: Add a saga_in_progress flag
  • Design UI/API to show "processing" state gracefully
  • Use read models (CQRS) that only show final states

Trade-Off 2: Compensation is Business Logic

Compensating transactions are not free. Someone must design, implement, test, and maintain them.
Each new saga step requires a corresponding compensation. This doubles the code surface area.

The Mitigation: Design compensations upfront during system design, not as an afterthought.

Trade-Off 3: Eventual Consistency is Visible

Between a saga starting and completing, the system is in a state that may not match
any logically correct business state. Downstream systems reading data during this window
may get inconsistent results.

The Mitigation:

  • Design read models carefully
  • Use sagas events to update read models only after saga completion
  • Communicate to business stakeholders that "processing" is a valid state

Trade-Off 4: Testing Complexity

Testing a saga requires testing the happy path AND every possible failure scenario:

  • Failure at each step
  • Failure during compensation
  • Network timeouts
  • Duplicate events (at-least-once delivery)
  • Out-of-order events

The Mitigation: Testcontainers + thorough integration tests covering all paths.

Trade-Off 5: Observability Complexity

A single saga spans multiple services. Debugging requires correlating logs and events
across all services using a correlation ID.

The Mitigation: Distributed tracing (AWS X-Ray, OpenTelemetry), structured logging,
saga-specific dashboards.


17. Key Terminology Reference

TermDefinition
SAGAA sequence of local transactions coordinated to achieve a larger business goal
Local TransactionA transaction within a single service/database
Compensating TransactionA transaction that semantically undoes the effect of a previous transaction
ChoreographySAGA coordination via event publishing and subscription (no central coordinator)
OrchestrationSAGA coordination via a central orchestrator that drives all steps
Pivot TransactionThe last compensable transaction in a saga; after this point, only forward progress
Compensable TransactionA transaction that has a corresponding compensation
Retriable TransactionA transaction after the pivot that should be retried until it succeeds
IdempotentAn operation that produces the same result regardless of how many times it is executed
Eventual ConsistencyA consistency model where the system will become consistent over time
SECSaga Execution Coordinator - the entity (service or mechanism) that tracks saga state
Dual-Write ProblemThe inability to atomically write to a database AND publish an event
Outbox PatternA pattern that solves the dual-write problem using the database as an event buffer
Semantic LockA flag on a record indicating it is participating in a saga
Compensating ActionBusiness-level undo (not database rollback)
Saga StateThe current status of a saga execution: STARTED, IN_PROGRESS, COMPLETED, COMPENSATING

18. Summary

ConceptKey Takeaway
The ProblemACID transactions cannot span multiple service databases
Why 2PC FailsBlocking, single point of failure, performance killer
CAP TheoremDistributed systems must choose between Consistency and Availability when partitioned
BASE vs ACIDAccept eventual consistency for distributed systems, strong consistency for single services
SAGA BasicsSequence of local transactions + compensating transactions for rollback
CompensationA new forward transaction, not database rollback. Must be idempotent.
SAGA TypesChoreography (event-driven) or Orchestration (centralized)
Key Trade-OffNo isolation during saga execution; eventual consistency only
Use SAGA WhenMulti-service transactions with high availability and throughput requirements
Do NOT Use WhenSingle-service data, strong consistency mandatory, operations not compensable

What Comes Next

You now understand WHY SAGA exists and the theoretical foundations.

Next: Part 2 - Choreography Pattern

Learn how to implement the event-driven choreography approach with complete, production-ready
Spring Boot code, Kafka configuration, and AWS integration.


Series Navigation: Index | Part 1 |
Part 2 |
Part 3 |
Part 4 |
Part 5 |
Part 6 |
Part 7