Part 1: SAGA Patterns - Fundamentals and Theory

The Problem This Pattern Solves
A Quick Story to Set the Stage
Understanding ACID Transactions
The Monolith Era: When ACID Was Easy
The Microservices Era: When ACID Became Hard
Why You Cannot Just Use a Distributed Database
CAP Theorem: The Fundamental Constraint
BASE Properties: The Practical Alternative
Two-Phase Commit (2PC): The Failed Solution
Enter the SAGA Pattern
How SAGA Works: The Core Mechanics
Compensating Transactions: The Heart of SAGA
Types of SAGA: Choreography vs Orchestration
SAGA vs 2PC: A Direct Comparison
When to Use SAGA and When Not To
Trade-Offs You Must Accept
Key Terminology Reference
Summary

1. The Problem This Pattern Solves

Imagine you are building an e-commerce system. A customer places an order. Your system must:

Create the order record
Charge the customer's credit card
Reserve items from inventory
Schedule a delivery shipment

In a monolithic application, these four operations happen inside a single database transaction.
Either ALL succeed or ALL fail. No half-completed states. The database guarantees this.

Now imagine each of these steps runs in a SEPARATE microservice, each with its OWN database.

Order Service    Payment Service    Inventory Service    Shipping Service
[orders DB]      [payments DB]      [inventory DB]       [shipping DB]
(MySQL RDS #1)   (MySQL RDS #2)     (MySQL RDS #3)       (MySQL RDS #4)

The Question: How do you guarantee that either ALL four steps succeed, or ALL four steps
are rolled back - when there is NO single database transaction spanning all four databases?

This is the distributed transaction problem. SAGA is the most practical solution.

2. A Quick Story to Set the Stage

Think of booking an international trip:

You book a flight with Airline A
You book a hotel in the destination city
You book a rental car
You buy travel insurance

If the rental car company suddenly has no cars available, what happens?

Option A (Strict atomicity): Cancel everything and refund everything. You wait hours for
refunds to appear on your credit card. You start again from scratch.

Option B (Compensating actions): The travel agent calls the airline and hotel, says
"the customer needs to cancel because the rental car failed." The airline marks the seat
available again. The hotel releases the room. This is compensation.

Option B is what a SAGA does. Not perfect rollback in a mathematical sense, but a series of
compensating actions that bring the system back to a consistent state.

3. Understanding ACID Transactions

Before understanding why SAGA exists, you must deeply understand ACID:

Atomicity

"All or nothing." Either every operation in the transaction succeeds, or none of them do.
If you crash in the middle, the database rolls back all partial changes.

BEGIN TRANSACTION;
  INSERT INTO orders (id, status, amount) VALUES ('ORD-001', 'PENDING', 100.00);
  UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 'PROD-101';
  INSERT INTO payments (order_id, amount, status) VALUES ('ORD-001', 100.00, 'CHARGED');
COMMIT;  -- Either all three succeed, or none of them are persisted

Consistency

The database moves from one valid state to another valid state. Constraints (foreign keys,
unique indexes, check constraints) are always honored.

Isolation

Concurrent transactions do not see each other's intermediate (uncommitted) data. One
transaction does not read dirty data from another in-progress transaction.

Durability

Once a transaction is committed, it persists even through crashes, power failures, or other
system failures. Data written to disk is permanent.

Why ACID Matters

ACID is a promise from your database that you can write code without worrying about:

Partial writes
Data corruption from concurrent access
Lost updates
Reads of temporary data

The problem is this promise ONLY works within a SINGLE database connection.

4. The Monolith Era: When ACID Was Easy

                MONOLITH APPLICATION
+--------------------------------------------------+
|                                                  |
|  Order Module -> Payment Module -> Inventory     |
|                                                  |
|         All share ONE database                   |
|                                                  |
+--------------------------------------------------+
                       |
                       v
+--------------------------------------------------+
|         SINGLE MySQL DATABASE                    |
|   orders | payments | inventory | shipping       |
|         All tables, one connection               |
+--------------------------------------------------+

In a monolith with a single database, this code works perfectly:

@Transactional  // Spring manages the transaction boundary
public void placeOrder(OrderRequest request) {
    // Step 1: Create Order
    Order order = orderRepository.save(new Order(request));
 
    // Step 2: Process Payment (same DB, same transaction)
    Payment payment = paymentRepository.save(new Payment(order));
 
    // Step 3: Reserve Inventory (same DB, same transaction)
    inventoryRepository.decrementStock(request.getProductId(), request.getQuantity());
 
    // Step 4: Create Shipment (same DB, same transaction)
    shipmentRepository.save(new Shipment(order));
 
    // If ANY exception is thrown, Spring rolls back ALL of the above
    // The database guarantees this with its transaction log
}

This is beautiful. Simple. Safe. But it requires ALL operations to hit the SAME database.

5. The Microservices Era: When ACID Became Hard

When you split your monolith into microservices:

Order Service          Payment Service        Inventory Service
+--------------+       +---------------+      +----------------+
| Spring Boot  |       | Spring Boot   |      | Spring Boot    |
| orders DB    |       | payments DB   |      | inventory DB   |
| (MySQL RDS1) |       | (MySQL RDS2)  |      | (MySQL RDS3)   |
+--------------+       +---------------+      +----------------+
      |                      |                       |
      |    REST or Kafka      |                       |
      +--------> call ------->+--------> call ------->+

The @Transactional annotation CANNOT span across these three separate databases.
They run on different machines, different JVMs, different database connections.

What actually happens when payment fails after order creation?

Timeline of a failed order:

T1: Order Service creates order (committed to orders DB - permanent!)
T2: Payment Service rejects payment (card declined)
T3: ??? The order record exists in the DB but payment failed

Result: DATABASE INCONSISTENCY
- orders DB shows order in PENDING state
- payments DB shows no payment
- inventory DB not touched
- Customer sees "order placed" but nothing will ship

This is the core problem. Data is permanently written to one database before we know
whether all other steps will succeed.

The Naive Solutions (And Why They All Fail)

Approach 1: Accept inconsistency and fix manually

Result: Data corruption, angry customers, manual reconciliation jobs at 3am

Approach 2: Use one giant database shared by all services

Violates microservices principle of data ownership
Creates a shared database - the tightest form of coupling
Kills independent deployability and scalability

Approach 3: Use distributed transactions (2PC)

This sounds correct but fails badly in practice (see Section 9)

Approach 4: SAGA Pattern

This is the correct answer for most cases

6. Why You Cannot Just Use a Distributed Database

"Why not use a NewSQL database like Google Spanner, CockroachDB, or YugabyteDB that
supports distributed ACID transactions?"

This is a valid question. Here is why it is usually not the answer:

Technical Reasons

Problem	Explanation
Network latency	Cross-region transactions take 50-200ms per operation
Lock contention	Global locks across nodes dramatically reduce throughput
Vendor lock-in	You are tied to one database vendor for all services
Cost	Global distributed databases are expensive at scale
Operational complexity	Requires specialized expertise to operate

Architectural Reasons

In microservices, services OWN their data. The entire point is that:

Order Service owns the orders database
Payment Service owns the payments database
These are deployment and operational units that evolve independently

If both services share even the same distributed database schema, you have re-created
a distributed monolith - the worst of both worlds.

When Distributed Databases ARE the Answer

When your entire system naturally fits one data model
When you need strong consistency at global scale (financial systems, inventory)
When you have a single team owning all services
When operational complexity is manageable

For most microservices systems though, SAGA is the right answer.

7. CAP Theorem: The Fundamental Constraint

Eric Brewer's CAP theorem states that a distributed data store can provide at most
TWO of the following three guarantees simultaneously:

           CONSISTENCY
          /
         /   <-- can only pick two
        /           \
AVAILABILITY --------- PARTITION TOLERANCE

Consistency (C)

Every read receives the most recent write or an error.
All nodes see the same data at the same time.

Availability (A)

Every request receives a non-error response.
The system always responds (though the data may not be the latest).

Partition Tolerance (P)

The system continues to operate even when network messages between nodes are lost
or delayed.

The Practical Implication

Network partitions ARE going to happen in real distributed systems. Cables fail,
switches fail, network congestion causes timeouts. Therefore P is NOT optional.

You are always choosing between C and A:

CP systems (choose consistency over availability): Return an error when uncertain.
Examples: MySQL with synchronous replication, HBase, ZooKeeper.
AP systems (choose availability over consistency): Return potentially stale data
but always respond. Examples: DynamoDB, Cassandra, CouchDB.

How This Relates to SAGA

SAGA is fundamentally an AP approach:

Services remain available and process requests even during network issues
Consistency is achieved EVENTUALLY through compensation
During a saga execution, the system may be in an intermediate inconsistent state
Final consistency is guaranteed only after all saga steps complete

8. BASE Properties: The Practical Alternative

BASE is the alternative to ACID for distributed systems:

Basically Available

The system guarantees availability. Responses are always given, though the data
might not be the most up-to-date.

Soft State

The state of the system may change over time, even without input.
During a saga, data is in a soft, transitional state.

Eventually Consistent

The system will BECOME consistent over time, given that it receives no new input.
After a saga completes (successfully or via compensation), all services will reflect
a consistent state.

ACID vs BASE Comparison

Property	ACID	BASE
Consistency	Strong (immediate)	Eventual (over time)
Isolation	Full (reads see committed data only)	Partial (reads may see uncommitted saga state)
Rollback	Database handles it	Application handles via compensation
Atomicity	Database guarantees	Application guarantees through compensating transactions
Suitable for	Single-database operations	Cross-service, multi-database operations
Complexity	Low (DB handles it)	High (application must manage)
Performance	Lower (locks)	Higher (no cross-service locks)

The Key Mental Shift

Moving from monolith to microservices requires accepting:

"I cannot have strong consistency across service boundaries.
I must design for eventual consistency and compensable failures."

This is not a weakness. It is a design choice that enables scale, availability,
and independent evolution of services.

9. Two-Phase Commit (2PC): The Failed Solution

Before SAGAs became popular, engineers tried to solve this with 2PC (Two-Phase Commit).
Understanding why 2PC fails is crucial to appreciating SAGA.

How 2PC Works

2PC uses a Transaction Coordinator (TC) that manages two phases:

PHASE 1: PREPARE (Voting Phase)
==============<mark class="obsidian-highlight">
Transaction Coordinator (TC)
         |
         |-- "Can you commit?" --> Order Service     --> "Yes, vote COMMIT"
         |-- "Can you commit?" --> Payment Service   --> "Yes, vote COMMIT"
         |-- "Can you commit?" --> Inventory Service --> "Yes, vote COMMIT"
         |
         | (TC waits for all votes)

PHASE 2: COMMIT (Decision Phase)
</mark>==============
         | (if ALL voted COMMIT)
         |
         |-- "COMMIT now!" --> Order Service     --> Commits, releases locks
         |-- "COMMIT now!" --> Payment Service   --> Commits, releases locks
         |-- "COMMIT now!" --> Inventory Service --> Commits, releases locks
         |
         | (if ANY voted ABORT)
         |
         |-- "ROLLBACK!" --> All services --> Roll back their changes

Why 2PC Fails in Microservices

Problem 1: Blocking Protocol
During Phase 1, all participants hold database locks waiting for the Phase 2 decision.
In a distributed system, this means locks can be held for seconds or minutes waiting
for a response from another service across the network.

Order Service holds locks on order records
Payment Service holds locks on payment records
Inventory Service holds locks on inventory records
All waiting for the TC to tell them to commit or rollback
= MASSIVE lock contention, terrible throughput

Problem 2: Single Point of Failure - The Coordinator

What happens if the Transaction Coordinator crashes AFTER Phase 1 but BEFORE Phase 2?

TC asks all participants: "Can you commit?"
All say: "Yes"
TC prepares to send commit decision...
TC CRASHES
            +---- Order Service: locked, waiting for commit/rollback decision
All         +---- Payment Service: locked, waiting
participants+---- Inventory Service: locked, waiting

They are stuck. They cannot commit on their own. They cannot rollback on their own.
They CANNOT release their locks.
= SYSTEM DEADLOCK until coordinator recovers

Problem 3: Network Partition After Prepare

TC sends COMMIT to Order Service -- OK
TC sends COMMIT to Payment Service -- NETWORK PARTITION
TC sends COMMIT to Inventory Service -- COMMIT received

Result:
- Order Service: COMMITTED
- Payment Service: UNKNOWN STATE (did it commit? did it not?)
- Inventory Service: COMMITTED
= SPLIT BRAIN INCONSISTENCY

Problem 4: Performance at Scale

2PC requires at minimum 2 round trips across the network for EVERY transaction.
In a system processing 10,000 transactions per second, this is catastrophic.

Problem 5: Heterogeneous Systems

2PC requires ALL participants to support the XA protocol. Most modern services use
message queues, HTTP APIs, and NoSQL stores that have no notion of XA.

The Verdict on 2PC

Issue	Severity
Blocking under coordinator failure	Critical
Performance overhead	High
Scalability limitation	High
Limited ecosystem support	Medium
Operational complexity	High
Cloud-native incompatibility	High

2PC works well within a single database system (MySQL uses it internally for
replication). It does NOT work well across independent microservices.

10. Enter the SAGA Pattern

The SAGA pattern was first described by Hector Garcia-Molina and Kenneth Salem in
their 1987 paper "SAGAS" (Princeton University, Department of Computer Science).

The original definition was for Long-Lived Transactions (LLTs) - transactions that
span minutes or hours and cannot hold database locks for that duration.

The Core Insight from the 1987 Paper:

A saga is a sequence of transactions T1, T2, ..., Tn that can be interleaved with
other transactions. Each Ti has a corresponding compensating transaction Ci that
semantically undoes the effect of Ti.

Modern Definition

A SAGA is a sequence of local transactions where:

Each local transaction updates data within a SINGLE service/database
Each local transaction publishes an event or sends a command to trigger the next step
If any step fails, compensating transactions are executed in reverse order to undo
the effects of the preceding steps

SAGA HAPPY PATH:
T1 (Order Created) -> T2 (Payment Processed) -> T3 (Inventory Reserved) -> T4 (Shipment Scheduled)
       |                      |                         |                          |
    Commit                  Commit                   Commit                     Commit
                                                                            SAGA COMPLETE

SAGA FAILURE PATH (failure at T3):
T1 (Order Created) -> T2 (Payment Processed) -> T3 (Inventory Reserve FAILS)
       |                      |                         |
    Commit                  Commit                    FAIL
                                                        |
                                                        v
                                         C2 (Refund Payment) <- C1 (Cancel Order)
                                              Commit                   Commit
                                                             SAGA COMPENSATED

What Makes SAGA Different from 2PC?

Aspect	2PC	SAGA
Consistency	Strong (atomic)	Eventual
Locking	Holds locks across network	No cross-service locks
Failure handling	Coordinator decides	Application compensates
Performance	Blocking	Non-blocking
Availability	Low (coordinator SPOF)	High
Complexity	Protocol complexity	Application logic complexity
Recovery	Automatic (coordinator)	Must design compensations

SAGA trades strong atomicity for availability and performance. It accepts that
the system may be in an inconsistent state temporarily, and uses application-level
logic to restore consistency.

11. How SAGA Works: The Core Mechanics

The Sequence of Events

Let us trace through the e-commerce order saga:

STEP 1: Customer places order
---------------------------------
OrderService.createOrder()
  - INSERT INTO orders (id, status) VALUES ('ORD-001', 'PENDING')
  - COMMIT (local transaction completes, this is permanent)
  - Publish OrderCreatedEvent to Kafka/SQS

STEP 2: Payment processing
---------------------------------
PaymentService receives OrderCreatedEvent
  - Begin local transaction
  - Call payment gateway API (external)
  - INSERT INTO payments (order_id, amount, status) VALUES ('ORD-001', 99.99, 'CHARGED')
  - COMMIT (local transaction completes)
  - Publish PaymentProcessedEvent

STEP 3: Inventory reservation
---------------------------------
InventoryService receives PaymentProcessedEvent
  - Begin local transaction
  - SELECT FOR UPDATE on inventory row (check quantity)
  - UPDATE inventory SET reserved = reserved + 1 WHERE product_id = 'PROD-101'
  - INSERT INTO reservations (order_id, product_id, quantity) VALUES (...)
  - COMMIT
  - Publish InventoryReservedEvent

STEP 4: Shipping schedule
---------------------------------
ShippingService receives InventoryReservedEvent
  - Begin local transaction
  - INSERT INTO shipments (order_id, status) VALUES ('ORD-001', 'SCHEDULED')
  - COMMIT
  - Publish ShipmentScheduledEvent

SAGA COMPLETE: All 4 steps succeeded
---------------------------------
OrderService receives ShipmentScheduledEvent
  - UPDATE orders SET status = 'CONFIRMED' WHERE id = 'ORD-001'

Failure Scenario

Failure at Step 3 (Inventory Out of Stock):
---------------------------------
InventoryService receives PaymentProcessedEvent
  - Begin local transaction
  - SELECT FOR UPDATE on inventory row
  - Quantity available: 0 (out of stock!)
  - ROLLBACK local transaction
  - Publish InventoryReservationFailedEvent

COMPENSATION BEGINS (reverse order):
---------------------------------
Step C2: Payment Compensation
  PaymentService receives InventoryReservationFailedEvent
  - Begin local transaction
  - UPDATE payments SET status = 'REFUNDED' WHERE order_id = 'ORD-001'
  - Call payment gateway refund API
  - COMMIT
  - Publish PaymentRefundedEvent

Step C1: Order Compensation
  OrderService receives PaymentRefundedEvent
  - Begin local transaction
  - UPDATE orders SET status = 'CANCELLED' WHERE id = 'ORD-001'
  - COMMIT
  - Publish OrderCancelledEvent

SAGA COMPENSATED: System is back to consistent state
All records updated to reflect cancellation

The Key Points

Local transactions commit immediately - No holding for global outcome
Compensation is NOT rollback - It is a new forward transaction that undoes effects
The order DB has a CANCELLED order - Not an absent order record
The payment DB has a REFUNDED record - Not a deleted record
The audit trail is preserved - Every step is recorded permanently

12. Compensating Transactions: The Heart of SAGA

A compensating transaction is NOT the same as a database rollback. This distinction
is fundamental and frequently misunderstood.

Semantic vs Syntactic Undo

Database Rollback (Syntactic Undo):

Removes all trace of the original operation
Happens automatically before commit
The original transaction never "happened" from the database's perspective
No audit trail

Compensating Transaction (Semantic Undo):

Creates a NEW forward transaction that reverses the EFFECTS of the original
Executes AFTER the original transaction already committed
Both the original action AND the compensation are permanently recorded
Full audit trail preserved
Must be designed explicitly by the developer

DATABASE ROLLBACK:
  T1: INSERT INTO orders VALUES (...)  <-- this insert
  ROLLBACK                             <-- this erases the insert
  Result: orders table unchanged, no record of attempted insert

SAGA COMPENSATION:
  T1: INSERT INTO orders (id, status) VALUES ('ORD-001', 'PENDING')  <-- commits
  COMMIT
  ... time passes, other things happen ...
  C1: UPDATE orders SET status = 'CANCELLED' WHERE id = 'ORD-001'   <-- commits
  COMMIT
  Result: orders table has CANCELLED order, full audit trail preserved

Properties of Good Compensating Transactions

1. Idempotent
If a compensating transaction is executed multiple times (due to retries), the result
must be the same as if it was executed once. This is CRITICAL.

// BAD: Not idempotent - double compensation causes a double refund
public void refundPayment(String orderId) {
    paymentGateway.refund(orderId);  // if called twice, refunds twice!
}
 
// GOOD: Idempotent - safe to call multiple times
public void refundPayment(String orderId) {
    Payment payment = paymentRepository.findByOrderId(orderId);
    if (payment.getStatus() == PaymentStatus.REFUNDED) {
        log.info("Payment {} already refunded, skipping", orderId);
        return;  // safe to return, already done
    }
    payment.setStatus(PaymentStatus.REFUNDED);
    paymentGateway.refund(payment.getExternalRef());
    paymentRepository.save(payment);
}

2. Semantically Reversible
The compensation must undo the BUSINESS EFFECT, not just the database change.

Original: Charged customer's credit card $99.99
Compensation: Refunded $99.99 to customer's credit card
NOT: Deleted the payment record (that would be auditing fraud)

3. Retryable
Because network failures can happen during compensation, the compensation must
be safe to retry. (This is related to idempotency.)

4. Non-Failing (By Design)
Compensating transactions should not fail. If a compensation fails, you have
a bigger problem. Design compensations to always succeed (possibly with manual
intervention as a last resort).

Types of Transactions in a SAGA

Pivotal Transaction: The transaction after which you no longer need to
compensate. Once the pivotal transaction commits, the saga can only go forward.

In the order saga:
T1: Create Order  (compensable: cancel order)
T2: Process Payment (compensable: refund payment)
T3: Reserve Inventory  <-- THIS IS THE PIVOT TRANSACTION in many designs
T4: Schedule Shipment  (retriable: shipment can be rescheduled)
T5: Confirm Order  (retriable: status update can be retried)

Once inventory is reserved and payment processed, you WANT to fulfill the order.
The pivot is where the saga transitions from "can be undone" to "must complete forward"

Compensable Transactions: Transactions that have a corresponding compensation.
These are the transactions before the pivotal transaction.

Retriable Transactions: Transactions after the pivotal transaction that should
be retried until they succeed (not compensated).

Complete SAGA Model:
[Compensable Txns] -> [Pivot Txn] -> [Retriable Txns]
T1, T2, T3         ->   T4        ->    T5, T6
These can fail      If T4 fails,   These should
and trigger         compensate     always succeed
C3, C2, C1         T1, T2, T3     (with retries)

13. Types of SAGA: Choreography vs Orchestration

There are exactly TWO ways to coordinate a SAGA:

Choreography (Event-Driven)

Services communicate by publishing and listening to events. There is NO central
coordinator. Each service knows what events to react to and what events to emit.

CHOREOGRAPHY FLOW:

Order Service  ---OrderCreatedEvent----->  Payment Service
                                                |
                                   PaymentProcessedEvent
                                                |
                                   Inventory Service
                                                |
                                   InventoryReservedEvent
                                                |
                                   Shipping Service
                                                |
                                   ShipmentCreatedEvent
                                                |
                                   Order Service (update to CONFIRMED)

COMPENSATION (payment fails):

Payment Service  ---PaymentFailedEvent----->  Order Service (cancel order)

No one service "knows" the whole saga. Each service only knows:

What events to subscribe to
What local action to perform
What event to emit on success or failure

Orchestration (Command-Driven)

A central Orchestrator service drives the saga. It sends commands to services
and receives success/failure responses. The orchestrator maintains state.

ORCHESTRATION FLOW:

                   SAGA ORCHESTRATOR
                   (knows all steps)
                         |
           +-------------+-------------+
           |             |             |
  ProcessPaymentCmd  ReserveInventoryCmd  CreateShipmentCmd
           |             |             |
     Payment Svc    Inventory Svc   Shipping Svc
           |             |             |
  PaymentResult    InventoryResult  ShipmentResult
           |             |             |
           +-------------+-------------+
                         |
                   (orchestrator decides next step
                    or initiates compensation)

The orchestrator explicitly knows the entire saga flow and coordinates every step.

Quick Comparison

Aspect	Choreography	Orchestration
Coordination	Decentralized	Centralized
Service coupling	Loose (via events)	Tighter (commands + responses)
Visibility	Hard to trace flow	Easy to see saga state
Business logic	Distributed across services	Concentrated in orchestrator
Testing	Complex	Simpler
Debugging	Difficult	Straightforward
Scalability	Excellent	Good (orchestrator can scale)
Event proliferation	Can explode	Controlled
When to use	Simple, few services	Complex, many services

Detailed implementations of both patterns are covered in Parts 2 and 3.

14. SAGA vs 2PC: A Direct Comparison

Factor	SAGA	2PC
Consistency model	Eventual	Strong
Performance	High (no distributed locks)	Low (blocking locks)
Availability	High (no coordinator SPOF)	Lower (coordinator dependency)
Scalability	Excellent	Poor
Network failures	Handled via compensation	Can cause deadlock
Heterogeneous systems	Works with any service type	Requires XA protocol support
Audit trail	Full trail of all operations	Only committed state
Complexity	Application logic	Protocol complexity
Cloud native	Yes	No
Suitable for microservices	Yes	No

When Would You Still Use 2PC?

2PC makes sense WITHIN a single system (not across microservices):

MySQL uses 2PC internally for its InnoDB storage engine
A single service that writes to two data stores (rare, use Outbox instead)
When absolute strong consistency is non-negotiable and performance is secondary

15. When to Use SAGA and When Not To

Use SAGA When:

Multiple services must update their databases as part of one business transaction
Example: Order, Payment, Inventory, Shipping must all update together.
Long-running business processes span multiple services
Example: Insurance claim processing that involves 10 services over several days.
Services use different databases and technologies
Example: Order service uses MySQL, Payment uses DynamoDB, Inventory uses Postgres.
High throughput is required
Example: Processing 50,000 orders per second. 2PC would collapse under this load.
System must remain available during partial failures
Example: Payment service goes down, but other services should keep running.

Do NOT Use SAGA When:

Strong consistency is an absolute business requirement
Example: Financial double-entry bookkeeping where debit and credit MUST be atomic.
Use a single service with a single transactional database instead.
All data fits in one service/database
If order, payment, inventory all belong to one service - just use @Transactional.
You are building a simple CRUD application
Over-engineering. Use standard transactional patterns.
Operations are not compensable
Example: "Send an SMS notification." You cannot un-send an SMS. In this case,
place the SMS step LAST (after the pivot transaction) as a retriable transaction.
Team lacks experience with distributed systems
SAGAs are complex. Build team competency before adopting in production.

The Decision Matrix

Does the operation span multiple service databases?
  |
  YES
  |
  v
Do all operations need to eventually be consistent?
  |
  YES
  |
  v
Can you design compensating transactions for each step?
  |
  YES
  |
  v
Is the team ready for eventual consistency in the application?
  |
  YES
  |
  v
USE SAGA PATTERN

If any answer is NO:
- Reconsider service boundaries (maybe they should be one service)
- Use synchronous request-response with retry
- Accept the strong consistency constraint and design around it

16. Trade-Offs You Must Accept

Trade-Off 1: Lack of Isolation

In a traditional database transaction, no other transaction sees your intermediate data.
In a SAGA, each local transaction commits and becomes visible immediately.

SAGA IN PROGRESS:
T1: Order PENDING (committed, visible to reads)
T2: Payment CHARGED (committed, visible to reads)
T3: (not started yet)

Another user queries order status: sees "PENDING" or partially complete data

The Impact: Users or other systems may see intermediate saga states.

The Mitigation:

Semantic locking: Add a saga_in_progress flag
Design UI/API to show "processing" state gracefully
Use read models (CQRS) that only show final states

Trade-Off 2: Compensation is Business Logic

Compensating transactions are not free. Someone must design, implement, test, and maintain them.
Each new saga step requires a corresponding compensation. This doubles the code surface area.

The Mitigation: Design compensations upfront during system design, not as an afterthought.

Trade-Off 3: Eventual Consistency is Visible

Between a saga starting and completing, the system is in a state that may not match
any logically correct business state. Downstream systems reading data during this window
may get inconsistent results.

The Mitigation:

Design read models carefully
Use sagas events to update read models only after saga completion
Communicate to business stakeholders that "processing" is a valid state

Trade-Off 4: Testing Complexity

Testing a saga requires testing the happy path AND every possible failure scenario:

Failure at each step
Failure during compensation
Network timeouts
Duplicate events (at-least-once delivery)
Out-of-order events

The Mitigation: Testcontainers + thorough integration tests covering all paths.

Trade-Off 5: Observability Complexity

A single saga spans multiple services. Debugging requires correlating logs and events
across all services using a correlation ID.

The Mitigation: Distributed tracing (AWS X-Ray, OpenTelemetry), structured logging,
saga-specific dashboards.

17. Key Terminology Reference

Term	Definition
SAGA	A sequence of local transactions coordinated to achieve a larger business goal
Local Transaction	A transaction within a single service/database
Compensating Transaction	A transaction that semantically undoes the effect of a previous transaction
Choreography	SAGA coordination via event publishing and subscription (no central coordinator)
Orchestration	SAGA coordination via a central orchestrator that drives all steps
Pivot Transaction	The last compensable transaction in a saga; after this point, only forward progress
Compensable Transaction	A transaction that has a corresponding compensation
Retriable Transaction	A transaction after the pivot that should be retried until it succeeds
Idempotent	An operation that produces the same result regardless of how many times it is executed
Eventual Consistency	A consistency model where the system will become consistent over time
SEC	Saga Execution Coordinator - the entity (service or mechanism) that tracks saga state
Dual-Write Problem	The inability to atomically write to a database AND publish an event
Outbox Pattern	A pattern that solves the dual-write problem using the database as an event buffer
Semantic Lock	A flag on a record indicating it is participating in a saga
Compensating Action	Business-level undo (not database rollback)
Saga State	The current status of a saga execution: STARTED, IN_PROGRESS, COMPLETED, COMPENSATING

18. Summary

Concept	Key Takeaway
The Problem	ACID transactions cannot span multiple service databases
Why 2PC Fails	Blocking, single point of failure, performance killer
CAP Theorem	Distributed systems must choose between Consistency and Availability when partitioned
BASE vs ACID	Accept eventual consistency for distributed systems, strong consistency for single services
SAGA Basics	Sequence of local transactions + compensating transactions for rollback
Compensation	A new forward transaction, not database rollback. Must be idempotent.
SAGA Types	Choreography (event-driven) or Orchestration (centralized)
Key Trade-Off	No isolation during saga execution; eventual consistency only
Use SAGA When	Multi-service transactions with high availability and throughput requirements
Do NOT Use When	Single-service data, strong consistency mandatory, operations not compensable

What Comes Next

You now understand WHY SAGA exists and the theoretical foundations.

Next: Part 2 - Choreography Pattern

Learn how to implement the event-driven choreography approach with complete, production-ready
Spring Boot code, Kafka configuration, and AWS integration.

Series: Saga Demystified

Part 1: SAGA Patterns - Fundamentals and Theory

Table of Contents

1. The Problem This Pattern Solves

2. A Quick Story to Set the Stage

3. Understanding ACID Transactions

Atomicity

Consistency

Isolation

Durability

Why ACID Matters

4. The Monolith Era: When ACID Was Easy

5. The Microservices Era: When ACID Became Hard

The Naive Solutions (And Why They All Fail)

6. Why You Cannot Just Use a Distributed Database

Technical Reasons

Architectural Reasons

When Distributed Databases ARE the Answer

7. CAP Theorem: The Fundamental Constraint

Consistency (C)

Availability (A)

Partition Tolerance (P)

The Practical Implication

How This Relates to SAGA

8. BASE Properties: The Practical Alternative

Basically Available

Soft State

Eventually Consistent

ACID vs BASE Comparison

The Key Mental Shift

9. Two-Phase Commit (2PC): The Failed Solution

How 2PC Works

Why 2PC Fails in Microservices

The Verdict on 2PC

10. Enter the SAGA Pattern

Modern Definition

What Makes SAGA Different from 2PC?

11. How SAGA Works: The Core Mechanics

The Sequence of Events

Failure Scenario

The Key Points

12. Compensating Transactions: The Heart of SAGA

Semantic vs Syntactic Undo

Properties of Good Compensating Transactions

Types of Transactions in a SAGA

13. Types of SAGA: Choreography vs Orchestration

Choreography (Event-Driven)

Orchestration (Command-Driven)

Quick Comparison

14. SAGA vs 2PC: A Direct Comparison

When Would You Still Use 2PC?

15. When to Use SAGA and When Not To

Use SAGA When:

Do NOT Use SAGA When:

The Decision Matrix

16. Trade-Offs You Must Accept

Trade-Off 1: Lack of Isolation

Trade-Off 2: Compensation is Business Logic

Trade-Off 3: Eventual Consistency is Visible

Trade-Off 4: Testing Complexity

Trade-Off 5: Observability Complexity

17. Key Terminology Reference

18. Summary

What Comes Next