← Back to Articles
6/6/2026Admin Post

saga demystified part6 pitfalls

Part 6: SAGA Pitfalls, Anti-Patterns, and Best Practices

Series Navigation: Index |
Part 1 |
Part 2 |
Part 3 |
Part 4 |
Part 5 | Part 6 |
Part 7 - Interview


Table of Contents

  1. Isolation Anomalies in SAGAs
  2. Anti-Pattern: Missing Idempotency
  3. Anti-Pattern: Missing Compensating Transactions
  4. Anti-Pattern: Saga Explosion
  5. Anti-Pattern: Too-Fine-Grained SAGAs
  6. Anti-Pattern: Saga as a Silver Bullet
  7. Anti-Pattern: Blocking in Compensation
  8. Anti-Pattern: Ignoring Pivot Transactions
  9. Anti-Pattern: No Dead Letter Queue
  10. Anti-Pattern: Synchronous Compensation Chain
  11. Production Challenge: Message Duplication
  12. Production Challenge: Out-of-Order Events
  13. Production Challenge: Partial Service Failure During Compensation
  14. Production Challenge: Schema Evolution with Events
  15. Production Challenge: Service Discovery During Compensation
  16. Industry Best Practices
  17. Testing Strategies for Production-Grade SAGAs
  18. Operational Runbook
  19. Summary

1. Isolation Anomalies in SAGAs

Because each SAGA step commits independently, other transactions can read intermediate states.
These are called isolation anomalies. Understanding them is critical for system design.

Anomaly 1: Dirty Reads (Lost Updates)

SCENARIO:
T1: SAGA creates Order (status=PENDING) - COMMITTED
T2: Another process reads Order (sees PENDING) - OK
T3: SAGA processes payment - COMMITTED
T4: Same process reads Order again (sees PENDING still, has stale cache)
T5: SAGA fails, Order goes to CANCELLED - COMMITTED
T4 reader: acts on stale PENDING state that is now invalid

REAL IMPACT:
- Customer dashboard shows "Order Processing" when it is actually cancelled
- Downstream analytics process a PENDING order that becomes CANCELLED

Mitigation: Semantic Locking

// Add a saga_in_progress flag to the order
public class Order {
    private OrderStatus status;
    private boolean sagaInProgress;  // true while saga is active
}
 
// Downstream systems check this flag
public OrderSummary getOrderSummary(String orderId) {
    Order order = orderRepository.findById(orderId).orElseThrow();
 
    if (order.isSagaInProgress()) {
        // Return a safe "processing" representation
        return OrderSummary.processing(orderId);
    }
 
    return OrderSummary.from(order);
}

Anomaly 2: Non-Repeatable Reads

SCENARIO:
Reporting service reads order summary at T1: total = $5000
SAGA processes more orders between T1 and T2
Reporting service reads order summary at T2: total = $6000

Not a bug per se, but if the report aggregates multiple reads in one logical operation,
values will be inconsistent WITHIN the same report.

Mitigation: Snapshot Consistency

// For reports, take a consistent snapshot at the START of the report run
// Store the snapshot in a separate reporting table updated at report time
// Report reads only from snapshot, not live tables
 
@Scheduled(cron = "0 0 1 * * *")  // Daily at 1am
public void generateDailyReportSnapshot() {
    Instant snapshotTime = Instant.now();
 
    // Take consistent snapshot
    List<OrderSummary> completedOrders = orderRepository
        .findByStatusAndCompletedBefore(OrderStatus.CONFIRMED, snapshotTime);
 
    // Store in reporting table
    reportingRepository.saveSnapshot(new DailySnapshot(snapshotTime, completedOrders));
}

Anomaly 3: Lost Updates

SCENARIO:
Two SAGAs for two different orders read inventory simultaneously:
  SAGA-1 reads: 5 items available
  SAGA-2 reads: 5 items available
  SAGA-1 reserves 4 items (4 available)
  SAGA-2 reserves 4 items (0 available, NEGATIVE if not checked!)

Without proper locking, both SAGAs see 5 items and both try to reserve 4.
Result: negative inventory (overselling)

Mitigation: Pessimistic Locking

@Transactional
public void reserveInventory(String productId, int quantity) {
    // PESSIMISTIC WRITE LOCK prevents concurrent reads
    InventoryItem item = inventoryRepository
        .findByProductIdWithLock(productId)
        .orElseThrow();
 
    if (item.getAvailableQuantity() < quantity) {
        throw new InsufficientStockException(productId);
    }
 
    item.setAvailableQuantity(item.getAvailableQuantity() - quantity);
    inventoryRepository.save(item);
    // Lock released when transaction commits
}

Anomaly 4: Phantom Reads

SCENARIO:
Inventory check shows 3 items in stock
Customer 1's SAGA reserves 3 items (available = 0)
Customer 2 reads inventory during SAGA-1 between check and reserve
Customer 2 sees 3 items still (phantom read - data is stale)
Customer 2 tries to order

Mitigation: Commutative Updates
Design operations to be order-independent (commutative). For inventory:

// Instead of: available = available - quantity (order-dependent)
// Use: available = total - reserved (always computable from total and reserved)
 
UPDATE inventory
SET reserved_quantity = reserved_quantity + ?
WHERE product_id = ?
  AND (total_quantity - reserved_quantity) >= ?  -- Check condition in UPDATE

Anomaly 5: Lost Compensation

SCENARIO:
SAGA starts, creates order (T1)
SAGA charges payment (T2)
Inventory fails (T3) - compensation triggered
Compensation: refund payment (C2)
Compensation: cancel order (C1)

PROBLEM: C1 (cancel order) is delayed and arrives AFTER a user action
User reads order status at T=10: PENDING
C1 executes at T=11: status=CANCELLED
User re-orders at T=12 (thinking first order failed)
Result: duplicate orders

Mitigation: Semantic Locking + Event Ordering

// Order cannot be re-created while saga_in_progress = true
// Only set saga_in_progress = false after C1 commits
 
public void createOrder(PlaceOrderRequest request) {
    // Prevent re-order while saga is active
    Optional<Order> existingOrder = orderRepository
        .findPendingSagaForCustomer(request.customerId());
 
    if (existingOrder.isPresent()) {
        throw new DuplicateOrderException(
            "Existing saga in progress for customer: " + request.customerId()
        );
    }
 
    // Create new order with saga_in_progress = true
}

2. Anti-Pattern: Missing Idempotency

What Happens Without Idempotency

Timeline:
T1: Kafka delivers PaymentProcessedEvent to Inventory Service
T2: Inventory Service processes event, reserves inventory
T3: Inventory Service sends ACK to Kafka (ACK lost in network!)
T4: Kafka re-delivers PaymentProcessedEvent (thinks it was not processed)
T5: Inventory Service processes AGAIN (duplicate reservation!)

Result: inventory double-reserved, phantom negative stock

The Fix: Always Implement Idempotency

// WRONG - No idempotency check
@KafkaListener(topics = "payment.events.payment-processed")
public void handlePaymentProcessed(PaymentProcessedEvent event, Acknowledgment ack) {
    inventoryService.reserveInventory(event.orderId(), event.items());
    ack.acknowledge();
    // If ACK fails, message is re-delivered and inventory reserved TWICE
}
 
// CORRECT - Database unique constraint as idempotency guard
@KafkaListener(topics = "payment.events.payment-processed")
@Transactional
public void handlePaymentProcessed(PaymentProcessedEvent event, Acknowledgment ack) {
    // Unique constraint on order_id in reservations table prevents duplicate
    try {
        inventoryService.reserveInventory(event.orderId(), event.items());
    } catch (DataIntegrityViolationException e) {
        // Already processed - safe to acknowledge and continue
        log.info("Duplicate event ignored (idempotent): orderId={}", event.orderId());
    }
    ack.acknowledge();
}

Checklist: Is Your Handler Idempotent?

  • Does it create a record? Add UNIQUE constraint on natural key
  • Does it update a record? Check current status before updating
  • Does it call an external API? Use idempotency key in the API call
  • Does it decrement a counter? Check if already decremented for this event
  • Does it publish an event? Use eventId for deduplication in outbox

3. Anti-Pattern: Missing Compensating Transactions

What Goes Wrong

BAD DESIGN:
Saga steps:
T1: Create Order
T2: Process Payment
T3: Reserve Inventory
T4: Send Email Notification
T5: Update Analytics Dashboard

Compensation only designed for:
C3: Release Inventory
C2: Refund Payment
C1: Cancel Order

MISSING: Compensation for T4 (email) and T5 (analytics)

Result: Customer receives "Order Confirmed" email for a cancelled order
Analytics dashboard shows wrong numbers

The Fix: Design Compensations Upfront

// SAGA DESIGN CHECKLIST
// For EVERY step in your saga, answer these questions BEFORE writing code:
//
// 1. What does this step do? (Forward action)
// 2. What should happen if we need to undo it? (Compensation)
// 3. Can this step be undone? (Some cannot - e.g., "sent an SMS")
// 4. If it cannot be undone, can it be placed AFTER the pivot transaction? (Retriable only)
// 5. Is the compensation idempotent?
// 6. Can the compensation fail? If so, what is the fallback?
 
enum OrderSagaStep {
    // Step                Compensation            Notes
    CREATE_ORDER,         // -> CANCEL_ORDER        Idempotent via order status check
    PROCESS_PAYMENT,      // -> REFUND_PAYMENT      Idempotent via payment status check
    RESERVE_INVENTORY,    // -> RELEASE_INVENTORY   Idempotent via reservation status check
    CREATE_SHIPMENT,      // -> CANCEL_SHIPMENT     After pivot - prefer retry over compensation
    SEND_CONFIRMATION,    // NO compensation!       Cannot unsend email - place LAST (fire-and-forget)
    UPDATE_ANALYTICS      // NO compensation!       Eventually consistent, self-healing
}

Non-Compensable Steps: Placement Strategy

CORRECT SAGA ORDERING:

[Compensable Steps]                    [Pivot]     [Non-Compensable/Retriable Steps]
Create Order ->                        Reserve --->  Create Shipment
Process Payment ->                     Inventory     Send Email (fire-and-forget)
                                                      Update Analytics (eventually consistent)

The non-compensable steps (email, analytics) come AFTER the pivot.
If they fail, we retry. We never compensate backward past the pivot for these.

4. Anti-Pattern: Saga Explosion

The Problem

As a system grows, every business process becomes a saga. Soon you have:

  • 50+ saga types
  • Hundreds of compensating transactions
  • Events proliferating across 20+ topics
  • No one understands the full system

Signs You Have Saga Explosion

  • "We have a saga for every microservice operation"
  • "Nobody knows which service triggers which saga"
  • Event subscriptions form a graph that nobody can draw
  • Adding a new feature requires updating 10 different services' event handlers

The Fix: Right-Size Your SAGAs

QUESTION: "Does this REALLY need a saga?"

Ask yourself:
1. Does this operation span MULTIPLE service databases?
   NO -> Use simple local transaction. No saga needed.

2. Does each step need to commit immediately (not hold locks)?
   NO -> Consider 2PC within a single service.

3. Are there more than 2 services involved?
   NO -> A simple async event may be enough (not a full saga).

RULE: Only create a saga when a business transaction MUST span multiple service
      databases AND each step must commit independently.

5. Anti-Pattern: Too-Fine-Grained SAGAs

The Problem

BAD: Every database update is its own saga step
  Saga: Update Customer Preferences
    T1: Update email preference (Service A)
    T2: Update notification preference (Service B)
    T3: Update UI preference (Service B)
    T4: Sync to analytics (Service C)

This creates enormous overhead for a simple update.
4 Kafka messages, 4 event handlers, 4 idempotency checks, 4 compensations.

The Fix: Coarse-Grained Steps

// BETTER: Coarser steps that batch related changes
// Step 1: Update ALL preferences in Service B in ONE local transaction
// Step 2: Sync to analytics (async, fire-and-forget)
 
// Only need a saga if preferences span DIFFERENT databases in DIFFERENT services
// If email and notification preferences are in the SAME database -> just use @Transactional
 
@Service
public class PreferencesService {
 
    @Transactional
    public void updateAllPreferences(String customerId, PreferencesRequest request) {
        // All in ONE local transaction - NO saga needed!
        preferenceRepository.updateEmailPreference(customerId, request.email());
        preferenceRepository.updateNotificationPreference(customerId, request.notification());
        preferenceRepository.updateUIPreference(customerId, request.ui());
 
        // Async fire-and-forget to analytics (no compensation needed)
        eventPublisher.publishAsync(new PreferencesUpdatedEvent(customerId));
    }
}

6. Anti-Pattern: Saga as a Silver Bullet

Symptoms

  • "We use SAGAs for everything"
  • Single-service operations wrapped in sagas
  • Simple CRUD operations with unnecessary compensation logic
  • Performance problems caused by saga overhead

The Fix: Pattern Selection Framework

When to use SAGA:
+ Multi-service transaction required
+ High availability requirement
+ Independent scalability per service
+ Long-running business process
+ Services use different databases

When NOT to use SAGA:
- All data in one database -> use @Transactional
- Simple CRUD in one service -> no pattern needed
- Strong consistency mandatory -> design differently or use single service
- Team not ready for eventual consistency -> simplify architecture first
- Operations not compensable -> redesign step order or use sagas only for compensable parts

7. Anti-Pattern: Blocking in Compensation

What Goes Wrong

// BAD: Compensation blocks waiting for synchronous response
@Transactional
public void compensateOrder(String orderId) {
    // Block waiting for payment service to confirm refund
    PaymentRefundResponse response = paymentServiceClient.refund(orderId)
        .get(30, TimeUnit.SECONDS);  // BLOCKS the compensation chain!
 
    // If payment service is down, this blocks for 30 seconds
    // Then throws TimeoutException
    // Compensation fails, saga stuck in COMPENSATING state forever
 
    orderRepository.updateStatus(orderId, OrderStatus.CANCELLED);
}

The Fix: Async Compensation

// GOOD: Compensation is event-driven, never blocks
// Compensation chain is driven by events, not synchronous calls
 
// Order Service publishes CompensationRequestedEvent
@Transactional
public void initiateCompensation(String orderId, String reason) {
    // Update status to CANCELLING (semantic lock for compensation in progress)
    Order order = orderRepository.findById(orderId).orElseThrow();
    order.setStatus(OrderStatus.CANCELLING);
    order.setSagaInProgress(true);
    orderRepository.save(order);
 
    // Publish compensation event - Payment Service will react asynchronously
    outboxEventRepository.save(OutboxEvent.of(
        KafkaTopics.COMPENSATION_ORDER,
        orderId,
        new CompensationRequestedEvent(orderId, reason)
    ));
 
    // Return immediately - do NOT wait for compensation to complete
    // Compensation completion will arrive via PaymentRefundedEvent
}

8. Anti-Pattern: Ignoring Pivot Transactions

What Happens When You Ignore the Pivot

Scenario: Order Saga without pivot awareness

Customer places order.
T1: Order Created (compensable)
T2: Payment Charged (compensable)
T3: Inventory Reserved (compensable) <- should be pivot
T4: Shipment Created
T5: Customer Notified

Shipment creation fails. Developer decides to compensate:
C4: Cancel Shipment (fine)
C3: Release Inventory (fine)
C2: Refund Payment (fine)
C1: Cancel Order (fine)

BUT: Customer was notified at T5 with "Order Confirmed"!
Now the order is cancelled but customer has a confirmation email.
Customer is confused. Support tickets flood in.

The Fix: Respect the Pivot

// Mark the pivot transaction explicitly in your design
// Everything before pivot: compensable
// Everything after pivot: either retriable or fire-and-forget
 
enum OrderSagaStep {
    CREATE_ORDER,        // Compensable: cancel order
    PROCESS_PAYMENT,     // Compensable: refund payment
    RESERVE_INVENTORY,   // PIVOT: No compensation after this succeeds
    CREATE_SHIPMENT,     // Post-pivot: Retry up to 5 times before manual intervention
    SEND_NOTIFICATION,   // Post-pivot: Fire-and-forget (after shipment confirmed)
    CONFIRM_ORDER        // Post-pivot: Retry
}
 
// In orchestrator: once pivot commits, no more compensation
private SagaStep getCompensationStart(SagaStep failedAt) {
    return switch (failedAt) {
        case CREATE_ORDER      -> null;                         // Nothing to compensate
        case PROCESS_PAYMENT   -> SagaStep.COMPENSATE_ORDER;   // Cancel order
        case RESERVE_INVENTORY -> SagaStep.COMPENSATE_PAYMENT; // Refund, then cancel
        case CREATE_SHIPMENT,
             SEND_NOTIFICATION,
             CONFIRM_ORDER     -> null;  // POST-PIVOT: No compensation, retry only
    };
}

9. Anti-Pattern: No Dead Letter Queue

What Happens

Scenario: Message permanently fails (bug in consumer code)
- Consumer throws NullPointerException every time
- Kafka retries the message 3 times (with backoff)
- Message returns to main topic partition
- Consumer processes it again: NPE
- Infinite retry loop!
- The partition is BLOCKED - no subsequent messages processed
- The saga is STUCK FOREVER

The Fix: Always Configure DLT

# application.yml
spring:
  kafka:
    consumer:
      # After max retries, message goes to DLT automatically
      # Never configure infinite retries
// ALWAYS configure DLT via Spring Kafka error handler
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> kafkaTemplate) {
    DeadLetterPublishingRecoverer dltRecoverer =
        new DeadLetterPublishingRecoverer(kafkaTemplate,
            (record, exception) -> {
                // Custom routing: send to named DLT topic
                return new TopicPartition(record.topic() + ".DLT", -1);
            }
        );
 
    DefaultErrorHandler errorHandler = new DefaultErrorHandler(
        dltRecoverer,
        new FixedBackOff(2000L, 3L)  // 3 retries, 2 second delay, THEN DLT
    );
 
    // Add monitoring when messages go to DLT
    errorHandler.setRetryListeners((record, ex, deliveryAttempt) ->
        log.warn("Retry attempt {} for topic={}, key={}, error={}",
            deliveryAttempt, record.topic(), record.key(), ex.getMessage())
    );
 
    return errorHandler;
}

10. Anti-Pattern: Synchronous Compensation Chain

The Problem

// BAD: Compensation waits synchronously for each step
@Transactional
public void compensate(SagaInstance saga) {
    // Step 1: Call Payment Service synchronously
    paymentServiceClient.refund(saga.getOrderId());  // 500ms
 
    // Step 2: Call Inventory Service synchronously
    inventoryServiceClient.releaseReservation(saga.getOrderId());  // 300ms
 
    // Step 3: Update order status
    orderService.cancelOrder(saga.getOrderId());  // 100ms
 
    // Total: 900ms of blocking
    // If ANY service is slow, EVERYTHING blocks
    // If ANY service is down, ENTIRE compensation fails
}

The Fix: Event-Driven Compensation Chain

// GOOD: Each compensation step publishes an event and returns immediately
// Next compensation step starts when event is consumed
 
// Payment Service: receives InventoryReservationFailedEvent
@KafkaListener(topics = KafkaTopics.INVENTORY_RESERVATION_FAILED)
@Transactional
public void handleInventoryFailed(InventoryReservationFailedEvent event, Acknowledgment ack) {
    // Do the refund (fast local operation + outbox write)
    paymentService.refund(event.orderId(), event.failureReason());
 
    // Refund triggers PaymentRefundedEvent -> Order Service cancels order
    // This is ASYNC - returns immediately
    ack.acknowledge();
}

11. Production Challenge: Message Duplication

Root Causes

  1. Consumer processes message but ACK is lost in transit
  2. Consumer crashes after processing but before ACK
  3. Outbox publisher publishes event but crashes before marking it as PUBLISHED
  4. Kafka rebalance during consumer group restart

Detection Strategy

// Add duplicate detection metrics
@KafkaListener(topics = KafkaTopics.PAYMENT_PROCESSED)
@Transactional
public void handlePaymentProcessed(PaymentProcessedEvent event, Acknowledgment ack) {
 
    // Check for duplicate
    if (processedEventRepository.existsByEventIdAndServiceId(
            event.eventId(), "inventory-service")) {
        log.warn("DUPLICATE_EVENT detected: eventId={}, topic=PAYMENT_PROCESSED",
            event.eventId());
        meterRegistry.counter("events.duplicates",
            "topic", "payment_processed").increment();
        ack.acknowledge();  // Acknowledge to prevent re-delivery
        return;
    }
 
    // Process event
    inventoryService.reserveInventory(event);
 
    // Mark as processed
    processedEventRepository.save(
        new ProcessedEvent(event.eventId(), "inventory-service", Instant.now())
    );
 
    ack.acknowledge();
}

12. Production Challenge: Out-of-Order Events

How Events Arrive Out of Order

In Kafka, messages within the SAME partition are ordered. Messages across partitions are NOT.

If your saga events go to different partitions (or different topics), they may arrive out of order.

EXPECTED ORDER:
PaymentProcessedEvent -> InventoryReservedEvent -> ShipmentCreatedEvent

ACTUAL ORDER RECEIVED (due to different partitions/consumer lag):
ShipmentCreatedEvent  <- arrives first (from partition 2)
InventoryReservedEvent <- arrives second (from partition 0)
PaymentProcessedEvent  <- arrives last (from partition 1)

A consumer processing ShipmentCreatedEvent has no order record yet!

Solution 1: Ensure Single Partition for Order Events

// Use orderId as Kafka KEY to ensure all events for the same order
// go to the SAME partition (Kafka routes by key hash to partition)
 
kafkaTemplate.send(
    new ProducerRecord<>(
        topic,
        orderId,    // <-- KEY: ensures same partition for same order
        eventPayload
    )
);

Solution 2: State Machine Guard

// Check current saga state before processing
@KafkaListener(topics = KafkaTopics.SHIPMENT_CREATED)
@Transactional
public void handleShipmentCreated(ShipmentCreatedEvent event, Acknowledgment ack) {
    Order order = orderRepository.findById(event.orderId()).orElse(null);
 
    if (order == null || order.getStatus() != OrderStatus.INVENTORY_RESERVED) {
        // Event arrived too early - requeue with delay
        log.warn("Out-of-order event: ShipmentCreated but order status is {}",
            order == null ? "NOT_FOUND" : order.getStatus());
 
        // Option 1: Delay and retry (using retry topic)
        throw new OutOfOrderEventException("Order not ready for shipment confirmation");
 
        // Option 2: Store in a pending event table and retry when ready
    }
 
    // Process normally
    orderService.confirmOrder(event.orderId(), event.shipmentId());
    ack.acknowledge();
}

Solution 3: Pending Events Store

// PendingEventStore.java
@Service
@RequiredArgsConstructor
public class PendingEventStore {
 
    private final PendingEventRepository pendingEventRepository;
 
    /**
     * Store an event that arrived before its prerequisites were met.
     * A scheduler will retry these events periodically.
     */
    @Transactional
    public void storeForLater(String eventType, String orderId, String payload) {
        pendingEventRepository.save(PendingEvent.builder()
            .eventId(UUID.randomUUID().toString())
            .eventType(eventType)
            .orderId(orderId)
            .payload(payload)
            .retryAfter(Instant.now().plusSeconds(5))  // Retry after 5 seconds
            .maxRetries(10)
            .createdAt(Instant.now())
            .build());
    }
 
    @Scheduled(fixedDelay = 5000)
    @Transactional
    public void retryPendingEvents() {
        List<PendingEvent> events = pendingEventRepository
            .findByRetryAfterBeforeOrderByCreatedAtAsc(Instant.now(),
                PageRequest.of(0, 50));
 
        for (PendingEvent event : events) {
            // Re-publish to original topic
            // ... re-publishing logic
 
            if (event.getRetryCount() >= event.getMaxRetries()) {
                event.setStatus("FAILED");
            } else {
                event.setRetryCount(event.getRetryCount() + 1);
                event.setRetryAfter(Instant.now().plusSeconds(
                    (long) Math.pow(2, event.getRetryCount()) * 5  // Exponential backoff
                ));
            }
            pendingEventRepository.save(event);
        }
    }
}

13. Production Challenge: Partial Service Failure During Compensation

Scenario

Order failed at Inventory step.
Compensation:
  C2: Refund Payment     <- Payment Service is DOWN for maintenance
  C1: Cancel Order       <- This must wait for C2

Result: Compensation is stuck. Order is in CANCELLING state forever.
Customer balance not refunded.

Solution: Aggressive Retry with Manual Fallback

// CompensationRetryService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class CompensationRetryService {
 
    private final SagaInstanceRepository sagaRepository;
    private final OrderSagaOrchestrator orchestrator;
    private final AlertService alertService;
 
    /**
     * Find sagas stuck in compensation and retry.
     */
    @Scheduled(fixedDelay = 30000)  // Every 30 seconds
    public void retryStuckCompensations() {
        List<SagaInstance> compensatingSagas = sagaRepository.findByStatus(
            SagaStatus.COMPENSATING
        );
 
        Instant stuckThreshold = Instant.now().minus(5, ChronoUnit.MINUTES);
 
        for (SagaInstance saga : compensatingSagas) {
            if (saga.getUpdatedAt().isBefore(stuckThreshold)) {
                if (saga.getRetryCount() < 20) {
                    log.warn("Retrying stuck compensation: sagaId={}, step={}, attempt={}",
                        saga.getSagaId(), saga.getCurrentStep(), saga.getRetryCount());
 
                    saga.setRetryCount(saga.getRetryCount() + 1);
                    sagaRepository.save(saga);
 
                    orchestrator.replayCurrentCompensationStep(saga);
                } else {
                    // Escalate to human
                    log.error("COMPENSATION PERMANENTLY STUCK: sagaId={}", saga.getSagaId());
                    alertService.sendPagerDutyAlert(
                        "CRITICAL: Manual intervention needed for saga compensation",
                        saga.getSagaId()
                    );
                    saga.setStatus(SagaStatus.FAILED);
                    sagaRepository.save(saga);
                }
            }
        }
    }
}

Manual Intervention API

// SagaManualInterventionController.java
@RestController
@RequestMapping("/admin/sagas")
@RequiredArgsConstructor
public class SagaManualInterventionController {
 
    private final SagaManualInterventionService interventionService;
 
    // Force-complete a compensation manually (after verifying manually)
    @PostMapping("/{sagaId}/compensations/{step}/force-complete")
    @PreAuthorize("hasRole('SAGA_ADMIN')")
    public ResponseEntity<Void> forceCompleteCompensation(
            @PathVariable String sagaId,
            @PathVariable String step,
            @RequestBody ForceCompleteRequest request) {
        interventionService.forceCompleteCompensation(sagaId, step, request.reason());
        return ResponseEntity.ok().build();
    }
 
    // Force-fail a saga (mark as FAILED, stop all processing)
    @PostMapping("/{sagaId}/force-fail")
    @PreAuthorize("hasRole('SAGA_ADMIN')")
    public ResponseEntity<Void> forceFail(
            @PathVariable String sagaId,
            @RequestBody ForceFailRequest request) {
        interventionService.forceFail(sagaId, request.reason(), request.operatorId());
        return ResponseEntity.ok().build();
    }
}

14. Production Challenge: Schema Evolution with Events

The Problem

Events are published to Kafka and consumed by multiple services.
If you change an event's structure (add/remove/rename fields), consumers may break.

BREAKING CHANGE:
v1 Event: OrderCreatedEvent { orderId, customerId, totalAmount }
v2 Event: OrderCreatedEvent { orderId, userId, amount }  // renamed fields!

Consumer expecting "customerId" receives null.
Consumer expecting "totalAmount" receives null.
Saga breaks silently.

Solution: Backward-Compatible Event Evolution

// ALWAYS add fields, never remove or rename
// Use @JsonIgnoreProperties(ignoreUnknown = true) on all event classes
 
@JsonIgnoreProperties(ignoreUnknown = true)  // CRITICAL: ignore unknown fields
public record OrderCreatedEvent(
    String eventId,
    String sagaId,
    String orderId,
    String customerId,
    @JsonProperty("userId") String userId,  // Added in v2 (customerId still works)
    BigDecimal totalAmount,
    @JsonProperty("amount") BigDecimal amount,  // Added in v2 alias
    List<OrderItemDto> items,
    Instant occurredAt
) {
    // Getter that returns either customerId or userId (backward compatible)
    @JsonIgnore
    public String getEffectiveCustomerId() {
        return customerId != null ? customerId : userId;
    }
}

Versioned Event Routing

// Event Router: handles multiple versions of the same event type
@Component
@RequiredArgsConstructor
public class VersionedEventRouter {
 
    @KafkaListener(topics = KafkaTopics.ORDER_CREATED)
    public void handleOrderCreated(
            @Payload String rawPayload,
            @Header("eventVersion") String version,
            Acknowledgment ack) {
 
        switch (version) {
            case "1" -> handleV1OrderCreated(rawPayload);
            case "2" -> handleV2OrderCreated(rawPayload);
            default  -> {
                log.warn("Unknown event version {} - trying latest handler", version);
                handleV2OrderCreated(rawPayload);
            }
        }
        ack.acknowledge();
    }
}

15. Production Challenge: Service Discovery During Compensation

The Problem

SCENARIO:
Order SAGA has 5 participating services.
Compensation is triggered.
Inventory Service was re-deployed at a different address during the saga.
Compensation event for Inventory Service is sent to old address.
No response. Compensation stuck.

Solution: Service Discovery + Event-Driven Communication

// NEVER use hard-coded service addresses in saga compensations
// ALWAYS use event-driven communication (Kafka/SQS topics)
// Topics are stable; service addresses change
 
// BAD:
restTemplate.post("http://inventory-service-v1.internal/api/compensation", request);
 
// GOOD:
kafkaTemplate.send(CommandTopics.INVENTORY_SERVICE, orderId,
    new ReleaseInventoryCommand(sagaId, orderId));
// Inventory Service listens to this topic regardless of its URL

16. Industry Best Practices

Practice 1: Design Compensations Before Forward Steps

SAGA DESIGN TEMPLATE:
For each step, fill in this table BEFORE writing any code:

| Step Name | Forward Action | Compensation | Idempotent? | Can Fail? | Fallback |
|-----------|---------------|--------------|-------------|-----------|----------|
| CREATE_ORDER | INSERT order | UPDATE status=CANCELLED | Yes (status check) | Low risk | Manual |
| PROCESS_PAYMENT | Call gateway | Call gateway refund | Yes (status check) | Medium risk | PagerDuty alert |
| RESERVE_INVENTORY | Decrement stock | Increment stock | Yes (status check) | Low risk | Retry 5x |
| CREATE_SHIPMENT | Call carrier | Call carrier cancel | Yes (status check) | High risk | Manual + retry |

Practice 2: Monitor Every Saga Step Duration

// Set SLA alerts for each step
// If Payment step takes > 10 seconds, alert immediately
 
@Component
public class SagaStepSlaMonitor {
 
    private static final Map<SagaStep, Duration> STEP_SLA_LIMITS = Map.of(
        SagaStep.CREATE_ORDER, Duration.ofSeconds(2),
        SagaStep.PROCESS_PAYMENT, Duration.ofSeconds(10),
        SagaStep.RESERVE_INVENTORY, Duration.ofSeconds(5),
        SagaStep.CREATE_SHIPMENT, Duration.ofSeconds(30)
    );
 
    public void checkStepSla(SagaStep step, Duration actualDuration) {
        Duration slaLimit = STEP_SLA_LIMITS.getOrDefault(step, Duration.ofSeconds(30));
 
        if (actualDuration.compareTo(slaLimit) > 0) {
            log.warn("SLA BREACH: step={}, actual={}ms, limit={}ms",
                step, actualDuration.toMillis(), slaLimit.toMillis());
            meterRegistry.counter("saga.sla.breach", "step", step.name()).increment();
        }
    }
}

Practice 3: Test Every Failure Scenario

// SagaFailureScenarioTests.java
// Test one failure scenario per test method
@SpringBootTest
@Testcontainers
class SagaFailureScenarioTests {
 
    @Test void testPaymentDeclined_ordersGetCancelled() {}
    @Test void testInventoryOutOfStock_paymentRefunded_orderCancelled() {}
    @Test void testShippingUnavailable_sagaRetries() {}
    @Test void testDuplicateOrderCreatedEvent_idempotent() {}
    @Test void testOutOfOrderEvents_handled() {}
    @Test void testSagaTimeout_compensated() {}
    @Test void testCompensationFails_alertSent_manualIntervention() {}
    @Test void testNetworkPartitionDuringPayment_sagaRecovers() {}
}

Practice 4: Use Saga IDs as Correlation IDs

// Every log line, every metric, every trace should include sagaId
// This enables end-to-end saga tracing across all services
 
// In event handlers, extract sagaId and set in MDC
@KafkaListener(topics = "payment.events.payment-processed")
public void handle(PaymentProcessedEvent event, Acknowledgment ack) {
    MDC.put("sagaId", event.sagaId());
    MDC.put("orderId", event.orderId());
 
    try {
        // All log statements within will include sagaId and orderId
        inventoryService.reserveInventory(event);
        ack.acknowledge();
    } finally {
        MDC.remove("sagaId");
        MDC.remove("orderId");
    }
}

Practice 5: Separate Compensation Logic from Business Logic

// BAD: Mixed business and compensation logic
@Service
public class PaymentService {
    public void processPayment(OrderCreatedEvent event) { /* ... */ }
 
    public void refundPayment(String orderId) {
        // Compensation mixed in the same service = tight coupling
    }
 
    public void partialRefund(String orderId, BigDecimal amount) { /* ... */ }
    public void reverseChargeback(String orderId) { /* ... */ }
}
 
// GOOD: Separate compensation handler
@Service
public class PaymentService {
    public PaymentResult processPayment(OrderCreatedEvent event) { /* ... */ }
}
 
@Component
public class PaymentCompensationHandler {
    private final PaymentService paymentService;
 
    public void handleCompensation(RefundPaymentCommand command) {
        paymentService.refund(command.orderId(), command.reason());
    }
}

17. Testing Strategies for Production-Grade SAGAs

Test Pyramid for SAGAs

                    /\
                   /  \
                  / E2E \          <- Full saga from API to database (few, slow)
                 /--------\
                /  Integr. \       <- Service + Kafka + DB (per service, medium)
               /------------\
              /     Unit     \     <- Orchestrator state machine, handlers (many, fast)
             /----------------\

Unit Testing the Orchestrator State Machine

// OrderSagaOrchestratorUnitTest.java
@ExtendWith(MockitoExtension.class)
class OrderSagaOrchestratorUnitTest {
 
    @Mock private SagaInstanceRepository sagaRepository;
    @Mock private KafkaTemplate<String, Object> kafkaTemplate;
 
    @InjectMocks private OrderSagaOrchestrator orchestrator;
 
    @Test
    @DisplayName("Payment success should transition to RESERVE_INVENTORY step")
    void paymentSuccess_shouldTransitionToInventoryStep() {
        SagaInstance saga = createSagaInStep(SagaStep.PROCESS_PAYMENT);
        SagaReplyEnvelope reply = SagaReplyEnvelope.success(saga.getSagaId(),
            "PROCESS_PAYMENT", "{\"paymentId\":\"PAY-001\"}");
 
        orchestrator.handleSagaReply(reply, mock(Acknowledgment.class));
 
        ArgumentCaptor<SagaInstance> sagaCaptor = ArgumentCaptor.forClass(SagaInstance.class);
        verify(sagaRepository).save(sagaCaptor.capture());
        assertThat(sagaCaptor.getValue().getCurrentStep())
            .isEqualTo(SagaStep.RESERVE_INVENTORY);
    }
 
    @Test
    @DisplayName("Inventory failure should trigger payment compensation")
    void inventoryFailure_shouldTriggerPaymentCompensation() {
        SagaInstance saga = createSagaInStep(SagaStep.RESERVE_INVENTORY);
        SagaReplyEnvelope reply = SagaReplyEnvelope.failure(
            saga.getSagaId(), "RESERVE_INVENTORY", "Out of stock", "INSUFFICIENT_STOCK"
        );
 
        orchestrator.handleSagaReply(reply, mock(Acknowledgment.class));
 
        // Verify: compensation command sent to payment service
        verify(kafkaTemplate).send(
            eq(CommandTopics.PAYMENT_SERVICE),
            eq(saga.getOrderId()),
            any(RefundPaymentCommand.class)
        );
    }
}

18. Operational Runbook

When a Saga Gets Stuck

Step 1: Identify stuck sagas
  Query: SELECT * FROM saga_instances
         WHERE status IN ('IN_PROGRESS', 'COMPENSATING')
         AND updated_at < NOW() - INTERVAL 30 MINUTE;

Step 2: Check the current step
  For each stuck saga, check what step it is on.
  Is the required service up? Check CloudWatch / health endpoints.

Step 3: Check for DLT messages
  Are there messages in the DLT for the relevant topics?
  Review: SELECT * FROM dlt_messages WHERE status = 'PENDING_REVIEW';

Step 4: Check Outbox
  Are events stuck in the outbox?
  Query: SELECT COUNT(*) FROM outbox_events
         WHERE status = 'PENDING' AND created_at < NOW() - INTERVAL 10 MINUTE;

Step 5: Attempt automatic recovery
  Use the recovery API: POST /admin/sagas/{sagaId}/retry-current-step

Step 6: If automatic recovery fails
  Review the error, fix the underlying issue, then:
  POST /admin/dlt-messages/{messageId}/replay

Step 7: Last resort - manual force
  POST /admin/sagas/{sagaId}/force-complete-step (with explicit reason)
  Or: POST /admin/sagas/{sagaId}/force-fail (accept data inconsistency, fix manually)

Step 8: Post-incident
  Document the incident, add new test case, fix root cause

19. Summary

CategoryKey Lesson
Isolation AnomaliesUse semantic locking and CQRS read models to hide intermediate saga state
IdempotencyEvery handler MUST be idempotent. Use database unique constraints first.
Missing CompensationsDesign ALL compensations before writing ANY code
Saga ExplosionOnly use SAGA for genuine multi-service, multi-database transactions
Event OrderingUse orderId as Kafka partition key for per-order ordering
Message DuplicationExpected behavior with at-least-once delivery. Idempotency is the fix.
Schema EvolutionOnly add fields, never remove. Use @JsonIgnoreProperties always.
Stuck CompensationsAggressive retry + escalation to human operator
TestingTest every failure scenario, not just the happy path
OperationsBuild runbook before deploying to production
MonitoringsagaId in every log line. Alert on stuck sagas, DLT messages, outbox failures.

Next: Part 7 - Interview Mastery

Master 60+ SAGA interview questions from junior to Principal Architect level,
complete with detailed answers, follow-up questions, and tips for handling tricky scenarios.


Series Navigation: Index |
Part 1 |
Part 2 |
Part 3 |
Part 4 |
Part 5 | Part 6 |
Part 7