Part 6: SAGA Pitfalls, Anti-Patterns, and Best Practices
Series Navigation: Index |
Part 1 |
Part 2 |
Part 3 |
Part 4 |
Part 5 | Part 6 |
Part 7 - Interview
Table of Contents
- Isolation Anomalies in SAGAs
- Anti-Pattern: Missing Idempotency
- Anti-Pattern: Missing Compensating Transactions
- Anti-Pattern: Saga Explosion
- Anti-Pattern: Too-Fine-Grained SAGAs
- Anti-Pattern: Saga as a Silver Bullet
- Anti-Pattern: Blocking in Compensation
- Anti-Pattern: Ignoring Pivot Transactions
- Anti-Pattern: No Dead Letter Queue
- Anti-Pattern: Synchronous Compensation Chain
- Production Challenge: Message Duplication
- Production Challenge: Out-of-Order Events
- Production Challenge: Partial Service Failure During Compensation
- Production Challenge: Schema Evolution with Events
- Production Challenge: Service Discovery During Compensation
- Industry Best Practices
- Testing Strategies for Production-Grade SAGAs
- Operational Runbook
- Summary
1. Isolation Anomalies in SAGAs
Because each SAGA step commits independently, other transactions can read intermediate states.
These are called isolation anomalies. Understanding them is critical for system design.
Anomaly 1: Dirty Reads (Lost Updates)
SCENARIO:
T1: SAGA creates Order (status=PENDING) - COMMITTED
T2: Another process reads Order (sees PENDING) - OK
T3: SAGA processes payment - COMMITTED
T4: Same process reads Order again (sees PENDING still, has stale cache)
T5: SAGA fails, Order goes to CANCELLED - COMMITTED
T4 reader: acts on stale PENDING state that is now invalid
REAL IMPACT:
- Customer dashboard shows "Order Processing" when it is actually cancelled
- Downstream analytics process a PENDING order that becomes CANCELLED
Mitigation: Semantic Locking
// Add a saga_in_progress flag to the order
public class Order {
private OrderStatus status;
private boolean sagaInProgress; // true while saga is active
}
// Downstream systems check this flag
public OrderSummary getOrderSummary(String orderId) {
Order order = orderRepository.findById(orderId).orElseThrow();
if (order.isSagaInProgress()) {
// Return a safe "processing" representation
return OrderSummary.processing(orderId);
}
return OrderSummary.from(order);
}Anomaly 2: Non-Repeatable Reads
SCENARIO:
Reporting service reads order summary at T1: total = $5000
SAGA processes more orders between T1 and T2
Reporting service reads order summary at T2: total = $6000
Not a bug per se, but if the report aggregates multiple reads in one logical operation,
values will be inconsistent WITHIN the same report.
Mitigation: Snapshot Consistency
// For reports, take a consistent snapshot at the START of the report run
// Store the snapshot in a separate reporting table updated at report time
// Report reads only from snapshot, not live tables
@Scheduled(cron = "0 0 1 * * *") // Daily at 1am
public void generateDailyReportSnapshot() {
Instant snapshotTime = Instant.now();
// Take consistent snapshot
List<OrderSummary> completedOrders = orderRepository
.findByStatusAndCompletedBefore(OrderStatus.CONFIRMED, snapshotTime);
// Store in reporting table
reportingRepository.saveSnapshot(new DailySnapshot(snapshotTime, completedOrders));
}Anomaly 3: Lost Updates
SCENARIO:
Two SAGAs for two different orders read inventory simultaneously:
SAGA-1 reads: 5 items available
SAGA-2 reads: 5 items available
SAGA-1 reserves 4 items (4 available)
SAGA-2 reserves 4 items (0 available, NEGATIVE if not checked!)
Without proper locking, both SAGAs see 5 items and both try to reserve 4.
Result: negative inventory (overselling)
Mitigation: Pessimistic Locking
@Transactional
public void reserveInventory(String productId, int quantity) {
// PESSIMISTIC WRITE LOCK prevents concurrent reads
InventoryItem item = inventoryRepository
.findByProductIdWithLock(productId)
.orElseThrow();
if (item.getAvailableQuantity() < quantity) {
throw new InsufficientStockException(productId);
}
item.setAvailableQuantity(item.getAvailableQuantity() - quantity);
inventoryRepository.save(item);
// Lock released when transaction commits
}Anomaly 4: Phantom Reads
SCENARIO:
Inventory check shows 3 items in stock
Customer 1's SAGA reserves 3 items (available = 0)
Customer 2 reads inventory during SAGA-1 between check and reserve
Customer 2 sees 3 items still (phantom read - data is stale)
Customer 2 tries to order
Mitigation: Commutative Updates
Design operations to be order-independent (commutative). For inventory:
// Instead of: available = available - quantity (order-dependent)
// Use: available = total - reserved (always computable from total and reserved)
UPDATE inventory
SET reserved_quantity = reserved_quantity + ?
WHERE product_id = ?
AND (total_quantity - reserved_quantity) >= ? -- Check condition in UPDATEAnomaly 5: Lost Compensation
SCENARIO:
SAGA starts, creates order (T1)
SAGA charges payment (T2)
Inventory fails (T3) - compensation triggered
Compensation: refund payment (C2)
Compensation: cancel order (C1)
PROBLEM: C1 (cancel order) is delayed and arrives AFTER a user action
User reads order status at T=10: PENDING
C1 executes at T=11: status=CANCELLED
User re-orders at T=12 (thinking first order failed)
Result: duplicate orders
Mitigation: Semantic Locking + Event Ordering
// Order cannot be re-created while saga_in_progress = true
// Only set saga_in_progress = false after C1 commits
public void createOrder(PlaceOrderRequest request) {
// Prevent re-order while saga is active
Optional<Order> existingOrder = orderRepository
.findPendingSagaForCustomer(request.customerId());
if (existingOrder.isPresent()) {
throw new DuplicateOrderException(
"Existing saga in progress for customer: " + request.customerId()
);
}
// Create new order with saga_in_progress = true
}2. Anti-Pattern: Missing Idempotency
What Happens Without Idempotency
Timeline:
T1: Kafka delivers PaymentProcessedEvent to Inventory Service
T2: Inventory Service processes event, reserves inventory
T3: Inventory Service sends ACK to Kafka (ACK lost in network!)
T4: Kafka re-delivers PaymentProcessedEvent (thinks it was not processed)
T5: Inventory Service processes AGAIN (duplicate reservation!)
Result: inventory double-reserved, phantom negative stock
The Fix: Always Implement Idempotency
// WRONG - No idempotency check
@KafkaListener(topics = "payment.events.payment-processed")
public void handlePaymentProcessed(PaymentProcessedEvent event, Acknowledgment ack) {
inventoryService.reserveInventory(event.orderId(), event.items());
ack.acknowledge();
// If ACK fails, message is re-delivered and inventory reserved TWICE
}
// CORRECT - Database unique constraint as idempotency guard
@KafkaListener(topics = "payment.events.payment-processed")
@Transactional
public void handlePaymentProcessed(PaymentProcessedEvent event, Acknowledgment ack) {
// Unique constraint on order_id in reservations table prevents duplicate
try {
inventoryService.reserveInventory(event.orderId(), event.items());
} catch (DataIntegrityViolationException e) {
// Already processed - safe to acknowledge and continue
log.info("Duplicate event ignored (idempotent): orderId={}", event.orderId());
}
ack.acknowledge();
}Checklist: Is Your Handler Idempotent?
- Does it create a record? Add UNIQUE constraint on natural key
- Does it update a record? Check current status before updating
- Does it call an external API? Use idempotency key in the API call
- Does it decrement a counter? Check if already decremented for this event
- Does it publish an event? Use eventId for deduplication in outbox
3. Anti-Pattern: Missing Compensating Transactions
What Goes Wrong
BAD DESIGN:
Saga steps:
T1: Create Order
T2: Process Payment
T3: Reserve Inventory
T4: Send Email Notification
T5: Update Analytics Dashboard
Compensation only designed for:
C3: Release Inventory
C2: Refund Payment
C1: Cancel Order
MISSING: Compensation for T4 (email) and T5 (analytics)
Result: Customer receives "Order Confirmed" email for a cancelled order
Analytics dashboard shows wrong numbers
The Fix: Design Compensations Upfront
// SAGA DESIGN CHECKLIST
// For EVERY step in your saga, answer these questions BEFORE writing code:
//
// 1. What does this step do? (Forward action)
// 2. What should happen if we need to undo it? (Compensation)
// 3. Can this step be undone? (Some cannot - e.g., "sent an SMS")
// 4. If it cannot be undone, can it be placed AFTER the pivot transaction? (Retriable only)
// 5. Is the compensation idempotent?
// 6. Can the compensation fail? If so, what is the fallback?
enum OrderSagaStep {
// Step Compensation Notes
CREATE_ORDER, // -> CANCEL_ORDER Idempotent via order status check
PROCESS_PAYMENT, // -> REFUND_PAYMENT Idempotent via payment status check
RESERVE_INVENTORY, // -> RELEASE_INVENTORY Idempotent via reservation status check
CREATE_SHIPMENT, // -> CANCEL_SHIPMENT After pivot - prefer retry over compensation
SEND_CONFIRMATION, // NO compensation! Cannot unsend email - place LAST (fire-and-forget)
UPDATE_ANALYTICS // NO compensation! Eventually consistent, self-healing
}Non-Compensable Steps: Placement Strategy
CORRECT SAGA ORDERING:
[Compensable Steps] [Pivot] [Non-Compensable/Retriable Steps]
Create Order -> Reserve ---> Create Shipment
Process Payment -> Inventory Send Email (fire-and-forget)
Update Analytics (eventually consistent)
The non-compensable steps (email, analytics) come AFTER the pivot.
If they fail, we retry. We never compensate backward past the pivot for these.
4. Anti-Pattern: Saga Explosion
The Problem
As a system grows, every business process becomes a saga. Soon you have:
- 50+ saga types
- Hundreds of compensating transactions
- Events proliferating across 20+ topics
- No one understands the full system
Signs You Have Saga Explosion
- "We have a saga for every microservice operation"
- "Nobody knows which service triggers which saga"
- Event subscriptions form a graph that nobody can draw
- Adding a new feature requires updating 10 different services' event handlers
The Fix: Right-Size Your SAGAs
QUESTION: "Does this REALLY need a saga?"
Ask yourself:
1. Does this operation span MULTIPLE service databases?
NO -> Use simple local transaction. No saga needed.
2. Does each step need to commit immediately (not hold locks)?
NO -> Consider 2PC within a single service.
3. Are there more than 2 services involved?
NO -> A simple async event may be enough (not a full saga).
RULE: Only create a saga when a business transaction MUST span multiple service
databases AND each step must commit independently.
5. Anti-Pattern: Too-Fine-Grained SAGAs
The Problem
BAD: Every database update is its own saga step
Saga: Update Customer Preferences
T1: Update email preference (Service A)
T2: Update notification preference (Service B)
T3: Update UI preference (Service B)
T4: Sync to analytics (Service C)
This creates enormous overhead for a simple update.
4 Kafka messages, 4 event handlers, 4 idempotency checks, 4 compensations.
The Fix: Coarse-Grained Steps
// BETTER: Coarser steps that batch related changes
// Step 1: Update ALL preferences in Service B in ONE local transaction
// Step 2: Sync to analytics (async, fire-and-forget)
// Only need a saga if preferences span DIFFERENT databases in DIFFERENT services
// If email and notification preferences are in the SAME database -> just use @Transactional
@Service
public class PreferencesService {
@Transactional
public void updateAllPreferences(String customerId, PreferencesRequest request) {
// All in ONE local transaction - NO saga needed!
preferenceRepository.updateEmailPreference(customerId, request.email());
preferenceRepository.updateNotificationPreference(customerId, request.notification());
preferenceRepository.updateUIPreference(customerId, request.ui());
// Async fire-and-forget to analytics (no compensation needed)
eventPublisher.publishAsync(new PreferencesUpdatedEvent(customerId));
}
}6. Anti-Pattern: Saga as a Silver Bullet
Symptoms
- "We use SAGAs for everything"
- Single-service operations wrapped in sagas
- Simple CRUD operations with unnecessary compensation logic
- Performance problems caused by saga overhead
The Fix: Pattern Selection Framework
When to use SAGA:
+ Multi-service transaction required
+ High availability requirement
+ Independent scalability per service
+ Long-running business process
+ Services use different databases
When NOT to use SAGA:
- All data in one database -> use @Transactional
- Simple CRUD in one service -> no pattern needed
- Strong consistency mandatory -> design differently or use single service
- Team not ready for eventual consistency -> simplify architecture first
- Operations not compensable -> redesign step order or use sagas only for compensable parts
7. Anti-Pattern: Blocking in Compensation
What Goes Wrong
// BAD: Compensation blocks waiting for synchronous response
@Transactional
public void compensateOrder(String orderId) {
// Block waiting for payment service to confirm refund
PaymentRefundResponse response = paymentServiceClient.refund(orderId)
.get(30, TimeUnit.SECONDS); // BLOCKS the compensation chain!
// If payment service is down, this blocks for 30 seconds
// Then throws TimeoutException
// Compensation fails, saga stuck in COMPENSATING state forever
orderRepository.updateStatus(orderId, OrderStatus.CANCELLED);
}The Fix: Async Compensation
// GOOD: Compensation is event-driven, never blocks
// Compensation chain is driven by events, not synchronous calls
// Order Service publishes CompensationRequestedEvent
@Transactional
public void initiateCompensation(String orderId, String reason) {
// Update status to CANCELLING (semantic lock for compensation in progress)
Order order = orderRepository.findById(orderId).orElseThrow();
order.setStatus(OrderStatus.CANCELLING);
order.setSagaInProgress(true);
orderRepository.save(order);
// Publish compensation event - Payment Service will react asynchronously
outboxEventRepository.save(OutboxEvent.of(
KafkaTopics.COMPENSATION_ORDER,
orderId,
new CompensationRequestedEvent(orderId, reason)
));
// Return immediately - do NOT wait for compensation to complete
// Compensation completion will arrive via PaymentRefundedEvent
}8. Anti-Pattern: Ignoring Pivot Transactions
What Happens When You Ignore the Pivot
Scenario: Order Saga without pivot awareness
Customer places order.
T1: Order Created (compensable)
T2: Payment Charged (compensable)
T3: Inventory Reserved (compensable) <- should be pivot
T4: Shipment Created
T5: Customer Notified
Shipment creation fails. Developer decides to compensate:
C4: Cancel Shipment (fine)
C3: Release Inventory (fine)
C2: Refund Payment (fine)
C1: Cancel Order (fine)
BUT: Customer was notified at T5 with "Order Confirmed"!
Now the order is cancelled but customer has a confirmation email.
Customer is confused. Support tickets flood in.
The Fix: Respect the Pivot
// Mark the pivot transaction explicitly in your design
// Everything before pivot: compensable
// Everything after pivot: either retriable or fire-and-forget
enum OrderSagaStep {
CREATE_ORDER, // Compensable: cancel order
PROCESS_PAYMENT, // Compensable: refund payment
RESERVE_INVENTORY, // PIVOT: No compensation after this succeeds
CREATE_SHIPMENT, // Post-pivot: Retry up to 5 times before manual intervention
SEND_NOTIFICATION, // Post-pivot: Fire-and-forget (after shipment confirmed)
CONFIRM_ORDER // Post-pivot: Retry
}
// In orchestrator: once pivot commits, no more compensation
private SagaStep getCompensationStart(SagaStep failedAt) {
return switch (failedAt) {
case CREATE_ORDER -> null; // Nothing to compensate
case PROCESS_PAYMENT -> SagaStep.COMPENSATE_ORDER; // Cancel order
case RESERVE_INVENTORY -> SagaStep.COMPENSATE_PAYMENT; // Refund, then cancel
case CREATE_SHIPMENT,
SEND_NOTIFICATION,
CONFIRM_ORDER -> null; // POST-PIVOT: No compensation, retry only
};
}9. Anti-Pattern: No Dead Letter Queue
What Happens
Scenario: Message permanently fails (bug in consumer code)
- Consumer throws NullPointerException every time
- Kafka retries the message 3 times (with backoff)
- Message returns to main topic partition
- Consumer processes it again: NPE
- Infinite retry loop!
- The partition is BLOCKED - no subsequent messages processed
- The saga is STUCK FOREVER
The Fix: Always Configure DLT
# application.yml
spring:
kafka:
consumer:
# After max retries, message goes to DLT automatically
# Never configure infinite retries// ALWAYS configure DLT via Spring Kafka error handler
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> kafkaTemplate) {
DeadLetterPublishingRecoverer dltRecoverer =
new DeadLetterPublishingRecoverer(kafkaTemplate,
(record, exception) -> {
// Custom routing: send to named DLT topic
return new TopicPartition(record.topic() + ".DLT", -1);
}
);
DefaultErrorHandler errorHandler = new DefaultErrorHandler(
dltRecoverer,
new FixedBackOff(2000L, 3L) // 3 retries, 2 second delay, THEN DLT
);
// Add monitoring when messages go to DLT
errorHandler.setRetryListeners((record, ex, deliveryAttempt) ->
log.warn("Retry attempt {} for topic={}, key={}, error={}",
deliveryAttempt, record.topic(), record.key(), ex.getMessage())
);
return errorHandler;
}10. Anti-Pattern: Synchronous Compensation Chain
The Problem
// BAD: Compensation waits synchronously for each step
@Transactional
public void compensate(SagaInstance saga) {
// Step 1: Call Payment Service synchronously
paymentServiceClient.refund(saga.getOrderId()); // 500ms
// Step 2: Call Inventory Service synchronously
inventoryServiceClient.releaseReservation(saga.getOrderId()); // 300ms
// Step 3: Update order status
orderService.cancelOrder(saga.getOrderId()); // 100ms
// Total: 900ms of blocking
// If ANY service is slow, EVERYTHING blocks
// If ANY service is down, ENTIRE compensation fails
}The Fix: Event-Driven Compensation Chain
// GOOD: Each compensation step publishes an event and returns immediately
// Next compensation step starts when event is consumed
// Payment Service: receives InventoryReservationFailedEvent
@KafkaListener(topics = KafkaTopics.INVENTORY_RESERVATION_FAILED)
@Transactional
public void handleInventoryFailed(InventoryReservationFailedEvent event, Acknowledgment ack) {
// Do the refund (fast local operation + outbox write)
paymentService.refund(event.orderId(), event.failureReason());
// Refund triggers PaymentRefundedEvent -> Order Service cancels order
// This is ASYNC - returns immediately
ack.acknowledge();
}11. Production Challenge: Message Duplication
Root Causes
- Consumer processes message but ACK is lost in transit
- Consumer crashes after processing but before ACK
- Outbox publisher publishes event but crashes before marking it as PUBLISHED
- Kafka rebalance during consumer group restart
Detection Strategy
// Add duplicate detection metrics
@KafkaListener(topics = KafkaTopics.PAYMENT_PROCESSED)
@Transactional
public void handlePaymentProcessed(PaymentProcessedEvent event, Acknowledgment ack) {
// Check for duplicate
if (processedEventRepository.existsByEventIdAndServiceId(
event.eventId(), "inventory-service")) {
log.warn("DUPLICATE_EVENT detected: eventId={}, topic=PAYMENT_PROCESSED",
event.eventId());
meterRegistry.counter("events.duplicates",
"topic", "payment_processed").increment();
ack.acknowledge(); // Acknowledge to prevent re-delivery
return;
}
// Process event
inventoryService.reserveInventory(event);
// Mark as processed
processedEventRepository.save(
new ProcessedEvent(event.eventId(), "inventory-service", Instant.now())
);
ack.acknowledge();
}12. Production Challenge: Out-of-Order Events
How Events Arrive Out of Order
In Kafka, messages within the SAME partition are ordered. Messages across partitions are NOT.
If your saga events go to different partitions (or different topics), they may arrive out of order.
EXPECTED ORDER:
PaymentProcessedEvent -> InventoryReservedEvent -> ShipmentCreatedEvent
ACTUAL ORDER RECEIVED (due to different partitions/consumer lag):
ShipmentCreatedEvent <- arrives first (from partition 2)
InventoryReservedEvent <- arrives second (from partition 0)
PaymentProcessedEvent <- arrives last (from partition 1)
A consumer processing ShipmentCreatedEvent has no order record yet!
Solution 1: Ensure Single Partition for Order Events
// Use orderId as Kafka KEY to ensure all events for the same order
// go to the SAME partition (Kafka routes by key hash to partition)
kafkaTemplate.send(
new ProducerRecord<>(
topic,
orderId, // <-- KEY: ensures same partition for same order
eventPayload
)
);Solution 2: State Machine Guard
// Check current saga state before processing
@KafkaListener(topics = KafkaTopics.SHIPMENT_CREATED)
@Transactional
public void handleShipmentCreated(ShipmentCreatedEvent event, Acknowledgment ack) {
Order order = orderRepository.findById(event.orderId()).orElse(null);
if (order == null || order.getStatus() != OrderStatus.INVENTORY_RESERVED) {
// Event arrived too early - requeue with delay
log.warn("Out-of-order event: ShipmentCreated but order status is {}",
order == null ? "NOT_FOUND" : order.getStatus());
// Option 1: Delay and retry (using retry topic)
throw new OutOfOrderEventException("Order not ready for shipment confirmation");
// Option 2: Store in a pending event table and retry when ready
}
// Process normally
orderService.confirmOrder(event.orderId(), event.shipmentId());
ack.acknowledge();
}Solution 3: Pending Events Store
// PendingEventStore.java
@Service
@RequiredArgsConstructor
public class PendingEventStore {
private final PendingEventRepository pendingEventRepository;
/**
* Store an event that arrived before its prerequisites were met.
* A scheduler will retry these events periodically.
*/
@Transactional
public void storeForLater(String eventType, String orderId, String payload) {
pendingEventRepository.save(PendingEvent.builder()
.eventId(UUID.randomUUID().toString())
.eventType(eventType)
.orderId(orderId)
.payload(payload)
.retryAfter(Instant.now().plusSeconds(5)) // Retry after 5 seconds
.maxRetries(10)
.createdAt(Instant.now())
.build());
}
@Scheduled(fixedDelay = 5000)
@Transactional
public void retryPendingEvents() {
List<PendingEvent> events = pendingEventRepository
.findByRetryAfterBeforeOrderByCreatedAtAsc(Instant.now(),
PageRequest.of(0, 50));
for (PendingEvent event : events) {
// Re-publish to original topic
// ... re-publishing logic
if (event.getRetryCount() >= event.getMaxRetries()) {
event.setStatus("FAILED");
} else {
event.setRetryCount(event.getRetryCount() + 1);
event.setRetryAfter(Instant.now().plusSeconds(
(long) Math.pow(2, event.getRetryCount()) * 5 // Exponential backoff
));
}
pendingEventRepository.save(event);
}
}
}13. Production Challenge: Partial Service Failure During Compensation
Scenario
Order failed at Inventory step.
Compensation:
C2: Refund Payment <- Payment Service is DOWN for maintenance
C1: Cancel Order <- This must wait for C2
Result: Compensation is stuck. Order is in CANCELLING state forever.
Customer balance not refunded.
Solution: Aggressive Retry with Manual Fallback
// CompensationRetryService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class CompensationRetryService {
private final SagaInstanceRepository sagaRepository;
private final OrderSagaOrchestrator orchestrator;
private final AlertService alertService;
/**
* Find sagas stuck in compensation and retry.
*/
@Scheduled(fixedDelay = 30000) // Every 30 seconds
public void retryStuckCompensations() {
List<SagaInstance> compensatingSagas = sagaRepository.findByStatus(
SagaStatus.COMPENSATING
);
Instant stuckThreshold = Instant.now().minus(5, ChronoUnit.MINUTES);
for (SagaInstance saga : compensatingSagas) {
if (saga.getUpdatedAt().isBefore(stuckThreshold)) {
if (saga.getRetryCount() < 20) {
log.warn("Retrying stuck compensation: sagaId={}, step={}, attempt={}",
saga.getSagaId(), saga.getCurrentStep(), saga.getRetryCount());
saga.setRetryCount(saga.getRetryCount() + 1);
sagaRepository.save(saga);
orchestrator.replayCurrentCompensationStep(saga);
} else {
// Escalate to human
log.error("COMPENSATION PERMANENTLY STUCK: sagaId={}", saga.getSagaId());
alertService.sendPagerDutyAlert(
"CRITICAL: Manual intervention needed for saga compensation",
saga.getSagaId()
);
saga.setStatus(SagaStatus.FAILED);
sagaRepository.save(saga);
}
}
}
}
}Manual Intervention API
// SagaManualInterventionController.java
@RestController
@RequestMapping("/admin/sagas")
@RequiredArgsConstructor
public class SagaManualInterventionController {
private final SagaManualInterventionService interventionService;
// Force-complete a compensation manually (after verifying manually)
@PostMapping("/{sagaId}/compensations/{step}/force-complete")
@PreAuthorize("hasRole('SAGA_ADMIN')")
public ResponseEntity<Void> forceCompleteCompensation(
@PathVariable String sagaId,
@PathVariable String step,
@RequestBody ForceCompleteRequest request) {
interventionService.forceCompleteCompensation(sagaId, step, request.reason());
return ResponseEntity.ok().build();
}
// Force-fail a saga (mark as FAILED, stop all processing)
@PostMapping("/{sagaId}/force-fail")
@PreAuthorize("hasRole('SAGA_ADMIN')")
public ResponseEntity<Void> forceFail(
@PathVariable String sagaId,
@RequestBody ForceFailRequest request) {
interventionService.forceFail(sagaId, request.reason(), request.operatorId());
return ResponseEntity.ok().build();
}
}14. Production Challenge: Schema Evolution with Events
The Problem
Events are published to Kafka and consumed by multiple services.
If you change an event's structure (add/remove/rename fields), consumers may break.
BREAKING CHANGE:
v1 Event: OrderCreatedEvent { orderId, customerId, totalAmount }
v2 Event: OrderCreatedEvent { orderId, userId, amount } // renamed fields!
Consumer expecting "customerId" receives null.
Consumer expecting "totalAmount" receives null.
Saga breaks silently.
Solution: Backward-Compatible Event Evolution
// ALWAYS add fields, never remove or rename
// Use @JsonIgnoreProperties(ignoreUnknown = true) on all event classes
@JsonIgnoreProperties(ignoreUnknown = true) // CRITICAL: ignore unknown fields
public record OrderCreatedEvent(
String eventId,
String sagaId,
String orderId,
String customerId,
@JsonProperty("userId") String userId, // Added in v2 (customerId still works)
BigDecimal totalAmount,
@JsonProperty("amount") BigDecimal amount, // Added in v2 alias
List<OrderItemDto> items,
Instant occurredAt
) {
// Getter that returns either customerId or userId (backward compatible)
@JsonIgnore
public String getEffectiveCustomerId() {
return customerId != null ? customerId : userId;
}
}Versioned Event Routing
// Event Router: handles multiple versions of the same event type
@Component
@RequiredArgsConstructor
public class VersionedEventRouter {
@KafkaListener(topics = KafkaTopics.ORDER_CREATED)
public void handleOrderCreated(
@Payload String rawPayload,
@Header("eventVersion") String version,
Acknowledgment ack) {
switch (version) {
case "1" -> handleV1OrderCreated(rawPayload);
case "2" -> handleV2OrderCreated(rawPayload);
default -> {
log.warn("Unknown event version {} - trying latest handler", version);
handleV2OrderCreated(rawPayload);
}
}
ack.acknowledge();
}
}15. Production Challenge: Service Discovery During Compensation
The Problem
SCENARIO:
Order SAGA has 5 participating services.
Compensation is triggered.
Inventory Service was re-deployed at a different address during the saga.
Compensation event for Inventory Service is sent to old address.
No response. Compensation stuck.
Solution: Service Discovery + Event-Driven Communication
// NEVER use hard-coded service addresses in saga compensations
// ALWAYS use event-driven communication (Kafka/SQS topics)
// Topics are stable; service addresses change
// BAD:
restTemplate.post("http://inventory-service-v1.internal/api/compensation", request);
// GOOD:
kafkaTemplate.send(CommandTopics.INVENTORY_SERVICE, orderId,
new ReleaseInventoryCommand(sagaId, orderId));
// Inventory Service listens to this topic regardless of its URL16. Industry Best Practices
Practice 1: Design Compensations Before Forward Steps
SAGA DESIGN TEMPLATE:
For each step, fill in this table BEFORE writing any code:
| Step Name | Forward Action | Compensation | Idempotent? | Can Fail? | Fallback |
|-----------|---------------|--------------|-------------|-----------|----------|
| CREATE_ORDER | INSERT order | UPDATE status=CANCELLED | Yes (status check) | Low risk | Manual |
| PROCESS_PAYMENT | Call gateway | Call gateway refund | Yes (status check) | Medium risk | PagerDuty alert |
| RESERVE_INVENTORY | Decrement stock | Increment stock | Yes (status check) | Low risk | Retry 5x |
| CREATE_SHIPMENT | Call carrier | Call carrier cancel | Yes (status check) | High risk | Manual + retry |
Practice 2: Monitor Every Saga Step Duration
// Set SLA alerts for each step
// If Payment step takes > 10 seconds, alert immediately
@Component
public class SagaStepSlaMonitor {
private static final Map<SagaStep, Duration> STEP_SLA_LIMITS = Map.of(
SagaStep.CREATE_ORDER, Duration.ofSeconds(2),
SagaStep.PROCESS_PAYMENT, Duration.ofSeconds(10),
SagaStep.RESERVE_INVENTORY, Duration.ofSeconds(5),
SagaStep.CREATE_SHIPMENT, Duration.ofSeconds(30)
);
public void checkStepSla(SagaStep step, Duration actualDuration) {
Duration slaLimit = STEP_SLA_LIMITS.getOrDefault(step, Duration.ofSeconds(30));
if (actualDuration.compareTo(slaLimit) > 0) {
log.warn("SLA BREACH: step={}, actual={}ms, limit={}ms",
step, actualDuration.toMillis(), slaLimit.toMillis());
meterRegistry.counter("saga.sla.breach", "step", step.name()).increment();
}
}
}Practice 3: Test Every Failure Scenario
// SagaFailureScenarioTests.java
// Test one failure scenario per test method
@SpringBootTest
@Testcontainers
class SagaFailureScenarioTests {
@Test void testPaymentDeclined_ordersGetCancelled() {}
@Test void testInventoryOutOfStock_paymentRefunded_orderCancelled() {}
@Test void testShippingUnavailable_sagaRetries() {}
@Test void testDuplicateOrderCreatedEvent_idempotent() {}
@Test void testOutOfOrderEvents_handled() {}
@Test void testSagaTimeout_compensated() {}
@Test void testCompensationFails_alertSent_manualIntervention() {}
@Test void testNetworkPartitionDuringPayment_sagaRecovers() {}
}Practice 4: Use Saga IDs as Correlation IDs
// Every log line, every metric, every trace should include sagaId
// This enables end-to-end saga tracing across all services
// In event handlers, extract sagaId and set in MDC
@KafkaListener(topics = "payment.events.payment-processed")
public void handle(PaymentProcessedEvent event, Acknowledgment ack) {
MDC.put("sagaId", event.sagaId());
MDC.put("orderId", event.orderId());
try {
// All log statements within will include sagaId and orderId
inventoryService.reserveInventory(event);
ack.acknowledge();
} finally {
MDC.remove("sagaId");
MDC.remove("orderId");
}
}Practice 5: Separate Compensation Logic from Business Logic
// BAD: Mixed business and compensation logic
@Service
public class PaymentService {
public void processPayment(OrderCreatedEvent event) { /* ... */ }
public void refundPayment(String orderId) {
// Compensation mixed in the same service = tight coupling
}
public void partialRefund(String orderId, BigDecimal amount) { /* ... */ }
public void reverseChargeback(String orderId) { /* ... */ }
}
// GOOD: Separate compensation handler
@Service
public class PaymentService {
public PaymentResult processPayment(OrderCreatedEvent event) { /* ... */ }
}
@Component
public class PaymentCompensationHandler {
private final PaymentService paymentService;
public void handleCompensation(RefundPaymentCommand command) {
paymentService.refund(command.orderId(), command.reason());
}
}17. Testing Strategies for Production-Grade SAGAs
Test Pyramid for SAGAs
/\
/ \
/ E2E \ <- Full saga from API to database (few, slow)
/--------\
/ Integr. \ <- Service + Kafka + DB (per service, medium)
/------------\
/ Unit \ <- Orchestrator state machine, handlers (many, fast)
/----------------\
Unit Testing the Orchestrator State Machine
// OrderSagaOrchestratorUnitTest.java
@ExtendWith(MockitoExtension.class)
class OrderSagaOrchestratorUnitTest {
@Mock private SagaInstanceRepository sagaRepository;
@Mock private KafkaTemplate<String, Object> kafkaTemplate;
@InjectMocks private OrderSagaOrchestrator orchestrator;
@Test
@DisplayName("Payment success should transition to RESERVE_INVENTORY step")
void paymentSuccess_shouldTransitionToInventoryStep() {
SagaInstance saga = createSagaInStep(SagaStep.PROCESS_PAYMENT);
SagaReplyEnvelope reply = SagaReplyEnvelope.success(saga.getSagaId(),
"PROCESS_PAYMENT", "{\"paymentId\":\"PAY-001\"}");
orchestrator.handleSagaReply(reply, mock(Acknowledgment.class));
ArgumentCaptor<SagaInstance> sagaCaptor = ArgumentCaptor.forClass(SagaInstance.class);
verify(sagaRepository).save(sagaCaptor.capture());
assertThat(sagaCaptor.getValue().getCurrentStep())
.isEqualTo(SagaStep.RESERVE_INVENTORY);
}
@Test
@DisplayName("Inventory failure should trigger payment compensation")
void inventoryFailure_shouldTriggerPaymentCompensation() {
SagaInstance saga = createSagaInStep(SagaStep.RESERVE_INVENTORY);
SagaReplyEnvelope reply = SagaReplyEnvelope.failure(
saga.getSagaId(), "RESERVE_INVENTORY", "Out of stock", "INSUFFICIENT_STOCK"
);
orchestrator.handleSagaReply(reply, mock(Acknowledgment.class));
// Verify: compensation command sent to payment service
verify(kafkaTemplate).send(
eq(CommandTopics.PAYMENT_SERVICE),
eq(saga.getOrderId()),
any(RefundPaymentCommand.class)
);
}
}18. Operational Runbook
When a Saga Gets Stuck
Step 1: Identify stuck sagas
Query: SELECT * FROM saga_instances
WHERE status IN ('IN_PROGRESS', 'COMPENSATING')
AND updated_at < NOW() - INTERVAL 30 MINUTE;
Step 2: Check the current step
For each stuck saga, check what step it is on.
Is the required service up? Check CloudWatch / health endpoints.
Step 3: Check for DLT messages
Are there messages in the DLT for the relevant topics?
Review: SELECT * FROM dlt_messages WHERE status = 'PENDING_REVIEW';
Step 4: Check Outbox
Are events stuck in the outbox?
Query: SELECT COUNT(*) FROM outbox_events
WHERE status = 'PENDING' AND created_at < NOW() - INTERVAL 10 MINUTE;
Step 5: Attempt automatic recovery
Use the recovery API: POST /admin/sagas/{sagaId}/retry-current-step
Step 6: If automatic recovery fails
Review the error, fix the underlying issue, then:
POST /admin/dlt-messages/{messageId}/replay
Step 7: Last resort - manual force
POST /admin/sagas/{sagaId}/force-complete-step (with explicit reason)
Or: POST /admin/sagas/{sagaId}/force-fail (accept data inconsistency, fix manually)
Step 8: Post-incident
Document the incident, add new test case, fix root cause
19. Summary
| Category | Key Lesson |
|---|---|
| Isolation Anomalies | Use semantic locking and CQRS read models to hide intermediate saga state |
| Idempotency | Every handler MUST be idempotent. Use database unique constraints first. |
| Missing Compensations | Design ALL compensations before writing ANY code |
| Saga Explosion | Only use SAGA for genuine multi-service, multi-database transactions |
| Event Ordering | Use orderId as Kafka partition key for per-order ordering |
| Message Duplication | Expected behavior with at-least-once delivery. Idempotency is the fix. |
| Schema Evolution | Only add fields, never remove. Use @JsonIgnoreProperties always. |
| Stuck Compensations | Aggressive retry + escalation to human operator |
| Testing | Test every failure scenario, not just the happy path |
| Operations | Build runbook before deploying to production |
| Monitoring | sagaId in every log line. Alert on stuck sagas, DLT messages, outbox failures. |
Next: Part 7 - Interview Mastery
Master 60+ SAGA interview questions from junior to Principal Architect level,
complete with detailed answers, follow-up questions, and tips for handling tricky scenarios.
Series Navigation: Index |
Part 1 |
Part 2 |
Part 3 |
Part 4 |
Part 5 | Part 6 |
Part 7