Part 7: SAGA Patterns - Interview Mastery Guide

How to Use This Guide

Questions are ordered from MOST frequently asked to most specialized.
Each question includes:

Model answer
Key points to mention (even if not asked)
Common follow-up questions
What interviewers are really testing
Common mistakes to avoid

Expectation by Level:

Level	What They Expect
Junior/Mid (3-5 yrs)	Definition, types, basic trade-offs, simple code
Senior (5-8 yrs)	Production implementation, failure handling, idempotency
Staff/Principal (8+ yrs)	Architecture decisions, trade-offs, when NOT to use, system design
Technical Architect	Strategic choices, team guidance, multi-system impact

Core Foundation Questions (Most Frequently Asked)
Implementation Questions (Frequently Asked)
Failure Handling and Recovery Questions
Advanced Architecture Questions
Tricky and Situational Questions
Principal / Technical Architect Questions
System Design Questions
Recent Industry Trends (2024-2026)
Quick Reference Cheat Sheet
How to Handle Follow-Up Questions

1. Core Foundation Questions

Q1: What is a SAGA pattern and why do we need it?

Model Answer:

A SAGA is a pattern for managing distributed transactions across multiple microservices,
each with its own database. It sequences local transactions where each step commits to its own
database and triggers the next step. If any step fails, compensating transactions are executed
in reverse order to undo the effects of prior steps.

We need it because in microservices, you cannot use a single ACID transaction across multiple
databases. The classic problem: a customer places an order that needs to charge a card (Payment
Service), reserve inventory (Inventory Service), and schedule delivery (Shipping Service).
Each runs on a separate database. You cannot wrap all three in a @Transactional annotation.

SAGA solves this by ensuring either all steps eventually succeed, or all prior effects are
compensated. It accepts eventual consistency instead of requiring strong atomic consistency.

Key points to mention even if not asked:

It does NOT use database rollback - it uses compensating transactions (forward transactions)
It was first described in 1987 by Garcia-Molina and Salem for long-lived transactions
Two types: Choreography (event-driven) and Orchestration (central coordinator)

Follow-ups:

"Why not just use a distributed database?" (See Part 1)
"How is this different from 2PC?" (Next question)
"What is eventual consistency and how does it affect the user experience?"

Q2: How does SAGA differ from Two-Phase Commit (2PC)?

Model Answer:

2PC uses a Transaction Coordinator that drives two phases: Prepare (all participants vote) and
Commit/Rollback (coordinator decides). It provides strong atomicity but has serious problems:

Blocking: All participants hold locks while waiting for the coordinator's decision.
If the coordinator crashes after Prepare but before Commit, all participants are stuck
holding locks indefinitely - a system-wide deadlock.
Single Point of Failure: The coordinator is a SPOF. If it fails at the wrong moment,
the system enters an indeterminate state.
Performance: Every transaction requires at minimum 2 network round trips.
Ecosystem incompatibility: Requires XA protocol support. Most modern services
(Kafka, Redis, NoSQL) do not support XA.

SAGA avoids all these by:

No distributed locks (each local transaction commits immediately)
No coordinator SPOF (choreography has none; orchestration has a RECOVERABLE coordinator)
No XA required (uses application-level coordination)
Trades strong atomicity for eventual consistency via compensation

The key trade-off in one sentence:
2PC gives you strong atomicity at the cost of availability and performance.
SAGA gives you high availability and performance at the cost of strong atomicity.

Interviewer is testing: Deep understanding of distributed systems trade-offs, not just definitions.

Q3: What are compensating transactions and how do they differ from database rollback?

Model Answer:

A compensating transaction is a NEW, forward-executing transaction that SEMANTICALLY undoes
the effects of a previously committed transaction.

Key distinction: Database rollback removes all trace of the original operation - it is as if
it never happened. A compensating transaction ACKNOWLEDGES the original operation happened
and creates a new operation that reverses its BUSINESS EFFECTS.

Example:

Original: INSERT INTO payments (order_id, amount, status) VALUES ('ORD-001', 99.99, 'CHARGED')
- This commits. The payment record is permanent.
Database rollback: The insert is gone. No evidence of attempted payment.
Compensating transaction: UPDATE payments SET status='REFUNDED', refund_date=NOW() WHERE order_id='ORD-001'
- Both records exist. Full audit trail preserved.

Three critical properties of a good compensating transaction:

Idempotent: If executed multiple times (due to retries), the result is the same.
A refund called twice should not result in two refunds.
Semantically correct: It must undo the BUSINESS EFFECT, not just delete data.
Cancelling an order marks it CANCELLED, not deleted.
Eventually always succeeds: Compensations should be designed to always eventually
complete, even if they require retries. A compensation that can permanently fail is
a system design problem.

Follow-up question: "Can a compensation fail?"
Yes, it can fail transiently. The system must retry. If it fails permanently (e.g., payment
gateway is shut down), you need a manual intervention process. This is why designing
compensations upfront, before forward steps, is critical.

Q4: What are the two types of SAGAs and when do you choose each?

Model Answer:

Choreography: Services communicate by publishing and consuming events. No central
coordinator. Each service reacts to events from other services and emits its own events.

Suitable when: Fewer than 5 services, simple linear flow, teams want loose coupling,
maximum throughput needed (no orchestrator bottleneck)
Problem: Hard to understand where a saga is at any point. Debugging requires looking
at events across many services.

Orchestration: A central orchestrator service drives the entire saga. It sends commands
to each service and receives responses. The orchestrator knows and controls every step.

Suitable when: More than 5 services, complex branching logic, need full saga visibility,
compliance and audit requirements, easier debugging is a priority
Problem: Orchestrator is a new service to build, deploy, and maintain. Services are coupled
to the orchestrator's command interface.

Decision framework:

1. How many services? > 5 -> Orchestration
2. Is the flow linear and simple? Yes -> Choreography may be fine
3. Do you need to answer "where is this saga right now?" easily? Yes -> Orchestration
4. Regulatory compliance and audit? Yes -> Orchestration
5. Maximum throughput? -> Choreography
6. Teams prefer independence? -> Choreography

Pro tip for interviews: Mention that real production systems often use BOTH.
Choreography for simple, high-volume flows. Orchestration for complex business processes.

Q5: How do you handle failures in a SAGA?

Model Answer:

SAGA failure handling has two dimensions: transient failures and business failures.

Transient failures (network timeouts, temporary service unavailability):

Retry with exponential backoff
Circuit breaker to prevent cascading failures
Dead Letter Queue after max retries

Business failures (payment declined, insufficient inventory):

These are NOT errors to retry. They are expected outcomes that trigger compensation.
The saga transitions to compensating state and executes compensating transactions in reverse order.

The compensation chain:
When step N fails, execute compensation in order: CN-1, CN-2, ... C1

For an order saga:

If Inventory fails:
  1. Refund payment (C2 - payment compensation)
  2. Cancel order (C1 - order compensation)

Do NOT compensate in wrong order:
  Wrong: Cancel order THEN refund payment (order is cancelled before refund confirmed)
  Right: Refund payment THEN cancel order (audit trail shows refund before cancellation)

Important consideration: Steps AFTER the pivot transaction (payments taken, inventory reserved)
should be retried indefinitely, not compensated. The business has already committed. Shipping
creation should be retried 5, 10, 20 times before human intervention, not rolled back.

Q6: What is idempotency and why is it critical for SAGAs?

Model Answer:

Idempotency means that executing an operation N times produces the same result as executing
it once. A POST /payments called 3 times with the same orderId should create exactly ONE payment.

In SAGAs, idempotency is non-negotiable because:

Message brokers use at-least-once delivery. The same event WILL arrive more than once.
Consumer processes the event, network fails before ACK - Kafka re-delivers the event.
Outbox publisher sends event and crashes before marking as PUBLISHED - published again.

Without idempotency:

Double charges (payment processed twice)
Double reservations (inventory decremented twice)
Double shipments
Double refunds (compensation executed twice)

Implementation strategies:

Database unique constraint (strongest):

ALTER TABLE payments ADD UNIQUE INDEX idx_payment_order_id (order_id);
-- Throws DataIntegrityViolationException on duplicate - catch and return early

Explicit status check (most readable):

if (payment.getStatus() == PaymentStatus.CHARGED) {
    log.info("Already charged, skipping (idempotent)");
    return;
}

Processed event ID store:

if (processedEventRepo.existsByEventId(event.eventId())) {
    return;  // Already handled this exact event
}

Follow-up: "How do you make a compensation idempotent?"
A refund should check: if payment status is already REFUNDED, return without calling the
payment gateway again. The compensation should be a no-op on duplicate execution.

Q7: What is the Transactional Outbox Pattern and why is it needed in SAGAs?

Model Answer:

The Transactional Outbox pattern solves the "dual-write problem" - the inability to atomically
write to a database AND publish to a message broker in a single operation.

The problem without outbox:

@Transactional
public void createOrder(OrderRequest req) {
    orderRepo.save(order);          // Write to DB
    kafkaTemplate.send(topic, event); // Publish to Kafka - SEPARATE OPERATION!
    // If the DB commits but Kafka fails - inconsistency!
    // If Kafka succeeds but DB rolls back - phantom event!
}

The outbox solution:

@Transactional
public void createOrder(OrderRequest req) {
    orderRepo.save(order);
    outboxRepo.save(OutboxEvent.of(topic, orderId, event));  // SAME TRANSACTION!
    // Both writes are atomic. Either both succeed or neither does.
}
 
// Separate scheduler reads outbox and publishes to Kafka
@Scheduled(fixedDelay = 500)
public void publishOutboxEvents() {
    outboxRepo.findPending().forEach(event -> {
        kafkaTemplate.send(event.getTopic(), event.getAggregateId(), event.getPayload());
        outboxRepo.markPublished(event.getEventId());
    });
}

Key points:

Outbox guarantees at-least-once delivery (same event may be published multiple times if
publisher crashes between send and mark-published). Consumer MUST be idempotent.
For lower latency (vs polling), use Debezium CDC which reads MySQL binary log directly.
The outbox table is in the SAME database as the service's business tables.

2. Implementation Questions

Q8: How do you implement a SAGA with Spring Boot and Kafka?

Model Answer:

High-level architecture:

Each service has an event handler class (e.g., PaymentSagaEventHandler)
Event handlers listen on Kafka topics with @KafkaListener
Business logic is in a service class
Service writes to DB AND outbox table in same @Transactional method
Outbox publisher (scheduled) reads outbox and sends to Kafka

Key Spring configuration:

enable-auto-commit: false on consumer (manual acknowledgment)
acks: all on producer (wait for all replicas)
enable.idempotence: true on producer (prevent duplicate messages)
Manual acknowledgment: call ack.acknowledge() AFTER successful processing

Critical: Never auto-commit Kafka offsets.
Auto-commit means the offset is committed regardless of whether processing succeeded.
If the application crashes after commit but before processing completes, the event is lost.
Manual acknowledgment ensures the offset only advances when processing is confirmed.

@KafkaListener(topics = "order.created", groupId = "payment-service")
public void handle(OrderCreatedEvent event, Acknowledgment ack) {
    paymentService.process(event);  // If this throws, offset NOT committed
    ack.acknowledge();              // Only acknowledges after successful processing
}

Follow-up: "How do you handle consumer rebalancing?"
When Kafka rebalances (a new consumer joins or leaves the group), offsets are reassigned.
Any in-flight messages may be redelivered. This is another reason why idempotency is essential.

Q9: How do you design MySQL schemas for SAGA state management?

Model Answer:

Key tables needed:

saga_instances (orchestration only): Stores saga lifecycle
- saga_id (PK), order_id, status, current_step, payload (JSON), version (optimistic lock)
- Index on: status (for finding active sagas), updated_at (for finding stuck sagas)
outbox_events (per service): Transactional outbox
- event_id (PK), aggregate_id (FK for ordering), topic_name, payload (JSON), status, retry_count
- Index on: status + created_at (for publisher batch reads)
processed_events (per service): Idempotency tracking
- event_id + service_id (composite PK)
- Auto-purge old records after 30 days
Business tables should have:
- status column with enum values including saga-in-progress states
- version column for optimistic locking
- saga_id for correlation

Locking strategy:

Inventory tables: pessimistic write lock (FOR UPDATE) - prevent concurrent reservations
Order/Payment tables: optimistic lock (version column) - concurrent updates are rare

-- Inventory: pessimistic because concurrent reservations are common
SELECT * FROM inventory WHERE product_id = ? FOR UPDATE;
 
-- Order: optimistic because concurrent order updates are rare
@Version Long version;  -- JPA @Version annotation

Q10: How do you test a SAGA?

Model Answer:

SAGAs need a layered test strategy:

Unit Tests (fast, many):

Test orchestrator state machine transitions with mocked services
Test each event handler with mocked downstream calls
Test idempotency: call handler twice, verify only one write

Integration Tests (medium speed):

Use Testcontainers (Kafka + MySQL containers)
Test happy path: publish events in sequence, verify final state
Test compensation: simulate failure at each step, verify full compensation

Contract Tests:

Verify event schema compatibility (producer and consumer agree)
Use Pact or Spring Cloud Contract
Critical for preventing schema evolution breaking changes

Chaos/Resilience Tests:

Test with duplicate events (idempotency verification)
Test with out-of-order events (state machine guard)
Test with service crashes mid-saga (recovery verification)

Key testing principle:
Test failure paths as rigorously as the happy path.
For an N-step saga, you need N failure test cases (fail at each step) plus
N compensation test cases (verify each compensation runs correctly).

3. Failure Handling and Recovery Questions

Q11: What happens if the payment refund fails during compensation?

Model Answer:

This is a critical scenario with no perfect answer. Options:

Option 1: Aggressive Retry
Retry the refund indefinitely with exponential backoff. Payment gateways come back online.
Store the "refund pending" state and retry every N minutes until it succeeds.

@Scheduled(fixedDelay = 30000)  // Every 30 seconds
public void retryPendingRefunds() {
    paymentRepository.findByStatus(PaymentStatus.REFUND_PENDING)
        .forEach(this::attemptRefund);
}

Option 2: Manual Intervention Pipeline
After N retries, escalate to a human operator. Create a support ticket automatically.
The ops team manually verifies and processes the refund.

Option 3: Compensate the Compensation
If you absolutely cannot refund via the payment gateway, issue a credit or coupon.
This is a business decision, not a technical one.

The key insight for architects:
Compensation failures reveal business requirements that were not thought through. Every
compensation failure scenario should have a defined BUSINESS PROCESS to handle it.
Technical systems can only automate well-defined processes.

What interviewers are testing:
That you understand compensation failures are real, not theoretical. And that you have
a plan for them beyond "it will work."

Q12: How do you handle saga timeouts?

Model Answer:

Every saga should have a maximum allowed execution time. If a saga has not completed
within this time, it is considered stuck and intervention is required.

Detection:
A scheduled job (every 1-5 minutes) queries for sagas that are IN_PROGRESS or COMPENSATING
and have not been updated within the timeout threshold.

Response:

First: attempt to replay the current step's command (maybe the message was lost)
After N replays: alert the operations team via PagerDuty/SNS
After human review: force-complete a step or force-fail the saga

Different timeouts for different sagas:

Order saga: 30 minutes (usually completes in seconds)
International transfer: 24 hours (involves manual review)
Return processing: 7 days (customer ships item back)

Implementation note:
Do NOT use Kafka message TTLs for saga timeouts. Messages expiring at the broker level
causes lost events, not controlled saga timeouts. Use application-level timeout tracking.

Q13: What is a "dirty read" in the context of SAGAs and how do you prevent it?

Model Answer:

In database transactions, a dirty read is reading uncommitted data from another transaction.
In SAGAs, the analog is reading intermediate saga state that has been committed locally but
represents a logically incomplete business transaction.

Example:

T1: Order created (status=PENDING) - committed to DB
T2: Customer checks order status - reads PENDING
T3: Payment fails, saga compensates, order goes to CANCELLED
The customer saw PENDING, which was a real committed state, but represented incomplete data

Prevention strategies:

Semantic locking: Add saga_in_progress boolean to entities.
Downstream reads return a "processing" response when true.
CQRS read models: Only update read models when saga completes.
Customers query read models, never the saga state tables.
Aggregate pending state: Accept that PENDING is a valid state.
Design UI to show "Order Processing" gracefully.
Saga version/timestamp check: Only show data where saga_completed_at IS NOT NULL.

The key insight:
Unlike ACID isolation which the database enforces for free, SAGA isolation must be
EXPLICITLY DESIGNED by the application developer. This is one of the key costs of using SAGAs.

Q14: How do you handle out-of-order event delivery?

Model Answer:

Out-of-order events happen because:

Different Kafka partitions have independent ordering
Network delays vary
Retry topics introduce delays
Consumer group rebalancing

Prevention (best solution):
Use the same partition key (orderId) for all events in a saga. Events with the same key
go to the same partition. Within a partition, Kafka guarantees FIFO order.

kafkaTemplate.send(new ProducerRecord<>(topic, orderId, payload));
//                                              ^^^^^^^ same key = same partition

Detection and handling:
Add state guards in event handlers:

if (order.getStatus() != OrderStatus.PAYMENT_PROCESSED) {
    // ShipmentCreated event arrived before InventoryReserved
    // Store in pending event table, retry in 5 seconds
    pendingEventStore.storeForLater("ShipmentCreated", orderId, payload);
    return;
}

Acknowledgment strategy:
Do NOT reject (throw exception) on out-of-order events - this causes message to go to DLT.
Instead: store for later reprocessing, acknowledge the original message.

4. Advanced Architecture Questions

Q15: How does SAGA integrate with CQRS?

Model Answer:

SAGA and CQRS complement each other perfectly.

SAGA operates on the WRITE side (commands). It updates multiple service databases.
CQRS's query models (read side) should only reflect COMPLETED saga states.

The integration pattern:

Saga progresses through steps, updating each service's write-side database
When saga COMPLETES (or is compensated), it publishes a domain event
CQRS projection handler subscribes to domain events
Projection handler updates the read model only on saga completion

Why this matters:
Without this integration, customers querying order status during an active saga would see
intermediate states (PENDING -> PAYMENT_PROCESSED -> INVENTORY_RESERVED). This is confusing.
With CQRS, customers see either their previous final state or "processing" until saga completes.

The critical rule:
Never update CQRS read models with intermediate saga state.
Only project on terminal events (OrderConfirmed, OrderCancelled).

For architects:
This also means your read models are eventually consistent with respect to WRITE state.
A customer who just placed an order will not see it in their "My Orders" immediately.
The UI must handle this gracefully (show the order immediately from the API response,
not from the read model query).

Q16: When would you NOT use a SAGA?

Model Answer:

This is a critical question that separates senior engineers from juniors. The answer
"always use SAGA for microservices" is wrong.

Do NOT use SAGA when:

All data is in one database: If Order, Payment, and Inventory are all tables in
one MySQL database, use @Transactional. No saga needed.
Strong consistency is a hard business requirement: Financial double-entry bookkeeping
(debit MUST equal credit atomically). Redesign to keep these in one service with one DB.
Simple CRUD without cross-service writes: A service that only creates/updates its own
data does not need a saga.
Operations are not compensable: "Send an SMS" cannot be unsent. Design the saga so
non-compensable operations are after the pivot transaction (retriable, not compensated).
Team is not ready: SAGA adds enormous complexity. If the team cannot yet reason about
eventual consistency, compensating transactions, and idempotency, they will create more
bugs than they solve.
Performance is critical and latency budget is tight: SAGAs add latency (multiple round
trips, event publishing, outbox polling). If you need sub-10ms response, reconsider.

The best answer a principal engineer gives:
"SAGAs are a tool for a specific problem: multi-service transactions with high availability
requirements. If you do not have that specific problem, a simpler solution is better."

Q17: How would you monitor SAGAs in production?

Model Answer:

SAGA monitoring requires three layers:

Layer 1: Business Metrics

Saga success rate (target: > 99%)
Average saga completion time (e.g., order saga should complete in < 5 seconds)
Compensation rate (> 5% compensations indicate a systemic problem)
Sagas stuck > 30 minutes count (should be 0)

Layer 2: Technical Metrics

Kafka consumer lag per topic per service
Outbox pending events count (> 1000 = Kafka may be down)
Dead Letter Topic message count (alert on any new messages)
Circuit breaker state per service

Layer 3: Distributed Tracing

Every saga has a sagaId that is used as the trace correlation ID
Every service logs sagaId and orderId in every log line via MDC
AWS X-Ray or OpenTelemetry traces show end-to-end saga execution timeline

Key alerts to configure:

Saga stuck > 30 minutes: PagerDuty P2
DLT message received: PagerDuty P2
Outbox events stuck > 10 minutes: PagerDuty P1
Compensation rate > 10%: Slack notification
Saga failure rate > 1%: Slack notification

Tools: AWS CloudWatch Insights for log queries, CloudWatch Metrics for dashboards,
AWS X-Ray for distributed tracing.

Q18: Explain the concept of a pivot transaction in a SAGA.

Model Answer:

A pivot transaction is the transaction in a saga after which the saga should ONLY progress
forward (by retrying if necessary), not compensate backward.

Think of it as the "point of no return." Before the pivot: if anything fails, compensate.
After the pivot: if anything fails, retry aggressively until success.

Why it exists:
Some business commitments cannot be undone once made. Once you have:

Taken a customer's money AND reserved their items
Reserved a concert seat AND charged the card
Transferred funds AND notified the recipient

...you are committed. The customer is expecting delivery. You MUST complete the saga forward.

Identifying the pivot:
The pivot is typically the last compensable transaction. Everything before it can be
compensated. Everything after it should be retriable.

In the order saga:

Create Order (compensable: cancel order)
Process Payment (compensable: refund)
Reserve Inventory  <-- PIVOT: after this, customer expects delivery
Create Shipment (retriable: retry until carrier accepts)
Confirm Order (retriable: retry until status updated)
Send Email (fire-and-forget: cannot be undone, so comes last)

Common mistake in interviews:
Many candidates say "the pivot is the last step." Wrong. The pivot is a semantic boundary
based on business commitment, not position in the sequence.

5. Tricky and Situational Questions

Q19: Your SAGA has been running for 2 hours and is stuck at step 3 of 5. Payment was charged. What do you do?

Model Answer:

This requires a methodical approach:

Step 1: Diagnosis

Check saga_instances: what is the current_step and retry_count?
Check saga_step_executions: what error did step 3 produce?
Check Kafka consumer lag: is step 3's consumer running?
Check service health: is the step 3 service healthy?
Check outbox_events: are there stuck pending events?

Step 2: Assess severity

Is payment charged? YES - this is CRITICAL. Customer money is held.
What step is it? Step 3 of 5 (likely inventory or shipping).
Has the step 3 service recovered? Check health endpoint.

Step 3: Take action

If step 3 service is healthy: replay current step command. The saga should proceed.
If step 3 service is down: wait for it to recover. Saga will auto-retry on recovery.
If step 3 will never recover: initiate compensation. Refund payment. Cancel order.
Alert customer.

Step 4: If compensation is also stuck:
Manual intervention. Ops team processes refund manually. Update saga status to FAILED.
Create a tracking ticket for customer support.

What interviewers are testing:
Operational maturity. Not just "how does SAGA work" but "what do YOU do at 2am when
a real saga is stuck and a customer's $500 is in limbo?"

Q20: You have two customers ordering the last item in inventory simultaneously. Both are in SAGAs. What happens?

Model Answer:

This is a classic SAGA concurrency problem. Without proper design, both sagas could succeed
in reserving the same item (overselling).

The race condition:

T1: SAGA-1 reads inventory: 1 item available
T2: SAGA-2 reads inventory: 1 item available  (reads same data, no lock yet)
T3: SAGA-1 decrements: available = 0
T4: SAGA-2 decrements: available = -1  (PROBLEM!)

Solution: Pessimistic locking on inventory

@Lock(LockModeType.PESSIMISTIC_WRITE)
Optional<InventoryItem> findByProductIdWithLock(String productId);

Now:

T1: SAGA-1 acquires PESSIMISTIC WRITE LOCK on inventory row
T2: SAGA-2 tries to acquire lock - BLOCKED, must wait
T3: SAGA-1 checks: 1 available >= 1 needed. Reserves. Commits. LOCK RELEASED.
T4: SAGA-2 acquires lock. Checks: 0 available < 1 needed. RESERVATION FAILS.
T5: SAGA-2 triggers compensation (refund payment, cancel order).

Alternative: Optimistic locking with retry
Use @Version on the inventory row. If both try to update, one fails with
OptimisticLockingFailureException. The failed one retries. On retry, it sees 0 stock and fails gracefully.

What about the compensation in SAGA-2?
SAGA-2 publishes InventoryReservationFailedEvent -> Payment Service refunds -> Order cancelled.
SAGA-2's customer gets a notification: "Sorry, we're out of stock."

This is the CORRECT behavior. Not a bug. Eventual consistency means one customer wins the race.

Q21: If both Choreography and Orchestration have trade-offs, why not combine them?

Model Answer:

Combining both is actually common in mature systems. This is called a "Hybrid" approach.

Pattern: Orchestration of Choreographies

A central orchestrator handles the high-level saga flow
Within each "domain" (e.g., payment domain), choreography handles internal events

SAGA Orchestrator (orchestration)
    |--- ProcessPaymentCommand ---> Payment Bounded Context
    |                                 PaymentService (internal choreography)
    |                                 FraudService (reacts to payment events)
    |                                 LedgerService (reacts to payment events)
    |<-- PaymentProcessedReply ---    (orchestrator only sees the final result)

Pattern: Choreography with Saga Tracker

Services use choreography (events) for communication
A separate Saga Tracker service OBSERVES events and maintains saga state
No central coordinator, but full visibility

// SagaTrackerService.java - observes all saga-related events
@KafkaListener(topicPattern = ".*\\.events\\..*")
public void trackEvent(SagaEvent event) {
    sagaStateRepository.recordEvent(event.sagaId(), event.getClass().getSimpleName());
    // Saga tracker provides visibility WITHOUT participating in the saga flow
}

The key insight for architects:
Choose the coordination model based on the DOMAIN and TEAM structure, not architecture
purity. A team that owns payment and fraud might use choreography within their domain.
The cross-domain coordination uses orchestration for visibility.

Q22: How do you handle schema evolution of events published to Kafka?

Model Answer:

Schema evolution in event-driven systems is one of the hardest problems in practice.

The core constraint:
Events are contracts between producers and consumers. Any breaking change breaks consumers.

Safe changes (backward compatible):

Adding optional fields with defaults
Adding new event types (new topics)

Breaking changes (NEVER do without migration):

Removing fields
Renaming fields
Changing field types

Strategy: Schema Registry
Use AWS Glue Schema Registry or Confluent Schema Registry to enforce compatibility.
Configure compatibility mode: BACKWARD (new consumers can read old messages) or
FULL (old and new consumers can both process new messages).

Strategy: Event Versioning
Include a version field in every event. Route to version-specific handlers.

Strategy: Tolerant Reader Pattern

@JsonIgnoreProperties(ignoreUnknown = true)  // ALWAYS add this to event classes
public record OrderCreatedEvent(
    String orderId,
    String customerId,
    BigDecimal totalAmount
) {}
// Any new fields added to the event are silently ignored by consumers using old code

Migration process:

Add new field to producer (consumers still work - unknown fields ignored)
Deploy all consumers to handle both old and new field names
Remove old field from producer (consumers no longer rely on old field)

6. Principal / Technical Architect Questions

Q23: You are designing a payment system for a bank. Should you use SAGAs?

Model Answer:

This requires careful analysis. The honest answer is: "It depends on the specific requirement,
but probably not for the core ledger operations."

Core banking ledger: PROBABLY NOT SAGA

Double-entry bookkeeping requires ABSOLUTE atomicity: debit and credit must either both
happen or neither happen. No eventual consistency acceptable.
Financial regulators may require atomic transaction guarantees.
Design: keep debits and credits in ONE service with ONE database. Use @Transactional.

Cross-bank transfers: SAGA IS APPROPRIATE

Transferring between banks involves external systems (SWIFT, ACH, Fedwire)
These systems are inherently eventually consistent
You MUST use SAGA or similar pattern
The saga: debit local account -> initiate external transfer -> on confirmation, mark complete
Compensation: if transfer fails, credit the local account back

Specific saga design for bank transfer:

Semantic lock: place "hold" on funds while saga is active (prevents double-spend)
Pivot: external transfer initiation (after this, retry until confirmation)
Compensation: release hold if external transfer is rejected
Long-running: international transfers may take hours/days

What interviewers are testing:
That you can identify when NOT to apply a pattern. "Use SAGA for everything in microservices"
is wrong. Strong consistency requirements should reshape the service boundary.

Q24: How would you explain the SAGA pattern to a non-technical executive?

Model Answer:

"Imagine our customer service rep trying to book a hotel, flight, and car rental simultaneously
for an overseas trip. They call the hotel, book a room. Call the airline, book a seat. Call
the car company... and the model they want is not available.

Now what? They call back the hotel and cancel the room. They call the airline and cancel the
seat. They try a different car company. Eventually, they book all three or give up and refund
everything.

That is exactly what our distributed transaction system does when processing orders. Each step
(payment, inventory, shipping) is like booking one part of the trip. If any step fails, we have
a trained process that undoes the previous steps - refunding the payment, releasing inventory.

The key difference from a simpler system: we do not wait for all three confirmations before
telling you 'done.' We start all three in sequence, and if anything goes wrong, we handle it
gracefully in the background. You either get a confirmed order or a cancelled order with a
refund - never a stuck 'processing forever' state."

Why this matters in interviews:
Principal Architects are often asked to communicate with business stakeholders. Demonstrating
that you can explain complex patterns in business terms is a differentiator.

Q25: What are the operational challenges of SAGAs at scale (millions of sagas/day)?

Model Answer:

At scale, several operational challenges become critical:

1. Database: saga_instances table growth

At 1 million sagas/day, after 30 days: 30 million rows
Regular archival: Move completed sagas older than 30 days to archive table
Partition the table by created_at (monthly partitions)
Use separate read replicas for monitoring queries

2. Kafka partition scaling

Each Kafka partition handles roughly 10-50 sagas/second
Need to calculate partition count based on throughput
Increasing partitions requires consumer group restart (brief service disruption)
Plan partition counts 2-3x ahead of current need

3. Outbox publisher contention

At high volume, the outbox publisher can become a bottleneck
Solution: Multiple outbox publisher instances, each claiming a partition of rows
Or: Switch to Debezium CDC (no polling, scales automatically)

4. Dead Letter Topic backlog

At scale, even 0.01% failure rate = 100 DLT messages/million sagas
Need automated DLT categorization: infrastructure issues (auto-retry) vs bugs (alert)
Track DLT trends, not just absolute counts

5. Observability overhead

At 1M sagas/day, CloudWatch Insights queries become expensive
Consider sampling strategies for metrics (100% for errors, 1% for success paths)
Aggregate metrics early (service-level stats, not individual saga logs)

Q26: Describe how you would migrate a monolith using @Transactional to microservices with SAGAs.

Model Answer:

This is the "strangler fig" migration combined with SAGA introduction. Do it incrementally.

Phase 1: Extract Services Without SAGAs (safe start)
Extract services that have NO cross-service transactions first.
User profile, notifications, analytics - these are read-heavy and do not need SAGA.

Phase 2: Introduce Events Alongside Transactions
Before removing the monolith transaction, add event publishing alongside it.
Keep @Transactional as the primary consistency mechanism.
Let consumers (new services) read from events.

Phase 3: Extract One Service at a Time
Extract Payment Service first (it has a natural, clean boundary).
The monolith now calls Payment Service via REST for payment operations.
Payment Service has its own database.
This is where you introduce your FIRST SAGA.

Phase 4: Implement the SAGA
Replace the @Transactional call that spanned order + payment with:

Order Service publishes OrderCreatedEvent
Payment Service consumes and processes payment
Payment Service publishes result
Order Service reacts to result

Phase 5: Validate and Monitor
Run monolith and new saga path in parallel (shadow mode).
Compare results. Fix discrepancies.
Gradually shift traffic to new path.

The golden rule:
Never try to extract multiple services simultaneously. The risk of failure is too high.
One service at a time. Validate. Stabilize. Then extract the next.

7. System Design Questions

Q27: Design the SAGA architecture for a ride-hailing application (like Uber).

Model Answer:

Ride-hailing involves multiple time-sensitive transactions:

The main saga: Book a Ride

Step 1: Create Trip Request (Trip Service)
Step 2: Match with Driver (Matching Service) - LONG WAIT, up to 5 minutes
Step 3: Authorize Payment (Payment Service) - semantic lock: hold funds
Step 4: Confirm Driver Acceptance (Trip Service) - PIVOT
Step 5: Track trip in real-time (Tracking Service) - retriable
Step 6: Process final payment when trip ends (Payment Service) - retriable

Key design decisions:

Choreography for real-time events (driver location updates, trip status)
- These events are high volume, low latency, fire-and-forget
- Kafka with short retention
Orchestration for payment saga
- Payment is critical, needs full visibility and compliance
- AWS Step Functions with explicit state for each payment step
Timeout handling (critical)
- Driver not found in 5 min: cancel saga, try wider radius or notify "no driver available"
- Driver accepts but does not arrive: timer escalation
- Payment authorization expires: re-authorize before trip ends
Semantic locking
- Once payment is authorized (hold), driver is confirmed: saga is committed
- Cannot cancel without compensation (release hold, refund any charged amount)
Non-compensable step strategy
- SMS notification to driver: after matching confirmation (cannot unsend)
- Push notification to rider: after driver match (cannot unsend)

Q28: Design the SAGA for an airline booking system.

Model Answer:

An airline booking needs to handle seat selection, payment, and potentially partner services.

The Booking SAGA:

Step 1: Reserve Seat Temporarily (Seat Service) - 15-minute hold (semantic lock)
Step 2: Check Frequent Flyer Status (FF Service) - may affect price
Step 3: Process Payment (Payment Service) - PIVOT
Step 4: Issue Ticket (Ticket Service) - retriable, critical after pivot
Step 5: Update Frequent Flyer Points (FF Service) - retriable, fire-and-forget
Step 6: Send Confirmation Email (Notification Service) - after confirmation

Special considerations:

Seat hold with TTL: The seat reservation has a 15-minute TTL.
If saga does not complete in 15 minutes, seat is auto-released.
This prevents seats being held indefinitely by stuck sagas.
Overbooking handling: Airlines deliberately oversell by 5-10%.
The seat reservation may "succeed" even if the flight is technically full.
Compensation for overbooking: upgrade, voucher, or rebooking.
Partner airline code-share: Booking involves calling a partner airline's API.
External systems may be slow. Must handle long waits gracefully.
Step 2 may have a 5-second timeout with graceful degradation.
Price volatility during saga: Price may change between step 1 and step 3.
Lock in the price at step 1. Communicate price clearly to customer BEFORE payment.

8. Recent Industry Trends (2024-2026)

Q29: How has AWS Step Functions Express Workflow changed SAGA design?

Model Answer:

Step Functions Standard Workflow has a limitation: execution history stored for 90 days,
maximum 1 year duration. For high-volume, short-duration sagas this is expensive.

Step Functions Express Workflow (launched 2019, matured by 2024):

Runs up to 5 minutes per execution
Can handle 100,000+ executions per second
At-least-once execution guarantee
Much lower cost than Standard for high-volume
No persistent execution history (this is a trade-off)

When to use Express vs Standard:

Criteria	Express	Standard
Duration	< 5 minutes	Up to 1 year
Throughput	100K+/sec	2,000/sec
History	Ephemeral	90 days
Guarantee	At-least-once	Exactly-once
Cost	Low	Higher
Use case	Order saga	Loan approval saga

The 2024-2026 trend:
Teams moving high-volume order sagas to Express Workflows.
Long-running business process sagas (insurance claims, loan processing) stay on Standard.

Q30: What is the role of Temporal in modern SAGA implementations?

Model Answer:

Temporal is an open-source workflow engine (used at Uber, Netflix, HashiCorp, Stripe)
that dramatically simplifies SAGA implementation.

What Temporal solves:

No need for custom outbox pattern - Temporal handles durability automatically
No need for custom state machine - workflow code is the state machine
No need for custom retry logic - declarative retry policies
No need for custom timeout tracking - built-in timeouts per activity
Automatic crash recovery - workflow resumes from last checkpoint after crash

Code comparison:

Custom SAGA (100s of lines):

State machine + outbox + retry scheduler + timeout manager + recovery job
= ~1000 lines of infrastructure code per saga

Temporal SAGA (clean business logic):

try {
    createOrder();
    processPayment();
    reserveInventory();
    createShipment();
} catch (PaymentException e) {
    cancelOrder("payment failed");
} catch (InventoryException e) {
    refundPayment();
    cancelOrder("out of stock");
}

When to use Temporal:

New systems without existing Kafka/Spring infrastructure
Complex sagas with branching, parallel, and conditional steps
Teams willing to learn Temporal's concepts
When Temporal's operational overhead is acceptable

When NOT to use Temporal:

Existing Spring/Kafka infrastructure is mature and working
Team already skilled in custom saga patterns
Regulatory requirements prevent third-party workflow engine

9. Quick Reference Cheat Sheet

SAGA PATTERN QUICK REFERENCE

What: Sequence of local transactions with compensating transactions for rollback.
Why: Cannot do ACID across multiple service databases.
Types: Choreography (events) | Orchestration (central coordinator)

Compensating Transaction:
  - Forward transaction that reverses business effects
  - NOT database rollback
  - Must be idempotent
  - Must eventually always succeed (with retries/fallback)

Pivot Transaction:
  - Last compensable step. After this: retry, never compensate.
  - Example: After payment charged AND inventory reserved -> ship at all costs

Outbox Pattern:
  - Solve dual-write: write to DB + write to outbox table in SAME transaction
  - Publisher reads outbox and publishes to Kafka
  - Guarantees at-least-once delivery

Idempotency Rule:
  - EVERY event handler MUST be idempotent
  - Use DB unique constraints, status checks, or processed-event table
  - At-least-once delivery = duplicates WILL arrive

Key Trade-offs:
  - No isolation: other transactions see intermediate states (solve with CQRS + semantic locking)
  - Eventual consistency only (not strong consistency)
  - Complexity: double the code surface (forward + compensation for every step)
  - Debugging: harder (distributed across many services)

When to use:
  + Multi-service transactions required
  + High availability and throughput
  + Independent scaling per service

When NOT to use:
  - Single database operations -> @Transactional
  - Strong consistency mandatory -> redesign service boundaries
  - Simple CRUD -> no pattern needed
  - Team not ready -> build expertise first

Key metrics:
  - Saga success rate (target > 99%)
  - Compensation rate (> 5% = systemic issue)
  - Average saga duration
  - Stuck saga count (target = 0)
  - DLT message count (target = 0, alert on any)

10. How to Handle Follow-Up Questions

When They Say "Tell Me More About..."

About compensations:
Go deeper: idempotency, what happens when compensation fails, non-compensable steps,
pivot transaction concept, semantic vs syntactic undo.

About distributed tracing:
Explain sagaId as correlation ID, MDC context propagation, AWS X-Ray, CloudWatch Insights.

About testing:
Unit tests for state machine, integration tests with Testcontainers, chaos testing with
duplicate events and out-of-order events.

When They Say "What Are the Risks?"

Always answer with BOTH the risk AND your mitigation:

"Intermediate states visible to users" -> "mitigated by CQRS read models and semantic locking"
"Compensation failures" -> "mitigated by aggressive retry and human intervention pipeline"
"Event duplication" -> "mitigated by idempotency at every handler"
"Debugging complexity" -> "mitigated by distributed tracing with sagaId correlation"

When They Ask "Have You Implemented This?"

If you have:

Describe the specific saga type (order, payment, etc.)
Mention scale (how many sagas/day, what uptime)
Describe a real failure scenario and how you handled it
Mention what you would do differently now

If you have not:

Say so honestly
Describe how you WOULD implement it based on your learning
Reference this knowledge: "Based on my understanding of the pattern..."
Ask a clarifying question to show you understand the domain: "What type of business
process would this saga handle? That would influence my design choices."

The Architect's Response Formula

For any design question, structure your answer as:

Requirement clarification: "Before deciding, I would need to know..."
Options with trade-offs: "There are two approaches: A and B. A gives X but costs Y. B gives Z but costs W."
Recommendation with context: "Given your scale/team/requirements, I would recommend A because..."
Risk acknowledgment: "The main risks are X and Y. We would mitigate them by..."
Iteration: "We would start with a simple implementation and evolve it as we learn from production."

End of SAGA Patterns in Microservices and Distributed Systems - Complete Guide
Series: 8 documents, 7 Parts + Index
Return to Index

Series: Saga Demystified

Part 7: SAGA Patterns - Interview Mastery Guide

How to Use This Guide

Table of Contents

1. Core Foundation Questions

Q1: What is a SAGA pattern and why do we need it?

Q2: How does SAGA differ from Two-Phase Commit (2PC)?

Q3: What are compensating transactions and how do they differ from database rollback?

Q4: What are the two types of SAGAs and when do you choose each?

Q5: How do you handle failures in a SAGA?

Q6: What is idempotency and why is it critical for SAGAs?

Q7: What is the Transactional Outbox Pattern and why is it needed in SAGAs?

2. Implementation Questions

Q8: How do you implement a SAGA with Spring Boot and Kafka?

Q9: How do you design MySQL schemas for SAGA state management?

Q10: How do you test a SAGA?

3. Failure Handling and Recovery Questions

Q11: What happens if the payment refund fails during compensation?

Q12: How do you handle saga timeouts?

Q13: What is a "dirty read" in the context of SAGAs and how do you prevent it?

Q14: How do you handle out-of-order event delivery?

4. Advanced Architecture Questions

Q15: How does SAGA integrate with CQRS?

Q16: When would you NOT use a SAGA?

Q17: How would you monitor SAGAs in production?

Q18: Explain the concept of a pivot transaction in a SAGA.

5. Tricky and Situational Questions

Q19: Your SAGA has been running for 2 hours and is stuck at step 3 of 5. Payment was charged. What do you do?

Q20: You have two customers ordering the last item in inventory simultaneously. Both are in SAGAs. What happens?

Q21: If both Choreography and Orchestration have trade-offs, why not combine them?

Q22: How do you handle schema evolution of events published to Kafka?

6. Principal / Technical Architect Questions

Q23: You are designing a payment system for a bank. Should you use SAGAs?

Q24: How would you explain the SAGA pattern to a non-technical executive?

Q25: What are the operational challenges of SAGAs at scale (millions of sagas/day)?

Q26: Describe how you would migrate a monolith using @Transactional to microservices with SAGAs.

7. System Design Questions

Q27: Design the SAGA architecture for a ride-hailing application (like Uber).

Q28: Design the SAGA for an airline booking system.

8. Recent Industry Trends (2024-2026)

Q29: How has AWS Step Functions Express Workflow changed SAGA design?

Q30: What is the role of Temporal in modern SAGA implementations?

9. Quick Reference Cheat Sheet

10. How to Handle Follow-Up Questions

When They Say "Tell Me More About..."

When They Say "What Are the Risks?"

When They Ask "Have You Implemented This?"

The Architect's Response Formula