← Back to Articles
6/6/2026Admin Post

consistency models part5 pitfalls tradeoffs

Consistency Models - Part 5: Pitfalls, Anti-Patterns, Trade-Offs, and Tips

Navigation: Index | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6


Table of Contents

  1. Anti-Pattern 1: The Dual-Write Problem
  2. Anti-Pattern 2: Eventual Consistency Everywhere
  3. Anti-Pattern 3: Missing Idempotency in Retry Logic
  4. Anti-Pattern 4: Ignoring Read-Your-Writes in User Flows
  5. Anti-Pattern 5: Using SERIALIZABLE Isolation Everywhere
  6. Anti-Pattern 6: Long-Running Transactions
  7. Anti-Pattern 7: Cache Without Expiry and Without Invalidation
  8. Anti-Pattern 8: Cache Stampede and Thundering Herd
  9. Anti-Pattern 9: Split-Brain in Distributed Locking
  10. Anti-Pattern 10: Out-of-Order Event Processing
  11. Anti-Pattern 11: Tight Coupling via Synchronous Distributed Transactions
  12. Anti-Pattern 12: Missing Compensation in SAGA
  13. Real Production Failures and Lessons
  14. Consistency Trade-Off Matrix
  15. Architectural Decision Framework
  16. Tips and Tricks from Production
  17. Consistency Observability and Debugging

1. Anti-Pattern: The Dual-Write Problem

What It Is

Writing to two different systems (e.g., database + Kafka) in sequence without atomicity. If the second write fails, the systems are in an inconsistent state.

The Bug (Bad Code)

// ANTI-PATTERN: Two separate writes with no atomicity guarantee
@Service
public class OrderServiceBroken {
 
    @Transactional
    public void createOrder(CreateOrderRequest request) {
        Order order = new Order(request);
        orderRepository.save(order);     // Write 1: DB succeeds
 
        // If this fails: DB has the order, Kafka does not
        // Downstream services are never notified -- silent data inconsistency
        kafkaTemplate.send("orders", order.getId().toString(),
            new OrderCreatedEvent(order));
    }
}

Failure scenarios:

  • Network hiccup between app and Kafka -- order saved, event never published
  • Kafka broker down -- order saved, no event, downstream never processes
  • App crashes after DB commit but before Kafka send -- order saved, Kafka empty

The Fix

Use the Outbox Pattern (see Part 3, Section 5) -- write event to DB in the same transaction.

// CORRECT: Atomic write to DB + outbox in one transaction
@Service
@Transactional
public void createOrder(CreateOrderRequest request) {
    Order order = orderRepository.save(new Order(request));
    // Event written to DB in same atomic transaction
    outboxRepository.save(OutboxEvent.from(order, "ORDER_CREATED"));
    // Separate async publisher reads outbox and publishes to Kafka
}

Real Impact

This is one of the most common bugs in microservice architectures. A payment service might save a "PAYMENT_RECEIVED" record but never notify the order service, leaving orders in PENDING state forever.


2. Anti-Pattern: Eventual Consistency Everywhere

What It Is

Defaulting to eventual consistency for all data, even data that requires strong consistency. Common in teams that "want to scale" but don't analyze what their data actually needs.

The Bug

// ANTI-PATTERN: Using eventually consistent reads for account balance
@Transactional(readOnly = true)  // Routes to read replica -- may be stale!
public BigDecimal getBalanceBeforeTransfer(Long accountId) {
    // If replica is lagging 2 seconds, the balance here could be $0
    // even though the actual balance is $1000
    // Leading to INCORRECT "insufficient funds" rejections
    return accountRepository.findBalance(accountId);
}

The Risk

  • Banking: User's balance shows 0onreplicawhileactualbalanceis0 on replica while actual balance is 1000. Transfer rejected incorrectly.
  • Inventory: Read stale stock count, allow purchase of item that is actually out of stock (overselling).
  • Authentication: Token deleted but still readable from stale cache -- security breach.

The Fix

Identify your data categories:

public class AccountService {
 
    // Money-critical: always read from primary
    @Transactional(isolation = Isolation.READ_COMMITTED)
    public BigDecimal getBalanceForTransfer(Long accountId) {
        // This MUST use the primary -- routes via write datasource
        DataSourceContextHolder.setDataSourceType(DataSourceType.WRITE);
        try {
            return accountRepository.findBalance(accountId);
        } finally {
            DataSourceContextHolder.clearDataSourceType();
        }
    }
 
    // Non-critical display: eventually consistent is fine
    @Transactional(readOnly = true)
    public AccountSummary getDashboardSummary(Long accountId) {
        // Showing $999.90 instead of $1000 for 50ms is acceptable in a dashboard
        return accountRepository.findSummary(accountId);
    }
}

Decision Rule

Ask: "What is the worst case if this read is 1-5 seconds stale?"

  • If the answer involves financial loss, security breach, or correctness violation: use strong consistency
  • If the answer is "user sees slightly outdated display data": eventual consistency is fine

3. Anti-Pattern: Missing Idempotency in Retry Logic

What It Is

Adding retry logic to handle transient failures without making operations idempotent. Retries then cause duplicate side effects.

The Bug

// ANTI-PATTERN: Retry without idempotency
@Retryable(retryFor = RuntimeException.class, maxAttempts = 3)
public void chargeCustomer(Long customerId, BigDecimal amount) {
    // If first attempt charges successfully but response times out,
    // retry will charge the customer AGAIN
    paymentGateway.charge(customerId, amount);
}

Real scenario: Charging 100,networktimesoutonresponse.Customerwasactuallycharged.Retryfires:customercharged100, network times out on response. Customer was actually charged. Retry fires: customer charged 200. Chaos.

The Fix

// CORRECT: Idempotency key prevents duplicate charges
@Retryable(retryFor = RuntimeException.class, maxAttempts = 3,
           backoff = @Backoff(delay = 1000, multiplier = 2))
public PaymentResult chargeCustomer(String idempotencyKey,
                                    Long customerId, BigDecimal amount) {
    // Check for existing charge first
    Optional<Payment> existing = paymentRepository.findByIdempotencyKey(idempotencyKey);
    if (existing.isPresent()) {
        log.info("Returning existing charge for idempotency key {}", idempotencyKey);
        return PaymentResult.from(existing.get());
    }
 
    // Create payment record BEFORE calling external gateway
    // This ensures we track the intent even if gateway call is slow
    Payment payment = Payment.builder()
        .idempotencyKey(idempotencyKey)
        .customerId(customerId)
        .amount(amount)
        .status(PaymentStatus.PROCESSING)
        .build();
    paymentRepository.save(payment);
 
    try {
        GatewayResult result = paymentGateway.charge(customerId, amount);
        payment.setStatus(PaymentStatus.SUCCESS);
        payment.setGatewayTransactionId(result.getTransactionId());
    } catch (Exception e) {
        payment.setStatus(PaymentStatus.FAILED);
        payment.setFailureReason(e.getMessage());
        throw e;
    } finally {
        paymentRepository.save(payment);
    }
 
    return PaymentResult.from(payment);
}

Golden Rule for Idempotency

Every operation that can be retried MUST be idempotent. Use unique idempotency keys (provided by client or generated from business context: customerId + orderId + timestamp-bucket).


4. Anti-Pattern: Ignoring Read-Your-Writes in User Flows

What It Is

Routing reads to replicas without accounting for the fact that the user just wrote data and expects to see it immediately.

The Bug (User Experience Disaster)

User action sequence:
1. User updates their profile picture
2. App writes new picture URL to MySQL primary
3. App redirects to profile view page
4. Profile view reads from MySQL replica (10ms lag)
5. User sees their OLD profile picture
6. User thinks the update failed and tries again
7. Multiple duplicate updates, frustrated user

The Fix

// Strategy 1: Force primary read after write
@Service
public class ProfileService {
 
    @Transactional
    public UserProfile updateProfile(Long userId, ProfileUpdateRequest request) {
        User user = userRepository.findById(userId).orElseThrow();
        user.update(request);
        userRepository.save(user);
 
        // Mark session so next reads go to primary
        UserWriteTracker.markRecentWrite(userId);
 
        return UserProfile.from(user);
    }
 
    @Transactional(readOnly = true)
    public UserProfile getProfile(Long userId) {
        // If user recently wrote, read from primary
        if (UserWriteTracker.hasRecentWrite(userId, Duration.ofSeconds(5))) {
            DataSourceContextHolder.setDataSourceType(DataSourceType.WRITE);
        }
        try {
            return userRepository.findById(userId).map(UserProfile::from).orElseThrow();
        } finally {
            DataSourceContextHolder.clearDataSourceType();
        }
    }
}
 
// Strategy 2: Optimistic response -- return updated data from the write response
// Don't round-trip to read; return the object you just saved
@Transactional
public UserProfile updateProfile(Long userId, ProfileUpdateRequest request) {
    User user = userRepository.findById(userId).orElseThrow();
    user.update(request);
    User saved = userRepository.save(user);
    return UserProfile.from(saved);  // Return immediately, no secondary read
}

5. Anti-Pattern: Using SERIALIZABLE Isolation Everywhere

What It Is

Using ISOLATION.SERIALIZABLE for all transactions "to be safe," without understanding the severe performance impact.

The Problem

SERIALIZABLE requires range locks for every read. This means:

  • Higher chance of deadlocks
  • Drastically reduced concurrency
  • Much lower throughput
  • Lock wait timeouts increase

Test result in real production: Switching a high-traffic read endpoint from READ_COMMITTED to SERIALIZABLE reduced throughput by 70% and increased p99 latency from 50ms to 900ms.

The Fix

Use the minimum isolation level required:

// WRONG: SERIALIZABLE for a simple lookup
@Transactional(isolation = Isolation.SERIALIZABLE)
public Product getProduct(Long id) {
    return productRepository.findById(id).orElseThrow();
}
 
// CORRECT: READ_COMMITTED for simple reads (or readOnly = true)
@Transactional(readOnly = true)  // Uses READ_COMMITTED snapshot
public Product getProduct(Long id) {
    return productRepository.findById(id).orElseThrow();
}
 
// Only use SERIALIZABLE for critical operations requiring full isolation
@Transactional(isolation = Isolation.SERIALIZABLE)
public void onCallScheduleUpdate(Long departmentId, Long doctorId) {
    // Write skew prevention needed: checking on-call count then updating
    long onCallCount = doctorRepository.countOnCall(departmentId);
    if (onCallCount <= 1) {
        throw new ScheduleViolationException("Cannot remove last on-call doctor");
    }
    doctorRepository.setOffCall(doctorId);
}

Isolation Level Usage Guide

Operation TypeRecommended Level
Simple read for displayreadOnly = true (READ_COMMITTED)
Read for update (will modify)READ_COMMITTED with pessimistic or optimistic lock
Audit/report (multi-read consistency)REPEATABLE_READ
Critical financial updateREAD_COMMITTED + pessimistic lock
Write skew preventionSERIALIZABLE (narrow scope only)
Batch report (long running read)REPEATABLE_READ (snapshot)

6. Anti-Pattern: Long-Running Transactions

What It Is

Holding a database transaction open for extended periods while doing non-DB work (HTTP calls, file I/O, processing).

The Bug

// ANTI-PATTERN: Transaction held open during external calls
@Transactional
public void processOrderWithExternalCalls(Long orderId) {
    Order order = orderRepository.findById(orderId).orElseThrow();
    order.setStatus(OrderStatus.PROCESSING);
    orderRepository.save(order);
 
    // DANGER: DB transaction is OPEN during this HTTP call (could take 2-10 seconds)
    PaymentResult result = paymentGateway.charge(order.getTotalAmount());  // network call!
 
    // DANGER: DB connection held, locks held, undo log growing
    InventoryResult inv = inventoryService.reserve(order.getItems());  // another network call!
 
    order.setStatus(OrderStatus.CONFIRMED);
    orderRepository.save(order);
}

Production consequences:

  • DB connection pool exhaustion (all connections waiting on external calls)
  • Undo log bloat (MySQL cannot clean old row versions)
  • Increased deadlock probability (locks held longer)
  • Cascading failures (slow external call = all threads blocked = service down)

The Fix

// CORRECT: Narrow transactions, external calls outside transactions
@Service
public class OrderProcessingService {
 
    public void processOrder(Long orderId) {
        // 1. Short transaction: load and mark processing
        Order order = markOrderProcessing(orderId);
 
        // 2. External calls OUTSIDE transaction
        PaymentResult paymentResult = paymentGateway.charge(order.getTotalAmount());
        InventoryResult inventoryResult = inventoryService.reserve(order.getItems());
 
        // 3. Short transaction: update final state
        finalizeOrder(orderId, paymentResult, inventoryResult);
    }
 
    @Transactional(timeout = 5)
    private Order markOrderProcessing(Long orderId) {
        Order order = orderRepository.findByIdWithLock(orderId).orElseThrow();
        order.setStatus(OrderStatus.PROCESSING);
        return orderRepository.save(order);
    }  // Transaction closes here
 
    @Transactional(timeout = 5)
    private void finalizeOrder(Long orderId, PaymentResult payment,
                                InventoryResult inventory) {
        Order order = orderRepository.findById(orderId).orElseThrow();
        order.setStatus(payment.isSuccess() ? OrderStatus.CONFIRMED : OrderStatus.FAILED);
        order.setPaymentTransactionId(payment.getTransactionId());
        orderRepository.save(order);
    }  // Transaction closes here
}

Rule of Thumb: A transaction should last milliseconds, not seconds. If a transaction holds network calls, it is too long.


7. Anti-Pattern: Cache Without Expiry and Without Invalidation

What It Is

Caching data indefinitely or with a very long TTL without a proper invalidation strategy. Data becomes stale and incorrect.

The Bug

// ANTI-PATTERN: Cache with no expiry
redisTemplate.opsForValue().set("user:123:preferences", preferences);
// Never expires! If user changes preferences, cache always returns old value.
 
// ANTI-PATTERN: Very long TTL with no invalidation
redisTemplate.opsForValue().set("product:456:price", price, Duration.ofDays(7));
// Price might change. Customer pays wrong price for up to 7 days.

The Fix

// CORRECT: Appropriate TTL + active invalidation
@Service
public class UserPreferenceService {
 
    private static final Duration PREF_TTL = Duration.ofHours(1);
 
    @Transactional
    public void updatePreferences(Long userId, PreferenceUpdate update) {
        // Update DB
        userPreferenceRepository.update(userId, update);
 
        // Actively invalidate cache
        String cacheKey = "user:" + userId + ":preferences";
        redisTemplate.delete(cacheKey);
 
        log.debug("Invalidated preferences cache for user {}", userId);
    }
 
    public UserPreferences getPreferences(Long userId) {
        String cacheKey = "user:" + userId + ":preferences";
        UserPreferences cached = redisTemplate.opsForValue().get(cacheKey);
        if (cached != null) return cached;
 
        UserPreferences prefs = userPreferenceRepository.findByUserId(userId);
        redisTemplate.opsForValue().set(cacheKey, prefs, PREF_TTL);  // TTL as safety net
        return prefs;
    }
}

Best Practice: Always use TTL as a safety net even when you have active invalidation. TTL catches cases where invalidation fails (Redis unreachable, bug in code that forgot to invalidate).


8. Anti-Pattern: Cache Stampede and Thundering Herd

What It Is

When a cached item expires, many concurrent requests all miss the cache simultaneously, all query the database at the same time, overwhelming it.

The Scenario

Cache key "trending:products" expires at 3:00 PM
At 3:00:001 PM: 5,000 requests arrive, all miss cache
5,000 queries hit MySQL simultaneously
MySQL falls over under 5,000 concurrent queries
App returns 500 errors

The Fix - Mutex/Lock-Based Prevention

// CORRECT: Distributed lock prevents stampede
public List<Product> getTrendingProducts() {
    String cacheKey = "trending:products";
 
    // Try cache first (fast path)
    List<Product> cached = (List<Product>) redisTemplate.opsForValue().get(cacheKey);
    if (cached != null) return cached;
 
    // Stampede prevention: only ONE instance rebuilds the cache
    RLock lock = redissonClient.getLock("lock:rebuild:" + cacheKey);
    try {
        if (lock.tryLock(5, 10, TimeUnit.SECONDS)) {
            // Double-check after acquiring lock
            cached = (List<Product>) redisTemplate.opsForValue().get(cacheKey);
            if (cached != null) return cached;
 
            // Only one thread/instance reaches here
            List<Product> fresh = productRepository.findTrending();
            redisTemplate.opsForValue().set(cacheKey, fresh, Duration.ofMinutes(10));
            return fresh;
        } else {
            // Could not get lock -- serve stale if possible, otherwise wait and retry
            return getTrendingProductsWithStaleFallback(cacheKey);
        }
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException("Interrupted waiting for cache rebuild lock", e);
    } finally {
        if (lock.isHeldByCurrentThread()) lock.unlock();
    }
}
 
// Better: Use "soft expiry" -- cache stores expiry time internally
// Data appears valid longer than it is; probabilistic early refresh
public List<Product> getTrendingProductsProactive() {
    SoftCachedData<List<Product>> softCached = softCache.get("trending:products");
 
    if (softCached != null) {
        // Schedule background refresh if within 20% of expiry
        if (softCached.isNearExpiry(0.20)) {
            cacheRefreshExecutor.submit(() -> refreshTrendingCache());
        }
        return softCached.getData();  // Return current data while refreshing in background
    }
    return refreshTrendingCache();
}

9. Anti-Pattern: Split-Brain in Distributed Locking

What It Is

Two nodes simultaneously believe they hold the same distributed lock, leading to concurrent conflicting operations.

How It Happens

t=0: Node A acquires Redis lock with 30s TTL
t=1: Node A does work...
t=15: Node A pauses (GC pause, OS scheduling, etc.)
t=30: Lock expires (Node A still paused, unaware)
t=31: Node B acquires the same lock
t=31: Node B starts doing work on the same resource
t=35: Node A resumes from pause, still thinks it holds the lock
t=35: BOTH Node A and Node B are modifying the same resource simultaneously!

The Fix - Fencing Tokens

@Service
public class SafeLockService {
 
    private final RedissonClient redissonClient;
    private final ResourceRepository resourceRepository;
 
    public void processWithFencing(String resourceId) {
        RLock lock = redissonClient.getLock("lock:" + resourceId);
 
        try {
            // Get fencing token BEFORE acquiring lock
            // Token is monotonically increasing -- if we see an older token, we're a zombie
            long fencingToken = lock.tryLockAndGetFencingToken(10, 30, TimeUnit.SECONDS);
 
            if (fencingToken < 0) {
                throw new LockAcquisitionException("Could not acquire lock for " + resourceId);
            }
 
            // Include fencing token in every write operation
            doWorkWithFencingToken(resourceId, fencingToken);
 
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        } finally {
            if (lock.isHeldByCurrentThread()) {
                lock.unlock();
            }
        }
    }
 
    private void doWorkWithFencingToken(String resourceId, long fencingToken) {
        // Store fencing token in DB alongside the data
        // If token in DB > our token: we're a zombie, abort
        resourceRepository.updateWithFencingToken(resourceId, fencingToken, newData);
    }
}
 
// In repository: reject writes with stale fencing tokens
@Transactional
public void updateWithFencingToken(String resourceId, long fencingToken, Data newData) {
    Resource resource = resourceRepository.findById(resourceId).orElseThrow();
 
    if (resource.getLastFencingToken() >= fencingToken) {
        throw new StaleWriteException("Stale write detected: our token " + fencingToken +
            " is not newer than stored " + resource.getLastFencingToken());
    }
 
    resource.setData(newData);
    resource.setLastFencingToken(fencingToken);
    resourceRepository.save(resource);
}

Additional Protection: Heartbeats and Watchdog

// Redisson watchdog: automatically extends lock TTL if owner is still alive
// Enabled by default -- lock TTL is renewed every 10 seconds if holder is running
RLock lock = redissonClient.getLock("lock:resource");
lock.lock();  // No explicit TTL -- watchdog auto-renews until unlock()
// If holder dies: watchdog stops, TTL expires naturally, lock released

10. Anti-Pattern: Out-of-Order Event Processing

What It Is

Consuming events without ordering guarantees, leading to state machine violations.

The Bug

Events published in order:
  1. ORDER_CREATED
  2. PAYMENT_RECEIVED
  3. ORDER_SHIPPED
  4. ORDER_DELIVERED

Due to SQS Standard queue or parallel consumers:
  Consumer 1 processes: ORDER_DELIVERED
  Consumer 2 processes: ORDER_CREATED

Result: Order marked DELIVERED before it was even CREATED in the consumer service

The Fix

// Option 1: Use SQS FIFO with message group per order
// All events for orderId=123 go to the same group, processed in order
 
// Option 2: Kafka -- all events for same orderId go to same partition (same consumer)
kafkaTemplate.send("order-events",
    orderId.toString(),  // partition key -- same orderId = same partition = ordered
    event);
 
// Option 3: Application-level ordering enforcement
@KafkaListener(topics = "order-events", groupId = "order-consumer")
@Transactional
public void handleOrderEvent(OrderEvent event) {
    Order order = orderRepository.findById(event.getOrderId()).orElseThrow();
 
    // State machine: only apply event if current state allows it
    if (!order.canTransitionTo(event.getTargetState())) {
        log.warn("Out-of-order event: order {} cannot transition from {} to {}. Event: {}",
            event.getOrderId(), order.getStatus(), event.getTargetState(), event.getType());
 
        // Option A: Dead-letter the event (it may arrive again after in-order events)
        // Option B: Store as "pending" and reprocess after state catches up
        outOfOrderEventStore.save(event);
        return;
    }
 
    order.applyEvent(event);
    orderRepository.save(order);
}
 
// Reprocess stored out-of-order events periodically
@Scheduled(fixedDelay = 5000)
@Transactional
public void reprocessOutOfOrderEvents() {
    List<OrderEvent> pending = outOfOrderEventStore.findPending();
    for (OrderEvent event : pending) {
        Order order = orderRepository.findById(event.getOrderId()).orElse(null);
        if (order != null && order.canTransitionTo(event.getTargetState())) {
            order.applyEvent(event);
            orderRepository.save(order);
            outOfOrderEventStore.delete(event);
        }
    }
}

11. Anti-Pattern: Tight Coupling via Synchronous Distributed Transactions

What It Is

Using synchronous calls or 2PC (Two-Phase Commit) across microservices to achieve strong consistency. This creates extreme coupling and availability dependencies.

Why 2PC Is Problematic in Microservices

Order Service calls:
  1. BEGIN DISTRIBUTED TRANSACTION
  2. --> Inventory Service: Reserve stock (waits)
  3. --> Payment Service: Charge customer (waits)
  4. COMMIT

If Inventory Service is slow: Order Service waits
If Payment Service crashes: Entire transaction blocked until recovery
If coordinator crashes mid-commit: All participants in limbo indefinitely

2PC availability: P(system available) = P(all participants available) = 0.99 _ 0.99 _ 0.99 = 0.97 (worse than any single service).

The Fix

Use SAGA (choreography or orchestration) for distributed coordination with compensations. See Part 3, Sections 6 and 7.

Rule: Never use distributed transactions (2PC, XA transactions) across microservices. Use SAGAs with compensating transactions instead.


12. Anti-Pattern: Missing Compensation in SAGA

What It Is

Implementing a SAGA without proper compensating transactions for failure scenarios.

The Bug

// ANTI-PATTERN: SAGA without compensation
@KafkaListener(topics = "payment-events")
public void onPaymentFailed(PaymentFailedEvent event) {
    Order order = orderRepository.findById(event.getOrderId()).orElseThrow();
    order.setStatus(OrderStatus.FAILED);
    orderRepository.save(order);
    // FORGOT: Inventory was already reserved! Never released!
    // Result: Inventory count permanently decremented for a failed order
}

The Fix

Every SAGA step must have a defined compensation:

StepForward ActionCompensating Action
1Create Order (PENDING)Cancel Order
2Reserve InventoryRelease Inventory Reservation
3Charge PaymentRefund Payment
4Ship OrderInitiate Return
private void onPaymentFailed(PaymentFailedEvent event) {
    Order order = orderRepository.findById(event.getOrderId()).orElseThrow();
    order.setStatus(OrderStatus.FAILED);
    orderRepository.save(order);
 
    // COMPENSATION: Release inventory that was reserved in Step 2
    inventoryCompensationPublisher.publish(new ReleaseInventoryCommand(
        event.getOrderId(), order.getItems()));
 
    // Log for observability
    log.info("SAGA compensation: releasing inventory for failed order {}", event.getOrderId());
}

13. Real Production Failures and Lessons

Failure 1: The Black Friday Inventory Oversell

What happened: An e-commerce platform used DynamoDB with eventually consistent reads to check inventory before checkout. Under high load on Black Friday, the replica lag was 2-3 seconds. 50,000 customers checked inventory, all saw "In Stock" (from stale replicas), all purchased. The actual inventory was 10,000 units. 40,000 orders had to be cancelled.

Root Cause: Inventory check used eventual consistency. Reservation used eventual consistency.

Fix:

// Inventory reservation MUST be atomic and strongly consistent
// Use DynamoDB conditional write: only decrement if quantity > 0
UpdateItemRequest reserve = UpdateItemRequest.builder()
    .tableName("Inventory")
    .key(Map.of("productId", AttributeValue.fromS(productId)))
    .updateExpression("SET quantity = quantity - :qty")
    .conditionExpression("quantity >= :qty")  // Atomic check-and-decrement
    .expressionAttributeValues(Map.of(":qty", AttributeValue.fromN(String.valueOf(quantity))))
    .build();

Lesson: Any operation that allocates a finite resource MUST use atomic, strongly consistent operations. Not "read then write" but "atomic conditional write."

Failure 2: The Ghost Payment

What happened: A payments microservice called an external payment gateway. The gateway processed the payment but the network response timed out. The microservice's retry logic retried the call without idempotency keys. Customer was charged twice.

Root Cause: External API calls without idempotency keys + retry logic = duplicate charges.

Fix: Always pass idempotency key when calling external services:

ExternalPaymentRequest request = ExternalPaymentRequest.builder()
    .customerId(customerId)
    .amount(amount)
    .idempotencyKey(orderId + "-" + attemptNumber)  // or just UUID per payment
    .build();
externalGatewayClient.charge(request);

Failure 3: The Replica Lag Data Breach

What happened: A healthcare application used read replicas for all reads, including reading authorization data. After a user's access was revoked (written to primary), the replica still showed the user as authorized for 15 seconds. During those 15 seconds, the user accessed protected data.

Root Cause: Authorization and access control reads were using eventually consistent replicas.

Fix: Security-critical reads (authentication, authorization, session validation) MUST always read from the primary:

// NEVER read auth data from replica
@Transactional  // Forces primary read
public boolean isAuthorized(Long userId, String resource) {
    DataSourceContextHolder.setDataSourceType(DataSourceType.WRITE);
    return permissionRepository.hasPermission(userId, resource);
}

Failure 4: The Lost Update in Financial Reconciliation

What happened: Two concurrent batch jobs both read a reconciliation record (balance = $10,000), both applied their calculations, both wrote back. The second write overwrote the first. One set of transactions was effectively "lost."

Root Cause: Classic lost update problem. Read-modify-write without any concurrency control.

Fix:

// Atomic SQL update instead of read-modify-write
@Transactional
public void applyTransactionBatch(Long accountId, List<Transaction> transactions) {
    BigDecimal totalDelta = transactions.stream()
        .map(t -> t.isDebit() ? t.getAmount().negate() : t.getAmount())
        .reduce(BigDecimal.ZERO, BigDecimal::add);
 
    // Atomic: database calculates new balance, not the application
    int updated = accountRepository.atomicUpdateBalance(accountId, totalDelta);
    if (updated == 0) {
        throw new AccountNotFoundException(accountId);
    }
}
 
// In repository:
@Query("UPDATE accounts SET balance = balance + :delta WHERE id = :id")
@Modifying
int atomicUpdateBalance(@Param("id") Long id, @Param("delta") BigDecimal delta);

14. Consistency Trade-Off Matrix

DimensionStrong ConsistencyEventual Consistency
Data FreshnessAlways currentMay be seconds/minutes stale
Write LatencyHigher (coordination overhead)Lower (asynchronous)
Read LatencyHigher (read from primary or quorum)Lower (read from nearest replica)
ThroughputLower (serialization overhead)Higher (parallel processing)
AvailabilityLower (may block during partition)Higher (serves requests always)
Conflict RiskNone (serialized)Possible concurrent conflicts
Implementation ComplexityLower (DB handles it)Higher (app must handle conflicts)
Cost (DynamoDB)2x read units1x read unit
Use ForMoney, auth, inventoryCatalog, analytics, feeds

The Trade-Off Decision Tree

Does incorrect data here cause financial loss, security issue, or legal liability?
  YES --> Strong consistency required
  NO  -->
       Does the user who just wrote this data need to see it immediately?
         YES --> Read-Your-Writes (session consistency)
         NO  -->
              Can the business tolerate 1-10 seconds of staleness?
                YES --> Eventual consistency (cheaper, faster)
                NO  -->
                     Can it tolerate seconds but not minutes?
                       YES --> Bounded staleness
                       NO  --> Strong consistency

15. Architectural Decision Framework

Questions to Answer Before Choosing Consistency Model

  1. Data criticality: What is the business impact of incorrect data? (Financial, reputational, legal?)
  2. Read/write ratio: Is it read-heavy (caching beneficial) or write-heavy?
  3. Concurrency level: How many concurrent users modify the same data?
  4. User expectation: Does the user expect to immediately see their own changes?
  5. Geographic distribution: Is your user base global? Where does the data live?
  6. Volume: How many operations per second? What is the acceptable latency?
  7. Failure tolerance: What happens if a read returns stale data? What happens if a write is delayed?

Data Classification Framework

Tier 1 - Strong Consistency Required:
  - Financial balances, transaction records
  - Authentication tokens, session data
  - Inventory counts for purchase
  - Authorization/permission data
  - Unique constraint enforcement (username, email)
  - Distributed locks and coordination

Tier 2 - Session/Causal Consistency:
  - User profiles and preferences (user sees own changes)
  - Shopping cart (session-scoped)
  - User-generated content (comment threads, posts)
  - Status updates (order tracking)

Tier 3 - Eventual Consistency Acceptable:
  - Product catalog (prices, descriptions)
  - Recommendations and personalization
  - Analytics dashboards
  - Social media feeds (likes, views)
  - Non-critical configuration
  - Search indexes

Tier 4 - Approximate is Fine:
  - Page view counters
  - Rating aggregates
  - Trending scores
  - Activity logs (some loss acceptable)

16. Tips and Tricks from Production

Tip 1: Set Connection-Level Transaction Isolation

Instead of annotating every method, set the isolation level at the connection level for your use case:

# In application.yml for services that are all READ_COMMITTED
spring:
  jpa:
    properties:
      hibernate:
        connection:
          isolation: 2 # READ_COMMITTED

Override per-method only when you need SERIALIZABLE or REPEATABLE_READ.

Tip 2: Use Batch Fetching to Reduce Lock Contention

Instead of N individual locks, acquire a batch lock:

// More efficient: one batch select FOR UPDATE instead of N individual locks
List<Account> accounts = accountRepository.findByIdsForUpdate(accountIds);

Tip 3: Monitor Slow Queries That Hold Locks

-- Find queries holding locks in MySQL
SELECT trx_id, trx_started, trx_requested_lock_id, trx_mysql_thread_id, trx_query
FROM information_schema.INNODB_TRX
WHERE trx_state = 'LOCK WAIT'
ORDER BY trx_started;
 
-- Find which transaction is blocking
SELECT blocking_trx_id, blocking_lock_id, requested_trx_id
FROM information_schema.INNODB_LOCK_WAITS;

Tip 4: Use SELECT ... FOR UPDATE SKIP LOCKED for Queue-Like Patterns

// Instead of distributed lock for a job queue, use SKIP LOCKED
// Allows multiple workers to process different items concurrently without blocking
@Query("SELECT j FROM Job j WHERE j.status = 'PENDING' " +
       "ORDER BY j.createdAt LIMIT 10 " +
       "FOR UPDATE SKIP LOCKED")  // Skip items locked by other workers
@Lock(LockModeType.PESSIMISTIC_WRITE)
List<Job> findAndLockPendingJobs(Pageable pageable);

Tip 5: Versioned Cache Keys for Zero-Downtime Cache Invalidation

// Include application version or data version in cache key
// Rolling deploy: new version writes to new key, old version reads from old key
// No cache invalidation storms during deployment
String cacheKey = "product:v" + dataVersion + ":" + productId;

Tip 6: Use @TransactionalEventListener for Post-Commit Events

// WRONG: Event published before transaction commits
// If transaction rolls back, event is already published!
@Transactional
public void createUser(CreateUserRequest request) {
    userRepository.save(new User(request));
    applicationEventPublisher.publishEvent(new UserCreatedEvent(request.getEmail()));
}
 
// CORRECT: Event published AFTER transaction commits
@Transactional
public void createUser(CreateUserRequest request) {
    User user = userRepository.save(new User(request));
    applicationEventPublisher.publishEvent(new UserCreatedEvent(user));
    // Event not published until commit succeeds
}
 
@Component
public class UserCreatedListener {
    @TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
    public void handleUserCreated(UserCreatedEvent event) {
        emailService.sendWelcomeEmail(event.getEmail());
        // Runs only after DB transaction commits
    }
}

Tip 7: Use open-in-view: false in Spring Boot

spring:
  jpa:
    open-in-view: false # CRITICAL for production

open-in-view: true (default) holds the DB connection open for the entire HTTP request lifecycle (including after the controller method returns). This means connections are held while serializing JSON responses, leading to connection pool exhaustion under load.

Tip 8: Tune HikariCP for AWS Aurora

spring:
  datasource:
    hikari:
      max-lifetime: 1740000 # 29 minutes -- Aurora closes idle connections at 30min
      keepalive-time: 60000 # 1 minute -- prevents AWS security group from closing idle
      connection-timeout: 30000 # 30 seconds -- fail fast if pool exhausted
      leak-detection-threshold: 60000 # Warn if connection held > 60s (detect long transactions)

17. Consistency Observability and Debugging

Metrics to Track

@Component
@RequiredArgsConstructor
public class ConsistencyMetrics {
 
    private final MeterRegistry meterRegistry;
 
    // Track optimistic lock conflicts
    public void recordOptimisticLockConflict(String entityType) {
        meterRegistry.counter("consistency.optimistic_lock.conflict",
            Tags.of("entity", entityType)).increment();
    }
 
    // Track cache miss rate (high miss rate = potential inconsistency or cold cache)
    public void recordCacheHit(String cacheName, boolean hit) {
        meterRegistry.counter("consistency.cache." + (hit ? "hit" : "miss"),
            Tags.of("cache", cacheName)).increment();
    }
 
    // Track outbox event processing lag
    public void recordOutboxLag(long lagMs) {
        meterRegistry.gauge("consistency.outbox.lag_ms", lagMs);
    }
 
    // Track idempotency key hits (duplicate request detection)
    public void recordIdempotencyHit(String operationType) {
        meterRegistry.counter("consistency.idempotency.hit",
            Tags.of("operation", operationType)).increment();
    }
}

SQL Queries for Production Debugging

-- Check for replica lag on Aurora reader
SELECT server_id,
       session_id,
       last_update_timestamp,
       TIMESTAMPDIFF(SECOND, last_update_timestamp, NOW()) AS lag_seconds
FROM information_schema.replica_host_status;
 
-- Check for long-running transactions (potential lock holders)
SELECT trx_id,
       trx_started,
       TIMEDIFF(NOW(), trx_started) AS duration,
       trx_state,
       trx_mysql_thread_id,
       SUBSTR(trx_query, 1, 200) AS query
FROM information_schema.INNODB_TRX
WHERE TIMEDIFF(NOW(), trx_started) > '00:00:05'  -- running > 5 seconds
ORDER BY trx_started;
 
-- Check current locks
SELECT r.trx_id waiting_trx_id,
       r.trx_mysql_thread_id waiting_thread,
       r.trx_query waiting_query,
       b.trx_id blocking_trx_id,
       b.trx_mysql_thread_id blocking_thread,
       b.trx_query blocking_query
FROM information_schema.INNODB_LOCK_WAITS w
INNER JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id;
 
-- Check pending outbox events (should be near zero in healthy system)
SELECT status, COUNT(*) AS count, MIN(created_at) AS oldest
FROM outbox_events
GROUP BY status;
 
-- Check for stuck PROCESSING outbox events (publisher crashed mid-process)
SELECT * FROM outbox_events
WHERE status = 'PROCESSING'
  AND created_at < DATE_SUB(NOW(), INTERVAL 5 MINUTE);
-- These should be reset to PENDING and retried

Next: Part 6: Interview Questions and Answers -- Comprehensive interview preparation from beginner to Technical Architect level.


Part of the Consistency Models Demystified series
Stack: Java 17, Spring Boot 3.x, MySQL 8.0, AWS