Consistency Models - Part 5: Pitfalls, Anti-Patterns, Trade-Offs, and Tips
Navigation: Index | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6
Table of Contents
- Anti-Pattern 1: The Dual-Write Problem
- Anti-Pattern 2: Eventual Consistency Everywhere
- Anti-Pattern 3: Missing Idempotency in Retry Logic
- Anti-Pattern 4: Ignoring Read-Your-Writes in User Flows
- Anti-Pattern 5: Using SERIALIZABLE Isolation Everywhere
- Anti-Pattern 6: Long-Running Transactions
- Anti-Pattern 7: Cache Without Expiry and Without Invalidation
- Anti-Pattern 8: Cache Stampede and Thundering Herd
- Anti-Pattern 9: Split-Brain in Distributed Locking
- Anti-Pattern 10: Out-of-Order Event Processing
- Anti-Pattern 11: Tight Coupling via Synchronous Distributed Transactions
- Anti-Pattern 12: Missing Compensation in SAGA
- Real Production Failures and Lessons
- Consistency Trade-Off Matrix
- Architectural Decision Framework
- Tips and Tricks from Production
- Consistency Observability and Debugging
1. Anti-Pattern: The Dual-Write Problem
What It Is
Writing to two different systems (e.g., database + Kafka) in sequence without atomicity. If the second write fails, the systems are in an inconsistent state.
The Bug (Bad Code)
// ANTI-PATTERN: Two separate writes with no atomicity guarantee
@Service
public class OrderServiceBroken {
@Transactional
public void createOrder(CreateOrderRequest request) {
Order order = new Order(request);
orderRepository.save(order); // Write 1: DB succeeds
// If this fails: DB has the order, Kafka does not
// Downstream services are never notified -- silent data inconsistency
kafkaTemplate.send("orders", order.getId().toString(),
new OrderCreatedEvent(order));
}
}Failure scenarios:
- Network hiccup between app and Kafka -- order saved, event never published
- Kafka broker down -- order saved, no event, downstream never processes
- App crashes after DB commit but before Kafka send -- order saved, Kafka empty
The Fix
Use the Outbox Pattern (see Part 3, Section 5) -- write event to DB in the same transaction.
// CORRECT: Atomic write to DB + outbox in one transaction
@Service
@Transactional
public void createOrder(CreateOrderRequest request) {
Order order = orderRepository.save(new Order(request));
// Event written to DB in same atomic transaction
outboxRepository.save(OutboxEvent.from(order, "ORDER_CREATED"));
// Separate async publisher reads outbox and publishes to Kafka
}Real Impact
This is one of the most common bugs in microservice architectures. A payment service might save a "PAYMENT_RECEIVED" record but never notify the order service, leaving orders in PENDING state forever.
2. Anti-Pattern: Eventual Consistency Everywhere
What It Is
Defaulting to eventual consistency for all data, even data that requires strong consistency. Common in teams that "want to scale" but don't analyze what their data actually needs.
The Bug
// ANTI-PATTERN: Using eventually consistent reads for account balance
@Transactional(readOnly = true) // Routes to read replica -- may be stale!
public BigDecimal getBalanceBeforeTransfer(Long accountId) {
// If replica is lagging 2 seconds, the balance here could be $0
// even though the actual balance is $1000
// Leading to INCORRECT "insufficient funds" rejections
return accountRepository.findBalance(accountId);
}The Risk
- Banking: User's balance shows 1000. Transfer rejected incorrectly.
- Inventory: Read stale stock count, allow purchase of item that is actually out of stock (overselling).
- Authentication: Token deleted but still readable from stale cache -- security breach.
The Fix
Identify your data categories:
public class AccountService {
// Money-critical: always read from primary
@Transactional(isolation = Isolation.READ_COMMITTED)
public BigDecimal getBalanceForTransfer(Long accountId) {
// This MUST use the primary -- routes via write datasource
DataSourceContextHolder.setDataSourceType(DataSourceType.WRITE);
try {
return accountRepository.findBalance(accountId);
} finally {
DataSourceContextHolder.clearDataSourceType();
}
}
// Non-critical display: eventually consistent is fine
@Transactional(readOnly = true)
public AccountSummary getDashboardSummary(Long accountId) {
// Showing $999.90 instead of $1000 for 50ms is acceptable in a dashboard
return accountRepository.findSummary(accountId);
}
}Decision Rule
Ask: "What is the worst case if this read is 1-5 seconds stale?"
- If the answer involves financial loss, security breach, or correctness violation: use strong consistency
- If the answer is "user sees slightly outdated display data": eventual consistency is fine
3. Anti-Pattern: Missing Idempotency in Retry Logic
What It Is
Adding retry logic to handle transient failures without making operations idempotent. Retries then cause duplicate side effects.
The Bug
// ANTI-PATTERN: Retry without idempotency
@Retryable(retryFor = RuntimeException.class, maxAttempts = 3)
public void chargeCustomer(Long customerId, BigDecimal amount) {
// If first attempt charges successfully but response times out,
// retry will charge the customer AGAIN
paymentGateway.charge(customerId, amount);
}Real scenario: Charging 200. Chaos.
The Fix
// CORRECT: Idempotency key prevents duplicate charges
@Retryable(retryFor = RuntimeException.class, maxAttempts = 3,
backoff = @Backoff(delay = 1000, multiplier = 2))
public PaymentResult chargeCustomer(String idempotencyKey,
Long customerId, BigDecimal amount) {
// Check for existing charge first
Optional<Payment> existing = paymentRepository.findByIdempotencyKey(idempotencyKey);
if (existing.isPresent()) {
log.info("Returning existing charge for idempotency key {}", idempotencyKey);
return PaymentResult.from(existing.get());
}
// Create payment record BEFORE calling external gateway
// This ensures we track the intent even if gateway call is slow
Payment payment = Payment.builder()
.idempotencyKey(idempotencyKey)
.customerId(customerId)
.amount(amount)
.status(PaymentStatus.PROCESSING)
.build();
paymentRepository.save(payment);
try {
GatewayResult result = paymentGateway.charge(customerId, amount);
payment.setStatus(PaymentStatus.SUCCESS);
payment.setGatewayTransactionId(result.getTransactionId());
} catch (Exception e) {
payment.setStatus(PaymentStatus.FAILED);
payment.setFailureReason(e.getMessage());
throw e;
} finally {
paymentRepository.save(payment);
}
return PaymentResult.from(payment);
}Golden Rule for Idempotency
Every operation that can be retried MUST be idempotent. Use unique idempotency keys (provided by client or generated from business context:
customerId + orderId + timestamp-bucket).
4. Anti-Pattern: Ignoring Read-Your-Writes in User Flows
What It Is
Routing reads to replicas without accounting for the fact that the user just wrote data and expects to see it immediately.
The Bug (User Experience Disaster)
User action sequence:
1. User updates their profile picture
2. App writes new picture URL to MySQL primary
3. App redirects to profile view page
4. Profile view reads from MySQL replica (10ms lag)
5. User sees their OLD profile picture
6. User thinks the update failed and tries again
7. Multiple duplicate updates, frustrated user
The Fix
// Strategy 1: Force primary read after write
@Service
public class ProfileService {
@Transactional
public UserProfile updateProfile(Long userId, ProfileUpdateRequest request) {
User user = userRepository.findById(userId).orElseThrow();
user.update(request);
userRepository.save(user);
// Mark session so next reads go to primary
UserWriteTracker.markRecentWrite(userId);
return UserProfile.from(user);
}
@Transactional(readOnly = true)
public UserProfile getProfile(Long userId) {
// If user recently wrote, read from primary
if (UserWriteTracker.hasRecentWrite(userId, Duration.ofSeconds(5))) {
DataSourceContextHolder.setDataSourceType(DataSourceType.WRITE);
}
try {
return userRepository.findById(userId).map(UserProfile::from).orElseThrow();
} finally {
DataSourceContextHolder.clearDataSourceType();
}
}
}
// Strategy 2: Optimistic response -- return updated data from the write response
// Don't round-trip to read; return the object you just saved
@Transactional
public UserProfile updateProfile(Long userId, ProfileUpdateRequest request) {
User user = userRepository.findById(userId).orElseThrow();
user.update(request);
User saved = userRepository.save(user);
return UserProfile.from(saved); // Return immediately, no secondary read
}5. Anti-Pattern: Using SERIALIZABLE Isolation Everywhere
What It Is
Using ISOLATION.SERIALIZABLE for all transactions "to be safe," without understanding the severe performance impact.
The Problem
SERIALIZABLE requires range locks for every read. This means:
- Higher chance of deadlocks
- Drastically reduced concurrency
- Much lower throughput
- Lock wait timeouts increase
Test result in real production: Switching a high-traffic read endpoint from READ_COMMITTED to SERIALIZABLE reduced throughput by 70% and increased p99 latency from 50ms to 900ms.
The Fix
Use the minimum isolation level required:
// WRONG: SERIALIZABLE for a simple lookup
@Transactional(isolation = Isolation.SERIALIZABLE)
public Product getProduct(Long id) {
return productRepository.findById(id).orElseThrow();
}
// CORRECT: READ_COMMITTED for simple reads (or readOnly = true)
@Transactional(readOnly = true) // Uses READ_COMMITTED snapshot
public Product getProduct(Long id) {
return productRepository.findById(id).orElseThrow();
}
// Only use SERIALIZABLE for critical operations requiring full isolation
@Transactional(isolation = Isolation.SERIALIZABLE)
public void onCallScheduleUpdate(Long departmentId, Long doctorId) {
// Write skew prevention needed: checking on-call count then updating
long onCallCount = doctorRepository.countOnCall(departmentId);
if (onCallCount <= 1) {
throw new ScheduleViolationException("Cannot remove last on-call doctor");
}
doctorRepository.setOffCall(doctorId);
}Isolation Level Usage Guide
| Operation Type | Recommended Level |
|---|---|
| Simple read for display | readOnly = true (READ_COMMITTED) |
| Read for update (will modify) | READ_COMMITTED with pessimistic or optimistic lock |
| Audit/report (multi-read consistency) | REPEATABLE_READ |
| Critical financial update | READ_COMMITTED + pessimistic lock |
| Write skew prevention | SERIALIZABLE (narrow scope only) |
| Batch report (long running read) | REPEATABLE_READ (snapshot) |
6. Anti-Pattern: Long-Running Transactions
What It Is
Holding a database transaction open for extended periods while doing non-DB work (HTTP calls, file I/O, processing).
The Bug
// ANTI-PATTERN: Transaction held open during external calls
@Transactional
public void processOrderWithExternalCalls(Long orderId) {
Order order = orderRepository.findById(orderId).orElseThrow();
order.setStatus(OrderStatus.PROCESSING);
orderRepository.save(order);
// DANGER: DB transaction is OPEN during this HTTP call (could take 2-10 seconds)
PaymentResult result = paymentGateway.charge(order.getTotalAmount()); // network call!
// DANGER: DB connection held, locks held, undo log growing
InventoryResult inv = inventoryService.reserve(order.getItems()); // another network call!
order.setStatus(OrderStatus.CONFIRMED);
orderRepository.save(order);
}Production consequences:
- DB connection pool exhaustion (all connections waiting on external calls)
- Undo log bloat (MySQL cannot clean old row versions)
- Increased deadlock probability (locks held longer)
- Cascading failures (slow external call = all threads blocked = service down)
The Fix
// CORRECT: Narrow transactions, external calls outside transactions
@Service
public class OrderProcessingService {
public void processOrder(Long orderId) {
// 1. Short transaction: load and mark processing
Order order = markOrderProcessing(orderId);
// 2. External calls OUTSIDE transaction
PaymentResult paymentResult = paymentGateway.charge(order.getTotalAmount());
InventoryResult inventoryResult = inventoryService.reserve(order.getItems());
// 3. Short transaction: update final state
finalizeOrder(orderId, paymentResult, inventoryResult);
}
@Transactional(timeout = 5)
private Order markOrderProcessing(Long orderId) {
Order order = orderRepository.findByIdWithLock(orderId).orElseThrow();
order.setStatus(OrderStatus.PROCESSING);
return orderRepository.save(order);
} // Transaction closes here
@Transactional(timeout = 5)
private void finalizeOrder(Long orderId, PaymentResult payment,
InventoryResult inventory) {
Order order = orderRepository.findById(orderId).orElseThrow();
order.setStatus(payment.isSuccess() ? OrderStatus.CONFIRMED : OrderStatus.FAILED);
order.setPaymentTransactionId(payment.getTransactionId());
orderRepository.save(order);
} // Transaction closes here
}Rule of Thumb: A transaction should last milliseconds, not seconds. If a transaction holds network calls, it is too long.
7. Anti-Pattern: Cache Without Expiry and Without Invalidation
What It Is
Caching data indefinitely or with a very long TTL without a proper invalidation strategy. Data becomes stale and incorrect.
The Bug
// ANTI-PATTERN: Cache with no expiry
redisTemplate.opsForValue().set("user:123:preferences", preferences);
// Never expires! If user changes preferences, cache always returns old value.
// ANTI-PATTERN: Very long TTL with no invalidation
redisTemplate.opsForValue().set("product:456:price", price, Duration.ofDays(7));
// Price might change. Customer pays wrong price for up to 7 days.The Fix
// CORRECT: Appropriate TTL + active invalidation
@Service
public class UserPreferenceService {
private static final Duration PREF_TTL = Duration.ofHours(1);
@Transactional
public void updatePreferences(Long userId, PreferenceUpdate update) {
// Update DB
userPreferenceRepository.update(userId, update);
// Actively invalidate cache
String cacheKey = "user:" + userId + ":preferences";
redisTemplate.delete(cacheKey);
log.debug("Invalidated preferences cache for user {}", userId);
}
public UserPreferences getPreferences(Long userId) {
String cacheKey = "user:" + userId + ":preferences";
UserPreferences cached = redisTemplate.opsForValue().get(cacheKey);
if (cached != null) return cached;
UserPreferences prefs = userPreferenceRepository.findByUserId(userId);
redisTemplate.opsForValue().set(cacheKey, prefs, PREF_TTL); // TTL as safety net
return prefs;
}
}Best Practice: Always use TTL as a safety net even when you have active invalidation. TTL catches cases where invalidation fails (Redis unreachable, bug in code that forgot to invalidate).
8. Anti-Pattern: Cache Stampede and Thundering Herd
What It Is
When a cached item expires, many concurrent requests all miss the cache simultaneously, all query the database at the same time, overwhelming it.
The Scenario
Cache key "trending:products" expires at 3:00 PM
At 3:00:001 PM: 5,000 requests arrive, all miss cache
5,000 queries hit MySQL simultaneously
MySQL falls over under 5,000 concurrent queries
App returns 500 errors
The Fix - Mutex/Lock-Based Prevention
// CORRECT: Distributed lock prevents stampede
public List<Product> getTrendingProducts() {
String cacheKey = "trending:products";
// Try cache first (fast path)
List<Product> cached = (List<Product>) redisTemplate.opsForValue().get(cacheKey);
if (cached != null) return cached;
// Stampede prevention: only ONE instance rebuilds the cache
RLock lock = redissonClient.getLock("lock:rebuild:" + cacheKey);
try {
if (lock.tryLock(5, 10, TimeUnit.SECONDS)) {
// Double-check after acquiring lock
cached = (List<Product>) redisTemplate.opsForValue().get(cacheKey);
if (cached != null) return cached;
// Only one thread/instance reaches here
List<Product> fresh = productRepository.findTrending();
redisTemplate.opsForValue().set(cacheKey, fresh, Duration.ofMinutes(10));
return fresh;
} else {
// Could not get lock -- serve stale if possible, otherwise wait and retry
return getTrendingProductsWithStaleFallback(cacheKey);
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("Interrupted waiting for cache rebuild lock", e);
} finally {
if (lock.isHeldByCurrentThread()) lock.unlock();
}
}
// Better: Use "soft expiry" -- cache stores expiry time internally
// Data appears valid longer than it is; probabilistic early refresh
public List<Product> getTrendingProductsProactive() {
SoftCachedData<List<Product>> softCached = softCache.get("trending:products");
if (softCached != null) {
// Schedule background refresh if within 20% of expiry
if (softCached.isNearExpiry(0.20)) {
cacheRefreshExecutor.submit(() -> refreshTrendingCache());
}
return softCached.getData(); // Return current data while refreshing in background
}
return refreshTrendingCache();
}9. Anti-Pattern: Split-Brain in Distributed Locking
What It Is
Two nodes simultaneously believe they hold the same distributed lock, leading to concurrent conflicting operations.
How It Happens
t=0: Node A acquires Redis lock with 30s TTL
t=1: Node A does work...
t=15: Node A pauses (GC pause, OS scheduling, etc.)
t=30: Lock expires (Node A still paused, unaware)
t=31: Node B acquires the same lock
t=31: Node B starts doing work on the same resource
t=35: Node A resumes from pause, still thinks it holds the lock
t=35: BOTH Node A and Node B are modifying the same resource simultaneously!
The Fix - Fencing Tokens
@Service
public class SafeLockService {
private final RedissonClient redissonClient;
private final ResourceRepository resourceRepository;
public void processWithFencing(String resourceId) {
RLock lock = redissonClient.getLock("lock:" + resourceId);
try {
// Get fencing token BEFORE acquiring lock
// Token is monotonically increasing -- if we see an older token, we're a zombie
long fencingToken = lock.tryLockAndGetFencingToken(10, 30, TimeUnit.SECONDS);
if (fencingToken < 0) {
throw new LockAcquisitionException("Could not acquire lock for " + resourceId);
}
// Include fencing token in every write operation
doWorkWithFencingToken(resourceId, fencingToken);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
if (lock.isHeldByCurrentThread()) {
lock.unlock();
}
}
}
private void doWorkWithFencingToken(String resourceId, long fencingToken) {
// Store fencing token in DB alongside the data
// If token in DB > our token: we're a zombie, abort
resourceRepository.updateWithFencingToken(resourceId, fencingToken, newData);
}
}
// In repository: reject writes with stale fencing tokens
@Transactional
public void updateWithFencingToken(String resourceId, long fencingToken, Data newData) {
Resource resource = resourceRepository.findById(resourceId).orElseThrow();
if (resource.getLastFencingToken() >= fencingToken) {
throw new StaleWriteException("Stale write detected: our token " + fencingToken +
" is not newer than stored " + resource.getLastFencingToken());
}
resource.setData(newData);
resource.setLastFencingToken(fencingToken);
resourceRepository.save(resource);
}Additional Protection: Heartbeats and Watchdog
// Redisson watchdog: automatically extends lock TTL if owner is still alive
// Enabled by default -- lock TTL is renewed every 10 seconds if holder is running
RLock lock = redissonClient.getLock("lock:resource");
lock.lock(); // No explicit TTL -- watchdog auto-renews until unlock()
// If holder dies: watchdog stops, TTL expires naturally, lock released10. Anti-Pattern: Out-of-Order Event Processing
What It Is
Consuming events without ordering guarantees, leading to state machine violations.
The Bug
Events published in order:
1. ORDER_CREATED
2. PAYMENT_RECEIVED
3. ORDER_SHIPPED
4. ORDER_DELIVERED
Due to SQS Standard queue or parallel consumers:
Consumer 1 processes: ORDER_DELIVERED
Consumer 2 processes: ORDER_CREATED
Result: Order marked DELIVERED before it was even CREATED in the consumer service
The Fix
// Option 1: Use SQS FIFO with message group per order
// All events for orderId=123 go to the same group, processed in order
// Option 2: Kafka -- all events for same orderId go to same partition (same consumer)
kafkaTemplate.send("order-events",
orderId.toString(), // partition key -- same orderId = same partition = ordered
event);
// Option 3: Application-level ordering enforcement
@KafkaListener(topics = "order-events", groupId = "order-consumer")
@Transactional
public void handleOrderEvent(OrderEvent event) {
Order order = orderRepository.findById(event.getOrderId()).orElseThrow();
// State machine: only apply event if current state allows it
if (!order.canTransitionTo(event.getTargetState())) {
log.warn("Out-of-order event: order {} cannot transition from {} to {}. Event: {}",
event.getOrderId(), order.getStatus(), event.getTargetState(), event.getType());
// Option A: Dead-letter the event (it may arrive again after in-order events)
// Option B: Store as "pending" and reprocess after state catches up
outOfOrderEventStore.save(event);
return;
}
order.applyEvent(event);
orderRepository.save(order);
}
// Reprocess stored out-of-order events periodically
@Scheduled(fixedDelay = 5000)
@Transactional
public void reprocessOutOfOrderEvents() {
List<OrderEvent> pending = outOfOrderEventStore.findPending();
for (OrderEvent event : pending) {
Order order = orderRepository.findById(event.getOrderId()).orElse(null);
if (order != null && order.canTransitionTo(event.getTargetState())) {
order.applyEvent(event);
orderRepository.save(order);
outOfOrderEventStore.delete(event);
}
}
}11. Anti-Pattern: Tight Coupling via Synchronous Distributed Transactions
What It Is
Using synchronous calls or 2PC (Two-Phase Commit) across microservices to achieve strong consistency. This creates extreme coupling and availability dependencies.
Why 2PC Is Problematic in Microservices
Order Service calls:
1. BEGIN DISTRIBUTED TRANSACTION
2. --> Inventory Service: Reserve stock (waits)
3. --> Payment Service: Charge customer (waits)
4. COMMIT
If Inventory Service is slow: Order Service waits
If Payment Service crashes: Entire transaction blocked until recovery
If coordinator crashes mid-commit: All participants in limbo indefinitely
2PC availability: P(system available) = P(all participants available) = 0.99 _ 0.99 _ 0.99 = 0.97 (worse than any single service).
The Fix
Use SAGA (choreography or orchestration) for distributed coordination with compensations. See Part 3, Sections 6 and 7.
Rule: Never use distributed transactions (2PC, XA transactions) across microservices. Use SAGAs with compensating transactions instead.
12. Anti-Pattern: Missing Compensation in SAGA
What It Is
Implementing a SAGA without proper compensating transactions for failure scenarios.
The Bug
// ANTI-PATTERN: SAGA without compensation
@KafkaListener(topics = "payment-events")
public void onPaymentFailed(PaymentFailedEvent event) {
Order order = orderRepository.findById(event.getOrderId()).orElseThrow();
order.setStatus(OrderStatus.FAILED);
orderRepository.save(order);
// FORGOT: Inventory was already reserved! Never released!
// Result: Inventory count permanently decremented for a failed order
}The Fix
Every SAGA step must have a defined compensation:
| Step | Forward Action | Compensating Action |
|---|---|---|
| 1 | Create Order (PENDING) | Cancel Order |
| 2 | Reserve Inventory | Release Inventory Reservation |
| 3 | Charge Payment | Refund Payment |
| 4 | Ship Order | Initiate Return |
private void onPaymentFailed(PaymentFailedEvent event) {
Order order = orderRepository.findById(event.getOrderId()).orElseThrow();
order.setStatus(OrderStatus.FAILED);
orderRepository.save(order);
// COMPENSATION: Release inventory that was reserved in Step 2
inventoryCompensationPublisher.publish(new ReleaseInventoryCommand(
event.getOrderId(), order.getItems()));
// Log for observability
log.info("SAGA compensation: releasing inventory for failed order {}", event.getOrderId());
}13. Real Production Failures and Lessons
Failure 1: The Black Friday Inventory Oversell
What happened: An e-commerce platform used DynamoDB with eventually consistent reads to check inventory before checkout. Under high load on Black Friday, the replica lag was 2-3 seconds. 50,000 customers checked inventory, all saw "In Stock" (from stale replicas), all purchased. The actual inventory was 10,000 units. 40,000 orders had to be cancelled.
Root Cause: Inventory check used eventual consistency. Reservation used eventual consistency.
Fix:
// Inventory reservation MUST be atomic and strongly consistent
// Use DynamoDB conditional write: only decrement if quantity > 0
UpdateItemRequest reserve = UpdateItemRequest.builder()
.tableName("Inventory")
.key(Map.of("productId", AttributeValue.fromS(productId)))
.updateExpression("SET quantity = quantity - :qty")
.conditionExpression("quantity >= :qty") // Atomic check-and-decrement
.expressionAttributeValues(Map.of(":qty", AttributeValue.fromN(String.valueOf(quantity))))
.build();Lesson: Any operation that allocates a finite resource MUST use atomic, strongly consistent operations. Not "read then write" but "atomic conditional write."
Failure 2: The Ghost Payment
What happened: A payments microservice called an external payment gateway. The gateway processed the payment but the network response timed out. The microservice's retry logic retried the call without idempotency keys. Customer was charged twice.
Root Cause: External API calls without idempotency keys + retry logic = duplicate charges.
Fix: Always pass idempotency key when calling external services:
ExternalPaymentRequest request = ExternalPaymentRequest.builder()
.customerId(customerId)
.amount(amount)
.idempotencyKey(orderId + "-" + attemptNumber) // or just UUID per payment
.build();
externalGatewayClient.charge(request);Failure 3: The Replica Lag Data Breach
What happened: A healthcare application used read replicas for all reads, including reading authorization data. After a user's access was revoked (written to primary), the replica still showed the user as authorized for 15 seconds. During those 15 seconds, the user accessed protected data.
Root Cause: Authorization and access control reads were using eventually consistent replicas.
Fix: Security-critical reads (authentication, authorization, session validation) MUST always read from the primary:
// NEVER read auth data from replica
@Transactional // Forces primary read
public boolean isAuthorized(Long userId, String resource) {
DataSourceContextHolder.setDataSourceType(DataSourceType.WRITE);
return permissionRepository.hasPermission(userId, resource);
}Failure 4: The Lost Update in Financial Reconciliation
What happened: Two concurrent batch jobs both read a reconciliation record (balance = $10,000), both applied their calculations, both wrote back. The second write overwrote the first. One set of transactions was effectively "lost."
Root Cause: Classic lost update problem. Read-modify-write without any concurrency control.
Fix:
// Atomic SQL update instead of read-modify-write
@Transactional
public void applyTransactionBatch(Long accountId, List<Transaction> transactions) {
BigDecimal totalDelta = transactions.stream()
.map(t -> t.isDebit() ? t.getAmount().negate() : t.getAmount())
.reduce(BigDecimal.ZERO, BigDecimal::add);
// Atomic: database calculates new balance, not the application
int updated = accountRepository.atomicUpdateBalance(accountId, totalDelta);
if (updated == 0) {
throw new AccountNotFoundException(accountId);
}
}
// In repository:
@Query("UPDATE accounts SET balance = balance + :delta WHERE id = :id")
@Modifying
int atomicUpdateBalance(@Param("id") Long id, @Param("delta") BigDecimal delta);14. Consistency Trade-Off Matrix
| Dimension | Strong Consistency | Eventual Consistency |
|---|---|---|
| Data Freshness | Always current | May be seconds/minutes stale |
| Write Latency | Higher (coordination overhead) | Lower (asynchronous) |
| Read Latency | Higher (read from primary or quorum) | Lower (read from nearest replica) |
| Throughput | Lower (serialization overhead) | Higher (parallel processing) |
| Availability | Lower (may block during partition) | Higher (serves requests always) |
| Conflict Risk | None (serialized) | Possible concurrent conflicts |
| Implementation Complexity | Lower (DB handles it) | Higher (app must handle conflicts) |
| Cost (DynamoDB) | 2x read units | 1x read unit |
| Use For | Money, auth, inventory | Catalog, analytics, feeds |
The Trade-Off Decision Tree
Does incorrect data here cause financial loss, security issue, or legal liability?
YES --> Strong consistency required
NO -->
Does the user who just wrote this data need to see it immediately?
YES --> Read-Your-Writes (session consistency)
NO -->
Can the business tolerate 1-10 seconds of staleness?
YES --> Eventual consistency (cheaper, faster)
NO -->
Can it tolerate seconds but not minutes?
YES --> Bounded staleness
NO --> Strong consistency
15. Architectural Decision Framework
Questions to Answer Before Choosing Consistency Model
- Data criticality: What is the business impact of incorrect data? (Financial, reputational, legal?)
- Read/write ratio: Is it read-heavy (caching beneficial) or write-heavy?
- Concurrency level: How many concurrent users modify the same data?
- User expectation: Does the user expect to immediately see their own changes?
- Geographic distribution: Is your user base global? Where does the data live?
- Volume: How many operations per second? What is the acceptable latency?
- Failure tolerance: What happens if a read returns stale data? What happens if a write is delayed?
Data Classification Framework
Tier 1 - Strong Consistency Required:
- Financial balances, transaction records
- Authentication tokens, session data
- Inventory counts for purchase
- Authorization/permission data
- Unique constraint enforcement (username, email)
- Distributed locks and coordination
Tier 2 - Session/Causal Consistency:
- User profiles and preferences (user sees own changes)
- Shopping cart (session-scoped)
- User-generated content (comment threads, posts)
- Status updates (order tracking)
Tier 3 - Eventual Consistency Acceptable:
- Product catalog (prices, descriptions)
- Recommendations and personalization
- Analytics dashboards
- Social media feeds (likes, views)
- Non-critical configuration
- Search indexes
Tier 4 - Approximate is Fine:
- Page view counters
- Rating aggregates
- Trending scores
- Activity logs (some loss acceptable)
16. Tips and Tricks from Production
Tip 1: Set Connection-Level Transaction Isolation
Instead of annotating every method, set the isolation level at the connection level for your use case:
# In application.yml for services that are all READ_COMMITTED
spring:
jpa:
properties:
hibernate:
connection:
isolation: 2 # READ_COMMITTEDOverride per-method only when you need SERIALIZABLE or REPEATABLE_READ.
Tip 2: Use Batch Fetching to Reduce Lock Contention
Instead of N individual locks, acquire a batch lock:
// More efficient: one batch select FOR UPDATE instead of N individual locks
List<Account> accounts = accountRepository.findByIdsForUpdate(accountIds);Tip 3: Monitor Slow Queries That Hold Locks
-- Find queries holding locks in MySQL
SELECT trx_id, trx_started, trx_requested_lock_id, trx_mysql_thread_id, trx_query
FROM information_schema.INNODB_TRX
WHERE trx_state = 'LOCK WAIT'
ORDER BY trx_started;
-- Find which transaction is blocking
SELECT blocking_trx_id, blocking_lock_id, requested_trx_id
FROM information_schema.INNODB_LOCK_WAITS;Tip 4: Use SELECT ... FOR UPDATE SKIP LOCKED for Queue-Like Patterns
// Instead of distributed lock for a job queue, use SKIP LOCKED
// Allows multiple workers to process different items concurrently without blocking
@Query("SELECT j FROM Job j WHERE j.status = 'PENDING' " +
"ORDER BY j.createdAt LIMIT 10 " +
"FOR UPDATE SKIP LOCKED") // Skip items locked by other workers
@Lock(LockModeType.PESSIMISTIC_WRITE)
List<Job> findAndLockPendingJobs(Pageable pageable);Tip 5: Versioned Cache Keys for Zero-Downtime Cache Invalidation
// Include application version or data version in cache key
// Rolling deploy: new version writes to new key, old version reads from old key
// No cache invalidation storms during deployment
String cacheKey = "product:v" + dataVersion + ":" + productId;Tip 6: Use @TransactionalEventListener for Post-Commit Events
// WRONG: Event published before transaction commits
// If transaction rolls back, event is already published!
@Transactional
public void createUser(CreateUserRequest request) {
userRepository.save(new User(request));
applicationEventPublisher.publishEvent(new UserCreatedEvent(request.getEmail()));
}
// CORRECT: Event published AFTER transaction commits
@Transactional
public void createUser(CreateUserRequest request) {
User user = userRepository.save(new User(request));
applicationEventPublisher.publishEvent(new UserCreatedEvent(user));
// Event not published until commit succeeds
}
@Component
public class UserCreatedListener {
@TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
public void handleUserCreated(UserCreatedEvent event) {
emailService.sendWelcomeEmail(event.getEmail());
// Runs only after DB transaction commits
}
}Tip 7: Use open-in-view: false in Spring Boot
spring:
jpa:
open-in-view: false # CRITICAL for productionopen-in-view: true (default) holds the DB connection open for the entire HTTP request lifecycle (including after the controller method returns). This means connections are held while serializing JSON responses, leading to connection pool exhaustion under load.
Tip 8: Tune HikariCP for AWS Aurora
spring:
datasource:
hikari:
max-lifetime: 1740000 # 29 minutes -- Aurora closes idle connections at 30min
keepalive-time: 60000 # 1 minute -- prevents AWS security group from closing idle
connection-timeout: 30000 # 30 seconds -- fail fast if pool exhausted
leak-detection-threshold: 60000 # Warn if connection held > 60s (detect long transactions)17. Consistency Observability and Debugging
Metrics to Track
@Component
@RequiredArgsConstructor
public class ConsistencyMetrics {
private final MeterRegistry meterRegistry;
// Track optimistic lock conflicts
public void recordOptimisticLockConflict(String entityType) {
meterRegistry.counter("consistency.optimistic_lock.conflict",
Tags.of("entity", entityType)).increment();
}
// Track cache miss rate (high miss rate = potential inconsistency or cold cache)
public void recordCacheHit(String cacheName, boolean hit) {
meterRegistry.counter("consistency.cache." + (hit ? "hit" : "miss"),
Tags.of("cache", cacheName)).increment();
}
// Track outbox event processing lag
public void recordOutboxLag(long lagMs) {
meterRegistry.gauge("consistency.outbox.lag_ms", lagMs);
}
// Track idempotency key hits (duplicate request detection)
public void recordIdempotencyHit(String operationType) {
meterRegistry.counter("consistency.idempotency.hit",
Tags.of("operation", operationType)).increment();
}
}SQL Queries for Production Debugging
-- Check for replica lag on Aurora reader
SELECT server_id,
session_id,
last_update_timestamp,
TIMESTAMPDIFF(SECOND, last_update_timestamp, NOW()) AS lag_seconds
FROM information_schema.replica_host_status;
-- Check for long-running transactions (potential lock holders)
SELECT trx_id,
trx_started,
TIMEDIFF(NOW(), trx_started) AS duration,
trx_state,
trx_mysql_thread_id,
SUBSTR(trx_query, 1, 200) AS query
FROM information_schema.INNODB_TRX
WHERE TIMEDIFF(NOW(), trx_started) > '00:00:05' -- running > 5 seconds
ORDER BY trx_started;
-- Check current locks
SELECT r.trx_id waiting_trx_id,
r.trx_mysql_thread_id waiting_thread,
r.trx_query waiting_query,
b.trx_id blocking_trx_id,
b.trx_mysql_thread_id blocking_thread,
b.trx_query blocking_query
FROM information_schema.INNODB_LOCK_WAITS w
INNER JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id;
-- Check pending outbox events (should be near zero in healthy system)
SELECT status, COUNT(*) AS count, MIN(created_at) AS oldest
FROM outbox_events
GROUP BY status;
-- Check for stuck PROCESSING outbox events (publisher crashed mid-process)
SELECT * FROM outbox_events
WHERE status = 'PROCESSING'
AND created_at < DATE_SUB(NOW(), INTERVAL 5 MINUTE);
-- These should be reset to PENDING and retriedNext: Part 6: Interview Questions and Answers -- Comprehensive interview preparation from beginner to Technical Architect level.
Part of the Consistency Models Demystified series
Stack: Java 17, Spring Boot 3.x, MySQL 8.0, AWS