Message Queues Demystified - Part 6: Pitfalls and Best Practices
"In theory, there is no difference between theory and practice. In practice, there is."
Every pitfall in this list represents real production incidents that have caused outages,
data loss, or duplicate billing in real systems.
Table of Contents
- The 15 Most Dangerous Pitfalls
- Anti-Patterns That Look Good But Destroy Production Systems
- Best Practices for Producers
- Best Practices for Consumers
- Best Practices for Infrastructure and Operations
- Production Readiness Checklist
- Design Checklist
The 15 Most Dangerous Pitfalls
Pitfall 1: Not Handling Duplicate Messages (Non-Idempotent Consumers)
The mistake:
// DANGEROUS - processing the same message twice will credit the loyalty points twice
@KafkaListener(topics = "order-events")
public void handleOrderPlaced(OrderEvent event) {
loyaltyService.addPoints(event.getCustomerId(), event.getPoints()); // Not idempotent
emailService.sendConfirmation(event.getEmail()); // Sends 2 emails
orderRepository.insert(new Order(event)); // Duplicate row
}Why it happens: At-least-once delivery (the default for all major brokers) guarantees every message is delivered at least once. Duplicates are normal, not exceptions.
The fix:
@KafkaListener(topics = "order-events")
@Transactional
public void handleOrderPlaced(OrderEvent event) {
// Check if already processed - within the same transaction
if (processedEventRepository.existsByEventId(event.getEventId())) {
log.debug("Duplicate event ignored: {}", event.getEventId());
return;
}
// Process
loyaltyService.addPoints(event.getCustomerId(), event.getPoints());
orderRepository.save(new Order(event));
// Mark as processed
processedEventRepository.save(new ProcessedEvent(event.getEventId()));
}Real-world consequence: Users get charged twice, credited twice, or receive 200 welcome emails.
Pitfall 2: Silent Message Loss from Auto-Commit Before Processing
The mistake:
// Spring Kafka with auto-commit ON (the default in some configurations)
// Kafka offsets auto-commit every 5 seconds REGARDLESS of whether processing succeeded
@KafkaListener(topics = "payment-events")
public void handlePayment(PaymentEvent event) {
// Suppose this throws an exception at 3 seconds
// Offset has already been committed at the 5-second auto-commit interval
// This message is LOST if the exception happens after commit but before processing
paymentService.processPayment(event); // Can throw
}The fix:
@Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> factory() {
factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
// NEVER use auto-commit for critical data
// Always use manual acknowledgment
return factory;
}
@KafkaListener(topics = "payment-events")
public void handlePayment(PaymentEvent event, Acknowledgment ack) {
paymentService.processPayment(event); // If this throws, no ACK = redelivery
ack.acknowledge(); // Only ACK after successful processing
}Real-world consequence: Financial transactions silently lost. Orders processed but never tracked.
Pitfall 3: Poison Message Causing Infinite Retry Loop
The mistake:
// A message with bad data will retry forever
@KafkaListener(topics = "orders")
public void handleOrder(String messageBody) {
// This message has malformed JSON: {"orderId": null, "amount": "not-a-number"}
OrderEvent event = objectMapper.readValue(messageBody, OrderEvent.class); // Throws
processOrder(event);
// Throws every time -> retried forever -> blocks queue processing
}What happens without DLQ:
Retry 1: FAIL
Retry 2: FAIL (1 second later)
Retry 3: FAIL (2 seconds later)
... forever ...
Retry 1000: FAIL (still the same bad message)
The consumer is stuck. Other messages behind this one are not processed.
The fix:
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> template) {
// Retry with exponential backoff
ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(1000L); // 1 second
backOff.setMultiplier(2.0); // Double each retry
backOff.setMaxInterval(30000L); // Max 30 seconds
// After max retries, send to DLQ
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(template);
return new DefaultErrorHandler(recoverer, backOff);
}Real-world consequence: One bad message brings down an entire processing pipeline. Queue depth grows to infinity.
Pitfall 4: Not Monitoring Consumer Lag
The mistake:
- Deploy the system
- Set up basic CPU/memory alerts
- Forget about consumer lag entirely
What happens:
Day 1: Consumer lag = 0
Day 5: Consumer lag = 50,000 (nobody noticed)
Day 7: Consumer lag = 500,000 (DBA notices slow queries)
Day 8: Consumer lag = 2,000,000 (SLA breach)
Day 9: Consumer lag = 5,000,000 (on-call engineer gets paged at 3am)
It took 9 days of degraded service before anyone noticed.
The fix: Alert on consumer lag exceeding thresholds (see Part 5).
Real-world consequence: SLA breaches, customer complaints, 3am incidents that could have been caught in seconds.
Pitfall 5: The Dual-Write Problem (DB + Message Queue)
The mistake:
@Transactional
public void placeOrder(OrderRequest request) {
orderRepository.save(order); // Step 1: DB write
kafkaTemplate.send("orders", event); // Step 2: Kafka write
// App crashes between Step 1 and Step 2:
// --> Order saved in DB, event never published
// --> Downstream services (Email, Warehouse) never know the order exists
// --> Silent inconsistency - very hard to detect
}The fix: Use the Outbox Pattern (detailed in Part 4).
Real-world consequence: Orders that are in the database but were never fulfilled. Customers who ordered and paid but never received their items.
Pitfall 6: Expecting Ordering Without Partition Keys (Kafka)
The mistake:
// Publishing events WITHOUT a partition key
kafkaTemplate.send("order-events", null, event);
// Kafka uses round-robin distribution: events go to different partitions
// Events for the SAME order can be in different partitions
// Different partitions = different consumer instances
// Events for the SAME order can be processed OUT OF ORDERWhat happens:
OrderPlaced --> Partition 0 --> Consumer A
PaymentReceived --> Partition 1 --> Consumer B (processed before Consumer A)
OrderShipped --> Partition 2 --> Consumer C
Consumer B processes PaymentReceived before Consumer A processes OrderPlaced
Result: payment recorded but order does not exist yet = inconsistency
The fix:
// Use a partition key that groups related messages together
kafkaTemplate.send("order-events", order.getId().toString(), event);
// All events for the SAME order go to the SAME partition
// SAME partition = SAME consumer = GUARANTEED orderingReal-world consequence: State corruption where payments are applied before orders exist, or refunds processed before the original charge.
Pitfall 7: Not Setting Message TTL or Retention Limits
The mistake:
- No TTL on queue messages
- No retention limit on Kafka topics
- No max-length on RabbitMQ queues
What happens:
Month 1: Queue depth = 0, everything fine
Month 2: Consumer has a bug, goes offline for 3 days
Month 3: Queue depth = 50 million messages, 200GB
Month 3: Consumer comes back online
Month 3: Consumer processes 50 million messages, some from 3 months ago
Month 3: Processing 3-month-old "send welcome email" events for users
who already received 10 follow-up emails
Also: Broker disk runs out, system goes down
The fix:
// RabbitMQ: Set message TTL and queue max-length
QueueBuilder.durable("onboarding-emails")
.withArgument("x-message-ttl", 86400000) // 1 day TTL
.withArgument("x-max-length", 100000) // Max 100K messages
.withArgument("x-overflow", "reject-publish")
.build();
// Kafka: Set retention per topic
kafka-configs.sh --alter --entity-type topics --entity-name onboarding-events \
--add-config retention.ms=86400000 // 1 day retentionReal-world consequence: Stale messages processed months later causing incorrect system state. Disk full causing broker outage.
Pitfall 8: Using a Single Partition for "Ordering" When You Actually Need Scale
The mistake:
Teams hear "Kafka guarantees ordering within a partition" and create topics with 1 partition to guarantee ordering.
What happens:
Topic: payment-events (1 partition)
Throughput needed: 10,000 payments/second
Throughput of 1 partition: ~2,000 messages/second
System fails under load. The "ordering" comes at the cost of system availability.
The fix:
- Use partition keys to group messages that need relative ordering
- Use multiple partitions (one per logical entity type)
- Different orders can be in different partitions; same order always in the same partition
// 12 partitions, but all events for order-123 ALWAYS in the same partition
kafkaTemplate.send("payment-events", orderId, paymentEvent);
// hash(orderId) % 12 determines partition
// All events for the same orderId = same partition = orderedReal-world consequence: Payment system throughput bottleneck causing cascading failures.
Pitfall 9: Connecting Too Many Consumers to RabbitMQ
The mistake:
// Each microservice instance opens a NEW connection
// 50 instances x 10 services = 500 connections
// Each connection uses file descriptors and memory
// RabbitMQ has a default limit of 65535 connections
// But managing thousands of idle connections causes CPU overheadThe fix:
// Use a connection pool and share connections
// Spring AMQP handles this, but configure properly:
@Bean
public CachingConnectionFactory connectionFactory() {
CachingConnectionFactory factory = new CachingConnectionFactory();
factory.setChannelCacheSize(25); // 25 channels per connection
factory.setCacheMode(CachingMode.CONNECTION); // Pool connections
factory.setConnectionCacheSize(2); // Reuse 2 connections max per service
return factory;
}Real-world consequence: RabbitMQ CPU saturated handling thousands of idle connections. Memory exhausted.
Pitfall 10: Not Testing Failure Scenarios
The mistake:
Testing only the happy path:
- Does a message get published?
- Does the consumer receive it?
Never testing:
- What happens when the consumer crashes mid-processing?
- What happens when the database is unavailable?
- What happens when Kafka is down for 30 seconds?
- What happens when a poison message appears?
The fix:
@Test
void shouldNotLoseMessageWhenConsumerCrashesBeforeAck() {
// Arrange
String orderId = "test-order-123";
OrderEvent event = new OrderEvent(orderId, ...);
kafkaTemplate.send("orders", orderId, event);
// Simulate consumer crash after receiving but before ACKing
consumerSimulator.receiveAndCrashBeforeAck();
// Act - restart consumer
consumerSimulator.restart();
// Assert - message should be reprocessed
await().atMost(30, SECONDS)
.untilAsserted(() -> {
verify(orderService, times(1)).processOrder(argThat(
e -> e.getOrderId().equals(orderId)));
});
}
@Test
void shouldHandlePoisonMessageWithoutBlockingOthers() {
// Publish a poison message
kafkaTemplate.send("orders", "poison", "INVALID_JSON{{{");
// Publish valid messages
kafkaTemplate.send("orders", "valid-1", new OrderEvent("order-1", ...));
kafkaTemplate.send("orders", "valid-2", new OrderEvent("order-2", ...));
// Assert: poison message went to DLQ, valid messages were processed
await().atMost(30, SECONDS).untilAsserted(() -> {
assertThat(dlqMessages.size()).isEqualTo(1);
assertThat(processedOrders).contains("order-1", "order-2");
});
}Real-world consequence: Data loss discovered in production. Incidents that could have been caught in development.
Pitfall 11: Schema Changes That Break Consumers
The mistake:
// Version 1 - working fine
public class OrderEvent {
private String orderId;
private String status;
}
// Version 2 - developer renames field
public class OrderEvent {
private String orderId;
private String orderStatus; // RENAMED from "status" - breaking change
}If the producer deploys Version 2 BEFORE all consumers are updated to Version 2:
- Consumer still expects
statusfield - Consumer receives
orderStatusfield - Consumer gets null for
status - Downstream logic breaks or corrupts data
The fix:
- Never rename fields - add new fields with new names
- Use Schema Registry with BACKWARD compatibility rules
- Deploy consumers BEFORE producers when making changes
- Use schema versioning in message headers
// Safe evolution - add, never rename
public class OrderEvent {
private String orderId;
private String status; // Keep the old field
private String orderStatus; // Add the new field too, mark old as deprecated
}Real-world consequence: Null pointer exceptions, data corruption, silent failures across services.
Pitfall 12: Tightly Coupling to the Message Queue Implementation
The mistake:
// Scattered across 50 service classes:
rabbitTemplate.convertAndSend("order-exchange", "order.placed", body);
// Now every place in the code knows about exchange names, routing keys
// When you migrate to Kafka, you need to change 50 filesThe fix:
// Abstract the messaging behind an interface
public interface EventPublisher {
void publish(DomainEvent event);
}
@Service
public class KafkaEventPublisher implements EventPublisher {
@Override
public void publish(DomainEvent event) {
kafkaTemplate.send(topicResolver.resolve(event.getType()),
event.getAggregateId(), event);
}
}
// All services just call publisher.publish(event)
// Switching from Kafka to RabbitMQ = change one class, not 50Real-world consequence: Massive refactoring effort when migrating brokers. Tight coupling that makes testing difficult.
Pitfall 13: Not Using Connection Retry and Reconnect Logic
The mistake:
// Producer starts before Kafka is ready
// Or Kafka restarts for maintenance
// Without retry: application crashes with "Unable to connect to Kafka"
// With no reconnect: application stays down until manually restartedThe fix:
# Kafka client auto-reconnect (built-in)
reconnect.backoff.ms=1000
reconnect.backoff.max.ms=10000
retry.backoff.ms=100
# Spring Boot application startup - wait for Kafka
@Component
@Order(Ordered.LOWEST_PRECEDENCE)
public class KafkaHealthWaiter implements ApplicationListener<ApplicationReadyEvent> {
private final KafkaAdmin kafkaAdmin;
@Override
public void onApplicationEvent(ApplicationReadyEvent event) {
RetryTemplate.builder()
.maxAttempts(10)
.exponentialBackoff(1000, 2, 30000)
.build()
.execute(context -> kafkaAdmin.describeTopics("order-events"));
}
}Real-world consequence: Application fails to start during normal Kafka maintenance window. On-call engineer paged at 2am.
Pitfall 14: Not Considering Message Size Limits
The mistake:
// Embedding large objects directly in messages
kafkaTemplate.send("image-events", userId, new ImageEvent(
imageData, // 5MB JPEG image embedded in the message
metadata
));
// Default Kafka max message size: 1MB
// This throws: RecordTooLargeExceptionThe fix: Use the Claim Check Pattern (see Part 2):
// Store large payload in S3, pass reference in message
s3Client.putObject("images-bucket", imageId + ".jpg", imageData);
kafkaTemplate.send("image-events", userId, new ImageEvent(
"s3://images-bucket/" + imageId + ".jpg", // Reference only
metadata
));Real-world consequence: RecordTooLargeException in production. Images silently dropped.
Pitfall 15: The Thundering Herd on Consumer Group Restart
The mistake:
Consumer group processes 10,000 messages/second normally.
Deploy a new version (rolling restart).
All consumer instances restart within 60 seconds.
During restart: messages accumulate (60 seconds x 10,000/sec = 600,000 messages)
All consumers come online simultaneously.
All consumers start processing full throttle at once.
Downstream database receives 100x normal write load.
Database struggles, queries timeout, cascading failure.
The fix:
// Rate limit consumer processing on startup
@Component
public class ThrottledConsumer {
@Value("${consumer.startup.ramp-up-seconds:30}")
private int rampUpSeconds;
private RateLimiter rateLimiter;
@PostConstruct
public void init() {
// Start at 10% of normal rate, increase linearly over ramp-up period
rateLimiter = RateLimiter.create(normalRate * 0.1);
scheduleRampUp();
}
@KafkaListener(topics = "order-events")
public void handleOrder(OrderEvent event, Acknowledgment ack) {
rateLimiter.acquire();
orderService.process(event);
ack.acknowledge();
}
}Anti-Patterns
Anti-Pattern 1: Using a Message Queue as a Database
// WRONG: Using the queue as permanent storage
// Reading the same messages repeatedly without ACKing them
// "Browsing" the queue to check current state
// This is not what queues are for.
// Use a database for persistent state.
// Use a queue for transient messages to be processed once.Anti-Pattern 2: Creating a Queue Per Message Type in RabbitMQ
WRONG:
Queue: user-registered
Queue: user-login
Queue: user-logout
Queue: user-profile-updated
Queue: user-password-reset
Queue: user-email-changed
... 200 more queues ...
Each queue has its own consumers, monitoring, and management overhead.
This creates operational hell.
RIGHT:
Topic: user-events (with event type in the message body/header)
Filter by event type in the consumer
OR use a Topic Exchange with routing key patterns
Anti-Pattern 3: Fire-and-Forget for Critical Operations
// WRONG - fire and forget for critical business operations
kafkaTemplate.send("payment-commands", paymentCommand);
// No ACK confirmation, no error handling, no retry
// If Kafka is temporarily unavailable, payment command is lost
// User is charged but payment is never processed
// RIGHT - ensure delivery for critical operations
try {
RecordMetadata metadata = kafkaTemplate.send("payment-commands", paymentCommand)
.get(5, TimeUnit.SECONDS); // Wait for confirmation
log.info("Payment command published: partition={}, offset={}",
metadata.partition(), metadata.offset());
} catch (TimeoutException | ExecutionException e) {
// Log, alert, possibly save to outbox for retry
throw new PaymentPublishException("Failed to publish payment command", e);
}Anti-Pattern 4: Consuming From Many Topics in a Single Consumer
// WRONG - one consumer handling completely unrelated topics
@KafkaListener(topics = {"order-events", "user-events", "payment-events",
"inventory-events", "shipping-events"})
public void handleEverything(Object event) {
// This is a massive switch statement waiting to happen
// Poor separation of concerns
// Cannot scale topics independently
// One topic's backlog blocks others in same thread
}
// RIGHT - separate listeners per domain
@KafkaListener(topics = "order-events", groupId = "order-processor")
public void handleOrderEvent(OrderEvent event) { ... }
@KafkaListener(topics = "payment-events", groupId = "payment-processor")
public void handlePaymentEvent(PaymentEvent event) { ... }Best Practices for Producers
BP-P1: Always Use Explicit Message IDs
OrderEvent event = OrderEvent.builder()
.eventId(UUID.randomUUID().toString()) // Always set this
.orderId(order.getId())
.build();
// Without this, consumers cannot detect and discard duplicatesBP-P2: Include Schema Version and Event Type in Headers
// Consumers can filter and deserialize correctly without parsing the payload
kafkaTemplate.send(new ProducerRecord<>("orders",
null, // partition (null = auto-assign by key)
order.getId(), // key
event,
List.of(
new RecordHeader("eventType", "OrderPlaced".getBytes()),
new RecordHeader("schemaVersion", "2".getBytes()),
new RecordHeader("sourceService", "order-service".getBytes()),
new RecordHeader("correlationId", correlationId.getBytes())
)
));BP-P3: Use Correlation IDs for Distributed Tracing
// Pass correlation ID from the incoming request through all published events
@KafkaListener(topics = "raw-orders")
public void processAndPublish(RawOrder rawOrder) {
String correlationId = MDC.get("correlationId"); // From incoming request context
ProcessedOrder processed = enrich(rawOrder);
ProducerRecord<String, ProcessedOrder> record = new ProducerRecord<>(
"processed-orders", rawOrder.getOrderId(), processed);
record.headers().add("correlationId", correlationId.getBytes()); // Propagate
kafkaTemplate.send(record);
}BP-P4: Never Block the Caller on Publish for Non-Critical Events
// WRONG for non-critical events - blocks the HTTP request thread
kafkaTemplate.send("analytics", event).get(); // Synchronous - blocks caller
// RIGHT for non-critical events
kafkaTemplate.send("analytics", event)
.whenComplete((result, ex) -> {
if (ex != null) {
log.warn("Analytics event publish failed, not retrying: {}", ex.getMessage());
}
});
// Return HTTP response immediately, don't wait for KafkaBest Practices for Consumers
BP-C1: Always Design for Idempotency
Before writing any consumer, ask: "What happens if this is called twice with the same message?"
If the answer is "bad things happen," add deduplication logic first.
BP-C2: Commit Offsets Only After Successful Processing
// Always use manual acknowledgment in production
factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);BP-C3: Log the Message ID, Offset, and Partition at the Start and End
@KafkaListener(topics = "orders")
public void handleOrder(
OrderEvent event,
@Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
@Header(KafkaHeaders.OFFSET) long offset,
Acknowledgment ack) {
log.info("Processing order: eventId={}, orderId={}, partition={}, offset={}",
event.getEventId(), event.getOrderId(), partition, offset);
orderService.process(event);
ack.acknowledge();
log.info("Processed order: eventId={}, partition={}, offset={}",
event.getEventId(), partition, offset);
}BP-C4: Set max.poll.interval.ms Greater Than Your Worst-Case Processing Time
# If processing a message can take up to 2 minutes in the worst case:
max.poll.interval.ms=300000 # Set to 5 minutes (adds buffer)
max.poll.records=50 # Reduce records per poll so the batch is manageableBP-C5: Implement a Circuit Breaker for Downstream Dependencies
@Component
public class ResilientOrderConsumer {
private final CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("database");
@KafkaListener(topics = "orders")
public void handleOrder(OrderEvent event, Acknowledgment ack) {
try {
circuitBreaker.executeRunnable(() -> {
orderRepository.save(new Order(event));
});
ack.acknowledge();
} catch (CallNotPermittedException e) {
// Circuit is open - database unavailable
// Do not ACK - let message be redelivered when circuit closes
log.warn("Circuit open, not processing message: {}", event.getOrderId());
}
}
}Best Practices for Infrastructure and Operations
BP-I1: Always Configure a DLQ - No Exceptions
Every production queue MUST have a Dead Letter Queue. There is no acceptable reason not to.
// This should be the default template for every queue
@Bean
public Queue createProductionQueue(String queueName, String dlqExchange) {
return QueueBuilder.durable(queueName)
.withArgument("x-dead-letter-exchange", dlqExchange)
.withArgument("x-dead-letter-routing-key", queueName + ".dlq")
.withArgument("x-message-ttl", 86400000) // Messages expire after 24 hours
.withArgument("x-queue-type", "quorum") // Always use quorum in production
.build();
}BP-I2: Monitor DLQ Depth with Alerts
# Prometheus alert
- alert: DLQNonEmpty
expr: rabbitmq_queue_messages{queue=~".*dlq.*"} > 0
for: 5m
annotations:
summary: "Dead Letter Queue has messages - investigation required"BP-I3: Use Environment-Specific Topic/Queue Naming
Development: dev.order-events
Staging: staging.order-events
Production: prod.order-events
OR use separate Kafka clusters per environment.
NEVER share a production broker with development.
BP-I4: Implement Health Checks for the Message Queue Connection
@Component
public class KafkaHealthIndicator implements HealthIndicator {
private final AdminClient adminClient;
@Override
public Health health() {
try {
Set<String> topics = adminClient.listTopics()
.names()
.get(3, TimeUnit.SECONDS);
return Health.up()
.withDetail("topicCount", topics.size())
.build();
} catch (Exception e) {
return Health.down()
.withException(e)
.build();
}
}
}BP-I5: Never Use replication.factor=1 in Production (Kafka)
# Create topics with replication factor 3
kafka-topics.sh --create \
--topic order-events \
--replication-factor 3 \ # ALWAYS 3 in production
--partitions 12 \
--config min.insync.replicas=2 # Require 2 ISR for writeProduction Readiness Checklist
Use this before putting any messaging system component into production.
Producer Checklist
- Message IDs (UUIDs) included in every message
- Schema version included in headers
- Event type included in headers
- Correlation/trace IDs propagated
- Producer uses
acks=allfor critical data - Idempotent producer enabled for Kafka
- Error handling and alerting for publish failures
- Dead letter or fallback strategy for when broker is unavailable
- Load tested at 3x expected peak throughput
Consumer Checklist
- Manual acknowledgment configured (no auto-commit)
- Idempotency implemented (duplicate message handling)
- DLQ configured and monitored
- Maximum retry count configured
- Poison message handling (deserialization errors caught)
- Consumer lag monitored with alerts
- Processing errors logged with full context (partition, offset, message ID)
- Circuit breaker for downstream dependencies
- Load tested with realistic message volumes
- Tested with duplicate messages
- Tested with malformed/poison messages
Infrastructure Checklist
- Replication factor = 3 for all production topics/queues
- min.insync.replicas = 2 for Kafka
- Quorum queues used for RabbitMQ (not classic mirrored)
- Brokers spread across multiple availability zones
- Message retention configured appropriately
- Queue/topic capacity limits configured
- Monitoring and dashboards set up
- Alerts configured for: consumer lag, DLQ depth, broker health, disk usage
- Runbooks documented for common incidents
- Disaster recovery tested (broker failure, AZ failure)
Design Checklist
Ask these questions before choosing to use a message queue:
1. Should this communication be synchronous or asynchronous?
- Does the caller need the result to complete the current operation? -> Synchronous
- Is it a background side effect? -> Asynchronous (message queue)
2. Queue or Topic?
- Should only ONE consumer process each message? -> Queue
- Should ALL consumers get every message? -> Topic
3. Ordering Requirements
- Is strict ordering required? -> Design partition key strategy
- Is ordering required only within one entity (e.g., per order)? -> Partition by entity ID
4. Failure Scenarios
- What happens when the consumer is down? -> Queue buffers it
- What happens when the broker is down? -> Producer needs fallback
- What happens when a message cannot be processed? -> DLQ configured?
5. Delivery Guarantee Needed
- Can you afford to lose messages? -> At-most-once
- Can you afford duplicates? -> At-least-once (default) with idempotent consumers
- Must have exactly once? -> Kafka transactions + idempotent consumer (high cost)
6. Consumer Scale
- What is the maximum throughput needed? -> Determines partition count
- How long does processing take per message? -> Determines consumer count
7. Data Sensitivity
- Is this financial data? -> acks=all, idempotent producer, DLQ, Saga pattern
- Is this analytics data? -> at-most-once acceptable, optimize for throughput
Previous: Part 5 - Operations and Performance
Next: Part 7 - Interview Questions
Index: Message Queues Demystified - Index