Message Queues Demystified - Part 6: Pitfalls and Best Practices

"In theory, there is no difference between theory and practice. In practice, there is."
Every pitfall in this list represents real production incidents that have caused outages,
data loss, or duplicate billing in real systems.

The 15 Most Dangerous Pitfalls
Anti-Patterns That Look Good But Destroy Production Systems
Best Practices for Producers
Best Practices for Consumers
Best Practices for Infrastructure and Operations
Production Readiness Checklist
Design Checklist

The 15 Most Dangerous Pitfalls

Pitfall 1: Not Handling Duplicate Messages (Non-Idempotent Consumers)

The mistake:

// DANGEROUS - processing the same message twice will credit the loyalty points twice
@KafkaListener(topics = "order-events")
public void handleOrderPlaced(OrderEvent event) {
    loyaltyService.addPoints(event.getCustomerId(), event.getPoints());  // Not idempotent
    emailService.sendConfirmation(event.getEmail());                     // Sends 2 emails
    orderRepository.insert(new Order(event));                           // Duplicate row
}

Why it happens: At-least-once delivery (the default for all major brokers) guarantees every message is delivered at least once. Duplicates are normal, not exceptions.

The fix:

@KafkaListener(topics = "order-events")
@Transactional
public void handleOrderPlaced(OrderEvent event) {
    // Check if already processed - within the same transaction
    if (processedEventRepository.existsByEventId(event.getEventId())) {
        log.debug("Duplicate event ignored: {}", event.getEventId());
        return;
    }
 
    // Process
    loyaltyService.addPoints(event.getCustomerId(), event.getPoints());
    orderRepository.save(new Order(event));
 
    // Mark as processed
    processedEventRepository.save(new ProcessedEvent(event.getEventId()));
}

Real-world consequence: Users get charged twice, credited twice, or receive 200 welcome emails.

Pitfall 2: Silent Message Loss from Auto-Commit Before Processing

The mistake:

// Spring Kafka with auto-commit ON (the default in some configurations)
// Kafka offsets auto-commit every 5 seconds REGARDLESS of whether processing succeeded
 
@KafkaListener(topics = "payment-events")
public void handlePayment(PaymentEvent event) {
    // Suppose this throws an exception at 3 seconds
    // Offset has already been committed at the 5-second auto-commit interval
    // This message is LOST if the exception happens after commit but before processing
    paymentService.processPayment(event);  // Can throw
}

The fix:

@Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> factory() {
    factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
    // NEVER use auto-commit for critical data
    // Always use manual acknowledgment
    return factory;
}
 
@KafkaListener(topics = "payment-events")
public void handlePayment(PaymentEvent event, Acknowledgment ack) {
    paymentService.processPayment(event);  // If this throws, no ACK = redelivery
    ack.acknowledge();  // Only ACK after successful processing
}

Real-world consequence: Financial transactions silently lost. Orders processed but never tracked.

Pitfall 3: Poison Message Causing Infinite Retry Loop

The mistake:

// A message with bad data will retry forever
@KafkaListener(topics = "orders")
public void handleOrder(String messageBody) {
    // This message has malformed JSON: {"orderId": null, "amount": "not-a-number"}
    OrderEvent event = objectMapper.readValue(messageBody, OrderEvent.class);  // Throws
    processOrder(event);
    // Throws every time -> retried forever -> blocks queue processing
}

What happens without DLQ:

Retry 1: FAIL
Retry 2: FAIL (1 second later)
Retry 3: FAIL (2 seconds later)
... forever ...
Retry 1000: FAIL (still the same bad message)
The consumer is stuck. Other messages behind this one are not processed.

The fix:

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, Object> template) {
    // Retry with exponential backoff
    ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
    backOff.setInitialInterval(1000L);  // 1 second
    backOff.setMultiplier(2.0);         // Double each retry
    backOff.setMaxInterval(30000L);     // Max 30 seconds
 
    // After max retries, send to DLQ
    DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(template);
 
    return new DefaultErrorHandler(recoverer, backOff);
}

Real-world consequence: One bad message brings down an entire processing pipeline. Queue depth grows to infinity.

Pitfall 4: Not Monitoring Consumer Lag

The mistake:

Deploy the system
Set up basic CPU/memory alerts
Forget about consumer lag entirely

What happens:

Day 1: Consumer lag = 0
Day 5: Consumer lag = 50,000 (nobody noticed)
Day 7: Consumer lag = 500,000 (DBA notices slow queries)
Day 8: Consumer lag = 2,000,000 (SLA breach)
Day 9: Consumer lag = 5,000,000 (on-call engineer gets paged at 3am)

It took 9 days of degraded service before anyone noticed.

The fix: Alert on consumer lag exceeding thresholds (see Part 5).

Real-world consequence: SLA breaches, customer complaints, 3am incidents that could have been caught in seconds.

Pitfall 5: The Dual-Write Problem (DB + Message Queue)

The mistake:

@Transactional
public void placeOrder(OrderRequest request) {
    orderRepository.save(order);      // Step 1: DB write
    kafkaTemplate.send("orders", event);  // Step 2: Kafka write
 
    // App crashes between Step 1 and Step 2:
    // --> Order saved in DB, event never published
    // --> Downstream services (Email, Warehouse) never know the order exists
    // --> Silent inconsistency - very hard to detect
}

The fix: Use the Outbox Pattern (detailed in Part 4).

Real-world consequence: Orders that are in the database but were never fulfilled. Customers who ordered and paid but never received their items.

Pitfall 6: Expecting Ordering Without Partition Keys (Kafka)

The mistake:

// Publishing events WITHOUT a partition key
kafkaTemplate.send("order-events", null, event);
// Kafka uses round-robin distribution: events go to different partitions
// Events for the SAME order can be in different partitions
// Different partitions = different consumer instances
// Events for the SAME order can be processed OUT OF ORDER

What happens:

OrderPlaced   --> Partition 0 --> Consumer A
PaymentReceived --> Partition 1 --> Consumer B (processed before Consumer A)
OrderShipped  --> Partition 2 --> Consumer C

Consumer B processes PaymentReceived before Consumer A processes OrderPlaced
Result: payment recorded but order does not exist yet = inconsistency

The fix:

// Use a partition key that groups related messages together
kafkaTemplate.send("order-events", order.getId().toString(), event);
// All events for the SAME order go to the SAME partition
// SAME partition = SAME consumer = GUARANTEED ordering

Real-world consequence: State corruption where payments are applied before orders exist, or refunds processed before the original charge.

Pitfall 7: Not Setting Message TTL or Retention Limits

The mistake:

No TTL on queue messages
No retention limit on Kafka topics
No max-length on RabbitMQ queues

What happens:

Month 1: Queue depth = 0, everything fine
Month 2: Consumer has a bug, goes offline for 3 days
Month 3: Queue depth = 50 million messages, 200GB
Month 3: Consumer comes back online
Month 3: Consumer processes 50 million messages, some from 3 months ago
Month 3: Processing 3-month-old "send welcome email" events for users
         who already received 10 follow-up emails

Also: Broker disk runs out, system goes down

The fix:

// RabbitMQ: Set message TTL and queue max-length
QueueBuilder.durable("onboarding-emails")
    .withArgument("x-message-ttl", 86400000)  // 1 day TTL
    .withArgument("x-max-length", 100000)       // Max 100K messages
    .withArgument("x-overflow", "reject-publish")
    .build();
 
// Kafka: Set retention per topic
kafka-configs.sh --alter --entity-type topics --entity-name onboarding-events \
  --add-config retention.ms=86400000  // 1 day retention

Real-world consequence: Stale messages processed months later causing incorrect system state. Disk full causing broker outage.

Pitfall 8: Using a Single Partition for "Ordering" When You Actually Need Scale

The mistake:
Teams hear "Kafka guarantees ordering within a partition" and create topics with 1 partition to guarantee ordering.

What happens:

Topic: payment-events (1 partition)
Throughput needed: 10,000 payments/second
Throughput of 1 partition: ~2,000 messages/second

System fails under load. The "ordering" comes at the cost of system availability.

The fix:

Use partition keys to group messages that need relative ordering
Use multiple partitions (one per logical entity type)
Different orders can be in different partitions; same order always in the same partition

// 12 partitions, but all events for order-123 ALWAYS in the same partition
kafkaTemplate.send("payment-events", orderId, paymentEvent);
// hash(orderId) % 12 determines partition
// All events for the same orderId = same partition = ordered

Real-world consequence: Payment system throughput bottleneck causing cascading failures.

Pitfall 9: Connecting Too Many Consumers to RabbitMQ

The mistake:

// Each microservice instance opens a NEW connection
// 50 instances x 10 services = 500 connections
// Each connection uses file descriptors and memory
// RabbitMQ has a default limit of 65535 connections
// But managing thousands of idle connections causes CPU overhead

The fix:

// Use a connection pool and share connections
// Spring AMQP handles this, but configure properly:
 
@Bean
public CachingConnectionFactory connectionFactory() {
    CachingConnectionFactory factory = new CachingConnectionFactory();
    factory.setChannelCacheSize(25);    // 25 channels per connection
    factory.setCacheMode(CachingMode.CONNECTION); // Pool connections
    factory.setConnectionCacheSize(2);  // Reuse 2 connections max per service
    return factory;
}

Real-world consequence: RabbitMQ CPU saturated handling thousands of idle connections. Memory exhausted.

Pitfall 10: Not Testing Failure Scenarios

The mistake:
Testing only the happy path:

Does a message get published?
Does the consumer receive it?

Never testing:

What happens when the consumer crashes mid-processing?
What happens when the database is unavailable?
What happens when Kafka is down for 30 seconds?
What happens when a poison message appears?

The fix:

@Test
void shouldNotLoseMessageWhenConsumerCrashesBeforeAck() {
    // Arrange
    String orderId = "test-order-123";
    OrderEvent event = new OrderEvent(orderId, ...);
    kafkaTemplate.send("orders", orderId, event);
 
    // Simulate consumer crash after receiving but before ACKing
    consumerSimulator.receiveAndCrashBeforeAck();
 
    // Act - restart consumer
    consumerSimulator.restart();
 
    // Assert - message should be reprocessed
    await().atMost(30, SECONDS)
        .untilAsserted(() -> {
            verify(orderService, times(1)).processOrder(argThat(
                e -> e.getOrderId().equals(orderId)));
        });
}
 
@Test
void shouldHandlePoisonMessageWithoutBlockingOthers() {
    // Publish a poison message
    kafkaTemplate.send("orders", "poison", "INVALID_JSON{{{");
 
    // Publish valid messages
    kafkaTemplate.send("orders", "valid-1", new OrderEvent("order-1", ...));
    kafkaTemplate.send("orders", "valid-2", new OrderEvent("order-2", ...));
 
    // Assert: poison message went to DLQ, valid messages were processed
    await().atMost(30, SECONDS).untilAsserted(() -> {
        assertThat(dlqMessages.size()).isEqualTo(1);
        assertThat(processedOrders).contains("order-1", "order-2");
    });
}

Real-world consequence: Data loss discovered in production. Incidents that could have been caught in development.

Pitfall 11: Schema Changes That Break Consumers

The mistake:

// Version 1 - working fine
public class OrderEvent {
    private String orderId;
    private String status;
}
 
// Version 2 - developer renames field
public class OrderEvent {
    private String orderId;
    private String orderStatus;  // RENAMED from "status" - breaking change
}

If the producer deploys Version 2 BEFORE all consumers are updated to Version 2:

Consumer still expects status field
Consumer receives orderStatus field
Consumer gets null for status
Downstream logic breaks or corrupts data

The fix:

Never rename fields - add new fields with new names
Use Schema Registry with BACKWARD compatibility rules
Deploy consumers BEFORE producers when making changes
Use schema versioning in message headers

// Safe evolution - add, never rename
public class OrderEvent {
    private String orderId;
    private String status;        // Keep the old field
    private String orderStatus;   // Add the new field too, mark old as deprecated
}

Real-world consequence: Null pointer exceptions, data corruption, silent failures across services.

Pitfall 12: Tightly Coupling to the Message Queue Implementation

The mistake:

// Scattered across 50 service classes:
rabbitTemplate.convertAndSend("order-exchange", "order.placed", body);
// Now every place in the code knows about exchange names, routing keys
// When you migrate to Kafka, you need to change 50 files

The fix:

// Abstract the messaging behind an interface
public interface EventPublisher {
    void publish(DomainEvent event);
}
 
@Service
public class KafkaEventPublisher implements EventPublisher {
    @Override
    public void publish(DomainEvent event) {
        kafkaTemplate.send(topicResolver.resolve(event.getType()),
            event.getAggregateId(), event);
    }
}
 
// All services just call publisher.publish(event)
// Switching from Kafka to RabbitMQ = change one class, not 50

Real-world consequence: Massive refactoring effort when migrating brokers. Tight coupling that makes testing difficult.

Pitfall 13: Not Using Connection Retry and Reconnect Logic

The mistake:

// Producer starts before Kafka is ready
// Or Kafka restarts for maintenance
// Without retry: application crashes with "Unable to connect to Kafka"
// With no reconnect: application stays down until manually restarted

The fix:

# Kafka client auto-reconnect (built-in)
reconnect.backoff.ms=1000
reconnect.backoff.max.ms=10000
retry.backoff.ms=100
 
# Spring Boot application startup - wait for Kafka
@Component
@Order(Ordered.LOWEST_PRECEDENCE)
public class KafkaHealthWaiter implements ApplicationListener<ApplicationReadyEvent> {
 
    private final KafkaAdmin kafkaAdmin;
 
    @Override
    public void onApplicationEvent(ApplicationReadyEvent event) {
        RetryTemplate.builder()
            .maxAttempts(10)
            .exponentialBackoff(1000, 2, 30000)
            .build()
            .execute(context -> kafkaAdmin.describeTopics("order-events"));
    }
}

Real-world consequence: Application fails to start during normal Kafka maintenance window. On-call engineer paged at 2am.

Pitfall 14: Not Considering Message Size Limits

The mistake:

// Embedding large objects directly in messages
kafkaTemplate.send("image-events", userId, new ImageEvent(
    imageData,           // 5MB JPEG image embedded in the message
    metadata
));
// Default Kafka max message size: 1MB
// This throws: RecordTooLargeException

The fix: Use the Claim Check Pattern (see Part 2):

// Store large payload in S3, pass reference in message
s3Client.putObject("images-bucket", imageId + ".jpg", imageData);
kafkaTemplate.send("image-events", userId, new ImageEvent(
    "s3://images-bucket/" + imageId + ".jpg",  // Reference only
    metadata
));

Real-world consequence: RecordTooLargeException in production. Images silently dropped.

Pitfall 15: The Thundering Herd on Consumer Group Restart

The mistake:

Consumer group processes 10,000 messages/second normally.
Deploy a new version (rolling restart).
All consumer instances restart within 60 seconds.
During restart: messages accumulate (60 seconds x 10,000/sec = 600,000 messages)
All consumers come online simultaneously.
All consumers start processing full throttle at once.
Downstream database receives 100x normal write load.
Database struggles, queries timeout, cascading failure.

The fix:

// Rate limit consumer processing on startup
@Component
public class ThrottledConsumer {
 
    @Value("${consumer.startup.ramp-up-seconds:30}")
    private int rampUpSeconds;
 
    private RateLimiter rateLimiter;
 
    @PostConstruct
    public void init() {
        // Start at 10% of normal rate, increase linearly over ramp-up period
        rateLimiter = RateLimiter.create(normalRate * 0.1);
        scheduleRampUp();
    }
 
    @KafkaListener(topics = "order-events")
    public void handleOrder(OrderEvent event, Acknowledgment ack) {
        rateLimiter.acquire();
        orderService.process(event);
        ack.acknowledge();
    }
}

Anti-Patterns

Anti-Pattern 1: Using a Message Queue as a Database

// WRONG: Using the queue as permanent storage
// Reading the same messages repeatedly without ACKing them
// "Browsing" the queue to check current state
 
// This is not what queues are for.
// Use a database for persistent state.
// Use a queue for transient messages to be processed once.

Anti-Pattern 2: Creating a Queue Per Message Type in RabbitMQ

WRONG:
  Queue: user-registered
  Queue: user-login
  Queue: user-logout
  Queue: user-profile-updated
  Queue: user-password-reset
  Queue: user-email-changed
  ... 200 more queues ...

Each queue has its own consumers, monitoring, and management overhead.
This creates operational hell.

RIGHT:
  Topic: user-events (with event type in the message body/header)
  Filter by event type in the consumer
  OR use a Topic Exchange with routing key patterns

Anti-Pattern 3: Fire-and-Forget for Critical Operations

// WRONG - fire and forget for critical business operations
kafkaTemplate.send("payment-commands", paymentCommand);
// No ACK confirmation, no error handling, no retry
 
// If Kafka is temporarily unavailable, payment command is lost
// User is charged but payment is never processed
 
// RIGHT - ensure delivery for critical operations
try {
    RecordMetadata metadata = kafkaTemplate.send("payment-commands", paymentCommand)
        .get(5, TimeUnit.SECONDS);  // Wait for confirmation
    log.info("Payment command published: partition={}, offset={}",
        metadata.partition(), metadata.offset());
} catch (TimeoutException | ExecutionException e) {
    // Log, alert, possibly save to outbox for retry
    throw new PaymentPublishException("Failed to publish payment command", e);
}

Anti-Pattern 4: Consuming From Many Topics in a Single Consumer

// WRONG - one consumer handling completely unrelated topics
@KafkaListener(topics = {"order-events", "user-events", "payment-events",
                          "inventory-events", "shipping-events"})
public void handleEverything(Object event) {
    // This is a massive switch statement waiting to happen
    // Poor separation of concerns
    // Cannot scale topics independently
    // One topic's backlog blocks others in same thread
}
 
// RIGHT - separate listeners per domain
@KafkaListener(topics = "order-events", groupId = "order-processor")
public void handleOrderEvent(OrderEvent event) { ... }
 
@KafkaListener(topics = "payment-events", groupId = "payment-processor")
public void handlePaymentEvent(PaymentEvent event) { ... }

Best Practices for Producers

BP-P1: Always Use Explicit Message IDs

OrderEvent event = OrderEvent.builder()
    .eventId(UUID.randomUUID().toString())  // Always set this
    .orderId(order.getId())
    .build();
// Without this, consumers cannot detect and discard duplicates

BP-P2: Include Schema Version and Event Type in Headers

// Consumers can filter and deserialize correctly without parsing the payload
kafkaTemplate.send(new ProducerRecord<>("orders",
    null,  // partition (null = auto-assign by key)
    order.getId(),  // key
    event,
    List.of(
        new RecordHeader("eventType", "OrderPlaced".getBytes()),
        new RecordHeader("schemaVersion", "2".getBytes()),
        new RecordHeader("sourceService", "order-service".getBytes()),
        new RecordHeader("correlationId", correlationId.getBytes())
    )
));

BP-P3: Use Correlation IDs for Distributed Tracing

// Pass correlation ID from the incoming request through all published events
@KafkaListener(topics = "raw-orders")
public void processAndPublish(RawOrder rawOrder) {
    String correlationId = MDC.get("correlationId");  // From incoming request context
 
    ProcessedOrder processed = enrich(rawOrder);
 
    ProducerRecord<String, ProcessedOrder> record = new ProducerRecord<>(
        "processed-orders", rawOrder.getOrderId(), processed);
    record.headers().add("correlationId", correlationId.getBytes()); // Propagate
 
    kafkaTemplate.send(record);
}

BP-P4: Never Block the Caller on Publish for Non-Critical Events

// WRONG for non-critical events - blocks the HTTP request thread
kafkaTemplate.send("analytics", event).get();  // Synchronous - blocks caller
 
// RIGHT for non-critical events
kafkaTemplate.send("analytics", event)
    .whenComplete((result, ex) -> {
        if (ex != null) {
            log.warn("Analytics event publish failed, not retrying: {}", ex.getMessage());
        }
    });
// Return HTTP response immediately, don't wait for Kafka

Best Practices for Consumers

BP-C1: Always Design for Idempotency

Before writing any consumer, ask: "What happens if this is called twice with the same message?"

If the answer is "bad things happen," add deduplication logic first.

BP-C2: Commit Offsets Only After Successful Processing

// Always use manual acknowledgment in production
factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);

BP-C3: Log the Message ID, Offset, and Partition at the Start and End

@KafkaListener(topics = "orders")
public void handleOrder(
        OrderEvent event,
        @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
        @Header(KafkaHeaders.OFFSET) long offset,
        Acknowledgment ack) {
 
    log.info("Processing order: eventId={}, orderId={}, partition={}, offset={}",
        event.getEventId(), event.getOrderId(), partition, offset);
 
    orderService.process(event);
    ack.acknowledge();
 
    log.info("Processed order: eventId={}, partition={}, offset={}",
        event.getEventId(), partition, offset);
}

BP-C4: Set max.poll.interval.ms Greater Than Your Worst-Case Processing Time

# If processing a message can take up to 2 minutes in the worst case:
max.poll.interval.ms=300000  # Set to 5 minutes (adds buffer)
max.poll.records=50          # Reduce records per poll so the batch is manageable

BP-C5: Implement a Circuit Breaker for Downstream Dependencies

@Component
public class ResilientOrderConsumer {
 
    private final CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("database");
 
    @KafkaListener(topics = "orders")
    public void handleOrder(OrderEvent event, Acknowledgment ack) {
        try {
            circuitBreaker.executeRunnable(() -> {
                orderRepository.save(new Order(event));
            });
            ack.acknowledge();
 
        } catch (CallNotPermittedException e) {
            // Circuit is open - database unavailable
            // Do not ACK - let message be redelivered when circuit closes
            log.warn("Circuit open, not processing message: {}", event.getOrderId());
        }
    }
}

Best Practices for Infrastructure and Operations

BP-I1: Always Configure a DLQ - No Exceptions

Every production queue MUST have a Dead Letter Queue. There is no acceptable reason not to.

// This should be the default template for every queue
@Bean
public Queue createProductionQueue(String queueName, String dlqExchange) {
    return QueueBuilder.durable(queueName)
        .withArgument("x-dead-letter-exchange", dlqExchange)
        .withArgument("x-dead-letter-routing-key", queueName + ".dlq")
        .withArgument("x-message-ttl", 86400000)  // Messages expire after 24 hours
        .withArgument("x-queue-type", "quorum")   // Always use quorum in production
        .build();
}

BP-I2: Monitor DLQ Depth with Alerts

# Prometheus alert
- alert: DLQNonEmpty
  expr: rabbitmq_queue_messages{queue=~".*dlq.*"} > 0
  for: 5m
  annotations:
    summary: "Dead Letter Queue has messages - investigation required"

BP-I3: Use Environment-Specific Topic/Queue Naming

Development:   dev.order-events
Staging:       staging.order-events
Production:    prod.order-events

OR use separate Kafka clusters per environment.
NEVER share a production broker with development.

BP-I4: Implement Health Checks for the Message Queue Connection

@Component
public class KafkaHealthIndicator implements HealthIndicator {
 
    private final AdminClient adminClient;
 
    @Override
    public Health health() {
        try {
            Set<String> topics = adminClient.listTopics()
                .names()
                .get(3, TimeUnit.SECONDS);
            return Health.up()
                .withDetail("topicCount", topics.size())
                .build();
        } catch (Exception e) {
            return Health.down()
                .withException(e)
                .build();
        }
    }
}

BP-I5: Never Use replication.factor=1 in Production (Kafka)

# Create topics with replication factor 3
kafka-topics.sh --create \
  --topic order-events \
  --replication-factor 3 \       # ALWAYS 3 in production
  --partitions 12 \
  --config min.insync.replicas=2  # Require 2 ISR for write

Production Readiness Checklist

Use this before putting any messaging system component into production.

Producer Checklist

Message IDs (UUIDs) included in every message
Schema version included in headers
Event type included in headers
Correlation/trace IDs propagated
Producer uses acks=all for critical data
Idempotent producer enabled for Kafka
Error handling and alerting for publish failures
Dead letter or fallback strategy for when broker is unavailable
Load tested at 3x expected peak throughput

Consumer Checklist

Infrastructure Checklist

Design Checklist

Ask these questions before choosing to use a message queue:

1. Should this communication be synchronous or asynchronous?

Does the caller need the result to complete the current operation? -> Synchronous
Is it a background side effect? -> Asynchronous (message queue)

2. Queue or Topic?

Should only ONE consumer process each message? -> Queue
Should ALL consumers get every message? -> Topic

3. Ordering Requirements

Is strict ordering required? -> Design partition key strategy
Is ordering required only within one entity (e.g., per order)? -> Partition by entity ID

4. Failure Scenarios

What happens when the consumer is down? -> Queue buffers it
What happens when the broker is down? -> Producer needs fallback
What happens when a message cannot be processed? -> DLQ configured?

5. Delivery Guarantee Needed

Can you afford to lose messages? -> At-most-once
Can you afford duplicates? -> At-least-once (default) with idempotent consumers
Must have exactly once? -> Kafka transactions + idempotent consumer (high cost)

6. Consumer Scale

What is the maximum throughput needed? -> Determines partition count
How long does processing take per message? -> Determines consumer count

7. Data Sensitivity

Is this financial data? -> acks=all, idempotent producer, DLQ, Saga pattern
Is this analytics data? -> at-most-once acceptable, optimize for throughput

Previous: Part 5 - Operations and Performance
Next: Part 7 - Interview Questions
Index: Message Queues Demystified - Index

Series: Message Queues Demystified

Message Queues Demystified - Part 6: Pitfalls and Best Practices

Table of Contents

The 15 Most Dangerous Pitfalls

Pitfall 1: Not Handling Duplicate Messages (Non-Idempotent Consumers)

Pitfall 2: Silent Message Loss from Auto-Commit Before Processing

Pitfall 3: Poison Message Causing Infinite Retry Loop

Pitfall 4: Not Monitoring Consumer Lag

Pitfall 5: The Dual-Write Problem (DB + Message Queue)

Pitfall 6: Expecting Ordering Without Partition Keys (Kafka)

Pitfall 7: Not Setting Message TTL or Retention Limits

Pitfall 8: Using a Single Partition for "Ordering" When You Actually Need Scale

Pitfall 9: Connecting Too Many Consumers to RabbitMQ

Pitfall 10: Not Testing Failure Scenarios

Pitfall 11: Schema Changes That Break Consumers

Pitfall 12: Tightly Coupling to the Message Queue Implementation

Pitfall 13: Not Using Connection Retry and Reconnect Logic

Pitfall 14: Not Considering Message Size Limits

Pitfall 15: The Thundering Herd on Consumer Group Restart

Anti-Patterns

Anti-Pattern 1: Using a Message Queue as a Database

Anti-Pattern 2: Creating a Queue Per Message Type in RabbitMQ

Anti-Pattern 3: Fire-and-Forget for Critical Operations

Anti-Pattern 4: Consuming From Many Topics in a Single Consumer

Best Practices for Producers

BP-P1: Always Use Explicit Message IDs

BP-P2: Include Schema Version and Event Type in Headers

BP-P3: Use Correlation IDs for Distributed Tracing

BP-P4: Never Block the Caller on Publish for Non-Critical Events

Best Practices for Consumers

BP-C1: Always Design for Idempotency

BP-C2: Commit Offsets Only After Successful Processing

BP-C3: Log the Message ID, Offset, and Partition at the Start and End

BP-C4: Set max.poll.interval.ms Greater Than Your Worst-Case Processing Time

BP-C5: Implement a Circuit Breaker for Downstream Dependencies

Best Practices for Infrastructure and Operations

BP-I1: Always Configure a DLQ - No Exceptions

BP-I2: Monitor DLQ Depth with Alerts

BP-I3: Use Environment-Specific Topic/Queue Naming

BP-I4: Implement Health Checks for the Message Queue Connection

BP-I5: Never Use replication.factor=1 in Production (Kafka)

Production Readiness Checklist

Producer Checklist

Consumer Checklist

Infrastructure Checklist

Design Checklist