Message Queues Demystified - Part 2: Messaging Patterns and Architecture

Knowing that a message queue exists is table stakes. Knowing which pattern to apply, and why,
separates engineers who use message queues from engineers who master them.

Work Queue (Competing Consumers)
Publish-Subscribe (Fan-out)
Request-Reply (Async RPC)
Fan-in (Aggregation)
Dead Letter Queue (DLQ)
Priority Queue
Claim Check Pattern
Message Routing and Filtering
Event-Driven Architecture
Choreography vs Orchestration
Scatter-Gather Pattern
Transactional Outbox Pattern (Preview)

1. Work Queue (Competing Consumers)

What It Is

The Work Queue is the most fundamental messaging pattern. A single queue holds tasks (messages). Multiple consumer instances compete to grab tasks from the queue. Each task is processed by exactly one consumer.

The Problem It Solves

You have a large volume of work items that can be processed independently. Processing each item takes non-trivial time. You need to process them as fast as possible.

Without this pattern, you have two bad options:

Option A: Process items sequentially in a loop - slow, single point of failure
Option B: Spawn a thread per item - unbounded thread explosion, memory exhaustion

How It Works

Producer (submits jobs)
        |
        v
[Queue: video-transcoding-jobs]
        |
  +-----+-----+-----+
  |     |     |     |
  v     v     v     v
Worker Worker Worker Worker
  1     2     3     4
(each grabs and processes one job at a time)
(when done, immediately grabs the next available job)

Step-by-step flow:

Producer publishes message: {jobId: "video-001", videoUrl: "s3://...", targetFormat: "mp4"}
Worker 1 picks up the message (other workers cannot see this message while Worker 1 has it)
Worker 1 transcodes the video (takes 3 minutes)
Worker 1 sends ACK - message is deleted from queue
Worker 1 immediately picks up the next message

If Worker 1 crashes during step 3:

The message's visibility timeout expires
The queue makes the message visible again
Worker 2 (or any other worker) picks it up and retries

Key Properties

Load distribution: Work is automatically spread across all available workers
Fault tolerance: Worker failures are handled transparently via re-queuing
Elastic scaling: Add workers to increase throughput; remove workers to save cost
Self-pacing: Slow workers are not overloaded (they pull only when ready)

Real-World Use Cases

Video transcoding (Netflix, YouTube uploads)
Image resizing and thumbnail generation
Email and SMS sending at scale
Background report generation
Machine learning inference jobs
Batch data processing

Implementation Example (Spring Boot + RabbitMQ)

// Producer - submits jobs to the queue
@Service
public class VideoProcessingJobSubmitter {
 
    private final RabbitTemplate rabbitTemplate;
 
    public void submitTranscodingJob(String videoId, String s3Url) {
        TranscodeJobMessage job = TranscodeJobMessage.builder()
            .jobId(UUID.randomUUID().toString())
            .videoId(videoId)
            .sourceUrl(s3Url)
            .targetFormats(List.of("mp4", "webm", "hls"))
            .priority("NORMAL")
            .submittedAt(Instant.now())
            .build();
 
        rabbitTemplate.convertAndSend("video-processing", "transcode.job", job);
        log.info("Submitted transcoding job: {}", job.getJobId());
    }
}
 
// Consumer - multiple instances compete for jobs
@Component
@Slf4j
public class VideoTranscodingWorker {
 
    @RabbitListener(queues = "video-transcoding-jobs")
    public void processTranscodeJob(TranscodeJobMessage job, Channel channel, Message message)
            throws IOException {
 
        log.info("Worker picked up job: {}", job.getJobId());
 
        try {
            // This takes several minutes - other workers handle other jobs in parallel
            ffmpegService.transcode(job.getSourceUrl(), job.getTargetFormats());
            videoRepository.markTranscoded(job.getVideoId());
 
            // Manual ACK after successful processing
            channel.basicAck(message.getMessageProperties().getDeliveryTag(), false);
            log.info("Job completed: {}", job.getJobId());
 
        } catch (Exception e) {
            log.error("Job failed: {}", job.getJobId(), e);
            // NACK with requeue=false (will go to DLQ after max retries)
            channel.basicNack(message.getMessageProperties().getDeliveryTag(), false, false);
        }
    }
}

Scaling Considerations

Queue depth: 10,000 jobs
Each job takes: 5 minutes
Target completion: within 2 hours

Workers needed: 10,000 jobs x 5 min / 120 min = ~417 workers

With auto-scaling: Start with 10 workers, scale up to 500 based on queue depth.
After jobs drain: Scale back down to 10 workers.

What It Is

One producer publishes a message (event) to a topic. Every subscriber independently receives and processes their own copy of the message.

The Problem It Solves

An event has occurred in your system. Multiple independent services need to react to this event. You do not want the event source to know about or depend on every service that cares.

How It Works

Event: UserRegistered (customerId: 12345, email: user@example.com, plan: PRO)

Producer (User Service)
        |
        v
[Topic: user-registered]
        |
   +----+----+----+
   |    |    |    |
   v    v    v    v
Email  CRM  Risk  Analytics
Svc    Svc  Svc    Svc
(send  (add (fraud (track
 welcome) to CRM) check) signup)

Each consumer group independently:

Maintains its own "cursor" (offset in Kafka, subscription in SNS)
Processes messages at its own pace
Fails and retries without affecting other consumers
Can be deployed, scaled, and updated independently

Event vs Command - An Important Distinction

Aspect	Event	Command
Meaning	"This happened"	"Please do this"
Example	`OrderPlaced`	`SendEmail`
Direction	Broadcast (1 to many)	Targeted (1 to 1)
Coupling	Producer unaware of consumers	Producer knows who executes
Suitable for	Pub/Sub on a topic	Point-to-Point on a queue

Golden rule: Publish events to topics. Send commands to queues.

Real-World Use Cases

Domain events in microservices (UserRegistered, OrderPlaced, PaymentReceived)
Cache invalidation broadcast to all service instances
Configuration change propagation
Real-time notification systems
Audit and compliance event logging
Data synchronization between services

Implementation Example (Spring Boot + Kafka)

// Producer - publishes the event once
@Service
@RequiredArgsConstructor
public class UserRegistrationService {
 
    private final UserRepository userRepository;
    private final KafkaTemplate<String, UserRegisteredEvent> kafkaTemplate;
 
    @Transactional
    public User registerUser(RegisterUserRequest request) {
        User user = User.create(request);
        userRepository.save(user);
 
        UserRegisteredEvent event = UserRegisteredEvent.builder()
            .eventId(UUID.randomUUID().toString())
            .userId(user.getId())
            .email(user.getEmail())
            .firstName(user.getFirstName())
            .plan(user.getSubscriptionPlan())
            .registeredAt(Instant.now())
            .build();
 
        // Publish to topic - ALL consumers will receive this
        kafkaTemplate.send("user-events", user.getId().toString(), event);
 
        return user;
    }
}
 
// Consumer 1 - Email Service (completely independent)
@Component
@KafkaListener(topics = "user-events", groupId = "email-service-group")
public class WelcomeEmailConsumer {
    public void onUserRegistered(UserRegisteredEvent event) {
        if ("UserRegistered".equals(event.getEventType())) {
            emailService.sendWelcomeEmail(event.getEmail(), event.getFirstName());
        }
    }
}
 
// Consumer 2 - CRM Service (completely independent)
@Component
@KafkaListener(topics = "user-events", groupId = "crm-service-group")
public class CrmSyncConsumer {
    public void onUserRegistered(UserRegisteredEvent event) {
        if ("UserRegistered".equals(event.getEventType())) {
            crmClient.createContact(event.getUserId(), event.getEmail(), event.getPlan());
        }
    }
}

The Power of Adding New Consumers

Month 1: Launch with Email Service consumer.

Month 3: Product wants to add onboarding tooltips.
         Add Onboarding Service consumer.
         Zero changes to User Registration Service.

Month 6: Security team wants fraud screening.
         Add Risk Assessment Service consumer.
         Zero changes to anything existing.

Month 12: Compliance requires audit logging.
          Add Compliance Logger consumer.
          Zero changes to anything existing.

This is the compounding value of Pub/Sub. Each new consumer adds capability without creating technical debt in the producer.

3. Request-Reply (Async RPC)

What It Is

An asynchronous implementation of a request-response interaction. The requester sends a message to a queue, and the worker processes it and sends the result to a reply queue. The requester waits on the reply queue.

The Problem It Solves

You need a response from a long-running operation, but you do not want the calling thread to block for minutes while waiting. You also want the benefits of decoupling (independent scaling, fault tolerance) even for operations that require a result.

How It Works

Requester                   Worker
    |                          |
    | 1. Send request to queue |
    |-----> [Request Queue] -->|
    |       (correlationId:    |
    |        "req-abc123",     |
    |        replyTo:          |
    |        "reply-queue-xyz") |
    |                          |
    |                          | 2. Process (takes time)
    |                          |
    |     3. Send result to replyTo queue
    |<---- [Reply Queue] <-----|
    |      (correlationId:     |
    |       "req-abc123",      |
    |       result: {...})     |
    |                          |
    | 4. Match by correlationId|
    | 5. Return result to caller

The Correlation ID is critical. Without it, the requester cannot match which reply corresponds to which request, especially when many requests are in-flight simultaneously.

Implementation Example

// Requester - sends request and waits for reply
@Service
public class DocumentAnalysisClient {
 
    private final RabbitTemplate rabbitTemplate;
 
    // Timeout after 30 seconds if no response
    @Value("${analysis.timeout.seconds:30}")
    private int timeoutSeconds;
 
    public AnalysisResult analyzeDocument(byte[] documentBytes, String documentType) {
        AnalysisRequest request = AnalysisRequest.builder()
            .requestId(UUID.randomUUID().toString())
            .documentBytes(documentBytes)
            .documentType(documentType)
            .requestedAt(Instant.now())
            .build();
 
        // sendAndReceive is the RabbitMQ helper for request-reply
        // It automatically creates a temporary reply queue and sets correlationId
        AnalysisResult result = (AnalysisResult) rabbitTemplate.convertSendAndReceive(
            "document-analysis",
            "analyze.document",
            request,
            message -> {
                // Set correlation ID for matching
                message.getMessageProperties().setCorrelationId(request.getRequestId());
                return message;
            }
        );
 
        if (result == null) {
            throw new AnalysisTimeoutException("Document analysis timed out after "
                + timeoutSeconds + " seconds");
        }
 
        return result;
    }
}
 
// Worker - processes requests and sends replies
@Component
public class DocumentAnalysisWorker {
 
    @RabbitListener(queues = "document-analysis-queue")
    public AnalysisResult handleAnalysisRequest(AnalysisRequest request) {
        // Just return the result - Spring AMQP handles sending it to the reply queue
        log.info("Analyzing document: requestId={}", request.getRequestId());
 
        // This could take 10-20 seconds for a large document
        return documentAnalyzer.analyze(request.getDocumentBytes(), request.getDocumentType());
    }
}

When to Use Request-Reply

Long-running operations that produce a result the caller needs
When you want async processing but still need the outcome
RPC-style communication between microservices with decoupled scaling

When NOT to Use Request-Reply

If the operation is fast (under 100ms), use synchronous REST
If the caller can truly fire-and-forget, use standard Pub/Sub

4. Fan-in (Aggregation)

What It Is

Multiple producers send messages to a single queue or topic. One or more consumers process all the messages from all sources in a unified way.

The Problem It Solves

You have data coming from many sources that all need to go through the same processing pipeline.

How It Works

IoT Sensor A (temperature data)   -->|
IoT Sensor B (temperature data)   -->|  [Queue: sensor-readings]  --> Data Pipeline
IoT Sensor C (humidity data)      -->|
IoT Sensor D (pressure data)      -->|

Mobile App Events (Android)   -->|
Mobile App Events (iOS)        -->|  [Topic: app-events]  --> Analytics Pipeline
Web App Events                -->|

Real-World Use Cases

Centralized logging from multiple services
IoT data collection from thousands of sensors
Event aggregation for analytics
Multi-region data consolidation
Merging data from multiple upstream systems

5. Dead Letter Queue (DLQ)

What It Is

A Dead Letter Queue (DLQ) is a special queue that receives messages that could not be successfully processed after all retry attempts have been exhausted.

The Problem It Solves

Some messages will always fail processing - maybe due to bad data format, a bug in consumer logic, or referencing a resource that no longer exists. Without a DLQ:

The message is retried forever (infinite loop)
Or the message is silently dropped (data loss)

A DLQ gives you: "We could not process this, but we have not lost it. Here it is for investigation."

How a Message Ends Up in the DLQ

Message arrives at queue
        |
Consumer attempts to process
        |
     Failure
        |
Retry 1 (after 30 seconds)
        |
     Failure
        |
Retry 2 (after 60 seconds)
        |
     Failure
        |
Retry 3 (after 120 seconds)  <-- Max retries reached
        |
     Failure
        |
Move to Dead Letter Queue  <-- Message is preserved for investigation
        |
Alert fires --> Engineering team investigates

Why Messages Fail

Category 1: Transient failures (should succeed on retry)

Network timeouts
Downstream service temporarily unavailable
Database connection pool exhausted

Category 2: Permanent failures (will NEVER succeed on retry - Poison Messages)

Malformed message format (deserialization failure)
Referenced entity deleted (order ID that no longer exists)
Business rule violation (negative amount in payment message)
Bug in consumer code that needs a deployment to fix

The DLQ primarily helps with Category 2. Retries handle Category 1.

Implementation with Exponential Backoff (Spring Boot + RabbitMQ)

# application.yml - RabbitMQ DLQ configuration
spring:
  rabbitmq:
    listener:
      simple:
        retry:
          enabled: true
          initial-interval: 1000 # 1 second first retry
          max-interval: 30000 # 30 second max between retries
          multiplier: 2.0 # Exponential backoff multiplier
          max-attempts: 5 # After 5 attempts, send to DLQ

@Configuration
public class RabbitMQConfig {
 
    @Bean
    public Queue orderProcessingQueue() {
        return QueueBuilder.durable("order-processing")
            .withArgument("x-dead-letter-exchange", "dlx.exchange")
            .withArgument("x-dead-letter-routing-key", "order-processing.dlq")
            .withArgument("x-message-ttl", 86400000)  // Messages expire after 24 hours
            .build();
    }
 
    @Bean
    public Queue orderProcessingDLQ() {
        // The DLQ itself - no further routing from here
        return QueueBuilder.durable("order-processing.dlq").build();
    }
 
    @Bean
    public DirectExchange deadLetterExchange() {
        return new DirectExchange("dlx.exchange");
    }
 
    @Bean
    public Binding dlqBinding() {
        return BindingBuilder
            .bind(orderProcessingDLQ())
            .to(deadLetterExchange())
            .with("order-processing.dlq");
    }
}

DLQ Processing and Alerting

// Monitor and alert on DLQ depth
@Component
@Slf4j
public class DLQMonitor {
 
    private final RabbitAdmin rabbitAdmin;
    private final AlertingService alertingService;
 
    @Scheduled(fixedDelay = 60000)  // Check every minute
    public void checkDLQDepth() {
        QueueInformation info = rabbitAdmin.getQueueInfo("order-processing.dlq");
        if (info != null && info.getMessageCount() > 0) {
            log.warn("DLQ has {} messages - investigation needed", info.getMessageCount());
            alertingService.sendAlert(
                AlertLevel.HIGH,
                "DLQ Alert",
                "order-processing.dlq has " + info.getMessageCount() + " failed messages"
            );
        }
    }
}
 
// DLQ consumer for investigation and manual replay
@Component
public class DLQInvestigationConsumer {
 
    @RabbitListener(queues = "order-processing.dlq")
    public void investigateFailedMessage(
            Message message,
            @Header(AmqpHeaders.DEATH_REASON) String deathReason,
            @Header(AmqpHeaders.DEATH_COUNT) Integer deathCount) {
 
        log.error("Failed message investigation: reason={}, retryCount={}, body={}",
            deathReason, deathCount,
            new String(message.getBody(), StandardCharsets.UTF_8));
 
        // Options:
        // 1. Fix the data and republish to original queue
        // 2. Log to a dead letter store for manual review
        // 3. Send notification to operations team
        // 4. Move to a human review dashboard
 
        failedMessageRepository.save(FailedMessage.from(message, deathReason, deathCount));
    }
}

DLQ Best Practices

Always configure a DLQ - Never let messages be silently dropped
Alert on DLQ depth - A growing DLQ means something is systematically wrong
Build a replay mechanism - Fix the bug, replay from DLQ
Set message TTL on DLQ - Old DLQ messages have no value; delete after 7-30 days
Categorize failure reasons - Track why messages are failing to identify patterns
Do not retry from DLQ automatically - Manual review before replay is almost always worth it

6. Priority Queue

What It Is

A queue where messages have a priority level. Higher-priority messages are delivered to consumers before lower-priority messages, regardless of arrival time.

The Problem It Solves

Not all messages are equally urgent. A password reset email should be sent before a weekly newsletter. A premium customer's order should be processed before a free-tier user's order.

How It Works

Messages arrive in order:          Processing order:
1. newsletter (priority 1)    -->  3. password-reset (priority 10) FIRST
2. order-confirmation (pri 5) -->  2. order-confirmation (priority 5) SECOND
3. password-reset (priority 10)--> 1. newsletter (priority 1) LAST

Implementation (RabbitMQ)

@Configuration
public class PriorityQueueConfig {
 
    @Bean
    public Queue notificationQueue() {
        return QueueBuilder.durable("notifications")
            .withArgument("x-max-priority", 10)  // Priority levels 0 (lowest) to 10 (highest)
            .build();
    }
}
 
@Service
public class NotificationPublisher {
 
    public void sendPasswordReset(String email) {
        MessageProperties props = new MessageProperties();
        props.setPriority(10);  // Highest priority
        Message message = new Message(payload.getBytes(), props);
        rabbitTemplate.send("notifications", message);
    }
 
    public void sendWeeklyNewsletter(String email) {
        MessageProperties props = new MessageProperties();
        props.setPriority(1);   // Lowest priority
        Message message = new Message(payload.getBytes(), props);
        rabbitTemplate.send("notifications", message);
    }
}

Priority Queue in Kafka

Kafka does not natively support priority queues. Common workarounds:

Separate topics per priority: notifications-high, notifications-medium, notifications-low
- Consumer polls high-priority topic first, then medium, then low
Priority routing at producer: Producer decides which topic to write to based on priority
Single consumer checks multiple topics: Polls with different frequency based on priority

Real-World Use Cases

Email sending: transactional emails (high) vs marketing emails (low)
Order processing: express delivery (high) vs standard (low)
Support tickets: premium customers (high) vs free tier (low)
Alert processing: critical alerts (high) vs informational (low)

7. Claim Check Pattern

What It Is

When a message payload is too large to send through the message queue efficiently (or exceeds broker limits), you store the large payload externally and send only a reference (the "claim check") in the message.

The Problem It Solves

Message brokers have payload size limits:

RabbitMQ default: 128MB (not recommended above a few MB for performance)
Kafka default: 1MB per message
AWS SQS: 256KB limit
AWS SQS + S3 Extended Client: up to 2GB (using this exact pattern)

Large payloads in queues cause:

High memory consumption in the broker
Slow serialization/deserialization
Network bandwidth saturation
Broker instability

How It Works

Without Claim Check (problematic):
Producer --> [Huge 50MB PDF embedded in message] --> Consumer
                (broker struggles, memory issues)

With Claim Check:
Step 1: Producer uploads 50MB PDF to S3/Blob Storage
        Gets back a reference: "s3://bucket/documents/doc-12345.pdf"

Step 2: Producer sends lightweight message with reference:
        { "documentId": "doc-12345", "s3Ref": "s3://bucket/documents/doc-12345.pdf" }

Step 3: Consumer receives lightweight message
        Consumer downloads 50MB PDF from S3 directly
        Consumer processes the document
        Consumer optionally deletes from S3 after processing

Implementation Example

// Producer with Claim Check
@Service
public class DocumentProcessingPublisher {
 
    private final S3Client s3Client;
    private final KafkaTemplate<String, DocumentProcessingMessage> kafkaTemplate;
 
    public void submitDocumentForProcessing(String documentId, byte[] documentContent) {
 
        // Step 1: Upload large payload to object storage
        String s3Key = "documents/processing/" + documentId + ".pdf";
        PutObjectRequest putRequest = PutObjectRequest.builder()
            .bucket("document-processing-bucket")
            .key(s3Key)
            .contentType("application/pdf")
            .build();
 
        s3Client.putObject(putRequest, RequestBody.fromBytes(documentContent));
 
        // Step 2: Send lightweight claim check message
        DocumentProcessingMessage message = DocumentProcessingMessage.builder()
            .messageId(UUID.randomUUID().toString())
            .documentId(documentId)
            .payloadReference("s3://document-processing-bucket/" + s3Key)
            .contentType("application/pdf")
            .originalSizeBytes(documentContent.length)
            .submittedAt(Instant.now())
            .build();
 
        kafkaTemplate.send("document-processing", documentId, message);
        log.info("Submitted document for processing: {} ({} bytes)",
            documentId, documentContent.length);
    }
}
 
// Consumer with Claim Check retrieval
@Service
public class DocumentProcessingConsumer {
 
    private final S3Client s3Client;
 
    @KafkaListener(topics = "document-processing", groupId = "doc-processor-group")
    public void processDocument(DocumentProcessingMessage message) {
 
        // Step 3: Retrieve the large payload from object storage
        GetObjectRequest getRequest = GetObjectRequest.builder()
            .bucket(extractBucket(message.getPayloadReference()))
            .key(extractKey(message.getPayloadReference()))
            .build();
 
        byte[] documentContent = s3Client.getObjectAsBytes(getRequest).asByteArray();
 
        log.info("Retrieved document: {} ({} bytes)",
            message.getDocumentId(), documentContent.length);
 
        // Process the full document
        ocrService.extractText(message.getDocumentId(), documentContent);
 
        // Step 4: Clean up the temporary storage (optional)
        deleteFromS3(message.getPayloadReference());
    }
}

When to Apply

Message payload exceeds broker size limits
Payload is a binary file (image, video, PDF, archive)
The same payload might be referenced by multiple messages
Payload storage in the broker would be inefficient or cost-prohibitive

8. Message Routing and Filtering

What It Is

Mechanisms to ensure messages are delivered only to the consumers that care about them, without requiring consumers to filter out irrelevant messages themselves.

Routing in RabbitMQ (Topic Exchange)

Producer publishes to exchange with routing keys:

"order.us-west.placed"     --> US West Order Processing Queue
"order.eu-central.placed"  --> EU Central Order Processing Queue
"order.*.placed"           --> All Regions - Analytics Queue
"order.#"                  --> All Order Events - Audit Queue
"payment.us-west.failed"   --> US West Failure Alerts Queue

Routing key pattern syntax:

* matches exactly one word
# matches zero or more words

// Publish with a specific routing key
rabbitTemplate.convertAndSend("order-exchange", "order.us-west.placed", orderEvent);
 
// Consumer binding - only gets US West orders
@RabbitListener(bindings = @QueueBinding(
    value = @Queue("us-west-orders"),
    exchange = @Exchange(value = "order-exchange", type = ExchangeTypes.TOPIC),
    key = "order.us-west.*"
))
public void handleUsWestOrder(OrderEvent event) { ... }

Message Filtering in Kafka (Consumer-Side)

Kafka does not natively route messages to different queues by content. Filtering happens at the consumer:

@KafkaListener(topics = "order-events")
public void handleEvent(OrderEvent event) {
    // Only process high-value orders
    if (event.getTotalAmount().compareTo(BigDecimal.valueOf(1000)) < 0) {
        log.debug("Skipping low-value order: {}", event.getOrderId());
        return;  // Skip but still ACK (otherwise message keeps coming back)
    }
 
    processHighValueOrder(event);
}

Message Filtering in AWS SNS

SNS supports server-side filter policies, so the queue only receives relevant messages:

{
  "FilterPolicy": {
    "eventType": ["OrderPlaced", "OrderShipped"],
    "region": ["us-west-2", "us-east-1"],
    "orderValue": [{ "numeric": [">=", 100] }]
  }
}

Only messages matching ALL filter conditions are delivered to the subscribed SQS queue. Other messages are silently discarded at SNS - the consumer never even sees them.

9. Event-Driven Architecture

What It Is

An architectural style where services communicate by publishing and consuming events. Services are unaware of each other and react to state changes happening elsewhere in the system.

Core Principles

Events represent immutable facts: "OrderWasPlaced" is a fact about the past, not a command
Services own their domain: Each service is the authoritative source for its domain events
Loose coupling by design: No direct service-to-service dependencies
Eventual consistency: State across services converges over time

The Event-Driven Flow

Traditional (Request-Driven):
User clicks "Place Order"
  -> Order Service
    -> Payment Service (direct HTTP call)
    -> Inventory Service (direct HTTP call)
    -> Email Service (direct HTTP call)
    -> Warehouse Service (direct HTTP call)
  <- Return "Order Confirmed"

Event-Driven:
User clicks "Place Order"
  -> Order Service (creates order, publishes OrderPlaced event)
  <- Return "Order Confirmed" (immediately)

  (In parallel, asynchronously):
  OrderPlaced event -> Payment Service (processes payment)
  OrderPlaced event -> Inventory Service (reserves stock)
  OrderPlaced event -> Email Service (sends confirmation)
  OrderPlaced event -> Warehouse Service (creates pick list)

Event Storming - Discovering the Domain Events

Event Storming is a workshop technique for discovering the events in a system:

Domain Events (orange sticky notes): "What happened?" - OrderPlaced, PaymentReceived, ItemShipped
Commands (blue sticky notes): "What caused it?" - PlaceOrder, ProcessPayment, ShipItem
Aggregates (yellow sticky notes): "What received the command?" - Order, Payment, Shipment
External Systems (pink sticky notes): "What external systems are involved?" - PaymentGateway, ShippingCarrier

Event Types: Domain Events vs Integration Events

Type	Scope	Content	Audience
Domain Event	Within one service	Rich with business context	Internal to the bounded context
Integration Event	Across services	Minimal, stable contract	External consumers

Best practice: Keep your domain events rich for internal use. When publishing to other services, create a stable integration event contract that you version explicitly.

10. Choreography vs Orchestration

This is one of the most critical distinctions in distributed systems design, and it comes up frequently in interviews.

Choreography

Each service knows what to do when it receives an event. There is no central coordinator. Services react to events and emit new events, creating a chain of reactions.

Example: Order fulfillment via choreography

Step 1: Customer places order
        Order Service publishes: OrderPlaced
                                     |
Step 2: Payment Service listens to OrderPlaced
        Payment Service processes payment
        Payment Service publishes: PaymentSucceeded
                                     |
Step 3: Inventory Service listens to PaymentSucceeded
        Inventory Service reserves items
        Inventory Service publishes: InventoryReserved
                                     |
Step 4: Fulfillment Service listens to InventoryReserved
        Fulfillment Service creates shipment
        Fulfillment Service publishes: ShipmentCreated

Visual:

OrderService ---(OrderPlaced)---> PaymentService
                                       |
                                (PaymentSucceeded)
                                       |
                                       v
                               InventoryService
                                       |
                               (InventoryReserved)
                                       |
                                       v
                              FulfillmentService
                                       |
                               (ShipmentCreated)

Pros of Choreography:

Fully decoupled - no central dependency
Easy to add new services (just subscribe to the right event)
No single point of failure

Cons of Choreography:

Hard to understand the overall flow (no single place to see the whole process)
Difficult to track "where is order #12345 in the process?"
Compensating transactions (rollback) are complex
Debugging failures requires tracing events across multiple services

Orchestration

A central Orchestrator service (or Saga Orchestrator) knows the entire workflow and explicitly commands each service to do its part. Services just execute commands and report results.

Example: Order fulfillment via orchestration

Central OrderOrchestrator directs the entire flow:

Step 1: OrderOrchestrator --> PaymentService.charge(orderId)
        PaymentService returns: {success: true, transactionId: "txn-123"}

Step 2: OrderOrchestrator --> InventoryService.reserve(orderId, items)
        InventoryService returns: {success: true, reservationId: "res-456"}

Step 3: OrderOrchestrator --> FulfillmentService.createShipment(orderId)
        FulfillmentService returns: {success: true, trackingId: "trk-789"}

Step 4: OrderOrchestrator marks order as COMPLETED

Failure Handling with Orchestration:

Step 1: charge payment - SUCCESS
Step 2: reserve inventory - FAILURE (out of stock)

Orchestrator executes compensating transactions:
  - Refund the payment (compensate Step 1)
  - Mark order as FAILED
  - Notify customer with reason

Visual:

          OrderOrchestrator
         /       |         \
        v        v          v
  PaymentSvc InventorySvc FulfillmentSvc
  (responds)   (responds)   (responds)

Pros of Orchestration:

Clear visibility of the entire workflow in one place
Easier to implement compensating transactions (rollback)
Straightforward to track "where is this order in the process?"
Simpler debugging

Cons of Orchestration:

Orchestrator becomes a central dependency
Orchestrator must be highly available
Risk of logic creep - orchestrator becomes too smart
Can become a bottleneck

Choreography vs Orchestration - Decision Guide

Scenario	Recommendation
Simple, 2-3 step workflow	Choreography
Complex, multi-step workflow with many failure modes	Orchestration
Workflow spans 5+ services	Orchestration
Need a clear audit trail of workflow state	Orchestration
Adding new participants should be zero-friction	Choreography
Compensating transactions are complex	Orchestration
Team size is large and services are owned by different teams	Choreography (more autonomous)

In practice: Most real-world systems use a hybrid. Use choreography for simple event reactions and orchestration for complex, stateful, multi-step workflows that require compensation.

11. Scatter-Gather Pattern

What It Is

The Scatter-Gather pattern broadcasts a request to multiple services (scatter) and then collects and aggregates all their responses (gather) to form a final result.

Use Case Example

Building a price comparison service that needs to get quotes from multiple insurance providers simultaneously:

Request: Get insurance quotes for customer profile X

Scatter: Send request simultaneously to:
         --> InsurerA.getQuote(profile)
         --> InsurerB.getQuote(profile)
         --> InsurerC.getQuote(profile)
         --> InsurerD.getQuote(profile)

Gather: Wait for responses (with timeout), aggregate all quotes, return best price

         InsurerA: $120/month  -->
         InsurerB: $135/month  -->  Aggregator  --> [Cheapest: $115 from InsurerC]
         InsurerC: $115/month  -->
         InsurerD: timeout     -->  (ignored - partial results acceptable)

Implementation Pattern

@Service
public class InsuranceQuoteAggregator {
 
    private final KafkaTemplate<String, QuoteRequest> kafkaTemplate;
    private final ConcurrentHashMap<String, List<InsuranceQuote>> pendingRequests
        = new ConcurrentHashMap<>();
 
    public List<InsuranceQuote> getQuotes(CustomerProfile profile) {
        String requestId = UUID.randomUUID().toString();
        CountDownLatch latch = new CountDownLatch(insurerTopics.size());
        pendingRequests.put(requestId, new CopyOnWriteArrayList<>());
 
        // SCATTER - broadcast to all insurers
        QuoteRequest request = new QuoteRequest(requestId, profile);
        for (String insurerTopic : insurerTopics) {
            kafkaTemplate.send(insurerTopic, requestId, request);
        }
 
        // GATHER - wait for responses (up to 5 seconds)
        try {
            latch.await(5, TimeUnit.SECONDS);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
 
        // Return whatever responses came back within the timeout
        return pendingRequests.remove(requestId);
    }
 
    @KafkaListener(topics = "insurance-quote-responses")
    public void handleQuoteResponse(InsuranceQuote quote) {
        List<InsuranceQuote> quotes = pendingRequests.get(quote.getRequestId());
        if (quotes != null) {
            quotes.add(quote);
        }
    }
}

12. Transactional Outbox Pattern (Preview)

The Outbox Pattern deserves its own deep-dive (covered fully in Part 4 - Advanced Concepts), but here is the concept:

The Problem It Solves

Consider this code:

@Transactional
public void placeOrder(OrderRequest request) {
    orderRepository.save(order);          // Database write
    kafkaTemplate.send("orders", event);  // Message publish
 
    // What happens if the application crashes between these two lines?
    // Or if Kafka is temporarily unavailable?
    // You get: order saved in DB, but event never published = inconsistency
}

This is the dual-write problem: you cannot atomically write to a database AND publish to a message queue in the same transaction.

The Solution (Preview)

Instead of publishing directly, write the event to an outbox table within the same database transaction:

@Transactional
public void placeOrder(OrderRequest request) {
    orderRepository.save(order);
    // Write event to outbox table in the SAME transaction
    outboxRepository.save(new OutboxEvent("order-events", eventPayload));
    // Atomic: either both succeed or both fail
}
// A separate process reads the outbox table and publishes to Kafka
// If publishing fails, it retries - the event is not lost

Full implementation and details in Part 4.

Pattern Selection Summary

Pattern	Use When	Key Characteristic
Work Queue	Distribute tasks across workers	Each task processed by one worker
Pub/Sub	Broadcast events to multiple services	Each subscriber gets all messages
Request-Reply	Need async result from long-running op	Correlation ID links request and reply
Fan-in	Aggregate data from multiple sources	Many producers, unified consumer
Dead Letter Queue	Handle unprocessable messages	Safety net for failed messages
Priority Queue	Some messages more urgent than others	Higher priority processed first
Claim Check	Large payloads exceeding size limits	Store payload externally, pass reference
Message Routing	Different consumers for different messages	Route by key, pattern, or attributes
Choreography	Loosely coupled event-driven services	Services react to events autonomously
Orchestration	Complex workflows needing central control	Coordinator directs each step
Scatter-Gather	Aggregate responses from multiple sources	Broadcast + collect with timeout
Outbox	Atomic DB write + message publish	Guarantee no message loss

Previous: Part 1 - Fundamentals
Next: Part 3 - Technologies Deep Dive
Index: Message Queues Demystified - Index

Series: Message Queues Demystified

Message Queues Demystified - Part 2: Messaging Patterns and Architecture

Table of Contents

1. Work Queue (Competing Consumers)

What It Is

The Problem It Solves

How It Works

Key Properties

Real-World Use Cases

Implementation Example (Spring Boot + RabbitMQ)

Scaling Considerations

2. Publish-Subscribe (Fan-out)

What It Is

The Problem It Solves

How It Works

Event vs Command - An Important Distinction

Real-World Use Cases

Implementation Example (Spring Boot + Kafka)

The Power of Adding New Consumers

3. Request-Reply (Async RPC)

What It Is

The Problem It Solves

How It Works

Implementation Example

When to Use Request-Reply

When NOT to Use Request-Reply

4. Fan-in (Aggregation)

What It Is

The Problem It Solves

How It Works

Real-World Use Cases

5. Dead Letter Queue (DLQ)

What It Is

The Problem It Solves

How a Message Ends Up in the DLQ

Why Messages Fail

Implementation with Exponential Backoff (Spring Boot + RabbitMQ)

DLQ Processing and Alerting

DLQ Best Practices

6. Priority Queue

What It Is

The Problem It Solves

How It Works

Implementation (RabbitMQ)

Priority Queue in Kafka

Real-World Use Cases

7. Claim Check Pattern

What It Is

The Problem It Solves

How It Works

Implementation Example

When to Apply

8. Message Routing and Filtering

What It Is

Routing in RabbitMQ (Topic Exchange)

Message Filtering in Kafka (Consumer-Side)

Message Filtering in AWS SNS

9. Event-Driven Architecture

What It Is

Core Principles

The Event-Driven Flow

Event Storming - Discovering the Domain Events

Event Types: Domain Events vs Integration Events

10. Choreography vs Orchestration

Choreography

Orchestration

Choreography vs Orchestration - Decision Guide

11. Scatter-Gather Pattern

What It Is

Use Case Example

Implementation Pattern

12. Transactional Outbox Pattern (Preview)

The Problem It Solves

The Solution (Preview)

Pattern Selection Summary