← Back to Articles
6/6/2026Admin Post

message queues part2 patterns architecture

Message Queues Demystified - Part 2: Messaging Patterns and Architecture

Knowing that a message queue exists is table stakes. Knowing which pattern to apply, and why,
separates engineers who use message queues from engineers who master them.


Table of Contents

  1. Work Queue (Competing Consumers)
  2. Publish-Subscribe (Fan-out)
  3. Request-Reply (Async RPC)
  4. Fan-in (Aggregation)
  5. Dead Letter Queue (DLQ)
  6. Priority Queue
  7. Claim Check Pattern
  8. Message Routing and Filtering
  9. Event-Driven Architecture
  10. Choreography vs Orchestration
  11. Scatter-Gather Pattern
  12. Transactional Outbox Pattern (Preview)

1. Work Queue (Competing Consumers)

What It Is

The Work Queue is the most fundamental messaging pattern. A single queue holds tasks (messages). Multiple consumer instances compete to grab tasks from the queue. Each task is processed by exactly one consumer.

The Problem It Solves

You have a large volume of work items that can be processed independently. Processing each item takes non-trivial time. You need to process them as fast as possible.

Without this pattern, you have two bad options:

  • Option A: Process items sequentially in a loop - slow, single point of failure
  • Option B: Spawn a thread per item - unbounded thread explosion, memory exhaustion

How It Works

Producer (submits jobs)
        |
        v
[Queue: video-transcoding-jobs]
        |
  +-----+-----+-----+
  |     |     |     |
  v     v     v     v
Worker Worker Worker Worker
  1     2     3     4
(each grabs and processes one job at a time)
(when done, immediately grabs the next available job)

Step-by-step flow:

  1. Producer publishes message: {jobId: "video-001", videoUrl: "s3://...", targetFormat: "mp4"}
  2. Worker 1 picks up the message (other workers cannot see this message while Worker 1 has it)
  3. Worker 1 transcodes the video (takes 3 minutes)
  4. Worker 1 sends ACK - message is deleted from queue
  5. Worker 1 immediately picks up the next message

If Worker 1 crashes during step 3:

  • The message's visibility timeout expires
  • The queue makes the message visible again
  • Worker 2 (or any other worker) picks it up and retries

Key Properties

  • Load distribution: Work is automatically spread across all available workers
  • Fault tolerance: Worker failures are handled transparently via re-queuing
  • Elastic scaling: Add workers to increase throughput; remove workers to save cost
  • Self-pacing: Slow workers are not overloaded (they pull only when ready)

Real-World Use Cases

  • Video transcoding (Netflix, YouTube uploads)
  • Image resizing and thumbnail generation
  • Email and SMS sending at scale
  • Background report generation
  • Machine learning inference jobs
  • Batch data processing

Implementation Example (Spring Boot + RabbitMQ)

// Producer - submits jobs to the queue
@Service
public class VideoProcessingJobSubmitter {
 
    private final RabbitTemplate rabbitTemplate;
 
    public void submitTranscodingJob(String videoId, String s3Url) {
        TranscodeJobMessage job = TranscodeJobMessage.builder()
            .jobId(UUID.randomUUID().toString())
            .videoId(videoId)
            .sourceUrl(s3Url)
            .targetFormats(List.of("mp4", "webm", "hls"))
            .priority("NORMAL")
            .submittedAt(Instant.now())
            .build();
 
        rabbitTemplate.convertAndSend("video-processing", "transcode.job", job);
        log.info("Submitted transcoding job: {}", job.getJobId());
    }
}
 
// Consumer - multiple instances compete for jobs
@Component
@Slf4j
public class VideoTranscodingWorker {
 
    @RabbitListener(queues = "video-transcoding-jobs")
    public void processTranscodeJob(TranscodeJobMessage job, Channel channel, Message message)
            throws IOException {
 
        log.info("Worker picked up job: {}", job.getJobId());
 
        try {
            // This takes several minutes - other workers handle other jobs in parallel
            ffmpegService.transcode(job.getSourceUrl(), job.getTargetFormats());
            videoRepository.markTranscoded(job.getVideoId());
 
            // Manual ACK after successful processing
            channel.basicAck(message.getMessageProperties().getDeliveryTag(), false);
            log.info("Job completed: {}", job.getJobId());
 
        } catch (Exception e) {
            log.error("Job failed: {}", job.getJobId(), e);
            // NACK with requeue=false (will go to DLQ after max retries)
            channel.basicNack(message.getMessageProperties().getDeliveryTag(), false, false);
        }
    }
}

Scaling Considerations

Queue depth: 10,000 jobs
Each job takes: 5 minutes
Target completion: within 2 hours

Workers needed: 10,000 jobs x 5 min / 120 min = ~417 workers

With auto-scaling: Start with 10 workers, scale up to 500 based on queue depth.
After jobs drain: Scale back down to 10 workers.

2. Publish-Subscribe (Fan-out)

What It Is

One producer publishes a message (event) to a topic. Every subscriber independently receives and processes their own copy of the message.

The Problem It Solves

An event has occurred in your system. Multiple independent services need to react to this event. You do not want the event source to know about or depend on every service that cares.

How It Works

Event: UserRegistered (customerId: 12345, email: user@example.com, plan: PRO)

Producer (User Service)
        |
        v
[Topic: user-registered]
        |
   +----+----+----+
   |    |    |    |
   v    v    v    v
Email  CRM  Risk  Analytics
Svc    Svc  Svc    Svc
(send  (add (fraud (track
 welcome) to CRM) check) signup)

Each consumer group independently:

  • Maintains its own "cursor" (offset in Kafka, subscription in SNS)
  • Processes messages at its own pace
  • Fails and retries without affecting other consumers
  • Can be deployed, scaled, and updated independently

Event vs Command - An Important Distinction

AspectEventCommand
Meaning"This happened""Please do this"
ExampleOrderPlacedSendEmail
DirectionBroadcast (1 to many)Targeted (1 to 1)
CouplingProducer unaware of consumersProducer knows who executes
Suitable forPub/Sub on a topicPoint-to-Point on a queue

Golden rule: Publish events to topics. Send commands to queues.

Real-World Use Cases

  • Domain events in microservices (UserRegistered, OrderPlaced, PaymentReceived)
  • Cache invalidation broadcast to all service instances
  • Configuration change propagation
  • Real-time notification systems
  • Audit and compliance event logging
  • Data synchronization between services

Implementation Example (Spring Boot + Kafka)

// Producer - publishes the event once
@Service
@RequiredArgsConstructor
public class UserRegistrationService {
 
    private final UserRepository userRepository;
    private final KafkaTemplate<String, UserRegisteredEvent> kafkaTemplate;
 
    @Transactional
    public User registerUser(RegisterUserRequest request) {
        User user = User.create(request);
        userRepository.save(user);
 
        UserRegisteredEvent event = UserRegisteredEvent.builder()
            .eventId(UUID.randomUUID().toString())
            .userId(user.getId())
            .email(user.getEmail())
            .firstName(user.getFirstName())
            .plan(user.getSubscriptionPlan())
            .registeredAt(Instant.now())
            .build();
 
        // Publish to topic - ALL consumers will receive this
        kafkaTemplate.send("user-events", user.getId().toString(), event);
 
        return user;
    }
}
 
// Consumer 1 - Email Service (completely independent)
@Component
@KafkaListener(topics = "user-events", groupId = "email-service-group")
public class WelcomeEmailConsumer {
    public void onUserRegistered(UserRegisteredEvent event) {
        if ("UserRegistered".equals(event.getEventType())) {
            emailService.sendWelcomeEmail(event.getEmail(), event.getFirstName());
        }
    }
}
 
// Consumer 2 - CRM Service (completely independent)
@Component
@KafkaListener(topics = "user-events", groupId = "crm-service-group")
public class CrmSyncConsumer {
    public void onUserRegistered(UserRegisteredEvent event) {
        if ("UserRegistered".equals(event.getEventType())) {
            crmClient.createContact(event.getUserId(), event.getEmail(), event.getPlan());
        }
    }
}

The Power of Adding New Consumers

Month 1: Launch with Email Service consumer.

Month 3: Product wants to add onboarding tooltips.
         Add Onboarding Service consumer.
         Zero changes to User Registration Service.

Month 6: Security team wants fraud screening.
         Add Risk Assessment Service consumer.
         Zero changes to anything existing.

Month 12: Compliance requires audit logging.
          Add Compliance Logger consumer.
          Zero changes to anything existing.

This is the compounding value of Pub/Sub. Each new consumer adds capability without creating technical debt in the producer.


3. Request-Reply (Async RPC)

What It Is

An asynchronous implementation of a request-response interaction. The requester sends a message to a queue, and the worker processes it and sends the result to a reply queue. The requester waits on the reply queue.

The Problem It Solves

You need a response from a long-running operation, but you do not want the calling thread to block for minutes while waiting. You also want the benefits of decoupling (independent scaling, fault tolerance) even for operations that require a result.

How It Works

Requester                   Worker
    |                          |
    | 1. Send request to queue |
    |-----> [Request Queue] -->|
    |       (correlationId:    |
    |        "req-abc123",     |
    |        replyTo:          |
    |        "reply-queue-xyz") |
    |                          |
    |                          | 2. Process (takes time)
    |                          |
    |     3. Send result to replyTo queue
    |<---- [Reply Queue] <-----|
    |      (correlationId:     |
    |       "req-abc123",      |
    |       result: {...})     |
    |                          |
    | 4. Match by correlationId|
    | 5. Return result to caller

The Correlation ID is critical. Without it, the requester cannot match which reply corresponds to which request, especially when many requests are in-flight simultaneously.

Implementation Example

// Requester - sends request and waits for reply
@Service
public class DocumentAnalysisClient {
 
    private final RabbitTemplate rabbitTemplate;
 
    // Timeout after 30 seconds if no response
    @Value("${analysis.timeout.seconds:30}")
    private int timeoutSeconds;
 
    public AnalysisResult analyzeDocument(byte[] documentBytes, String documentType) {
        AnalysisRequest request = AnalysisRequest.builder()
            .requestId(UUID.randomUUID().toString())
            .documentBytes(documentBytes)
            .documentType(documentType)
            .requestedAt(Instant.now())
            .build();
 
        // sendAndReceive is the RabbitMQ helper for request-reply
        // It automatically creates a temporary reply queue and sets correlationId
        AnalysisResult result = (AnalysisResult) rabbitTemplate.convertSendAndReceive(
            "document-analysis",
            "analyze.document",
            request,
            message -> {
                // Set correlation ID for matching
                message.getMessageProperties().setCorrelationId(request.getRequestId());
                return message;
            }
        );
 
        if (result == null) {
            throw new AnalysisTimeoutException("Document analysis timed out after "
                + timeoutSeconds + " seconds");
        }
 
        return result;
    }
}
 
// Worker - processes requests and sends replies
@Component
public class DocumentAnalysisWorker {
 
    @RabbitListener(queues = "document-analysis-queue")
    public AnalysisResult handleAnalysisRequest(AnalysisRequest request) {
        // Just return the result - Spring AMQP handles sending it to the reply queue
        log.info("Analyzing document: requestId={}", request.getRequestId());
 
        // This could take 10-20 seconds for a large document
        return documentAnalyzer.analyze(request.getDocumentBytes(), request.getDocumentType());
    }
}

When to Use Request-Reply

  • Long-running operations that produce a result the caller needs
  • When you want async processing but still need the outcome
  • RPC-style communication between microservices with decoupled scaling

When NOT to Use Request-Reply

  • If the operation is fast (under 100ms), use synchronous REST
  • If the caller can truly fire-and-forget, use standard Pub/Sub

4. Fan-in (Aggregation)

What It Is

Multiple producers send messages to a single queue or topic. One or more consumers process all the messages from all sources in a unified way.

The Problem It Solves

You have data coming from many sources that all need to go through the same processing pipeline.

How It Works

IoT Sensor A (temperature data)   -->|
IoT Sensor B (temperature data)   -->|  [Queue: sensor-readings]  --> Data Pipeline
IoT Sensor C (humidity data)      -->|
IoT Sensor D (pressure data)      -->|

Mobile App Events (Android)   -->|
Mobile App Events (iOS)        -->|  [Topic: app-events]  --> Analytics Pipeline
Web App Events                -->|

Real-World Use Cases

  • Centralized logging from multiple services
  • IoT data collection from thousands of sensors
  • Event aggregation for analytics
  • Multi-region data consolidation
  • Merging data from multiple upstream systems

5. Dead Letter Queue (DLQ)

What It Is

A Dead Letter Queue (DLQ) is a special queue that receives messages that could not be successfully processed after all retry attempts have been exhausted.

The Problem It Solves

Some messages will always fail processing - maybe due to bad data format, a bug in consumer logic, or referencing a resource that no longer exists. Without a DLQ:

  • The message is retried forever (infinite loop)
  • Or the message is silently dropped (data loss)

A DLQ gives you: "We could not process this, but we have not lost it. Here it is for investigation."

How a Message Ends Up in the DLQ

Message arrives at queue
        |
Consumer attempts to process
        |
     Failure
        |
Retry 1 (after 30 seconds)
        |
     Failure
        |
Retry 2 (after 60 seconds)
        |
     Failure
        |
Retry 3 (after 120 seconds)  <-- Max retries reached
        |
     Failure
        |
Move to Dead Letter Queue  <-- Message is preserved for investigation
        |
Alert fires --> Engineering team investigates

Why Messages Fail

Category 1: Transient failures (should succeed on retry)

  • Network timeouts
  • Downstream service temporarily unavailable
  • Database connection pool exhausted

Category 2: Permanent failures (will NEVER succeed on retry - Poison Messages)

  • Malformed message format (deserialization failure)
  • Referenced entity deleted (order ID that no longer exists)
  • Business rule violation (negative amount in payment message)
  • Bug in consumer code that needs a deployment to fix

The DLQ primarily helps with Category 2. Retries handle Category 1.

Implementation with Exponential Backoff (Spring Boot + RabbitMQ)

# application.yml - RabbitMQ DLQ configuration
spring:
  rabbitmq:
    listener:
      simple:
        retry:
          enabled: true
          initial-interval: 1000 # 1 second first retry
          max-interval: 30000 # 30 second max between retries
          multiplier: 2.0 # Exponential backoff multiplier
          max-attempts: 5 # After 5 attempts, send to DLQ
@Configuration
public class RabbitMQConfig {
 
    @Bean
    public Queue orderProcessingQueue() {
        return QueueBuilder.durable("order-processing")
            .withArgument("x-dead-letter-exchange", "dlx.exchange")
            .withArgument("x-dead-letter-routing-key", "order-processing.dlq")
            .withArgument("x-message-ttl", 86400000)  // Messages expire after 24 hours
            .build();
    }
 
    @Bean
    public Queue orderProcessingDLQ() {
        // The DLQ itself - no further routing from here
        return QueueBuilder.durable("order-processing.dlq").build();
    }
 
    @Bean
    public DirectExchange deadLetterExchange() {
        return new DirectExchange("dlx.exchange");
    }
 
    @Bean
    public Binding dlqBinding() {
        return BindingBuilder
            .bind(orderProcessingDLQ())
            .to(deadLetterExchange())
            .with("order-processing.dlq");
    }
}

DLQ Processing and Alerting

// Monitor and alert on DLQ depth
@Component
@Slf4j
public class DLQMonitor {
 
    private final RabbitAdmin rabbitAdmin;
    private final AlertingService alertingService;
 
    @Scheduled(fixedDelay = 60000)  // Check every minute
    public void checkDLQDepth() {
        QueueInformation info = rabbitAdmin.getQueueInfo("order-processing.dlq");
        if (info != null && info.getMessageCount() > 0) {
            log.warn("DLQ has {} messages - investigation needed", info.getMessageCount());
            alertingService.sendAlert(
                AlertLevel.HIGH,
                "DLQ Alert",
                "order-processing.dlq has " + info.getMessageCount() + " failed messages"
            );
        }
    }
}
 
// DLQ consumer for investigation and manual replay
@Component
public class DLQInvestigationConsumer {
 
    @RabbitListener(queues = "order-processing.dlq")
    public void investigateFailedMessage(
            Message message,
            @Header(AmqpHeaders.DEATH_REASON) String deathReason,
            @Header(AmqpHeaders.DEATH_COUNT) Integer deathCount) {
 
        log.error("Failed message investigation: reason={}, retryCount={}, body={}",
            deathReason, deathCount,
            new String(message.getBody(), StandardCharsets.UTF_8));
 
        // Options:
        // 1. Fix the data and republish to original queue
        // 2. Log to a dead letter store for manual review
        // 3. Send notification to operations team
        // 4. Move to a human review dashboard
 
        failedMessageRepository.save(FailedMessage.from(message, deathReason, deathCount));
    }
}

DLQ Best Practices

  1. Always configure a DLQ - Never let messages be silently dropped
  2. Alert on DLQ depth - A growing DLQ means something is systematically wrong
  3. Build a replay mechanism - Fix the bug, replay from DLQ
  4. Set message TTL on DLQ - Old DLQ messages have no value; delete after 7-30 days
  5. Categorize failure reasons - Track why messages are failing to identify patterns
  6. Do not retry from DLQ automatically - Manual review before replay is almost always worth it

6. Priority Queue

What It Is

A queue where messages have a priority level. Higher-priority messages are delivered to consumers before lower-priority messages, regardless of arrival time.

The Problem It Solves

Not all messages are equally urgent. A password reset email should be sent before a weekly newsletter. A premium customer's order should be processed before a free-tier user's order.

How It Works

Messages arrive in order:          Processing order:
1. newsletter (priority 1)    -->  3. password-reset (priority 10) FIRST
2. order-confirmation (pri 5) -->  2. order-confirmation (priority 5) SECOND
3. password-reset (priority 10)--> 1. newsletter (priority 1) LAST

Implementation (RabbitMQ)

@Configuration
public class PriorityQueueConfig {
 
    @Bean
    public Queue notificationQueue() {
        return QueueBuilder.durable("notifications")
            .withArgument("x-max-priority", 10)  // Priority levels 0 (lowest) to 10 (highest)
            .build();
    }
}
 
@Service
public class NotificationPublisher {
 
    public void sendPasswordReset(String email) {
        MessageProperties props = new MessageProperties();
        props.setPriority(10);  // Highest priority
        Message message = new Message(payload.getBytes(), props);
        rabbitTemplate.send("notifications", message);
    }
 
    public void sendWeeklyNewsletter(String email) {
        MessageProperties props = new MessageProperties();
        props.setPriority(1);   // Lowest priority
        Message message = new Message(payload.getBytes(), props);
        rabbitTemplate.send("notifications", message);
    }
}

Priority Queue in Kafka

Kafka does not natively support priority queues. Common workarounds:

  1. Separate topics per priority: notifications-high, notifications-medium, notifications-low
    • Consumer polls high-priority topic first, then medium, then low
  2. Priority routing at producer: Producer decides which topic to write to based on priority
  3. Single consumer checks multiple topics: Polls with different frequency based on priority

Real-World Use Cases

  • Email sending: transactional emails (high) vs marketing emails (low)
  • Order processing: express delivery (high) vs standard (low)
  • Support tickets: premium customers (high) vs free tier (low)
  • Alert processing: critical alerts (high) vs informational (low)

7. Claim Check Pattern

What It Is

When a message payload is too large to send through the message queue efficiently (or exceeds broker limits), you store the large payload externally and send only a reference (the "claim check") in the message.

The Problem It Solves

Message brokers have payload size limits:

  • RabbitMQ default: 128MB (not recommended above a few MB for performance)
  • Kafka default: 1MB per message
  • AWS SQS: 256KB limit
  • AWS SQS + S3 Extended Client: up to 2GB (using this exact pattern)

Large payloads in queues cause:

  • High memory consumption in the broker
  • Slow serialization/deserialization
  • Network bandwidth saturation
  • Broker instability

How It Works

Without Claim Check (problematic):
Producer --> [Huge 50MB PDF embedded in message] --> Consumer
                (broker struggles, memory issues)

With Claim Check:
Step 1: Producer uploads 50MB PDF to S3/Blob Storage
        Gets back a reference: "s3://bucket/documents/doc-12345.pdf"

Step 2: Producer sends lightweight message with reference:
        { "documentId": "doc-12345", "s3Ref": "s3://bucket/documents/doc-12345.pdf" }

Step 3: Consumer receives lightweight message
        Consumer downloads 50MB PDF from S3 directly
        Consumer processes the document
        Consumer optionally deletes from S3 after processing

Implementation Example

// Producer with Claim Check
@Service
public class DocumentProcessingPublisher {
 
    private final S3Client s3Client;
    private final KafkaTemplate<String, DocumentProcessingMessage> kafkaTemplate;
 
    public void submitDocumentForProcessing(String documentId, byte[] documentContent) {
 
        // Step 1: Upload large payload to object storage
        String s3Key = "documents/processing/" + documentId + ".pdf";
        PutObjectRequest putRequest = PutObjectRequest.builder()
            .bucket("document-processing-bucket")
            .key(s3Key)
            .contentType("application/pdf")
            .build();
 
        s3Client.putObject(putRequest, RequestBody.fromBytes(documentContent));
 
        // Step 2: Send lightweight claim check message
        DocumentProcessingMessage message = DocumentProcessingMessage.builder()
            .messageId(UUID.randomUUID().toString())
            .documentId(documentId)
            .payloadReference("s3://document-processing-bucket/" + s3Key)
            .contentType("application/pdf")
            .originalSizeBytes(documentContent.length)
            .submittedAt(Instant.now())
            .build();
 
        kafkaTemplate.send("document-processing", documentId, message);
        log.info("Submitted document for processing: {} ({} bytes)",
            documentId, documentContent.length);
    }
}
 
// Consumer with Claim Check retrieval
@Service
public class DocumentProcessingConsumer {
 
    private final S3Client s3Client;
 
    @KafkaListener(topics = "document-processing", groupId = "doc-processor-group")
    public void processDocument(DocumentProcessingMessage message) {
 
        // Step 3: Retrieve the large payload from object storage
        GetObjectRequest getRequest = GetObjectRequest.builder()
            .bucket(extractBucket(message.getPayloadReference()))
            .key(extractKey(message.getPayloadReference()))
            .build();
 
        byte[] documentContent = s3Client.getObjectAsBytes(getRequest).asByteArray();
 
        log.info("Retrieved document: {} ({} bytes)",
            message.getDocumentId(), documentContent.length);
 
        // Process the full document
        ocrService.extractText(message.getDocumentId(), documentContent);
 
        // Step 4: Clean up the temporary storage (optional)
        deleteFromS3(message.getPayloadReference());
    }
}

When to Apply

  • Message payload exceeds broker size limits
  • Payload is a binary file (image, video, PDF, archive)
  • The same payload might be referenced by multiple messages
  • Payload storage in the broker would be inefficient or cost-prohibitive

8. Message Routing and Filtering

What It Is

Mechanisms to ensure messages are delivered only to the consumers that care about them, without requiring consumers to filter out irrelevant messages themselves.

Routing in RabbitMQ (Topic Exchange)

Producer publishes to exchange with routing keys:

"order.us-west.placed"     --> US West Order Processing Queue
"order.eu-central.placed"  --> EU Central Order Processing Queue
"order.*.placed"           --> All Regions - Analytics Queue
"order.#"                  --> All Order Events - Audit Queue
"payment.us-west.failed"   --> US West Failure Alerts Queue

Routing key pattern syntax:

  • * matches exactly one word
  • # matches zero or more words
// Publish with a specific routing key
rabbitTemplate.convertAndSend("order-exchange", "order.us-west.placed", orderEvent);
 
// Consumer binding - only gets US West orders
@RabbitListener(bindings = @QueueBinding(
    value = @Queue("us-west-orders"),
    exchange = @Exchange(value = "order-exchange", type = ExchangeTypes.TOPIC),
    key = "order.us-west.*"
))
public void handleUsWestOrder(OrderEvent event) { ... }

Message Filtering in Kafka (Consumer-Side)

Kafka does not natively route messages to different queues by content. Filtering happens at the consumer:

@KafkaListener(topics = "order-events")
public void handleEvent(OrderEvent event) {
    // Only process high-value orders
    if (event.getTotalAmount().compareTo(BigDecimal.valueOf(1000)) < 0) {
        log.debug("Skipping low-value order: {}", event.getOrderId());
        return;  // Skip but still ACK (otherwise message keeps coming back)
    }
 
    processHighValueOrder(event);
}

Message Filtering in AWS SNS

SNS supports server-side filter policies, so the queue only receives relevant messages:

{
  "FilterPolicy": {
    "eventType": ["OrderPlaced", "OrderShipped"],
    "region": ["us-west-2", "us-east-1"],
    "orderValue": [{ "numeric": [">=", 100] }]
  }
}

Only messages matching ALL filter conditions are delivered to the subscribed SQS queue. Other messages are silently discarded at SNS - the consumer never even sees them.


9. Event-Driven Architecture

What It Is

An architectural style where services communicate by publishing and consuming events. Services are unaware of each other and react to state changes happening elsewhere in the system.

Core Principles

  1. Events represent immutable facts: "OrderWasPlaced" is a fact about the past, not a command
  2. Services own their domain: Each service is the authoritative source for its domain events
  3. Loose coupling by design: No direct service-to-service dependencies
  4. Eventual consistency: State across services converges over time

The Event-Driven Flow

Traditional (Request-Driven):
User clicks "Place Order"
  -> Order Service
    -> Payment Service (direct HTTP call)
    -> Inventory Service (direct HTTP call)
    -> Email Service (direct HTTP call)
    -> Warehouse Service (direct HTTP call)
  <- Return "Order Confirmed"

Event-Driven:
User clicks "Place Order"
  -> Order Service (creates order, publishes OrderPlaced event)
  <- Return "Order Confirmed" (immediately)

  (In parallel, asynchronously):
  OrderPlaced event -> Payment Service (processes payment)
  OrderPlaced event -> Inventory Service (reserves stock)
  OrderPlaced event -> Email Service (sends confirmation)
  OrderPlaced event -> Warehouse Service (creates pick list)

Event Storming - Discovering the Domain Events

Event Storming is a workshop technique for discovering the events in a system:

  1. Domain Events (orange sticky notes): "What happened?" - OrderPlaced, PaymentReceived, ItemShipped
  2. Commands (blue sticky notes): "What caused it?" - PlaceOrder, ProcessPayment, ShipItem
  3. Aggregates (yellow sticky notes): "What received the command?" - Order, Payment, Shipment
  4. External Systems (pink sticky notes): "What external systems are involved?" - PaymentGateway, ShippingCarrier

Event Types: Domain Events vs Integration Events

TypeScopeContentAudience
Domain EventWithin one serviceRich with business contextInternal to the bounded context
Integration EventAcross servicesMinimal, stable contractExternal consumers

Best practice: Keep your domain events rich for internal use. When publishing to other services, create a stable integration event contract that you version explicitly.


10. Choreography vs Orchestration

This is one of the most critical distinctions in distributed systems design, and it comes up frequently in interviews.

Choreography

Each service knows what to do when it receives an event. There is no central coordinator. Services react to events and emit new events, creating a chain of reactions.

Example: Order fulfillment via choreography

Step 1: Customer places order
        Order Service publishes: OrderPlaced
                                     |
Step 2: Payment Service listens to OrderPlaced
        Payment Service processes payment
        Payment Service publishes: PaymentSucceeded
                                     |
Step 3: Inventory Service listens to PaymentSucceeded
        Inventory Service reserves items
        Inventory Service publishes: InventoryReserved
                                     |
Step 4: Fulfillment Service listens to InventoryReserved
        Fulfillment Service creates shipment
        Fulfillment Service publishes: ShipmentCreated

Visual:

OrderService ---(OrderPlaced)---> PaymentService
                                       |
                                (PaymentSucceeded)
                                       |
                                       v
                               InventoryService
                                       |
                               (InventoryReserved)
                                       |
                                       v
                              FulfillmentService
                                       |
                               (ShipmentCreated)

Pros of Choreography:

  • Fully decoupled - no central dependency
  • Easy to add new services (just subscribe to the right event)
  • No single point of failure

Cons of Choreography:

  • Hard to understand the overall flow (no single place to see the whole process)
  • Difficult to track "where is order #12345 in the process?"
  • Compensating transactions (rollback) are complex
  • Debugging failures requires tracing events across multiple services

Orchestration

A central Orchestrator service (or Saga Orchestrator) knows the entire workflow and explicitly commands each service to do its part. Services just execute commands and report results.

Example: Order fulfillment via orchestration

Central OrderOrchestrator directs the entire flow:

Step 1: OrderOrchestrator --> PaymentService.charge(orderId)
        PaymentService returns: {success: true, transactionId: "txn-123"}

Step 2: OrderOrchestrator --> InventoryService.reserve(orderId, items)
        InventoryService returns: {success: true, reservationId: "res-456"}

Step 3: OrderOrchestrator --> FulfillmentService.createShipment(orderId)
        FulfillmentService returns: {success: true, trackingId: "trk-789"}

Step 4: OrderOrchestrator marks order as COMPLETED

Failure Handling with Orchestration:

Step 1: charge payment - SUCCESS
Step 2: reserve inventory - FAILURE (out of stock)

Orchestrator executes compensating transactions:
  - Refund the payment (compensate Step 1)
  - Mark order as FAILED
  - Notify customer with reason

Visual:

          OrderOrchestrator
         /       |         \
        v        v          v
  PaymentSvc InventorySvc FulfillmentSvc
  (responds)   (responds)   (responds)

Pros of Orchestration:

  • Clear visibility of the entire workflow in one place
  • Easier to implement compensating transactions (rollback)
  • Straightforward to track "where is this order in the process?"
  • Simpler debugging

Cons of Orchestration:

  • Orchestrator becomes a central dependency
  • Orchestrator must be highly available
  • Risk of logic creep - orchestrator becomes too smart
  • Can become a bottleneck

Choreography vs Orchestration - Decision Guide

ScenarioRecommendation
Simple, 2-3 step workflowChoreography
Complex, multi-step workflow with many failure modesOrchestration
Workflow spans 5+ servicesOrchestration
Need a clear audit trail of workflow stateOrchestration
Adding new participants should be zero-frictionChoreography
Compensating transactions are complexOrchestration
Team size is large and services are owned by different teamsChoreography (more autonomous)

In practice: Most real-world systems use a hybrid. Use choreography for simple event reactions and orchestration for complex, stateful, multi-step workflows that require compensation.


11. Scatter-Gather Pattern

What It Is

The Scatter-Gather pattern broadcasts a request to multiple services (scatter) and then collects and aggregates all their responses (gather) to form a final result.

Use Case Example

Building a price comparison service that needs to get quotes from multiple insurance providers simultaneously:

Request: Get insurance quotes for customer profile X

Scatter: Send request simultaneously to:
         --> InsurerA.getQuote(profile)
         --> InsurerB.getQuote(profile)
         --> InsurerC.getQuote(profile)
         --> InsurerD.getQuote(profile)

Gather: Wait for responses (with timeout), aggregate all quotes, return best price

         InsurerA: $120/month  -->
         InsurerB: $135/month  -->  Aggregator  --> [Cheapest: $115 from InsurerC]
         InsurerC: $115/month  -->
         InsurerD: timeout     -->  (ignored - partial results acceptable)

Implementation Pattern

@Service
public class InsuranceQuoteAggregator {
 
    private final KafkaTemplate<String, QuoteRequest> kafkaTemplate;
    private final ConcurrentHashMap<String, List<InsuranceQuote>> pendingRequests
        = new ConcurrentHashMap<>();
 
    public List<InsuranceQuote> getQuotes(CustomerProfile profile) {
        String requestId = UUID.randomUUID().toString();
        CountDownLatch latch = new CountDownLatch(insurerTopics.size());
        pendingRequests.put(requestId, new CopyOnWriteArrayList<>());
 
        // SCATTER - broadcast to all insurers
        QuoteRequest request = new QuoteRequest(requestId, profile);
        for (String insurerTopic : insurerTopics) {
            kafkaTemplate.send(insurerTopic, requestId, request);
        }
 
        // GATHER - wait for responses (up to 5 seconds)
        try {
            latch.await(5, TimeUnit.SECONDS);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
 
        // Return whatever responses came back within the timeout
        return pendingRequests.remove(requestId);
    }
 
    @KafkaListener(topics = "insurance-quote-responses")
    public void handleQuoteResponse(InsuranceQuote quote) {
        List<InsuranceQuote> quotes = pendingRequests.get(quote.getRequestId());
        if (quotes != null) {
            quotes.add(quote);
        }
    }
}

12. Transactional Outbox Pattern (Preview)

The Outbox Pattern deserves its own deep-dive (covered fully in Part 4 - Advanced Concepts), but here is the concept:

The Problem It Solves

Consider this code:

@Transactional
public void placeOrder(OrderRequest request) {
    orderRepository.save(order);          // Database write
    kafkaTemplate.send("orders", event);  // Message publish
 
    // What happens if the application crashes between these two lines?
    // Or if Kafka is temporarily unavailable?
    // You get: order saved in DB, but event never published = inconsistency
}

This is the dual-write problem: you cannot atomically write to a database AND publish to a message queue in the same transaction.

The Solution (Preview)

Instead of publishing directly, write the event to an outbox table within the same database transaction:

@Transactional
public void placeOrder(OrderRequest request) {
    orderRepository.save(order);
    // Write event to outbox table in the SAME transaction
    outboxRepository.save(new OutboxEvent("order-events", eventPayload));
    // Atomic: either both succeed or both fail
}
// A separate process reads the outbox table and publishes to Kafka
// If publishing fails, it retries - the event is not lost

Full implementation and details in Part 4.


Pattern Selection Summary

PatternUse WhenKey Characteristic
Work QueueDistribute tasks across workersEach task processed by one worker
Pub/SubBroadcast events to multiple servicesEach subscriber gets all messages
Request-ReplyNeed async result from long-running opCorrelation ID links request and reply
Fan-inAggregate data from multiple sourcesMany producers, unified consumer
Dead Letter QueueHandle unprocessable messagesSafety net for failed messages
Priority QueueSome messages more urgent than othersHigher priority processed first
Claim CheckLarge payloads exceeding size limitsStore payload externally, pass reference
Message RoutingDifferent consumers for different messagesRoute by key, pattern, or attributes
ChoreographyLoosely coupled event-driven servicesServices react to events autonomously
OrchestrationComplex workflows needing central controlCoordinator directs each step
Scatter-GatherAggregate responses from multiple sourcesBroadcast + collect with timeout
OutboxAtomic DB write + message publishGuarantee no message loss

Previous: Part 1 - Fundamentals
Next: Part 3 - Technologies Deep Dive
Index: Message Queues Demystified - Index