Message Queues - Supplement 3: Trade-Offs & Decision Guide

Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 4 - Real-World Architecture

Every architectural decision involves trade-offs. This guide makes those trade-offs explicit,
concrete, and actionable. Each section answers "when should I choose A over B?" with
criteria, comparison tables, and real-world context.

When to Use Message Queues (and When NOT to)
Where in Your Architecture to Place Message Queues
Technology Selection: Full Decision Matrix
Kafka vs RabbitMQ: Deep Comparison
Cloud-Managed vs Self-Hosted Trade-Offs
Delivery Semantics Trade-Off Table
Queue vs Topic vs Stream: The Full Decision
Partition Strategy Trade-Offs
Message Size Trade-Offs
Synchronous vs Asynchronous Decision Framework
Event-Driven vs Command-Driven Trade-Offs
Consumer Concurrency Trade-Offs
Retention and Storage Trade-Offs
Schema Evolution Strategy Trade-Offs
Full Architectural Decision Trees

1. When to Use Message Queues (and When NOT to)

Apply Message Queues When...

Scenario	Reason Message Queue is Right	Example
Work takes longer than a caller can wait	Caller gets immediate response; work continues in background	Video transcoding, report generation
Multiple downstream services need the same event	Fan-out without coupling to each downstream	Order placed → warehouse + email + analytics
Downstream service is slower than upstream	Queue absorbs traffic spikes; downstream processes at its own pace	Payment processor maxes at 500/sec; orders arrive at 2000/sec during sales
Downstream service availability affects upstream	Decouple: downstream being down does not fail the upstream call	Email service down should not fail order placement
Workload can be parallelized	Add consumers to scale processing without changing producers	Batch image resizing, log processing
Retry/reliability without caller involvement	Failed downstream processing retried automatically	Webhook delivery, external API calls
Audit trail of all events needed	Topic log provides ordered, replayable event history	Financial transactions, compliance
Services are in different teams/deployment cycles	Event contracts are stable; teams deploy independently	Microservices with different release schedules

Do NOT Use Message Queues When...

Scenario	Why Message Queue is Wrong	Better Alternative
Caller needs the response to continue	Async adds latency with no benefit; caller blocks anyway	Synchronous HTTP/gRPC
Operation must be atomic across services	Queue does not provide distributed transactions	Saga pattern (if unavoidable) or monolith
System is a simple CRUD application	Unnecessary complexity for straightforward request/response	REST API
You need to query current state	Queues are not databases; querying is wrong tool	Database query or cache
Real-time user interaction required	Queue latency is too high for sub-100ms user interactions	WebSockets, Server-Sent Events
Two services that could be the same service	Unnecessary inter-service communication	Merge into one service
Request volume is low and predictable	Queue overhead not worth it for < 10 req/sec	Direct HTTP call
All consumers must respond before proceeding	Synchronous fan-out is semantically required	Request-scatter-gather via HTTP

The Decision Question

Ask these questions in order:

1. Does the caller need the result to proceed?
   YES → Use synchronous HTTP/gRPC
   NO  → Consider async. Continue to question 2.

2. Can the operation fail and be retried later without the caller knowing?
   YES → Message queue is a good fit
   NO  → Consider: is this actually two concerns mixed together?

3. Are there multiple consumers that all need this event independently?
   YES → Message queue (pub-sub / topic model) is the right tool
   NO  → Consider: could a direct call handle this? If so, simpler is better.

4. Is the downstream significantly slower or less reliable than upstream?
   YES → Message queue provides buffering and decoupling
   NO  → Direct call might be simpler; evaluate overhead vs benefit.

5. Do you need event replay or time-travel for debugging/analytics?
   YES → Kafka (event log) is specifically designed for this
   NO  → Any queue suffices.

2. Where in Your Architecture to Place Message Queues

Placement Patterns

Pattern A: Edge Buffering (Input)

External clients (high, bursty)
         ↓
    [API Gateway]
         ↓
    Message Queue ← Buffer here
         ↓
    Internal Services (capacity-limited)

When to use: External traffic is unpredictable. Internal processing has finite capacity (database writes, payment gateway calls). You need to absorb traffic spikes without rejecting requests.

Trade-off: Adds latency for the user (job accepted but not completed). Requires async result delivery (webhook, polling, WebSocket).

Pattern B: Service Decoupling (Middle)

Service A  →  Message Queue  →  Service B
                    ↓
              Service C
                    ↓
              Service D

When to use: Service A is owned by one team; services B, C, D by other teams. Deployment independence is important. Services B, C, D can be added/removed without changing Service A.

Trade-off: Eventual consistency. Service A cannot know what state Services B/C/D are in. Debugging requires distributed tracing across the queue boundary.

Pattern C: Event Backbone (Architecture-Wide)

All domain services → publish to central event log (Kafka)
All domain services → subscribe to events they care about

Event log becomes the single source of truth for "what happened"

When to use: Event sourcing architecture. Regulatory audit requirements (every event logged). Multiple consumers per domain event. Analytics and ML training data pipeline from production events.

Trade-off: Event schema evolution is critical and difficult. All teams must coordinate on event contracts. Higher operational complexity (Schema Registry, replication, ordering policies).

Pattern D: Work Distribution (Compute Tier)

Job Submission API
       ↓
  [Job Queue]
       ↓
┌───────────────┐
│ Worker Pool   │
│ (elastic)     │
└───────────────┘

When to use: Compute-intensive tasks (ML inference, video processing, report generation). Work items are independent (can be parallelized). Need to scale workers based on queue depth.

Trade-off: Worker scaling adds latency. Job failures must be handled (DLQ, retry policies). Result delivery to the original caller requires a separate mechanism (callback URL, polling).

The "Should I Put a Queue Here?" Decision for Each Service Boundary

For each service-to-service call, evaluate:

Latency requirement:
  < 100ms → probably synchronous HTTP
  100ms–2s → either can work; queue adds resilience
  > 2s (background work) → queue almost always correct

Dependency direction:
  A depends on B for its core function → HTTP (tight coupling acceptable)
  A does something that B needs to know about → Queue (B observes A)

Failure semantics:
  "If B is down, A should fail immediately" → HTTP
  "If B is down, A should continue and B will catch up" → Queue

Traffic pattern:
  Steady, predictable → HTTP is fine
  Bursty, spiky → Queue for buffering
  Batch (overnight processing) → Queue almost always correct

3. Technology Selection: Full Decision Matrix

Requirement	RabbitMQ	Kafka	AWS SQS	Redis Streams	Google Pub/Sub	Azure Service Bus
Message ordering	Per-queue (not partitioned)	Strong (per partition)	FIFO queues only	Per stream	Best-effort	Session-based
Throughput	~50K msg/sec/broker	>1M msg/sec/broker	~3K msg/sec/queue	~1M msg/sec	~1M msg/sec	~1M msg/sec
Message replay	No (consumed = gone)	Yes (retention-based)	No	Limited	Limited	No
Push vs Pull	Push (broker pushes)	Pull (consumer polls)	Pull (client polls)	Pull	Push + Pull	Push + Pull
Routing flexibility	Excellent (exchanges)	Limited (topic-based)	None	None	Attribute filtering	Topic/subscription
Exactly-once	Not natively	Yes (transactions)	No	No	Yes (per region)	Yes
Self-hosted	Yes	Yes	No (AWS only)	Yes	No (GCP only)	No (Azure only)
Managed service	CloudAMQP, AmazonMQ	Confluent, MSK, Aiven	Native	Redis Cloud	Native	Native
Operational complexity	Medium	High	Very Low	Low	Very Low	Very Low
Best for	Task queues, routing	Event log, analytics, stream processing	Simple cloud queuing	Simple streams, low infra	Cloud-native event bus	Enterprise .NET/Java

4. Kafka vs RabbitMQ: Deep Comparison

This is the most commonly debated technology choice. Here is the complete picture:

Architecture Philosophy Difference

RabbitMQ: Smart broker, dumb consumers
  - Broker maintains subscriber state
  - Broker routes messages to consumers
  - Broker tracks what each consumer has received
  - Consumers are stateless receivers
  - Message consumed = removed from queue

Kafka: Dumb broker, smart consumers
  - Broker just stores messages in an ordered log
  - Consumers track their own position (offset)
  - Consumers pull messages at their own pace
  - Consumers can re-read messages at any time
  - Messages persist until retention policy expires

This philosophical difference drives ALL the other trade-offs.

Use Case Decision Table

Use Case	Choose RabbitMQ	Choose Kafka
Task queue (each task done once)	✅ Designed for this	⚠️ Works but overkill
Job scheduling with priority	✅ Native priority queues	❌ No native priority
Complex routing (by content/headers)	✅ Topic/Header exchanges	❌ Limited routing
Request-Reply pattern	✅ Reply-to header support	⚠️ Possible but complex
Event sourcing / audit log	❌ Messages not retained	✅ Designed for this
Stream processing (Kafka Streams)	❌ No stream processing	✅ Kafka Streams, ksqlDB
Replay past events for new service	❌ Impossible after consume	✅ Replay from any offset
Analytics data pipeline	❌ Not designed for volume	✅ Built for petabyte scale
IoT data ingestion (millions/sec)	❌ Throughput insufficient	✅ Handles millions/sec
Simple microservice messaging	✅ Easy to set up	⚠️ Complex for simple use
Multiple consumers per message	✅ Fan-out exchanges	✅ Consumer groups
Exactly-once across services	❌ No native support	✅ Kafka transactions
Message TTL / expiry	✅ Native support	⚠️ Via retention policies
Team learning curve	✅ Easier	❌ Steeper curve
Operational simplicity	✅ Simpler	❌ More complex (ZK/KRaft)

Performance Trade-Off

RabbitMQ:
  Write latency: 1-5ms (very low)
  Throughput: 50,000-100,000 messages/sec per broker
  Memory footprint: Higher (tracks per-consumer state)
  Disk usage: Lower (messages deleted after consumption)
  Good for: < 1 million messages/day, latency-sensitive tasks

Kafka:
  Write latency: 2-10ms (tunable: lower with acks=1, higher with acks=all)
  Throughput: 1,000,000+ messages/sec per broker
  Memory footprint: Lower (stateless broker, consumers track state)
  Disk usage: Higher (messages retained, not deleted)
  Good for: Billions of messages/day, analytics pipelines, event sourcing

When Teams Pick the Wrong One

Common mistake 1: Using Kafka for a simple task queue
  Symptom: "We have Kafka but only 3 consumers and 1000 events/day"
  Problem: Kafka's complexity (partitions, replication, schema registry) adds
           operational overhead for no benefit at this scale
  Solution: RabbitMQ or even SQS would have been simpler and sufficient

Common mistake 2: Using RabbitMQ for event sourcing
  Symptom: "We need to replay last month's events for our new analytics service"
  Problem: RabbitMQ messages are gone after consumption - cannot replay
  Solution: Kafka (or rebuild from DB with Debezium CDC)

Common mistake 3: Using Kafka for complex routing
  Symptom: "We want to route messages based on 5 header attributes to different queues"
  Problem: Kafka has no native routing - must be done in application code
  Solution: RabbitMQ's Topic or Headers exchange handles this natively

5. Cloud-Managed vs Self-Hosted Trade-Offs

Trade-Off Matrix

Dimension	Self-Hosted (Kafka/RabbitMQ)	Managed (Confluent/MSK/SQS/AmazonMQ)
Upfront cost	Low ($)	None
Ongoing cost	Medium (infra + ops time)	High (per-message + per-hour pricing)
Cost at scale	Lower (EC2 is cheaper than managed)	Can be very expensive at petabyte scale
Operational burden	High (upgrades, disk, replication)	Very Low (vendor handles)
Feature availability	All features, any version	Latest only (managed version lag)
Customization	Full control	Limited to vendor's config options
Vendor lock-in	None (open source)	Medium-High (proprietary APIs/features)
Time to production	Weeks (setup, hardening)	Hours
Disaster recovery	Self-managed (MirrorMaker, backups)	Vendor-provided (often across AZs)
Compliance (FedRAMP, etc.)	You certify	Vendor often pre-certified
SLA	You build it	Vendor provides (99.9–99.99%)

Cost Breakdown Examples

Scenario: 100 million messages/day, 1KB each = 100GB/day

AWS MSK (managed Kafka):
  Broker cost: 3 × m5.xlarge = $0.192/hr × 3 × 720hr = ~$415/month
  Storage: 100GB/day × 7 day retention = 700GB × $0.10/GB = $70/month
  Data transfer: 100GB/day × 30 days × $0.02/GB = $60/month
  Total: ~$545/month

Self-hosted Kafka (AWS EC2):
  Broker cost: 3 × m5.xlarge = $415/month
  EBS storage: 700GB × $0.10 = $70/month
  PLUS operations time: 0.5 FTE × $150K/year = ~$6,250/month
  Total: ~$6,735/month  ← HIGHER despite cheaper infrastructure!

AWS SQS (managed queue):
  100 million messages × $0.40/million = $40/month
  Total: $40/month (dramatically cheaper for simple queue use case!)

Confluent Cloud (managed Kafka):
  100 million messages × roughly $0.08/million = $8,000/month
  (Confluent pricing is expensive at scale but rich in features)

Lesson: For simple queue use cases, SQS is dramatically cheaper.
For event log/analytics at medium scale, MSK balances cost and operations.
Self-hosted only wins at petabyte scale where infrastructure cost dominates ops cost.

6. Delivery Semantics Trade-Off Table

The Three Guarantees Explained

Semantic	Message Delivered	Duplicates Possible	Data Loss Possible	Default For	Suitable For
At-most-once	0 or 1 times	No	Yes	Analytics, metrics	Sensor readings, telemetry, non-critical events
At-least-once	1 or more times	Yes	No	All major brokers	Orders, notifications, emails (with idempotent consumer)
Exactly-once	Exactly 1 time	No	No	Nowhere (requires special setup)	Financial transactions, billing, inventory deductions

Implementation Cost of Each Semantic

At-most-once:
  Producer: Fire and forget (no acknowledgment wait)
  Consumer: Auto-commit before processing
  Extra infrastructure: None
  Relative cost: 1× (baseline)
  Throughput impact: Highest (no waiting for ACKs)

At-least-once:
  Producer: acks=1 or acks=all + retries enabled
  Consumer: Manual ack after successful processing
  Extra infrastructure: DLQ for failed messages
  Relative cost: 2× (acknowledgment overhead)
  Throughput impact: Medium

Exactly-once (Kafka):
  Producer: enable.idempotence=true + transactional.id set + acks=all
  Consumer: isolation.level=read_committed + transactional processing
  Broker: Kafka transactions enabled, min.insync.replicas=2
  Consumer: Idempotent processing logic (DB deduplication) as backup
  Extra infrastructure: Kafka transactions, Schema Registry, transactional DB
  Relative cost: 5–10× (transaction coordination overhead)
  Throughput impact: Significant reduction (20-50% vs at-least-once)

When Exactly-Once is NOT Worth It

Myth: "We need exactly-once delivery for financial transactions"
Reality: "We need exactly-once PROCESSING, which is different"

Exactly-once delivery is a property of the MESSAGE TRANSPORT.
Exactly-once processing is achieved by making your CONSUMER idempotent.

The idempotent consumer approach:
  - Uses at-least-once delivery (simpler, faster, more available)
  - Consumer records processed event IDs in a DB with unique constraint
  - If duplicate arrives: DB unique constraint prevents re-processing
  - Net result: exactly-once PROCESSING without exactly-once DELIVERY

This is simpler, cheaper, and has better availability than Kafka transactions.

Kafka transactions are appropriate when:
  - You consume from one Kafka topic AND produce to another Kafka topic
  - Both operations must succeed or fail together atomically
  - Example: Read order event → compute → write to analytics topic (atomic)

Kafka transactions are NOT appropriate for:
  - Consumer → Database (use DB transactions + idempotency key instead)
  - Consumer → External API (external APIs are not part of Kafka's transaction boundary)

7. Queue vs Topic vs Stream: The Full Decision

Definitions Clarified

Queue:
  - One message → consumed by exactly ONE consumer
  - Message is deleted after consumption
  - Consumer position tracked by broker
  - Examples: RabbitMQ queue, AWS SQS, Azure Service Bus queue

Topic (Message Bus):
  - One message → consumed by ALL subscribed consumers
  - Each subscriber gets its own copy
  - Message retained for subscription duration
  - Examples: AWS SNS, Google Pub/Sub, Kafka topics (with multiple consumer groups)

Stream (Event Log):
  - Ordered, persistent log of records
  - Multiple consumers can read the same records at different positions
  - Records not deleted after consumption (retention-based)
  - Consumers manage their own position (offset)
  - Examples: Kafka, AWS Kinesis, Redis Streams

Decision Criteria

Use a Queue when:
  ✓ Each piece of work should be done by exactly one worker
  ✓ Work items are independent of each other
  ✓ You want automatic load distribution across workers
  ✓ Messages do not need to be replayed
  ✓ Example: Email sending queue, job processing queue

Use a Topic (Pub-Sub) when:
  ✓ Multiple services all need to know when something happened
  ✓ Publishers should not need to know who the subscribers are
  ✓ Adding a new consumer does not change the producer
  ✓ Subscribers have different interests/filters on the same event stream
  ✓ Example: OrderPlaced event → (warehouse, email, analytics all subscribe)

Use a Stream (Event Log) when:
  ✓ You need to replay past events (new service joining, bug fix replay)
  ✓ Events need to be processed in strict order
  ✓ Multiple independent consumers at different positions in time
  ✓ Long-term retention for compliance or analytics
  ✓ You want to "rewind" and re-process history
  ✓ Example: Financial transaction log, change data capture, user activity feed

When a Queue AND a Topic Together Make Sense

Pattern: SNS → SQS Fan-Out (AWS)
  SNS Topic (guarantees fan-out to all queues)
       |
  ┌────+────┐────────────┐
  ↓         ↓            ↓
SQS Queue  SQS Queue   SQS Queue
(warehouse) (email)     (analytics)

Each service gets its own queue:
  - Slow analytics consumer does not block warehouse consumer
  - Each queue has its own DLQ
  - Each queue has its own retry policy
  - Adding a new consumer = add a new SQS queue + subscribe to SNS

This is the right pattern for fan-out with independent consumers

8. Partition Strategy Trade-Offs

Choosing a Partition Key

Partition Key Choice	Ordering Guarantee	Distribution	Hotspot Risk	Use When
Entity ID (orderId, userId)	All events for same entity ordered	Even (if IDs are random/UUID)	Low	Entity-level ordering required (most common)
User ID	All events for same user ordered	Even	Low for UUID; High if Pareto distribution	User activity streams
Tenant ID	All events for same tenant ordered	Can be hot for large tenants	High if tenant sizes vary greatly	Multi-tenant with equal tenants
Geographic region	Regional ordering	May be uneven	Possible (dense regions)	Regional data sovereignty
Custom hash bucket	Within-bucket ordering	Controlled	Low (by design)	Known hot entities needing distribution
Round-robin (null key)	No ordering guarantee	Perfect	None	Pure throughput, no ordering needed

How Many Partitions?

Under-partitioning (too few):
  Problem: Bottleneck throughput at peak load
  Problem: Cannot add more consumers than partitions
  Problem: One slow message blocks all messages on that partition
  Rule: If unsure, start with more partitions (cannot easily increase later)

Over-partitioning (too many):
  Problem: More partitions = more ZooKeeper/controller load
  Problem: Leader election takes longer with more partitions
  Problem: Consumer rebalancing takes longer
  Problem: Replication overhead increases linearly with partition count
  Rule: Do not create 10,000 partitions for a low-volume topic

Partition count formula:
  target_throughput_mb_per_sec / throughput_per_partition_mb_per_sec

  Where throughput_per_partition:
    For producer: ~10–40 MB/sec per partition (varies by hardware)
    For consumer: ~15–50 MB/sec per partition (depends on processing speed)
    Conservative estimate: 10 MB/sec per partition

  Example:
    Target: 500 MB/sec peak
    Per partition: 10 MB/sec (conservative)
    Partitions needed: 50
    Add 20% headroom: 60 partitions
    Round to even number: 60 (or 64 for power-of-2 if custom partitioner used)

Practical defaults:
  Low-traffic topics (< 10 MB/sec): 3–6 partitions
  Medium-traffic topics (10–100 MB/sec): 12–24 partitions
  High-traffic topics (> 100 MB/sec): 48–120 partitions
  Rethink if > 200 partitions: consider separate cluster

Replication Factor Trade-Off

replication.factor=1:
  Cost: Lowest (1× storage)
  Durability: NONE - single broker failure = data loss
  Use for: Development only. Never in production.

replication.factor=2:
  Cost: 2× storage
  Durability: Tolerates 1 broker failure
  Problem: min.insync.replicas=2 means writes fail when ANY broker fails
  Use for: Low-criticality topics where availability > durability

replication.factor=3 (recommended for production):
  Cost: 3× storage
  Durability: Tolerates 1 broker failure with no data loss
  Availability: Writes succeed as long as 2/3 brokers available
  Use for: All production data (standard practice)

replication.factor=3 + min.insync.replicas=2 is the production-standard combination
  - Tolerates 1 broker failure with zero data loss
  - Remains writable with 2/3 brokers available
  - Refuses writes (safely) when only 1 broker available (better than silent data loss)

9. Message Size Trade-Offs

Message Size Spectrum

Message Size	Approach	Pros	Cons	Example
< 1KB	Include all data inline	Simple, atomic, no external dependency	Fine for most cases	Order status change event
1–100KB	Include inline OR use Claim Check	Inline: simpler; Claim Check: avoids broker overhead	Claim Check adds S3 latency	Product catalog update with images
100KB–1MB	Claim Check (S3/blob reference)	Keeps broker fast and cheap	Extra S3 latency, S3 availability dependency	Large report generation result
> 1MB	Must use Claim Check	Kafka max 1MB default (configurable to 10MB)	S3 becomes critical path	Video upload notification, large datasets

Kafka message.max.bytes Trade-Off

message.max.bytes=1MB (default):
  Benefit: Predictable performance, small messages flow fast
  Limit: Applications must handle RecordTooLargeException

message.max.bytes=10MB:
  Benefit: Larger payloads without external storage
  Cost: 10x memory per in-flight message
  Cost: Network bandwidth spikes from large messages
  Cost: Slower throughput (large messages fill batches faster)
  Use only when: Infrequent large messages AND throughput is not critical

message.max.bytes=50MB+:
  Almost always wrong choice for a message broker
  Use the Claim Check pattern instead:
    - Store payload in S3/GCS/Azure Blob
    - Put reference URL in message (< 1KB)
    - Consumer fetches from S3 when needed
    - S3 is optimized for large object storage; Kafka is not

10. Synchronous vs Asynchronous Decision Framework

The Full Trade-Off Comparison

Dimension	Synchronous (HTTP/gRPC)	Asynchronous (Message Queue)
Latency	Low (direct connection)	Higher (queue overhead: 2–20ms)
Coupling	Tight (caller knows callee)	Loose (caller knows event, not consumer)
Availability	Caller fails if callee is down	Caller succeeds even if consumer is down
Consistency	Easier (immediate feedback)	Harder (eventual consistency)
Error handling	Immediate error feedback	Delayed (may discover failure much later)
Backpressure	Natural (caller blocked)	Explicit (queue depth, consumer lag)
Observability	Simple (request-response trace)	Complex (distributed, multi-hop)
Scaling	Scale caller and callee together	Scale independently
Testing	Simple (mock the HTTP call)	Complex (requires broker in test)
Operational complexity	Low	High (broker management, DLQ, replay)
Result delivery	Immediate in response	Webhook, polling, or WebSocket needed

Decision Framework by Use Case

DEFINITELY synchronous (do not use message queue):
  - User authentication (must know result immediately)
  - Inventory availability check before adding to cart
  - Payment authorization result (user waiting at checkout)
  - Search queries (user waiting for results)
  - Any user interaction requiring sub-500ms response

DEFINITELY asynchronous (message queue):
  - Sending confirmation emails (not in the critical user path)
  - Updating analytics dashboards (minutes of delay acceptable)
  - Generating invoices/reports (can take seconds to minutes)
  - Syncing data to external systems (downstream may be slow)
  - Processing uploaded files (video, large CSVs)
  - Any work item where "accepted, processing" is an acceptable response

DEPENDS (evaluate with the 5 questions from Section 1):
  - Inventory reservation (after purchase)
    → Async with Saga for reservation confirmation
  - Payment processing notification to warehouse
    → Async (warehouse does not need immediate notification)
  - Fraud check during payment
    → Synchronous (must know result before accepting payment)
    → But long-running fraud analysis: async webhook callback

The Hybrid Pattern: Sync Accept, Async Process

// Best of both worlds for long-running operations
 
@PostMapping("/orders")
public ResponseEntity<OrderAcceptedResponse> placeOrder(@RequestBody OrderRequest request) {
    // STEP 1: Validate synchronously (fast, < 50ms)
    orderValidator.validate(request);  // Throws if invalid - caller gets 400 immediately
 
    // STEP 2: Create order record synchronously (gives user an ID)
    Order order = orderRepository.save(Order.create(request));  // 10ms
 
    // STEP 3: Publish event asynchronously (non-blocking)
    kafkaTemplate.send("order-events", order.getId(), OrderPlacedEvent.from(order));
 
    // Return immediately with order ID and tracking URL
    // User can check status asynchronously
    return ResponseEntity.accepted().body(
        OrderAcceptedResponse.builder()
            .orderId(order.getId())
            .status("PROCESSING")
            .trackingUrl("/orders/" + order.getId())
            .estimatedCompletionSeconds(30)
            .build()
    );
    // Total response time: ~60ms
    // User gets immediate feedback; downstream work happens asynchronously
}

11. Event-Driven vs Command-Driven Trade-Offs

The Semantic Difference

Command (Imperative): "Do this specific thing"
  - Sent to a specific recipient
  - Recipient is expected to act
  - Sender has intent
  - Example: "ProcessPayment", "SendEmail", "UpdateInventory"
  - Single consumer (one service owns the action)

Event (Declarative): "This thing happened"
  - Published to whoever is interested
  - Publisher does not know or care who reacts
  - Publisher has no intent toward consumers
  - Example: "OrderPlaced", "PaymentProcessed", "InventoryDepleted"
  - Multiple consumers (anyone can react)

Trade-Off Comparison

Dimension	Command-Driven	Event-Driven
Coupling	Medium (knows who does the work)	Low (does not know who reacts)
Responsibility clarity	High (clear ownership)	Lower (any service can add consumers)
Unintended side effects	Low (explicit intent)	Higher (new consumers can be added silently)
Choreography	Harder (explicit chains)	Easier (each service reacts independently)
Debugging	Easier (follow command chain)	Harder (who consumed this event?)
Adding new consumers	Requires changing publisher	No publisher changes needed
Removing consumers	Requires changing publisher	Just stop subscribing
Saga implementation	Orchestration (explicit)	Choreography (implicit)

When to Use Each

Command-Driven is better when:
  - Exactly ONE service should handle the operation
  - Failure must be reported back to the publisher
  - You need clear ownership and accountability
  - The operation is a business operation with side effects
  - Example: payment-service.command → ProcessPayment (one owner)

Event-Driven is better when:
  - Multiple services have independent reactions to the same fact
  - You want to add consumers without changing the publisher
  - The publisher should not know what reactions occur
  - Example: OrderPlaced → warehouse + email + analytics (all react independently)

Mixed approach (most real systems):
  - External input to your domain: Commands (explicit, owned)
  - Internal domain state changes published outward: Events (observable)

  OrderService receives PlaceOrderCommand (command from API layer)
  OrderService publishes OrderPlaced event (event for all observers)
  WarehouseService subscribes to OrderPlaced (observes, acts independently)
  PaymentService subscribes to OrderPlaced (observes, acts independently)

12. Consumer Concurrency Trade-Offs

Concurrency Models

Model 1: One consumer thread per partition (Kafka default)
  - Thread A handles Partition 0
  - Thread B handles Partition 1
  - Thread C handles Partition 2

  Pro: Simple, ordering guaranteed within thread
  Con: Concurrency limited by partition count
  Use when: Order matters within a partition (most financial use cases)

Model 2: Async processing pool (multiple threads per partition)
  - Consumer thread polls Partition 0
  - Dispatches to thread pool (10 threads)
  - Thread pool processes concurrently

  Pro: Higher throughput per partition (10x)
  Con: Ordering not guaranteed (multiple threads racing on same partition)
  Con: Offset management is complex (cannot commit until all in-flight done)
  Use when: Processing is IO-bound and order does not matter (analytics, logging)

Model 3: Virtual threads (Java 21+)
  - Each message processed on a virtual thread
  - Virtual threads are lightweight (millions possible)
  - Blocking IO does not waste OS threads

  Pro: Extremely high concurrency for IO-bound processing
  Con: Requires Java 21+
  Con: Still requires careful ordering management
  Use when: High-concurrency, IO-bound processing on Java 21+

Concurrency vs Ordering Trade-Off

High concurrency + strict ordering = impossible
You must choose:

Option A: Strict ordering (lower concurrency)
  - Single-threaded per partition
  - Kafka: 1 consumer instance per partition
  - Throughput: limited by single-thread processing speed
  - Use for: Financial state machines, order processing with dependencies

Option B: High concurrency (no ordering guarantee)
  - Multiple threads or async processing per partition
  - Throughput: scales with thread pool size
  - Use for: Independent events (each can be processed in any order)
  - Use for: Analytics, logging, notification sending (order does not matter)

Option C: Partial ordering (partition-level ordering, parallel across partitions)
  - One thread per partition (parallel partitions, sequential within partition)
  - Throughput: scales with partition count
  - Use for: Per-entity ordering (all order-123 events in order, but independent of order-456)
  - This is the most common correct approach for most systems

13. Retention and Storage Trade-Offs

Retention Policy Decision Matrix

Data Type	Retention Period	Why	Risk of Too Short	Risk of Too Long
Financial transactions	90 days (log) + 7 years (archive)	Regulatory compliance	Regulatory violation	Storage cost (manageable with compression)
Order events	30 days	Bug fix replay + audit	Cannot replay month-old bugs	Storage cost
User activity	7 days	Analytics pipeline + replay	Cannot rebuild analytics from scratch	Privacy compliance (GDPR right to erasure)
Application logs	3–7 days	Debugging recent incidents	Lose context on slow-developing bugs	Storage cost, compliance complexity
Metrics/telemetry	1–3 days	Near-real-time dashboards	Cannot investigate recent incidents	Storage cost
Analytics raw events	7–14 days	ML training data recency	Stale training data	Storage cost
System notifications	1 day	Transient, time-sensitive	Duplicate alerts on replay	Stale notifications delivered

Size-Based vs Time-Based Retention

Time-based retention (retention.ms):
  - Keeps messages for a fixed time window
  - Predictable deletion schedule
  - Good for: regulatory compliance (exact time-based SLA)
  - Risk: If message rate surges, storage spikes before time-based cleanup

Size-based retention (retention.bytes):
  - Keeps up to a fixed total size per partition
  - Predictable disk usage
  - Good for: disk capacity management
  - Risk: During low traffic, older messages survive longer than expected

Best practice: Use BOTH (whichever limit is reached first triggers cleanup)
  kafka-configs.sh --alter --topic my-topic \
    --add-config retention.ms=2592000000 \   # 30 days
    --add-config retention.bytes=53687091200  # 50GB per partition

14. Schema Evolution Strategy Trade-Offs

Evolution Strategy Comparison

Strategy	Compatibility	Complexity	Risk Level	Suitable For
JSON, no schema	None (drift freely)	Very Low	Very High	Prototypes, POCs, never production
JSON with documentation	Informal, trust-based	Low	High	Small teams, trusted internal use
JSON Schema Registry	Formal, enforced	Medium	Medium	Medium teams, multiple consumers
Avro + Schema Registry	Enforced, binary	High	Low	High-performance, many consumers
Protobuf + Registry	Enforced, binary	High	Low	Cross-language, forward compat critical
Consumer-Driven Contracts (Pact)	Contract-tested	High	Low	Many independent teams

Schema Registry Compatibility Modes

BACKWARD compatibility (most common, recommended default):
  - New schema CAN read messages written with OLD schema
  - Old schema CANNOT read messages written with NEW schema
  - Safe pattern: consumers update first, then producers
  - Changes allowed: add optional fields, remove fields (with defaults)
  - Changes NOT allowed: rename fields, change field types

FORWARD compatibility:
  - Old schema CAN read messages written with NEW schema
  - New schema CANNOT read messages written with OLD schema
  - Safe pattern: producers update first, then consumers
  - Changes allowed: remove fields, add fields (consumers ignore unknown)
  - Changes NOT allowed: rename fields, change field types

FULL compatibility (most restrictive, safest for long-lived schemas):
  - Both backward AND forward compatible
  - Any consumer version can read any producer version
  - Changes allowed: add optional fields (with defaults)
  - Everything else: blocked
  - Use for: Public APIs, schemas shared across many teams

NONE (no compatibility checking):
  - Any change allowed
  - Producer and consumer must be deployed atomically
  - Use for: Development only, never production

15. Full Architectural Decision Trees

Decision Tree 1: Message Broker Selection

START: I need to communicate between services asynchronously

→ Is this for a single cloud provider environment?
    YES → Am I on AWS?
            YES → High throughput or event sourcing needed?
                    YES → AWS MSK (managed Kafka)
                    NO  → AWS SQS (managed queue, much simpler and cheaper)
          Am I on GCP?
            YES → Google Cloud Pub/Sub (managed, native integration)
          Am I on Azure?
            YES → Azure Service Bus or Azure Event Hubs (Kafka-compatible)
    NO → Continue...

→ Do I need message replay (re-read past messages)?
    YES → Kafka (only broker with true event log replay)
    NO  → Continue...

→ Do I need complex routing (by content, headers, priority)?
    YES → RabbitMQ (richest routing model)
    NO  → Continue...

→ Do I need > 500,000 messages/second throughput?
    YES → Kafka (RabbitMQ cannot sustain this at scale)
    NO  → Continue...

→ Do I need stream processing (joins, aggregations, windowed operations)?
    YES → Kafka + Kafka Streams or Flink
    NO  → Continue...

→ Do I value operational simplicity over features?
    YES → SQS (cloud) or RabbitMQ (self-hosted) - much simpler to operate
    NO  → Kafka provides the most power at the cost of complexity

Default recommendation:
  Small teams, simple use cases: → SQS (cloud) or RabbitMQ (self-hosted)
  Large teams, event-driven architecture, analytics: → Kafka

Decision Tree 2: Partition Key Selection

START: What partition key should I use for this Kafka topic?

→ Do messages need strict ordering?
    NO → Use null key (round-robin distribution - highest throughput)
    YES → Continue...

→ What is the unit of ordering?
    "All events for the same ORDER": → Use orderId as partition key
    "All events for the same USER":  → Use userId as partition key
    "All events for the same ENTITY": → Use entityId as partition key

→ Is there a risk of hot partitions (one key has >> traffic than others)?
    YES → Is the hot key predictable (known large tenants/merchants)?
            YES → Custom partitioner with multiple slots for hot keys
            NO  → Compound key (entityId + random bucket 0-3)
    NO  → Simple entity ID hash is fine

→ Are there any keys that MUST be processed in sequence relative to each other?
    YES (e.g., all payment events for a user must be in order across order IDs):
         → Use a higher-level key (userId) that groups all related entities
    NO  → Entity-level key is sufficient

Decision Tree 3: Delivery Semantics Selection

START: What delivery guarantee do I need?

→ Can I afford to lose some messages?
    YES (metrics, telemetry, analytics sampling):
         → At-most-once
         → acks=0 (no acknowledgment) or acks=1 (leader only)
         → Auto-commit BEFORE processing
         → Highest throughput
    NO → Continue...

→ Can I make my consumer idempotent (safe to process twice)?
    YES (most well-designed consumers):
         → At-least-once with idempotent consumer
         → acks=all, retries enabled
         → Manual ack AFTER processing
         → Deduplication by event ID in consumer
         → Best balance of reliability and simplicity

    NO (consumer has unavoidable side effects that cannot be deduplicated):
         → Exactly-once via Kafka transactions (high cost)
         → Consider redesigning for idempotency first
         → Is the consumer calling an external API?
              → Pass idempotency key to external API (Stripe, Twilio support this)
              → This gives you exactly-once EFFECT without exactly-once DELIVERY

Previous: Supplement 2 - Production Challenges
Next: Supplement 4 - Real-World Architecture
Index: Message Queues Demystified - Index

Series: Message Queues Demystified