← Back to Articles
6/6/2026Admin Post

message queues supplement3 tradeoffs decisions

Message Queues - Supplement 3: Trade-Offs & Decision Guide

Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 4 - Real-World Architecture

Every architectural decision involves trade-offs. This guide makes those trade-offs explicit,
concrete, and actionable. Each section answers "when should I choose A over B?" with
criteria, comparison tables, and real-world context.


Table of Contents

  1. When to Use Message Queues (and When NOT to)
  2. Where in Your Architecture to Place Message Queues
  3. Technology Selection: Full Decision Matrix
  4. Kafka vs RabbitMQ: Deep Comparison
  5. Cloud-Managed vs Self-Hosted Trade-Offs
  6. Delivery Semantics Trade-Off Table
  7. Queue vs Topic vs Stream: The Full Decision
  8. Partition Strategy Trade-Offs
  9. Message Size Trade-Offs
  10. Synchronous vs Asynchronous Decision Framework
  11. Event-Driven vs Command-Driven Trade-Offs
  12. Consumer Concurrency Trade-Offs
  13. Retention and Storage Trade-Offs
  14. Schema Evolution Strategy Trade-Offs
  15. Full Architectural Decision Trees

1. When to Use Message Queues (and When NOT to)

Apply Message Queues When...

ScenarioReason Message Queue is RightExample
Work takes longer than a caller can waitCaller gets immediate response; work continues in backgroundVideo transcoding, report generation
Multiple downstream services need the same eventFan-out without coupling to each downstreamOrder placed → warehouse + email + analytics
Downstream service is slower than upstreamQueue absorbs traffic spikes; downstream processes at its own pacePayment processor maxes at 500/sec; orders arrive at 2000/sec during sales
Downstream service availability affects upstreamDecouple: downstream being down does not fail the upstream callEmail service down should not fail order placement
Workload can be parallelizedAdd consumers to scale processing without changing producersBatch image resizing, log processing
Retry/reliability without caller involvementFailed downstream processing retried automaticallyWebhook delivery, external API calls
Audit trail of all events neededTopic log provides ordered, replayable event historyFinancial transactions, compliance
Services are in different teams/deployment cyclesEvent contracts are stable; teams deploy independentlyMicroservices with different release schedules

Do NOT Use Message Queues When...

ScenarioWhy Message Queue is WrongBetter Alternative
Caller needs the response to continueAsync adds latency with no benefit; caller blocks anywaySynchronous HTTP/gRPC
Operation must be atomic across servicesQueue does not provide distributed transactionsSaga pattern (if unavoidable) or monolith
System is a simple CRUD applicationUnnecessary complexity for straightforward request/responseREST API
You need to query current stateQueues are not databases; querying is wrong toolDatabase query or cache
Real-time user interaction requiredQueue latency is too high for sub-100ms user interactionsWebSockets, Server-Sent Events
Two services that could be the same serviceUnnecessary inter-service communicationMerge into one service
Request volume is low and predictableQueue overhead not worth it for < 10 req/secDirect HTTP call
All consumers must respond before proceedingSynchronous fan-out is semantically requiredRequest-scatter-gather via HTTP

The Decision Question

Ask these questions in order:

1. Does the caller need the result to proceed?
   YES → Use synchronous HTTP/gRPC
   NO  → Consider async. Continue to question 2.

2. Can the operation fail and be retried later without the caller knowing?
   YES → Message queue is a good fit
   NO  → Consider: is this actually two concerns mixed together?

3. Are there multiple consumers that all need this event independently?
   YES → Message queue (pub-sub / topic model) is the right tool
   NO  → Consider: could a direct call handle this? If so, simpler is better.

4. Is the downstream significantly slower or less reliable than upstream?
   YES → Message queue provides buffering and decoupling
   NO  → Direct call might be simpler; evaluate overhead vs benefit.

5. Do you need event replay or time-travel for debugging/analytics?
   YES → Kafka (event log) is specifically designed for this
   NO  → Any queue suffices.

2. Where in Your Architecture to Place Message Queues

Placement Patterns

Pattern A: Edge Buffering (Input)

External clients (high, bursty)
         ↓
    [API Gateway]
         ↓
    Message Queue ← Buffer here
         ↓
    Internal Services (capacity-limited)

When to use: External traffic is unpredictable. Internal processing has finite capacity (database writes, payment gateway calls). You need to absorb traffic spikes without rejecting requests.

Trade-off: Adds latency for the user (job accepted but not completed). Requires async result delivery (webhook, polling, WebSocket).


Pattern B: Service Decoupling (Middle)

Service A  →  Message Queue  →  Service B
                    ↓
              Service C
                    ↓
              Service D

When to use: Service A is owned by one team; services B, C, D by other teams. Deployment independence is important. Services B, C, D can be added/removed without changing Service A.

Trade-off: Eventual consistency. Service A cannot know what state Services B/C/D are in. Debugging requires distributed tracing across the queue boundary.


Pattern C: Event Backbone (Architecture-Wide)

All domain services → publish to central event log (Kafka)
All domain services → subscribe to events they care about

Event log becomes the single source of truth for "what happened"

When to use: Event sourcing architecture. Regulatory audit requirements (every event logged). Multiple consumers per domain event. Analytics and ML training data pipeline from production events.

Trade-off: Event schema evolution is critical and difficult. All teams must coordinate on event contracts. Higher operational complexity (Schema Registry, replication, ordering policies).


Pattern D: Work Distribution (Compute Tier)

Job Submission API
       ↓
  [Job Queue]
       ↓
┌───────────────┐
│ Worker Pool   │
│ (elastic)     │
└───────────────┘

When to use: Compute-intensive tasks (ML inference, video processing, report generation). Work items are independent (can be parallelized). Need to scale workers based on queue depth.

Trade-off: Worker scaling adds latency. Job failures must be handled (DLQ, retry policies). Result delivery to the original caller requires a separate mechanism (callback URL, polling).


The "Should I Put a Queue Here?" Decision for Each Service Boundary

For each service-to-service call, evaluate:

Latency requirement:
  < 100ms → probably synchronous HTTP
  100ms–2s → either can work; queue adds resilience
  > 2s (background work) → queue almost always correct

Dependency direction:
  A depends on B for its core function → HTTP (tight coupling acceptable)
  A does something that B needs to know about → Queue (B observes A)

Failure semantics:
  "If B is down, A should fail immediately" → HTTP
  "If B is down, A should continue and B will catch up" → Queue

Traffic pattern:
  Steady, predictable → HTTP is fine
  Bursty, spiky → Queue for buffering
  Batch (overnight processing) → Queue almost always correct

3. Technology Selection: Full Decision Matrix

RequirementRabbitMQKafkaAWS SQSRedis StreamsGoogle Pub/SubAzure Service Bus
Message orderingPer-queue (not partitioned)Strong (per partition)FIFO queues onlyPer streamBest-effortSession-based
Throughput~50K msg/sec/broker>1M msg/sec/broker~3K msg/sec/queue~1M msg/sec~1M msg/sec~1M msg/sec
Message replayNo (consumed = gone)Yes (retention-based)NoLimitedLimitedNo
Push vs PullPush (broker pushes)Pull (consumer polls)Pull (client polls)PullPush + PullPush + Pull
Routing flexibilityExcellent (exchanges)Limited (topic-based)NoneNoneAttribute filteringTopic/subscription
Exactly-onceNot nativelyYes (transactions)NoNoYes (per region)Yes
Self-hostedYesYesNo (AWS only)YesNo (GCP only)No (Azure only)
Managed serviceCloudAMQP, AmazonMQConfluent, MSK, AivenNativeRedis CloudNativeNative
Operational complexityMediumHighVery LowLowVery LowVery Low
Best forTask queues, routingEvent log, analytics, stream processingSimple cloud queuingSimple streams, low infraCloud-native event busEnterprise .NET/Java

4. Kafka vs RabbitMQ: Deep Comparison

This is the most commonly debated technology choice. Here is the complete picture:

Architecture Philosophy Difference

RabbitMQ: Smart broker, dumb consumers
  - Broker maintains subscriber state
  - Broker routes messages to consumers
  - Broker tracks what each consumer has received
  - Consumers are stateless receivers
  - Message consumed = removed from queue

Kafka: Dumb broker, smart consumers
  - Broker just stores messages in an ordered log
  - Consumers track their own position (offset)
  - Consumers pull messages at their own pace
  - Consumers can re-read messages at any time
  - Messages persist until retention policy expires

This philosophical difference drives ALL the other trade-offs.

Use Case Decision Table

Use CaseChoose RabbitMQChoose Kafka
Task queue (each task done once)✅ Designed for this⚠️ Works but overkill
Job scheduling with priority✅ Native priority queues❌ No native priority
Complex routing (by content/headers)✅ Topic/Header exchanges❌ Limited routing
Request-Reply pattern✅ Reply-to header support⚠️ Possible but complex
Event sourcing / audit log❌ Messages not retained✅ Designed for this
Stream processing (Kafka Streams)❌ No stream processing✅ Kafka Streams, ksqlDB
Replay past events for new service❌ Impossible after consume✅ Replay from any offset
Analytics data pipeline❌ Not designed for volume✅ Built for petabyte scale
IoT data ingestion (millions/sec)❌ Throughput insufficient✅ Handles millions/sec
Simple microservice messaging✅ Easy to set up⚠️ Complex for simple use
Multiple consumers per message✅ Fan-out exchanges✅ Consumer groups
Exactly-once across services❌ No native support✅ Kafka transactions
Message TTL / expiry✅ Native support⚠️ Via retention policies
Team learning curve✅ Easier❌ Steeper curve
Operational simplicity✅ Simpler❌ More complex (ZK/KRaft)

Performance Trade-Off

RabbitMQ:
  Write latency: 1-5ms (very low)
  Throughput: 50,000-100,000 messages/sec per broker
  Memory footprint: Higher (tracks per-consumer state)
  Disk usage: Lower (messages deleted after consumption)
  Good for: < 1 million messages/day, latency-sensitive tasks

Kafka:
  Write latency: 2-10ms (tunable: lower with acks=1, higher with acks=all)
  Throughput: 1,000,000+ messages/sec per broker
  Memory footprint: Lower (stateless broker, consumers track state)
  Disk usage: Higher (messages retained, not deleted)
  Good for: Billions of messages/day, analytics pipelines, event sourcing

When Teams Pick the Wrong One

Common mistake 1: Using Kafka for a simple task queue
  Symptom: "We have Kafka but only 3 consumers and 1000 events/day"
  Problem: Kafka's complexity (partitions, replication, schema registry) adds
           operational overhead for no benefit at this scale
  Solution: RabbitMQ or even SQS would have been simpler and sufficient

Common mistake 2: Using RabbitMQ for event sourcing
  Symptom: "We need to replay last month's events for our new analytics service"
  Problem: RabbitMQ messages are gone after consumption - cannot replay
  Solution: Kafka (or rebuild from DB with Debezium CDC)

Common mistake 3: Using Kafka for complex routing
  Symptom: "We want to route messages based on 5 header attributes to different queues"
  Problem: Kafka has no native routing - must be done in application code
  Solution: RabbitMQ's Topic or Headers exchange handles this natively

5. Cloud-Managed vs Self-Hosted Trade-Offs

Trade-Off Matrix

DimensionSelf-Hosted (Kafka/RabbitMQ)Managed (Confluent/MSK/SQS/AmazonMQ)
Upfront costLow ($)None
Ongoing costMedium (infra + ops time)High (per-message + per-hour pricing)
Cost at scaleLower (EC2 is cheaper than managed)Can be very expensive at petabyte scale
Operational burdenHigh (upgrades, disk, replication)Very Low (vendor handles)
Feature availabilityAll features, any versionLatest only (managed version lag)
CustomizationFull controlLimited to vendor's config options
Vendor lock-inNone (open source)Medium-High (proprietary APIs/features)
Time to productionWeeks (setup, hardening)Hours
Disaster recoverySelf-managed (MirrorMaker, backups)Vendor-provided (often across AZs)
Compliance (FedRAMP, etc.)You certifyVendor often pre-certified
SLAYou build itVendor provides (99.9–99.99%)

Cost Breakdown Examples

Scenario: 100 million messages/day, 1KB each = 100GB/day

AWS MSK (managed Kafka):
  Broker cost: 3 × m5.xlarge = $0.192/hr × 3 × 720hr = ~$415/month
  Storage: 100GB/day × 7 day retention = 700GB × $0.10/GB = $70/month
  Data transfer: 100GB/day × 30 days × $0.02/GB = $60/month
  Total: ~$545/month

Self-hosted Kafka (AWS EC2):
  Broker cost: 3 × m5.xlarge = $415/month
  EBS storage: 700GB × $0.10 = $70/month
  PLUS operations time: 0.5 FTE × $150K/year = ~$6,250/month
  Total: ~$6,735/month  ← HIGHER despite cheaper infrastructure!

AWS SQS (managed queue):
  100 million messages × $0.40/million = $40/month
  Total: $40/month (dramatically cheaper for simple queue use case!)

Confluent Cloud (managed Kafka):
  100 million messages × roughly $0.08/million = $8,000/month
  (Confluent pricing is expensive at scale but rich in features)

Lesson: For simple queue use cases, SQS is dramatically cheaper.
For event log/analytics at medium scale, MSK balances cost and operations.
Self-hosted only wins at petabyte scale where infrastructure cost dominates ops cost.

6. Delivery Semantics Trade-Off Table

The Three Guarantees Explained

SemanticMessage DeliveredDuplicates PossibleData Loss PossibleDefault ForSuitable For
At-most-once0 or 1 timesNoYesAnalytics, metricsSensor readings, telemetry, non-critical events
At-least-once1 or more timesYesNoAll major brokersOrders, notifications, emails (with idempotent consumer)
Exactly-onceExactly 1 timeNoNoNowhere (requires special setup)Financial transactions, billing, inventory deductions

Implementation Cost of Each Semantic

At-most-once:
  Producer: Fire and forget (no acknowledgment wait)
  Consumer: Auto-commit before processing
  Extra infrastructure: None
  Relative cost: 1× (baseline)
  Throughput impact: Highest (no waiting for ACKs)

At-least-once:
  Producer: acks=1 or acks=all + retries enabled
  Consumer: Manual ack after successful processing
  Extra infrastructure: DLQ for failed messages
  Relative cost: 2× (acknowledgment overhead)
  Throughput impact: Medium

Exactly-once (Kafka):
  Producer: enable.idempotence=true + transactional.id set + acks=all
  Consumer: isolation.level=read_committed + transactional processing
  Broker: Kafka transactions enabled, min.insync.replicas=2
  Consumer: Idempotent processing logic (DB deduplication) as backup
  Extra infrastructure: Kafka transactions, Schema Registry, transactional DB
  Relative cost: 5–10× (transaction coordination overhead)
  Throughput impact: Significant reduction (20-50% vs at-least-once)

When Exactly-Once is NOT Worth It

Myth: "We need exactly-once delivery for financial transactions"
Reality: "We need exactly-once PROCESSING, which is different"

Exactly-once delivery is a property of the MESSAGE TRANSPORT.
Exactly-once processing is achieved by making your CONSUMER idempotent.

The idempotent consumer approach:
  - Uses at-least-once delivery (simpler, faster, more available)
  - Consumer records processed event IDs in a DB with unique constraint
  - If duplicate arrives: DB unique constraint prevents re-processing
  - Net result: exactly-once PROCESSING without exactly-once DELIVERY

This is simpler, cheaper, and has better availability than Kafka transactions.

Kafka transactions are appropriate when:
  - You consume from one Kafka topic AND produce to another Kafka topic
  - Both operations must succeed or fail together atomically
  - Example: Read order event → compute → write to analytics topic (atomic)

Kafka transactions are NOT appropriate for:
  - Consumer → Database (use DB transactions + idempotency key instead)
  - Consumer → External API (external APIs are not part of Kafka's transaction boundary)

7. Queue vs Topic vs Stream: The Full Decision

Definitions Clarified

Queue:
  - One message → consumed by exactly ONE consumer
  - Message is deleted after consumption
  - Consumer position tracked by broker
  - Examples: RabbitMQ queue, AWS SQS, Azure Service Bus queue

Topic (Message Bus):
  - One message → consumed by ALL subscribed consumers
  - Each subscriber gets its own copy
  - Message retained for subscription duration
  - Examples: AWS SNS, Google Pub/Sub, Kafka topics (with multiple consumer groups)

Stream (Event Log):
  - Ordered, persistent log of records
  - Multiple consumers can read the same records at different positions
  - Records not deleted after consumption (retention-based)
  - Consumers manage their own position (offset)
  - Examples: Kafka, AWS Kinesis, Redis Streams

Decision Criteria

Use a Queue when:
  ✓ Each piece of work should be done by exactly one worker
  ✓ Work items are independent of each other
  ✓ You want automatic load distribution across workers
  ✓ Messages do not need to be replayed
  ✓ Example: Email sending queue, job processing queue

Use a Topic (Pub-Sub) when:
  ✓ Multiple services all need to know when something happened
  ✓ Publishers should not need to know who the subscribers are
  ✓ Adding a new consumer does not change the producer
  ✓ Subscribers have different interests/filters on the same event stream
  ✓ Example: OrderPlaced event → (warehouse, email, analytics all subscribe)

Use a Stream (Event Log) when:
  ✓ You need to replay past events (new service joining, bug fix replay)
  ✓ Events need to be processed in strict order
  ✓ Multiple independent consumers at different positions in time
  ✓ Long-term retention for compliance or analytics
  ✓ You want to "rewind" and re-process history
  ✓ Example: Financial transaction log, change data capture, user activity feed

When a Queue AND a Topic Together Make Sense

Pattern: SNS → SQS Fan-Out (AWS)
  SNS Topic (guarantees fan-out to all queues)
       |
  ┌────+────┐────────────┐
  ↓         ↓            ↓
SQS Queue  SQS Queue   SQS Queue
(warehouse) (email)     (analytics)

Each service gets its own queue:
  - Slow analytics consumer does not block warehouse consumer
  - Each queue has its own DLQ
  - Each queue has its own retry policy
  - Adding a new consumer = add a new SQS queue + subscribe to SNS

This is the right pattern for fan-out with independent consumers

8. Partition Strategy Trade-Offs

Choosing a Partition Key

Partition Key ChoiceOrdering GuaranteeDistributionHotspot RiskUse When
Entity ID (orderId, userId)All events for same entity orderedEven (if IDs are random/UUID)LowEntity-level ordering required (most common)
User IDAll events for same user orderedEvenLow for UUID; High if Pareto distributionUser activity streams
Tenant IDAll events for same tenant orderedCan be hot for large tenantsHigh if tenant sizes vary greatlyMulti-tenant with equal tenants
Geographic regionRegional orderingMay be unevenPossible (dense regions)Regional data sovereignty
Custom hash bucketWithin-bucket orderingControlledLow (by design)Known hot entities needing distribution
Round-robin (null key)No ordering guaranteePerfectNonePure throughput, no ordering needed

How Many Partitions?

Under-partitioning (too few):
  Problem: Bottleneck throughput at peak load
  Problem: Cannot add more consumers than partitions
  Problem: One slow message blocks all messages on that partition
  Rule: If unsure, start with more partitions (cannot easily increase later)

Over-partitioning (too many):
  Problem: More partitions = more ZooKeeper/controller load
  Problem: Leader election takes longer with more partitions
  Problem: Consumer rebalancing takes longer
  Problem: Replication overhead increases linearly with partition count
  Rule: Do not create 10,000 partitions for a low-volume topic

Partition count formula:
  target_throughput_mb_per_sec / throughput_per_partition_mb_per_sec

  Where throughput_per_partition:
    For producer: ~10–40 MB/sec per partition (varies by hardware)
    For consumer: ~15–50 MB/sec per partition (depends on processing speed)
    Conservative estimate: 10 MB/sec per partition

  Example:
    Target: 500 MB/sec peak
    Per partition: 10 MB/sec (conservative)
    Partitions needed: 50
    Add 20% headroom: 60 partitions
    Round to even number: 60 (or 64 for power-of-2 if custom partitioner used)

Practical defaults:
  Low-traffic topics (< 10 MB/sec): 3–6 partitions
  Medium-traffic topics (10–100 MB/sec): 12–24 partitions
  High-traffic topics (> 100 MB/sec): 48–120 partitions
  Rethink if > 200 partitions: consider separate cluster

Replication Factor Trade-Off

replication.factor=1:
  Cost: Lowest (1× storage)
  Durability: NONE - single broker failure = data loss
  Use for: Development only. Never in production.

replication.factor=2:
  Cost: 2× storage
  Durability: Tolerates 1 broker failure
  Problem: min.insync.replicas=2 means writes fail when ANY broker fails
  Use for: Low-criticality topics where availability > durability

replication.factor=3 (recommended for production):
  Cost: 3× storage
  Durability: Tolerates 1 broker failure with no data loss
  Availability: Writes succeed as long as 2/3 brokers available
  Use for: All production data (standard practice)

replication.factor=3 + min.insync.replicas=2 is the production-standard combination
  - Tolerates 1 broker failure with zero data loss
  - Remains writable with 2/3 brokers available
  - Refuses writes (safely) when only 1 broker available (better than silent data loss)

9. Message Size Trade-Offs

Message Size Spectrum

Message SizeApproachProsConsExample
< 1KBInclude all data inlineSimple, atomic, no external dependencyFine for most casesOrder status change event
1–100KBInclude inline OR use Claim CheckInline: simpler; Claim Check: avoids broker overheadClaim Check adds S3 latencyProduct catalog update with images
100KB–1MBClaim Check (S3/blob reference)Keeps broker fast and cheapExtra S3 latency, S3 availability dependencyLarge report generation result
> 1MBMust use Claim CheckKafka max 1MB default (configurable to 10MB)S3 becomes critical pathVideo upload notification, large datasets

Kafka message.max.bytes Trade-Off

message.max.bytes=1MB (default):
  Benefit: Predictable performance, small messages flow fast
  Limit: Applications must handle RecordTooLargeException

message.max.bytes=10MB:
  Benefit: Larger payloads without external storage
  Cost: 10x memory per in-flight message
  Cost: Network bandwidth spikes from large messages
  Cost: Slower throughput (large messages fill batches faster)
  Use only when: Infrequent large messages AND throughput is not critical

message.max.bytes=50MB+:
  Almost always wrong choice for a message broker
  Use the Claim Check pattern instead:
    - Store payload in S3/GCS/Azure Blob
    - Put reference URL in message (< 1KB)
    - Consumer fetches from S3 when needed
    - S3 is optimized for large object storage; Kafka is not

10. Synchronous vs Asynchronous Decision Framework

The Full Trade-Off Comparison

DimensionSynchronous (HTTP/gRPC)Asynchronous (Message Queue)
LatencyLow (direct connection)Higher (queue overhead: 2–20ms)
CouplingTight (caller knows callee)Loose (caller knows event, not consumer)
AvailabilityCaller fails if callee is downCaller succeeds even if consumer is down
ConsistencyEasier (immediate feedback)Harder (eventual consistency)
Error handlingImmediate error feedbackDelayed (may discover failure much later)
BackpressureNatural (caller blocked)Explicit (queue depth, consumer lag)
ObservabilitySimple (request-response trace)Complex (distributed, multi-hop)
ScalingScale caller and callee togetherScale independently
TestingSimple (mock the HTTP call)Complex (requires broker in test)
Operational complexityLowHigh (broker management, DLQ, replay)
Result deliveryImmediate in responseWebhook, polling, or WebSocket needed

Decision Framework by Use Case

DEFINITELY synchronous (do not use message queue):
  - User authentication (must know result immediately)
  - Inventory availability check before adding to cart
  - Payment authorization result (user waiting at checkout)
  - Search queries (user waiting for results)
  - Any user interaction requiring sub-500ms response

DEFINITELY asynchronous (message queue):
  - Sending confirmation emails (not in the critical user path)
  - Updating analytics dashboards (minutes of delay acceptable)
  - Generating invoices/reports (can take seconds to minutes)
  - Syncing data to external systems (downstream may be slow)
  - Processing uploaded files (video, large CSVs)
  - Any work item where "accepted, processing" is an acceptable response

DEPENDS (evaluate with the 5 questions from Section 1):
  - Inventory reservation (after purchase)
    → Async with Saga for reservation confirmation
  - Payment processing notification to warehouse
    → Async (warehouse does not need immediate notification)
  - Fraud check during payment
    → Synchronous (must know result before accepting payment)
    → But long-running fraud analysis: async webhook callback

The Hybrid Pattern: Sync Accept, Async Process

// Best of both worlds for long-running operations
 
@PostMapping("/orders")
public ResponseEntity<OrderAcceptedResponse> placeOrder(@RequestBody OrderRequest request) {
    // STEP 1: Validate synchronously (fast, < 50ms)
    orderValidator.validate(request);  // Throws if invalid - caller gets 400 immediately
 
    // STEP 2: Create order record synchronously (gives user an ID)
    Order order = orderRepository.save(Order.create(request));  // 10ms
 
    // STEP 3: Publish event asynchronously (non-blocking)
    kafkaTemplate.send("order-events", order.getId(), OrderPlacedEvent.from(order));
 
    // Return immediately with order ID and tracking URL
    // User can check status asynchronously
    return ResponseEntity.accepted().body(
        OrderAcceptedResponse.builder()
            .orderId(order.getId())
            .status("PROCESSING")
            .trackingUrl("/orders/" + order.getId())
            .estimatedCompletionSeconds(30)
            .build()
    );
    // Total response time: ~60ms
    // User gets immediate feedback; downstream work happens asynchronously
}

11. Event-Driven vs Command-Driven Trade-Offs

The Semantic Difference

Command (Imperative): "Do this specific thing"
  - Sent to a specific recipient
  - Recipient is expected to act
  - Sender has intent
  - Example: "ProcessPayment", "SendEmail", "UpdateInventory"
  - Single consumer (one service owns the action)

Event (Declarative): "This thing happened"
  - Published to whoever is interested
  - Publisher does not know or care who reacts
  - Publisher has no intent toward consumers
  - Example: "OrderPlaced", "PaymentProcessed", "InventoryDepleted"
  - Multiple consumers (anyone can react)

Trade-Off Comparison

DimensionCommand-DrivenEvent-Driven
CouplingMedium (knows who does the work)Low (does not know who reacts)
Responsibility clarityHigh (clear ownership)Lower (any service can add consumers)
Unintended side effectsLow (explicit intent)Higher (new consumers can be added silently)
ChoreographyHarder (explicit chains)Easier (each service reacts independently)
DebuggingEasier (follow command chain)Harder (who consumed this event?)
Adding new consumersRequires changing publisherNo publisher changes needed
Removing consumersRequires changing publisherJust stop subscribing
Saga implementationOrchestration (explicit)Choreography (implicit)

When to Use Each

Command-Driven is better when:
  - Exactly ONE service should handle the operation
  - Failure must be reported back to the publisher
  - You need clear ownership and accountability
  - The operation is a business operation with side effects
  - Example: payment-service.command → ProcessPayment (one owner)

Event-Driven is better when:
  - Multiple services have independent reactions to the same fact
  - You want to add consumers without changing the publisher
  - The publisher should not know what reactions occur
  - Example: OrderPlaced → warehouse + email + analytics (all react independently)

Mixed approach (most real systems):
  - External input to your domain: Commands (explicit, owned)
  - Internal domain state changes published outward: Events (observable)

  OrderService receives PlaceOrderCommand (command from API layer)
  OrderService publishes OrderPlaced event (event for all observers)
  WarehouseService subscribes to OrderPlaced (observes, acts independently)
  PaymentService subscribes to OrderPlaced (observes, acts independently)

12. Consumer Concurrency Trade-Offs

Concurrency Models

Model 1: One consumer thread per partition (Kafka default)
  - Thread A handles Partition 0
  - Thread B handles Partition 1
  - Thread C handles Partition 2

  Pro: Simple, ordering guaranteed within thread
  Con: Concurrency limited by partition count
  Use when: Order matters within a partition (most financial use cases)

Model 2: Async processing pool (multiple threads per partition)
  - Consumer thread polls Partition 0
  - Dispatches to thread pool (10 threads)
  - Thread pool processes concurrently

  Pro: Higher throughput per partition (10x)
  Con: Ordering not guaranteed (multiple threads racing on same partition)
  Con: Offset management is complex (cannot commit until all in-flight done)
  Use when: Processing is IO-bound and order does not matter (analytics, logging)

Model 3: Virtual threads (Java 21+)
  - Each message processed on a virtual thread
  - Virtual threads are lightweight (millions possible)
  - Blocking IO does not waste OS threads

  Pro: Extremely high concurrency for IO-bound processing
  Con: Requires Java 21+
  Con: Still requires careful ordering management
  Use when: High-concurrency, IO-bound processing on Java 21+

Concurrency vs Ordering Trade-Off

High concurrency + strict ordering = impossible
You must choose:

Option A: Strict ordering (lower concurrency)
  - Single-threaded per partition
  - Kafka: 1 consumer instance per partition
  - Throughput: limited by single-thread processing speed
  - Use for: Financial state machines, order processing with dependencies

Option B: High concurrency (no ordering guarantee)
  - Multiple threads or async processing per partition
  - Throughput: scales with thread pool size
  - Use for: Independent events (each can be processed in any order)
  - Use for: Analytics, logging, notification sending (order does not matter)

Option C: Partial ordering (partition-level ordering, parallel across partitions)
  - One thread per partition (parallel partitions, sequential within partition)
  - Throughput: scales with partition count
  - Use for: Per-entity ordering (all order-123 events in order, but independent of order-456)
  - This is the most common correct approach for most systems

13. Retention and Storage Trade-Offs

Retention Policy Decision Matrix

Data TypeRetention PeriodWhyRisk of Too ShortRisk of Too Long
Financial transactions90 days (log) + 7 years (archive)Regulatory complianceRegulatory violationStorage cost (manageable with compression)
Order events30 daysBug fix replay + auditCannot replay month-old bugsStorage cost
User activity7 daysAnalytics pipeline + replayCannot rebuild analytics from scratchPrivacy compliance (GDPR right to erasure)
Application logs3–7 daysDebugging recent incidentsLose context on slow-developing bugsStorage cost, compliance complexity
Metrics/telemetry1–3 daysNear-real-time dashboardsCannot investigate recent incidentsStorage cost
Analytics raw events7–14 daysML training data recencyStale training dataStorage cost
System notifications1 dayTransient, time-sensitiveDuplicate alerts on replayStale notifications delivered

Size-Based vs Time-Based Retention

Time-based retention (retention.ms):
  - Keeps messages for a fixed time window
  - Predictable deletion schedule
  - Good for: regulatory compliance (exact time-based SLA)
  - Risk: If message rate surges, storage spikes before time-based cleanup

Size-based retention (retention.bytes):
  - Keeps up to a fixed total size per partition
  - Predictable disk usage
  - Good for: disk capacity management
  - Risk: During low traffic, older messages survive longer than expected

Best practice: Use BOTH (whichever limit is reached first triggers cleanup)
  kafka-configs.sh --alter --topic my-topic \
    --add-config retention.ms=2592000000 \   # 30 days
    --add-config retention.bytes=53687091200  # 50GB per partition

14. Schema Evolution Strategy Trade-Offs

Evolution Strategy Comparison

StrategyCompatibilityComplexityRisk LevelSuitable For
JSON, no schemaNone (drift freely)Very LowVery HighPrototypes, POCs, never production
JSON with documentationInformal, trust-basedLowHighSmall teams, trusted internal use
JSON Schema RegistryFormal, enforcedMediumMediumMedium teams, multiple consumers
Avro + Schema RegistryEnforced, binaryHighLowHigh-performance, many consumers
Protobuf + RegistryEnforced, binaryHighLowCross-language, forward compat critical
Consumer-Driven Contracts (Pact)Contract-testedHighLowMany independent teams

Schema Registry Compatibility Modes

BACKWARD compatibility (most common, recommended default):
  - New schema CAN read messages written with OLD schema
  - Old schema CANNOT read messages written with NEW schema
  - Safe pattern: consumers update first, then producers
  - Changes allowed: add optional fields, remove fields (with defaults)
  - Changes NOT allowed: rename fields, change field types

FORWARD compatibility:
  - Old schema CAN read messages written with NEW schema
  - New schema CANNOT read messages written with OLD schema
  - Safe pattern: producers update first, then consumers
  - Changes allowed: remove fields, add fields (consumers ignore unknown)
  - Changes NOT allowed: rename fields, change field types

FULL compatibility (most restrictive, safest for long-lived schemas):
  - Both backward AND forward compatible
  - Any consumer version can read any producer version
  - Changes allowed: add optional fields (with defaults)
  - Everything else: blocked
  - Use for: Public APIs, schemas shared across many teams

NONE (no compatibility checking):
  - Any change allowed
  - Producer and consumer must be deployed atomically
  - Use for: Development only, never production

15. Full Architectural Decision Trees

Decision Tree 1: Message Broker Selection

START: I need to communicate between services asynchronously

→ Is this for a single cloud provider environment?
    YES → Am I on AWS?
            YES → High throughput or event sourcing needed?
                    YES → AWS MSK (managed Kafka)
                    NO  → AWS SQS (managed queue, much simpler and cheaper)
          Am I on GCP?
            YES → Google Cloud Pub/Sub (managed, native integration)
          Am I on Azure?
            YES → Azure Service Bus or Azure Event Hubs (Kafka-compatible)
    NO → Continue...

→ Do I need message replay (re-read past messages)?
    YES → Kafka (only broker with true event log replay)
    NO  → Continue...

→ Do I need complex routing (by content, headers, priority)?
    YES → RabbitMQ (richest routing model)
    NO  → Continue...

→ Do I need > 500,000 messages/second throughput?
    YES → Kafka (RabbitMQ cannot sustain this at scale)
    NO  → Continue...

→ Do I need stream processing (joins, aggregations, windowed operations)?
    YES → Kafka + Kafka Streams or Flink
    NO  → Continue...

→ Do I value operational simplicity over features?
    YES → SQS (cloud) or RabbitMQ (self-hosted) - much simpler to operate
    NO  → Kafka provides the most power at the cost of complexity

Default recommendation:
  Small teams, simple use cases: → SQS (cloud) or RabbitMQ (self-hosted)
  Large teams, event-driven architecture, analytics: → Kafka

Decision Tree 2: Partition Key Selection

START: What partition key should I use for this Kafka topic?

→ Do messages need strict ordering?
    NO → Use null key (round-robin distribution - highest throughput)
    YES → Continue...

→ What is the unit of ordering?
    "All events for the same ORDER": → Use orderId as partition key
    "All events for the same USER":  → Use userId as partition key
    "All events for the same ENTITY": → Use entityId as partition key

→ Is there a risk of hot partitions (one key has >> traffic than others)?
    YES → Is the hot key predictable (known large tenants/merchants)?
            YES → Custom partitioner with multiple slots for hot keys
            NO  → Compound key (entityId + random bucket 0-3)
    NO  → Simple entity ID hash is fine

→ Are there any keys that MUST be processed in sequence relative to each other?
    YES (e.g., all payment events for a user must be in order across order IDs):
         → Use a higher-level key (userId) that groups all related entities
    NO  → Entity-level key is sufficient

Decision Tree 3: Delivery Semantics Selection

START: What delivery guarantee do I need?

→ Can I afford to lose some messages?
    YES (metrics, telemetry, analytics sampling):
         → At-most-once
         → acks=0 (no acknowledgment) or acks=1 (leader only)
         → Auto-commit BEFORE processing
         → Highest throughput
    NO → Continue...

→ Can I make my consumer idempotent (safe to process twice)?
    YES (most well-designed consumers):
         → At-least-once with idempotent consumer
         → acks=all, retries enabled
         → Manual ack AFTER processing
         → Deduplication by event ID in consumer
         → Best balance of reliability and simplicity

    NO (consumer has unavoidable side effects that cannot be deduplicated):
         → Exactly-once via Kafka transactions (high cost)
         → Consider redesigning for idempotency first
         → Is the consumer calling an external API?
              → Pass idempotency key to external API (Stripe, Twilio support this)
              → This gives you exactly-once EFFECT without exactly-once DELIVERY

Previous: Supplement 2 - Production Challenges
Next: Supplement 4 - Real-World Architecture
Index: Message Queues Demystified - Index