Message Queues - Supplement 3: Trade-Offs & Decision Guide
Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 4 - Real-World Architecture
Every architectural decision involves trade-offs. This guide makes those trade-offs explicit,
concrete, and actionable. Each section answers "when should I choose A over B?" with
criteria, comparison tables, and real-world context.
Table of Contents
- When to Use Message Queues (and When NOT to)
- Where in Your Architecture to Place Message Queues
- Technology Selection: Full Decision Matrix
- Kafka vs RabbitMQ: Deep Comparison
- Cloud-Managed vs Self-Hosted Trade-Offs
- Delivery Semantics Trade-Off Table
- Queue vs Topic vs Stream: The Full Decision
- Partition Strategy Trade-Offs
- Message Size Trade-Offs
- Synchronous vs Asynchronous Decision Framework
- Event-Driven vs Command-Driven Trade-Offs
- Consumer Concurrency Trade-Offs
- Retention and Storage Trade-Offs
- Schema Evolution Strategy Trade-Offs
- Full Architectural Decision Trees
1. When to Use Message Queues (and When NOT to)
Apply Message Queues When...
| Scenario | Reason Message Queue is Right | Example |
|---|---|---|
| Work takes longer than a caller can wait | Caller gets immediate response; work continues in background | Video transcoding, report generation |
| Multiple downstream services need the same event | Fan-out without coupling to each downstream | Order placed → warehouse + email + analytics |
| Downstream service is slower than upstream | Queue absorbs traffic spikes; downstream processes at its own pace | Payment processor maxes at 500/sec; orders arrive at 2000/sec during sales |
| Downstream service availability affects upstream | Decouple: downstream being down does not fail the upstream call | Email service down should not fail order placement |
| Workload can be parallelized | Add consumers to scale processing without changing producers | Batch image resizing, log processing |
| Retry/reliability without caller involvement | Failed downstream processing retried automatically | Webhook delivery, external API calls |
| Audit trail of all events needed | Topic log provides ordered, replayable event history | Financial transactions, compliance |
| Services are in different teams/deployment cycles | Event contracts are stable; teams deploy independently | Microservices with different release schedules |
Do NOT Use Message Queues When...
| Scenario | Why Message Queue is Wrong | Better Alternative |
|---|---|---|
| Caller needs the response to continue | Async adds latency with no benefit; caller blocks anyway | Synchronous HTTP/gRPC |
| Operation must be atomic across services | Queue does not provide distributed transactions | Saga pattern (if unavoidable) or monolith |
| System is a simple CRUD application | Unnecessary complexity for straightforward request/response | REST API |
| You need to query current state | Queues are not databases; querying is wrong tool | Database query or cache |
| Real-time user interaction required | Queue latency is too high for sub-100ms user interactions | WebSockets, Server-Sent Events |
| Two services that could be the same service | Unnecessary inter-service communication | Merge into one service |
| Request volume is low and predictable | Queue overhead not worth it for < 10 req/sec | Direct HTTP call |
| All consumers must respond before proceeding | Synchronous fan-out is semantically required | Request-scatter-gather via HTTP |
The Decision Question
Ask these questions in order:
1. Does the caller need the result to proceed?
YES → Use synchronous HTTP/gRPC
NO → Consider async. Continue to question 2.
2. Can the operation fail and be retried later without the caller knowing?
YES → Message queue is a good fit
NO → Consider: is this actually two concerns mixed together?
3. Are there multiple consumers that all need this event independently?
YES → Message queue (pub-sub / topic model) is the right tool
NO → Consider: could a direct call handle this? If so, simpler is better.
4. Is the downstream significantly slower or less reliable than upstream?
YES → Message queue provides buffering and decoupling
NO → Direct call might be simpler; evaluate overhead vs benefit.
5. Do you need event replay or time-travel for debugging/analytics?
YES → Kafka (event log) is specifically designed for this
NO → Any queue suffices.
2. Where in Your Architecture to Place Message Queues
Placement Patterns
Pattern A: Edge Buffering (Input)
External clients (high, bursty)
↓
[API Gateway]
↓
Message Queue ← Buffer here
↓
Internal Services (capacity-limited)
When to use: External traffic is unpredictable. Internal processing has finite capacity (database writes, payment gateway calls). You need to absorb traffic spikes without rejecting requests.
Trade-off: Adds latency for the user (job accepted but not completed). Requires async result delivery (webhook, polling, WebSocket).
Pattern B: Service Decoupling (Middle)
Service A → Message Queue → Service B
↓
Service C
↓
Service D
When to use: Service A is owned by one team; services B, C, D by other teams. Deployment independence is important. Services B, C, D can be added/removed without changing Service A.
Trade-off: Eventual consistency. Service A cannot know what state Services B/C/D are in. Debugging requires distributed tracing across the queue boundary.
Pattern C: Event Backbone (Architecture-Wide)
All domain services → publish to central event log (Kafka)
All domain services → subscribe to events they care about
Event log becomes the single source of truth for "what happened"
When to use: Event sourcing architecture. Regulatory audit requirements (every event logged). Multiple consumers per domain event. Analytics and ML training data pipeline from production events.
Trade-off: Event schema evolution is critical and difficult. All teams must coordinate on event contracts. Higher operational complexity (Schema Registry, replication, ordering policies).
Pattern D: Work Distribution (Compute Tier)
Job Submission API
↓
[Job Queue]
↓
┌───────────────┐
│ Worker Pool │
│ (elastic) │
└───────────────┘
When to use: Compute-intensive tasks (ML inference, video processing, report generation). Work items are independent (can be parallelized). Need to scale workers based on queue depth.
Trade-off: Worker scaling adds latency. Job failures must be handled (DLQ, retry policies). Result delivery to the original caller requires a separate mechanism (callback URL, polling).
The "Should I Put a Queue Here?" Decision for Each Service Boundary
For each service-to-service call, evaluate:
Latency requirement:
< 100ms → probably synchronous HTTP
100ms–2s → either can work; queue adds resilience
> 2s (background work) → queue almost always correct
Dependency direction:
A depends on B for its core function → HTTP (tight coupling acceptable)
A does something that B needs to know about → Queue (B observes A)
Failure semantics:
"If B is down, A should fail immediately" → HTTP
"If B is down, A should continue and B will catch up" → Queue
Traffic pattern:
Steady, predictable → HTTP is fine
Bursty, spiky → Queue for buffering
Batch (overnight processing) → Queue almost always correct
3. Technology Selection: Full Decision Matrix
| Requirement | RabbitMQ | Kafka | AWS SQS | Redis Streams | Google Pub/Sub | Azure Service Bus |
|---|---|---|---|---|---|---|
| Message ordering | Per-queue (not partitioned) | Strong (per partition) | FIFO queues only | Per stream | Best-effort | Session-based |
| Throughput | ~50K msg/sec/broker | >1M msg/sec/broker | ~3K msg/sec/queue | ~1M msg/sec | ~1M msg/sec | ~1M msg/sec |
| Message replay | No (consumed = gone) | Yes (retention-based) | No | Limited | Limited | No |
| Push vs Pull | Push (broker pushes) | Pull (consumer polls) | Pull (client polls) | Pull | Push + Pull | Push + Pull |
| Routing flexibility | Excellent (exchanges) | Limited (topic-based) | None | None | Attribute filtering | Topic/subscription |
| Exactly-once | Not natively | Yes (transactions) | No | No | Yes (per region) | Yes |
| Self-hosted | Yes | Yes | No (AWS only) | Yes | No (GCP only) | No (Azure only) |
| Managed service | CloudAMQP, AmazonMQ | Confluent, MSK, Aiven | Native | Redis Cloud | Native | Native |
| Operational complexity | Medium | High | Very Low | Low | Very Low | Very Low |
| Best for | Task queues, routing | Event log, analytics, stream processing | Simple cloud queuing | Simple streams, low infra | Cloud-native event bus | Enterprise .NET/Java |
4. Kafka vs RabbitMQ: Deep Comparison
This is the most commonly debated technology choice. Here is the complete picture:
Architecture Philosophy Difference
RabbitMQ: Smart broker, dumb consumers
- Broker maintains subscriber state
- Broker routes messages to consumers
- Broker tracks what each consumer has received
- Consumers are stateless receivers
- Message consumed = removed from queue
Kafka: Dumb broker, smart consumers
- Broker just stores messages in an ordered log
- Consumers track their own position (offset)
- Consumers pull messages at their own pace
- Consumers can re-read messages at any time
- Messages persist until retention policy expires
This philosophical difference drives ALL the other trade-offs.
Use Case Decision Table
| Use Case | Choose RabbitMQ | Choose Kafka |
|---|---|---|
| Task queue (each task done once) | ✅ Designed for this | ⚠️ Works but overkill |
| Job scheduling with priority | ✅ Native priority queues | ❌ No native priority |
| Complex routing (by content/headers) | ✅ Topic/Header exchanges | ❌ Limited routing |
| Request-Reply pattern | ✅ Reply-to header support | ⚠️ Possible but complex |
| Event sourcing / audit log | ❌ Messages not retained | ✅ Designed for this |
| Stream processing (Kafka Streams) | ❌ No stream processing | ✅ Kafka Streams, ksqlDB |
| Replay past events for new service | ❌ Impossible after consume | ✅ Replay from any offset |
| Analytics data pipeline | ❌ Not designed for volume | ✅ Built for petabyte scale |
| IoT data ingestion (millions/sec) | ❌ Throughput insufficient | ✅ Handles millions/sec |
| Simple microservice messaging | ✅ Easy to set up | ⚠️ Complex for simple use |
| Multiple consumers per message | ✅ Fan-out exchanges | ✅ Consumer groups |
| Exactly-once across services | ❌ No native support | ✅ Kafka transactions |
| Message TTL / expiry | ✅ Native support | ⚠️ Via retention policies |
| Team learning curve | ✅ Easier | ❌ Steeper curve |
| Operational simplicity | ✅ Simpler | ❌ More complex (ZK/KRaft) |
Performance Trade-Off
RabbitMQ:
Write latency: 1-5ms (very low)
Throughput: 50,000-100,000 messages/sec per broker
Memory footprint: Higher (tracks per-consumer state)
Disk usage: Lower (messages deleted after consumption)
Good for: < 1 million messages/day, latency-sensitive tasks
Kafka:
Write latency: 2-10ms (tunable: lower with acks=1, higher with acks=all)
Throughput: 1,000,000+ messages/sec per broker
Memory footprint: Lower (stateless broker, consumers track state)
Disk usage: Higher (messages retained, not deleted)
Good for: Billions of messages/day, analytics pipelines, event sourcing
When Teams Pick the Wrong One
Common mistake 1: Using Kafka for a simple task queue
Symptom: "We have Kafka but only 3 consumers and 1000 events/day"
Problem: Kafka's complexity (partitions, replication, schema registry) adds
operational overhead for no benefit at this scale
Solution: RabbitMQ or even SQS would have been simpler and sufficient
Common mistake 2: Using RabbitMQ for event sourcing
Symptom: "We need to replay last month's events for our new analytics service"
Problem: RabbitMQ messages are gone after consumption - cannot replay
Solution: Kafka (or rebuild from DB with Debezium CDC)
Common mistake 3: Using Kafka for complex routing
Symptom: "We want to route messages based on 5 header attributes to different queues"
Problem: Kafka has no native routing - must be done in application code
Solution: RabbitMQ's Topic or Headers exchange handles this natively
5. Cloud-Managed vs Self-Hosted Trade-Offs
Trade-Off Matrix
| Dimension | Self-Hosted (Kafka/RabbitMQ) | Managed (Confluent/MSK/SQS/AmazonMQ) |
|---|---|---|
| Upfront cost | Low ($) | None |
| Ongoing cost | Medium (infra + ops time) | High (per-message + per-hour pricing) |
| Cost at scale | Lower (EC2 is cheaper than managed) | Can be very expensive at petabyte scale |
| Operational burden | High (upgrades, disk, replication) | Very Low (vendor handles) |
| Feature availability | All features, any version | Latest only (managed version lag) |
| Customization | Full control | Limited to vendor's config options |
| Vendor lock-in | None (open source) | Medium-High (proprietary APIs/features) |
| Time to production | Weeks (setup, hardening) | Hours |
| Disaster recovery | Self-managed (MirrorMaker, backups) | Vendor-provided (often across AZs) |
| Compliance (FedRAMP, etc.) | You certify | Vendor often pre-certified |
| SLA | You build it | Vendor provides (99.9–99.99%) |
Cost Breakdown Examples
Scenario: 100 million messages/day, 1KB each = 100GB/day
AWS MSK (managed Kafka):
Broker cost: 3 × m5.xlarge = $0.192/hr × 3 × 720hr = ~$415/month
Storage: 100GB/day × 7 day retention = 700GB × $0.10/GB = $70/month
Data transfer: 100GB/day × 30 days × $0.02/GB = $60/month
Total: ~$545/month
Self-hosted Kafka (AWS EC2):
Broker cost: 3 × m5.xlarge = $415/month
EBS storage: 700GB × $0.10 = $70/month
PLUS operations time: 0.5 FTE × $150K/year = ~$6,250/month
Total: ~$6,735/month ← HIGHER despite cheaper infrastructure!
AWS SQS (managed queue):
100 million messages × $0.40/million = $40/month
Total: $40/month (dramatically cheaper for simple queue use case!)
Confluent Cloud (managed Kafka):
100 million messages × roughly $0.08/million = $8,000/month
(Confluent pricing is expensive at scale but rich in features)
Lesson: For simple queue use cases, SQS is dramatically cheaper.
For event log/analytics at medium scale, MSK balances cost and operations.
Self-hosted only wins at petabyte scale where infrastructure cost dominates ops cost.
6. Delivery Semantics Trade-Off Table
The Three Guarantees Explained
| Semantic | Message Delivered | Duplicates Possible | Data Loss Possible | Default For | Suitable For |
|---|---|---|---|---|---|
| At-most-once | 0 or 1 times | No | Yes | Analytics, metrics | Sensor readings, telemetry, non-critical events |
| At-least-once | 1 or more times | Yes | No | All major brokers | Orders, notifications, emails (with idempotent consumer) |
| Exactly-once | Exactly 1 time | No | No | Nowhere (requires special setup) | Financial transactions, billing, inventory deductions |
Implementation Cost of Each Semantic
At-most-once:
Producer: Fire and forget (no acknowledgment wait)
Consumer: Auto-commit before processing
Extra infrastructure: None
Relative cost: 1× (baseline)
Throughput impact: Highest (no waiting for ACKs)
At-least-once:
Producer: acks=1 or acks=all + retries enabled
Consumer: Manual ack after successful processing
Extra infrastructure: DLQ for failed messages
Relative cost: 2× (acknowledgment overhead)
Throughput impact: Medium
Exactly-once (Kafka):
Producer: enable.idempotence=true + transactional.id set + acks=all
Consumer: isolation.level=read_committed + transactional processing
Broker: Kafka transactions enabled, min.insync.replicas=2
Consumer: Idempotent processing logic (DB deduplication) as backup
Extra infrastructure: Kafka transactions, Schema Registry, transactional DB
Relative cost: 5–10× (transaction coordination overhead)
Throughput impact: Significant reduction (20-50% vs at-least-once)
When Exactly-Once is NOT Worth It
Myth: "We need exactly-once delivery for financial transactions"
Reality: "We need exactly-once PROCESSING, which is different"
Exactly-once delivery is a property of the MESSAGE TRANSPORT.
Exactly-once processing is achieved by making your CONSUMER idempotent.
The idempotent consumer approach:
- Uses at-least-once delivery (simpler, faster, more available)
- Consumer records processed event IDs in a DB with unique constraint
- If duplicate arrives: DB unique constraint prevents re-processing
- Net result: exactly-once PROCESSING without exactly-once DELIVERY
This is simpler, cheaper, and has better availability than Kafka transactions.
Kafka transactions are appropriate when:
- You consume from one Kafka topic AND produce to another Kafka topic
- Both operations must succeed or fail together atomically
- Example: Read order event → compute → write to analytics topic (atomic)
Kafka transactions are NOT appropriate for:
- Consumer → Database (use DB transactions + idempotency key instead)
- Consumer → External API (external APIs are not part of Kafka's transaction boundary)
7. Queue vs Topic vs Stream: The Full Decision
Definitions Clarified
Queue:
- One message → consumed by exactly ONE consumer
- Message is deleted after consumption
- Consumer position tracked by broker
- Examples: RabbitMQ queue, AWS SQS, Azure Service Bus queue
Topic (Message Bus):
- One message → consumed by ALL subscribed consumers
- Each subscriber gets its own copy
- Message retained for subscription duration
- Examples: AWS SNS, Google Pub/Sub, Kafka topics (with multiple consumer groups)
Stream (Event Log):
- Ordered, persistent log of records
- Multiple consumers can read the same records at different positions
- Records not deleted after consumption (retention-based)
- Consumers manage their own position (offset)
- Examples: Kafka, AWS Kinesis, Redis Streams
Decision Criteria
Use a Queue when:
✓ Each piece of work should be done by exactly one worker
✓ Work items are independent of each other
✓ You want automatic load distribution across workers
✓ Messages do not need to be replayed
✓ Example: Email sending queue, job processing queue
Use a Topic (Pub-Sub) when:
✓ Multiple services all need to know when something happened
✓ Publishers should not need to know who the subscribers are
✓ Adding a new consumer does not change the producer
✓ Subscribers have different interests/filters on the same event stream
✓ Example: OrderPlaced event → (warehouse, email, analytics all subscribe)
Use a Stream (Event Log) when:
✓ You need to replay past events (new service joining, bug fix replay)
✓ Events need to be processed in strict order
✓ Multiple independent consumers at different positions in time
✓ Long-term retention for compliance or analytics
✓ You want to "rewind" and re-process history
✓ Example: Financial transaction log, change data capture, user activity feed
When a Queue AND a Topic Together Make Sense
Pattern: SNS → SQS Fan-Out (AWS)
SNS Topic (guarantees fan-out to all queues)
|
┌────+────┐────────────┐
↓ ↓ ↓
SQS Queue SQS Queue SQS Queue
(warehouse) (email) (analytics)
Each service gets its own queue:
- Slow analytics consumer does not block warehouse consumer
- Each queue has its own DLQ
- Each queue has its own retry policy
- Adding a new consumer = add a new SQS queue + subscribe to SNS
This is the right pattern for fan-out with independent consumers
8. Partition Strategy Trade-Offs
Choosing a Partition Key
| Partition Key Choice | Ordering Guarantee | Distribution | Hotspot Risk | Use When |
|---|---|---|---|---|
| Entity ID (orderId, userId) | All events for same entity ordered | Even (if IDs are random/UUID) | Low | Entity-level ordering required (most common) |
| User ID | All events for same user ordered | Even | Low for UUID; High if Pareto distribution | User activity streams |
| Tenant ID | All events for same tenant ordered | Can be hot for large tenants | High if tenant sizes vary greatly | Multi-tenant with equal tenants |
| Geographic region | Regional ordering | May be uneven | Possible (dense regions) | Regional data sovereignty |
| Custom hash bucket | Within-bucket ordering | Controlled | Low (by design) | Known hot entities needing distribution |
| Round-robin (null key) | No ordering guarantee | Perfect | None | Pure throughput, no ordering needed |
How Many Partitions?
Under-partitioning (too few):
Problem: Bottleneck throughput at peak load
Problem: Cannot add more consumers than partitions
Problem: One slow message blocks all messages on that partition
Rule: If unsure, start with more partitions (cannot easily increase later)
Over-partitioning (too many):
Problem: More partitions = more ZooKeeper/controller load
Problem: Leader election takes longer with more partitions
Problem: Consumer rebalancing takes longer
Problem: Replication overhead increases linearly with partition count
Rule: Do not create 10,000 partitions for a low-volume topic
Partition count formula:
target_throughput_mb_per_sec / throughput_per_partition_mb_per_sec
Where throughput_per_partition:
For producer: ~10–40 MB/sec per partition (varies by hardware)
For consumer: ~15–50 MB/sec per partition (depends on processing speed)
Conservative estimate: 10 MB/sec per partition
Example:
Target: 500 MB/sec peak
Per partition: 10 MB/sec (conservative)
Partitions needed: 50
Add 20% headroom: 60 partitions
Round to even number: 60 (or 64 for power-of-2 if custom partitioner used)
Practical defaults:
Low-traffic topics (< 10 MB/sec): 3–6 partitions
Medium-traffic topics (10–100 MB/sec): 12–24 partitions
High-traffic topics (> 100 MB/sec): 48–120 partitions
Rethink if > 200 partitions: consider separate cluster
Replication Factor Trade-Off
replication.factor=1:
Cost: Lowest (1× storage)
Durability: NONE - single broker failure = data loss
Use for: Development only. Never in production.
replication.factor=2:
Cost: 2× storage
Durability: Tolerates 1 broker failure
Problem: min.insync.replicas=2 means writes fail when ANY broker fails
Use for: Low-criticality topics where availability > durability
replication.factor=3 (recommended for production):
Cost: 3× storage
Durability: Tolerates 1 broker failure with no data loss
Availability: Writes succeed as long as 2/3 brokers available
Use for: All production data (standard practice)
replication.factor=3 + min.insync.replicas=2 is the production-standard combination
- Tolerates 1 broker failure with zero data loss
- Remains writable with 2/3 brokers available
- Refuses writes (safely) when only 1 broker available (better than silent data loss)
9. Message Size Trade-Offs
Message Size Spectrum
| Message Size | Approach | Pros | Cons | Example |
|---|---|---|---|---|
| < 1KB | Include all data inline | Simple, atomic, no external dependency | Fine for most cases | Order status change event |
| 1–100KB | Include inline OR use Claim Check | Inline: simpler; Claim Check: avoids broker overhead | Claim Check adds S3 latency | Product catalog update with images |
| 100KB–1MB | Claim Check (S3/blob reference) | Keeps broker fast and cheap | Extra S3 latency, S3 availability dependency | Large report generation result |
| > 1MB | Must use Claim Check | Kafka max 1MB default (configurable to 10MB) | S3 becomes critical path | Video upload notification, large datasets |
Kafka message.max.bytes Trade-Off
message.max.bytes=1MB (default):
Benefit: Predictable performance, small messages flow fast
Limit: Applications must handle RecordTooLargeException
message.max.bytes=10MB:
Benefit: Larger payloads without external storage
Cost: 10x memory per in-flight message
Cost: Network bandwidth spikes from large messages
Cost: Slower throughput (large messages fill batches faster)
Use only when: Infrequent large messages AND throughput is not critical
message.max.bytes=50MB+:
Almost always wrong choice for a message broker
Use the Claim Check pattern instead:
- Store payload in S3/GCS/Azure Blob
- Put reference URL in message (< 1KB)
- Consumer fetches from S3 when needed
- S3 is optimized for large object storage; Kafka is not
10. Synchronous vs Asynchronous Decision Framework
The Full Trade-Off Comparison
| Dimension | Synchronous (HTTP/gRPC) | Asynchronous (Message Queue) |
|---|---|---|
| Latency | Low (direct connection) | Higher (queue overhead: 2–20ms) |
| Coupling | Tight (caller knows callee) | Loose (caller knows event, not consumer) |
| Availability | Caller fails if callee is down | Caller succeeds even if consumer is down |
| Consistency | Easier (immediate feedback) | Harder (eventual consistency) |
| Error handling | Immediate error feedback | Delayed (may discover failure much later) |
| Backpressure | Natural (caller blocked) | Explicit (queue depth, consumer lag) |
| Observability | Simple (request-response trace) | Complex (distributed, multi-hop) |
| Scaling | Scale caller and callee together | Scale independently |
| Testing | Simple (mock the HTTP call) | Complex (requires broker in test) |
| Operational complexity | Low | High (broker management, DLQ, replay) |
| Result delivery | Immediate in response | Webhook, polling, or WebSocket needed |
Decision Framework by Use Case
DEFINITELY synchronous (do not use message queue):
- User authentication (must know result immediately)
- Inventory availability check before adding to cart
- Payment authorization result (user waiting at checkout)
- Search queries (user waiting for results)
- Any user interaction requiring sub-500ms response
DEFINITELY asynchronous (message queue):
- Sending confirmation emails (not in the critical user path)
- Updating analytics dashboards (minutes of delay acceptable)
- Generating invoices/reports (can take seconds to minutes)
- Syncing data to external systems (downstream may be slow)
- Processing uploaded files (video, large CSVs)
- Any work item where "accepted, processing" is an acceptable response
DEPENDS (evaluate with the 5 questions from Section 1):
- Inventory reservation (after purchase)
→ Async with Saga for reservation confirmation
- Payment processing notification to warehouse
→ Async (warehouse does not need immediate notification)
- Fraud check during payment
→ Synchronous (must know result before accepting payment)
→ But long-running fraud analysis: async webhook callback
The Hybrid Pattern: Sync Accept, Async Process
// Best of both worlds for long-running operations
@PostMapping("/orders")
public ResponseEntity<OrderAcceptedResponse> placeOrder(@RequestBody OrderRequest request) {
// STEP 1: Validate synchronously (fast, < 50ms)
orderValidator.validate(request); // Throws if invalid - caller gets 400 immediately
// STEP 2: Create order record synchronously (gives user an ID)
Order order = orderRepository.save(Order.create(request)); // 10ms
// STEP 3: Publish event asynchronously (non-blocking)
kafkaTemplate.send("order-events", order.getId(), OrderPlacedEvent.from(order));
// Return immediately with order ID and tracking URL
// User can check status asynchronously
return ResponseEntity.accepted().body(
OrderAcceptedResponse.builder()
.orderId(order.getId())
.status("PROCESSING")
.trackingUrl("/orders/" + order.getId())
.estimatedCompletionSeconds(30)
.build()
);
// Total response time: ~60ms
// User gets immediate feedback; downstream work happens asynchronously
}11. Event-Driven vs Command-Driven Trade-Offs
The Semantic Difference
Command (Imperative): "Do this specific thing"
- Sent to a specific recipient
- Recipient is expected to act
- Sender has intent
- Example: "ProcessPayment", "SendEmail", "UpdateInventory"
- Single consumer (one service owns the action)
Event (Declarative): "This thing happened"
- Published to whoever is interested
- Publisher does not know or care who reacts
- Publisher has no intent toward consumers
- Example: "OrderPlaced", "PaymentProcessed", "InventoryDepleted"
- Multiple consumers (anyone can react)
Trade-Off Comparison
| Dimension | Command-Driven | Event-Driven |
|---|---|---|
| Coupling | Medium (knows who does the work) | Low (does not know who reacts) |
| Responsibility clarity | High (clear ownership) | Lower (any service can add consumers) |
| Unintended side effects | Low (explicit intent) | Higher (new consumers can be added silently) |
| Choreography | Harder (explicit chains) | Easier (each service reacts independently) |
| Debugging | Easier (follow command chain) | Harder (who consumed this event?) |
| Adding new consumers | Requires changing publisher | No publisher changes needed |
| Removing consumers | Requires changing publisher | Just stop subscribing |
| Saga implementation | Orchestration (explicit) | Choreography (implicit) |
When to Use Each
Command-Driven is better when:
- Exactly ONE service should handle the operation
- Failure must be reported back to the publisher
- You need clear ownership and accountability
- The operation is a business operation with side effects
- Example: payment-service.command → ProcessPayment (one owner)
Event-Driven is better when:
- Multiple services have independent reactions to the same fact
- You want to add consumers without changing the publisher
- The publisher should not know what reactions occur
- Example: OrderPlaced → warehouse + email + analytics (all react independently)
Mixed approach (most real systems):
- External input to your domain: Commands (explicit, owned)
- Internal domain state changes published outward: Events (observable)
OrderService receives PlaceOrderCommand (command from API layer)
OrderService publishes OrderPlaced event (event for all observers)
WarehouseService subscribes to OrderPlaced (observes, acts independently)
PaymentService subscribes to OrderPlaced (observes, acts independently)
12. Consumer Concurrency Trade-Offs
Concurrency Models
Model 1: One consumer thread per partition (Kafka default)
- Thread A handles Partition 0
- Thread B handles Partition 1
- Thread C handles Partition 2
Pro: Simple, ordering guaranteed within thread
Con: Concurrency limited by partition count
Use when: Order matters within a partition (most financial use cases)
Model 2: Async processing pool (multiple threads per partition)
- Consumer thread polls Partition 0
- Dispatches to thread pool (10 threads)
- Thread pool processes concurrently
Pro: Higher throughput per partition (10x)
Con: Ordering not guaranteed (multiple threads racing on same partition)
Con: Offset management is complex (cannot commit until all in-flight done)
Use when: Processing is IO-bound and order does not matter (analytics, logging)
Model 3: Virtual threads (Java 21+)
- Each message processed on a virtual thread
- Virtual threads are lightweight (millions possible)
- Blocking IO does not waste OS threads
Pro: Extremely high concurrency for IO-bound processing
Con: Requires Java 21+
Con: Still requires careful ordering management
Use when: High-concurrency, IO-bound processing on Java 21+
Concurrency vs Ordering Trade-Off
High concurrency + strict ordering = impossible
You must choose:
Option A: Strict ordering (lower concurrency)
- Single-threaded per partition
- Kafka: 1 consumer instance per partition
- Throughput: limited by single-thread processing speed
- Use for: Financial state machines, order processing with dependencies
Option B: High concurrency (no ordering guarantee)
- Multiple threads or async processing per partition
- Throughput: scales with thread pool size
- Use for: Independent events (each can be processed in any order)
- Use for: Analytics, logging, notification sending (order does not matter)
Option C: Partial ordering (partition-level ordering, parallel across partitions)
- One thread per partition (parallel partitions, sequential within partition)
- Throughput: scales with partition count
- Use for: Per-entity ordering (all order-123 events in order, but independent of order-456)
- This is the most common correct approach for most systems
13. Retention and Storage Trade-Offs
Retention Policy Decision Matrix
| Data Type | Retention Period | Why | Risk of Too Short | Risk of Too Long |
|---|---|---|---|---|
| Financial transactions | 90 days (log) + 7 years (archive) | Regulatory compliance | Regulatory violation | Storage cost (manageable with compression) |
| Order events | 30 days | Bug fix replay + audit | Cannot replay month-old bugs | Storage cost |
| User activity | 7 days | Analytics pipeline + replay | Cannot rebuild analytics from scratch | Privacy compliance (GDPR right to erasure) |
| Application logs | 3–7 days | Debugging recent incidents | Lose context on slow-developing bugs | Storage cost, compliance complexity |
| Metrics/telemetry | 1–3 days | Near-real-time dashboards | Cannot investigate recent incidents | Storage cost |
| Analytics raw events | 7–14 days | ML training data recency | Stale training data | Storage cost |
| System notifications | 1 day | Transient, time-sensitive | Duplicate alerts on replay | Stale notifications delivered |
Size-Based vs Time-Based Retention
Time-based retention (retention.ms):
- Keeps messages for a fixed time window
- Predictable deletion schedule
- Good for: regulatory compliance (exact time-based SLA)
- Risk: If message rate surges, storage spikes before time-based cleanup
Size-based retention (retention.bytes):
- Keeps up to a fixed total size per partition
- Predictable disk usage
- Good for: disk capacity management
- Risk: During low traffic, older messages survive longer than expected
Best practice: Use BOTH (whichever limit is reached first triggers cleanup)
kafka-configs.sh --alter --topic my-topic \
--add-config retention.ms=2592000000 \ # 30 days
--add-config retention.bytes=53687091200 # 50GB per partition
14. Schema Evolution Strategy Trade-Offs
Evolution Strategy Comparison
| Strategy | Compatibility | Complexity | Risk Level | Suitable For |
|---|---|---|---|---|
| JSON, no schema | None (drift freely) | Very Low | Very High | Prototypes, POCs, never production |
| JSON with documentation | Informal, trust-based | Low | High | Small teams, trusted internal use |
| JSON Schema Registry | Formal, enforced | Medium | Medium | Medium teams, multiple consumers |
| Avro + Schema Registry | Enforced, binary | High | Low | High-performance, many consumers |
| Protobuf + Registry | Enforced, binary | High | Low | Cross-language, forward compat critical |
| Consumer-Driven Contracts (Pact) | Contract-tested | High | Low | Many independent teams |
Schema Registry Compatibility Modes
BACKWARD compatibility (most common, recommended default):
- New schema CAN read messages written with OLD schema
- Old schema CANNOT read messages written with NEW schema
- Safe pattern: consumers update first, then producers
- Changes allowed: add optional fields, remove fields (with defaults)
- Changes NOT allowed: rename fields, change field types
FORWARD compatibility:
- Old schema CAN read messages written with NEW schema
- New schema CANNOT read messages written with OLD schema
- Safe pattern: producers update first, then consumers
- Changes allowed: remove fields, add fields (consumers ignore unknown)
- Changes NOT allowed: rename fields, change field types
FULL compatibility (most restrictive, safest for long-lived schemas):
- Both backward AND forward compatible
- Any consumer version can read any producer version
- Changes allowed: add optional fields (with defaults)
- Everything else: blocked
- Use for: Public APIs, schemas shared across many teams
NONE (no compatibility checking):
- Any change allowed
- Producer and consumer must be deployed atomically
- Use for: Development only, never production
15. Full Architectural Decision Trees
Decision Tree 1: Message Broker Selection
START: I need to communicate between services asynchronously
→ Is this for a single cloud provider environment?
YES → Am I on AWS?
YES → High throughput or event sourcing needed?
YES → AWS MSK (managed Kafka)
NO → AWS SQS (managed queue, much simpler and cheaper)
Am I on GCP?
YES → Google Cloud Pub/Sub (managed, native integration)
Am I on Azure?
YES → Azure Service Bus or Azure Event Hubs (Kafka-compatible)
NO → Continue...
→ Do I need message replay (re-read past messages)?
YES → Kafka (only broker with true event log replay)
NO → Continue...
→ Do I need complex routing (by content, headers, priority)?
YES → RabbitMQ (richest routing model)
NO → Continue...
→ Do I need > 500,000 messages/second throughput?
YES → Kafka (RabbitMQ cannot sustain this at scale)
NO → Continue...
→ Do I need stream processing (joins, aggregations, windowed operations)?
YES → Kafka + Kafka Streams or Flink
NO → Continue...
→ Do I value operational simplicity over features?
YES → SQS (cloud) or RabbitMQ (self-hosted) - much simpler to operate
NO → Kafka provides the most power at the cost of complexity
Default recommendation:
Small teams, simple use cases: → SQS (cloud) or RabbitMQ (self-hosted)
Large teams, event-driven architecture, analytics: → Kafka
Decision Tree 2: Partition Key Selection
START: What partition key should I use for this Kafka topic?
→ Do messages need strict ordering?
NO → Use null key (round-robin distribution - highest throughput)
YES → Continue...
→ What is the unit of ordering?
"All events for the same ORDER": → Use orderId as partition key
"All events for the same USER": → Use userId as partition key
"All events for the same ENTITY": → Use entityId as partition key
→ Is there a risk of hot partitions (one key has >> traffic than others)?
YES → Is the hot key predictable (known large tenants/merchants)?
YES → Custom partitioner with multiple slots for hot keys
NO → Compound key (entityId + random bucket 0-3)
NO → Simple entity ID hash is fine
→ Are there any keys that MUST be processed in sequence relative to each other?
YES (e.g., all payment events for a user must be in order across order IDs):
→ Use a higher-level key (userId) that groups all related entities
NO → Entity-level key is sufficient
Decision Tree 3: Delivery Semantics Selection
START: What delivery guarantee do I need?
→ Can I afford to lose some messages?
YES (metrics, telemetry, analytics sampling):
→ At-most-once
→ acks=0 (no acknowledgment) or acks=1 (leader only)
→ Auto-commit BEFORE processing
→ Highest throughput
NO → Continue...
→ Can I make my consumer idempotent (safe to process twice)?
YES (most well-designed consumers):
→ At-least-once with idempotent consumer
→ acks=all, retries enabled
→ Manual ack AFTER processing
→ Deduplication by event ID in consumer
→ Best balance of reliability and simplicity
NO (consumer has unavoidable side effects that cannot be deduplicated):
→ Exactly-once via Kafka transactions (high cost)
→ Consider redesigning for idempotency first
→ Is the consumer calling an external API?
→ Pass idempotency key to external API (Stripe, Twilio support this)
→ This gives you exactly-once EFFECT without exactly-once DELIVERY
Previous: Supplement 2 - Production Challenges
Next: Supplement 4 - Real-World Architecture
Index: Message Queues Demystified - Index