Message Queues Demystified - Part 5: Operations and Performance
A system that works perfectly on your laptop and fails under production load is not a working system.
Operations and performance are where theory meets reality.
Table of Contents
- Throughput Optimization
- Batching Strategies
- Message Compression
- Horizontal Scaling and Partition Strategy
- Consumer Lag - The Most Important Operational Metric
- Key Metrics to Monitor
- High Availability and Replication
- Message TTL and Retention Policies
- Capacity Planning
- Performance Tuning Cheat Sheet
1. Throughput Optimization
Understanding the Bottlenecks
Before optimizing, identify where the bottleneck is:
End-to-end message path:
Producer App -> Network -> Broker (write) -> Disk -> Network -> Consumer App -> Processing
Common bottleneck locations:
1. Producer: serialization speed, batch size, connection count
2. Network (producer to broker): bandwidth, latency
3. Broker write: disk I/O, replication overhead
4. Broker read: disk I/O, network
5. Consumer: deserialization speed, processing time, downstream dependencies
6. Database (consumer side): write throughput, lock contention
Producer Throughput Optimization
Kafka Producer settings:
@Bean
public ProducerFactory<String, Object> highThroughputProducerFactory() {
Map<String, Object> config = new HashMap<>();
// THROUGHPUT settings
config.put(ProducerConfig.LINGER_MS_CONFIG, 20);
// Wait up to 20ms to accumulate messages before sending
// Allows batching at the cost of slightly higher latency
// Default is 0 (send immediately) - terrible for throughput
config.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536);
// 64KB batch size (default is 16KB)
// Larger batches = more messages per network round trip
config.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864L);
// 64MB total buffer memory (default is 32MB)
// Producer blocks when buffer is full
config.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
// Compress batches before sending
// LZ4 is fast with good compression ratio
// RELIABILITY settings (do not sacrifice these for throughput)
config.put(ProducerConfig.ACKS_CONFIG, "1");
// acks=1 for throughput (leader ACK only)
// Use acks=all for critical data
config.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
// 5 in-flight requests per connection
// Increases throughput but can affect ordering on retry
// Set to 1 if you need strict per-key ordering with retries
return new DefaultKafkaProducerFactory<>(config);
}Throughput vs Latency trade-off:
linger.ms=0 (default):
- Each message sent immediately
- Low latency (messages sent fast)
- Low throughput (each message = one network call)
linger.ms=20:
- Wait 20ms to accumulate messages into a batch
- Higher latency (up to 20ms delay)
- High throughput (one network call per batch = 100x efficiency)
- Right choice for background processing, analytics
linger.ms=100:
- Very high throughput
- 100ms latency acceptable only for bulk operations
Consumer Throughput Optimization
Increase parallelism:
@Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> factory() {
ConcurrentKafkaListenerContainerFactory<String, Object> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConcurrency(6);
// 6 concurrent consumer threads per instance
// Each thread processes one partition
// Requires at least 6 partitions for full utilization
return factory;
}Parallel processing within a single message (for CPU-bound work):
@KafkaListener(topics = "image-resize-jobs", concurrency = "6")
public void processImage(ImageResizeJob job, Acknowledgment ack) {
CompletableFuture.runAsync(() -> {
imageService.resize(job.getImageUrl(), job.getTargetDimensions());
}, imageProcessingExecutor)
.thenRun(ack::acknowledge)
.exceptionally(e -> {
log.error("Image resize failed: {}", job.getImageId(), e);
return null; // Do not ACK - will be redelivered
});
}Optimize the processing path:
// Slow - N database round trips for N messages
@KafkaListener(topics = "user-events")
public void handleUserEvent(UserEvent event) {
userRepository.findById(event.getUserId()) // Database call
.ifPresent(user -> {
userRepository.save(user.apply(event)); // Another database call
});
}
// Fast - 1 database round trip for N messages (batch consumer)
@KafkaListener(topics = "user-events", batch = "true")
public void handleUserEvents(List<UserEvent> events) {
Set<String> userIds = events.stream()
.map(UserEvent::getUserId)
.collect(Collectors.toSet());
// Single query for all users in the batch
Map<String, User> users = userRepository.findAllById(userIds)
.stream().collect(Collectors.toMap(User::getId, u -> u));
List<User> updatedUsers = events.stream()
.map(event -> users.get(event.getUserId()).apply(event))
.collect(Collectors.toList());
// Single batch save
userRepository.saveAll(updatedUsers);
}2. Batching Strategies
Producer Batching
Producer batching groups multiple messages into a single network call to the broker. This is one of the highest-leverage optimizations.
Without batching:
10,000 messages = 10,000 network calls = 10,000 disk writes on broker
Network overhead dominates
With batching (linger.ms=20, batch.size=64KB):
10,000 messages = ~100 network calls = more efficient disk writes
10-100x throughput improvement
The batch.size limit:
Kafka sends a batch when either:
- The batch reaches
batch.sizebytes, OR linger.msmilliseconds have passed
Whichever comes first. So even with a large linger.ms, a busy producer will send batches immediately when they fill up.
Consumer Batch Processing
// Single message processing - 1000 DB calls for 1000 messages
@KafkaListener(topics = "orders")
public void process(OrderEvent event) {
orderRepository.save(new ProcessedOrder(event)); // 1 DB call per message
}
// Batch processing - 10 DB calls for 1000 messages (100x reduction)
@KafkaListener(topics = "orders", batch = "true")
public void processBatch(List<ConsumerRecord<String, OrderEvent>> records) {
List<ProcessedOrder> orders = records.stream()
.map(record -> new ProcessedOrder(record.value()))
.collect(Collectors.toList());
orderRepository.saveAll(orders); // 1 DB call for the whole batch
// Acknowledge all records in the batch
// Spring Kafka handles this automatically in batch mode
}Batch size tuning:
# Maximum records to fetch per poll
spring.kafka.consumer.max-poll-records=500
# Maximum bytes to fetch per poll per partition
fetch.max.bytes=52428800 # 50MB
fetch.min.bytes=1024 # Wait for at least 1KB before returning
fetch.max.wait.ms=500 # Wait at most 500ms for fetch.min.bytesRabbitMQ Batch Publishing
// Manual batching in RabbitMQ
@Service
public class BatchPublisher {
private final List<Message> pendingBatch = new ArrayList<>();
private static final int BATCH_SIZE = 100;
public synchronized void addToBatch(Object payload) {
pendingBatch.add(MessageBuilder.withBody(
objectMapper.writeValueAsBytes(payload)).build());
if (pendingBatch.size() >= BATCH_SIZE) {
flush();
}
}
@Scheduled(fixedDelay = 1000)
public synchronized void flush() {
if (pendingBatch.isEmpty()) return;
rabbitTemplate.execute(channel -> {
for (Message message : pendingBatch) {
channel.basicPublish("exchange", "routing.key", null, message.getBody());
}
channel.waitForConfirmsOrDie(5000); // Wait for broker ACK
return null;
});
pendingBatch.clear();
}
}3. Message Compression
Compression reduces network bandwidth and disk usage at the cost of CPU time on the producer (compress) and consumer (decompress).
Compression Options
| Algorithm | Ratio | Speed | CPU Cost | Best For |
|---|---|---|---|---|
none | 1.0x | N/A | None | Low-volume, time-critical |
gzip | 3-5x | Slow | High | Disk-space sensitive, throughput not critical |
snappy | 2-3x | Fast | Low | General purpose, balanced |
lz4 | 2-3x | Very Fast | Very Low | High-throughput pipelines |
zstd | 3-5x | Fast | Medium | Best overall compression + speed balance |
Recommendation for most Kafka deployments:
- High throughput pipelines:
lz4 - Storage-sensitive:
zstd(best overall) - Compatibility:
snappy
// Kafka compression (batch-level)
config.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
// Entire batch is compressed as one unit
// More messages in batch = better compression ratioWhen Compression Helps Most
Best compression scenarios:
- JSON messages (highly redundant, compresses well)
- Log events (repetitive field names and values)
- Large batches (compression works better on larger inputs)
Compression adds little value for:
- Already compressed data (JPEG, MP4, ZIP)
- Very small messages (< 1KB)
- Low-volume, latency-sensitive paths
4. Horizontal Scaling and Partition Strategy
Kafka Partition Count - The Golden Rule
Rule: Number of partitions = maximum consumer parallelism
6 partitions, 3 consumers: each consumer handles 2 partitions
6 partitions, 6 consumers: each consumer handles 1 partition (maximum parallelism)
6 partitions, 10 consumers: 6 consumers active, 4 idle (waste)
To handle 10,000 messages/second and each consumer handles 1,000/second:
Required consumers = 10,000 / 1,000 = 10
Therefore: create at least 10 partitions
Partition count guidelines:
Small topic (< 10,000 msg/sec): 3-6 partitions
Medium topic (10K-100K msg/sec): 12-24 partitions
Large topic (> 100K msg/sec): 48-96+ partitions
Start with more partitions than you think you need.
You cannot decrease partitions without recreating the topic.
You CAN add partitions later (but this changes partition assignment for existing keys).
Consumer Group Rebalancing
When a consumer joins or leaves a consumer group, Kafka triggers a rebalance: all partitions are reassigned among the remaining consumers.
During a rebalance:
- ALL consumers stop processing
- Partitions are redistributed
- Consumers start processing from their new partitions
- This can take seconds to minutes depending on configuration
Reducing rebalance impact:
@Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> factory() {
factory.getContainerProperties().setPartitionAssignmentStrategy(
List.of(CooperativeStickyAssignor.class)
);
// CooperativeStickyAssignor (Kafka 2.4+):
// - Assigns only the necessary partitions, not all
// - Consumers keep their current partitions when possible
// - No stop-the-world rebalance - incremental assignment
// - Greatly reduces impact of scaling events
}
// Tune session timeout and heartbeat
config.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 45000); // 45 seconds
config.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 15000); // 15 seconds
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000); // 5 minutes max between pollsThe max.poll.interval.ms trap:
If your consumer takes longer than max.poll.interval.ms to process a batch:
1. Kafka considers the consumer dead
2. Triggers a rebalance
3. The consumer's partitions are reassigned
4. When the consumer finishes and tries to commit, it fails
5. Messages are reprocessed by the new assigned consumer
6. DUPLICATE PROCESSING
Fix: Either increase max.poll.interval.ms OR reduce max.poll.records
so each batch is smaller and faster to process.
5. Consumer Lag - The Most Important Operational Metric
What Consumer Lag Is
Consumer lag is the difference between the latest produced offset and the latest committed consumer offset for each partition.
Kafka Partition 0 current state:
Latest produced offset: 50,000
Consumer "email-group" committed offset: 49,900
Consumer lag: 100 messages
If consumers are keeping up: lag stays constant or decreases
If consumers are falling behind: lag grows over time
Measuring Consumer Lag
# Kafka CLI
kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--group email-service-group \
--describe
# Output:
# TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
# order-events 0 49900 50000 100 email-consumer-1
# order-events 1 50200 50200 0 email-consumer-2
# order-events 2 49800 50000 200 email-consumer-3
# TOTAL LAG: 300 messages// Programmatic lag monitoring
@Service
public class ConsumerLagMonitor {
private final AdminClient adminClient;
private final MeterRegistry meterRegistry;
@Scheduled(fixedDelay = 30000)
public void reportLag() {
Map<TopicPartition, OffsetAndMetadata> offsets = adminClient
.listConsumerGroupOffsets("email-service-group")
.partitionsToOffsetAndMetadata()
.get();
Map<TopicPartition, Long> endOffsets = adminClient
.listOffsets(offsets.keySet().stream()
.collect(Collectors.toMap(tp -> tp, tp -> OffsetSpec.latest())))
.all()
.get()
.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey,
e -> e.getValue().offset()));
long totalLag = offsets.entrySet().stream()
.mapToLong(entry -> {
long endOffset = endOffsets.getOrDefault(entry.getKey(), 0L);
long consumerOffset = entry.getValue().offset();
long lag = Math.max(0, endOffset - consumerOffset);
// Report per-partition lag as a gauge metric
meterRegistry.gauge("kafka.consumer.lag",
Tags.of("topic", entry.getKey().topic(),
"partition", String.valueOf(entry.getKey().partition()),
"group", "email-service-group"),
lag);
return lag;
}).sum();
// Alert if total lag exceeds threshold
if (totalLag > 10000) {
alertingService.sendAlert("Consumer lag exceeds 10,000: current lag = " + totalLag);
}
}
}Consumer Lag Alert Thresholds
LAG LEVEL ACTION REQUIRED
---------------------------------------------------------------------------
0 - 100 Normal, consumers keeping up
100 - 1,000 Watch - slight backpressure
1,000 - 10,000 Alert - investigate, consider scaling
10,000 - 100,000 Critical - immediate scaling or issue investigation
100,000+ Incident - consumers severely behind, SLA at risk
6. Key Metrics to Monitor
Kafka Metrics
# Producer Metrics
kafka.producer.record-send-rate: Messages sent per second
kafka.producer.record-error-rate: Failed sends per second
kafka.producer.request-latency-avg: Average time for broker to ACK
kafka.producer.batch-size-avg: Average batch size (bigger = more efficient)
kafka.producer.compression-rate: Compression effectiveness (lower = better)
kafka.producer.record-queue-time-avg: Time messages wait in buffer before sending
# Consumer Metrics
kafka.consumer.records-consumed-rate: Messages consumed per second
kafka.consumer.fetch-rate: Poll calls per second
kafka.consumer.fetch-latency-avg: Time to fetch from broker
kafka.consumer.records-lag: MOST IMPORTANT: messages behind head
kafka.consumer.records-lag-max: Maximum lag across all partitions
kafka.consumer.commit-latency-avg: Time to commit offsets
# Broker Metrics
kafka.server.bytes-in-per-sec: Incoming data rate
kafka.server.bytes-out-per-sec: Outgoing data rate
kafka.server.messages-in-per-sec: Message ingestion rate
kafka.server.request-handler-pool.idle: Broker CPU availability
kafka.log.flush-time-ms-p99: Disk flush latency (99th percentile)
kafka.replica.under-replicated-partitions: > 0 means replication is lagging (CRITICAL)
kafka.controller.active-controller-count: Must be exactly 1RabbitMQ Metrics
rabbitmq.queue.messages: Total messages in queue
rabbitmq.queue.messages.ready: Messages waiting for consumer
rabbitmq.queue.messages.unacknowledged: Messages in flight (consumer has but not ACKed)
rabbitmq.queue.consumers: Number of active consumers
rabbitmq.queue.message.publish.rate: Messages published per second
rabbitmq.queue.message.deliver.rate: Messages delivered per second
rabbitmq.queue.message.ack.rate: Messages acknowledged per second
rabbitmq.connection.count: Active connections
rabbitmq.channel.count: Active channels
rabbitmq.node.disk.free.alarm: CRITICAL: disk space alarm
rabbitmq.node.mem.alarm: CRITICAL: memory alarmAlert Runbook
ALERT: under-replicated-partitions > 0
Cause: A broker may be down or falling behind in replication
Action: Check broker health, disk usage, network connectivity
Priority: P1 - data at risk if another broker fails
ALERT: consumer lag > 10,000
Cause: Consumers falling behind producers
Action: Scale consumers, investigate processing bottleneck
Priority: P2 - SLA impact growing
ALERT: DLQ depth > 0
Cause: Messages failing after all retries
Action: Investigate consumer logs, check for data quality issues
Priority: P2 - messages require manual review
ALERT: producer error rate > 0.1%
Cause: Messages failing to publish
Action: Check broker availability, producer configuration
Priority: P1 - data may be lost
ALERT: RabbitMQ memory alarm
Cause: Queue depth too large, memory pressure
Action: Scale consumers, check for consumer slowdown
Priority: P1 - broker will stop accepting messages
7. High Availability and Replication
Kafka High Availability
Replication Factor:
replication.factor = 3: Topic data exists on 3 brokers
Tolerated failures: 2 brokers can fail, system continues
replication.factor = 2: Data exists on 2 brokers
Tolerated failures: 1 broker can fail
Minimum for production: replication.factor = 3
Never use replication.factor = 1 in production
min.insync.replicas:
With replication.factor=3 and min.insync.replicas=2:
- Producer with acks=all requires at least 2 replicas to ACK
- Prevents writes to a degraded cluster (only 1 replica up)
- Throws NotEnoughReplicasException if only 1 replica is in-sync
- This is INTENTIONAL: better to fail writes than lose data
min.insync.replicas should always be: replication.factor - 1
(Allows one broker down while maintaining data safety)
Rack awareness:
# Assign brokers to different racks/availability zones
# Kafka distributes replicas across racks for AZ resilience
broker.rack=us-east-1a # Broker 1 configuration
broker.rack=us-east-1b # Broker 2 configuration
broker.rack=us-east-1c # Broker 3 configuration
# Result: even if AZ us-east-1a goes down,
# all partitions still have replicas in 1b and 1cRabbitMQ High Availability with Quorum Queues
// Quorum Queues - Raft-based, strongly consistent replication
@Bean
public Queue highAvailabilityQueue() {
return QueueBuilder.durable("critical-orders")
.withArgument("x-queue-type", "quorum")
// quorum queues:
// - Automatically replicated across cluster nodes
// - Data not lost if minority of nodes fail
// - Default replication: majority of nodes
// - No configuration of replica count needed (uses cluster size)
.build();
}Quorum Queue behavior:
3-node cluster: requires 2 nodes to be available (majority)
5-node cluster: requires 3 nodes to be available
Can tolerate: floor(n/2) failures where n = cluster size
3-node cluster: can tolerate 1 node failure
5-node cluster: can tolerate 2 node failures
SQS High Availability
SQS is fully managed and automatically highly available:
- Messages are stored redundantly across multiple availability zones
- No configuration required
- 99.9% availability SLA for standard queues
- Standard and FIFO queues are replicated by default
8. Message TTL and Retention Policies
Kafka Retention Configuration
# Per-topic retention - set when creating topic
kafka-topics.sh --create \
--topic order-events \
--config retention.ms=604800000 \ # 7 days by default
--config retention.bytes=10737418240 \ # 10GB max per partition
--config segment.ms=86400000 \ # Roll log segment daily
--config cleanup.policy=delete # delete (default) or compact
# Modify retention on existing topic
kafka-configs.sh \
--bootstrap-server kafka:9092 \
--entity-type topics \
--entity-name order-events \
--alter \
--add-config retention.ms=2592000000 # Change to 30 daysRetention strategy guidelines:
| Topic Purpose | Recommended Retention | Reason |
|---|---|---|
| Transaction events (orders, payments) | 30-90 days | Audit, replay, debugging |
| User activity events | 7-14 days | Analytics, replay |
| System logs | 3-7 days | Debugging only |
| Metrics and telemetry | 1-3 days | Only recent data matters |
| Audit log (compliance) | 1-7 years | Regulatory requirement |
| Configuration changes | Forever (compacted) | Need latest state always |
Setting Message TTL in RabbitMQ
// Per-message TTL
MessageProperties props = new MessageProperties();
props.setExpiration("60000"); // Milliseconds - expires after 60 seconds
Message message = new Message(payload, props);
rabbitTemplate.send("exchange", "routing.key", message);
// Per-queue TTL - all messages in this queue expire after 5 minutes
@Bean
public Queue timeSensitiveQueue() {
return QueueBuilder.durable("time-sensitive-notifications")
.withArgument("x-message-ttl", 300000) // 5 minutes in milliseconds
.withArgument("x-dead-letter-exchange", "dlx.expired") // Send expired to DLQ
.build();
}SQS Message Visibility and Retention
// Maximum retention: 14 days
CreateQueueRequest request = CreateQueueRequest.builder()
.queueName("order-processing")
.attributes(Map.of(
QueueAttributeName.MESSAGE_RETENTION_PERIOD, "1209600", // 14 days in seconds
QueueAttributeName.VISIBILITY_TIMEOUT, "300", // 5 minute visibility timeout
QueueAttributeName.RECEIVE_MESSAGE_WAIT_TIME_SECONDS, "20" // Long polling
))
.build();9. Capacity Planning
How to Size a Kafka Cluster
Step 1: Calculate message volume
Requirement: Handle 50,000 messages per second, each message ~1KB average
Daily volume: 50,000 msg/sec x 86,400 sec/day = 4.32 billion messages
Daily data volume: 4.32B x 1KB = 4.32TB raw
With replication factor 3: 4.32TB x 3 = 12.96TB/day
With 7-day retention: 12.96TB x 7 = ~91TB total storage needed
Step 2: Calculate partition count
Single partition throughput: ~10-20MB/sec read + write
Required throughput: 50,000 msg/sec x 1KB = ~50MB/sec
Partitions needed (write): 50MB / 10MB = 5 partitions minimum
With headroom: 12-24 partitions recommended
Step 3: Size the brokers
Per-broker storage: 91TB / 3 brokers = ~30TB per broker
Per-broker disk I/O: Use SSDs for write-heavy topics
Per-broker RAM: 4-6GB JVM heap + OS page cache (as much as possible)
Kafka relies heavily on OS page cache for reads
Rule: RAM = max(6GB JVM, 16-32GB for page cache)
Per-broker CPU: 8-16 cores for high throughput
Production minimum per broker:
CPU: 16 cores
RAM: 32GB (6GB JVM + 26GB page cache)
Disk: NVMe SSD, RAID-10 for reliability + performance
Network: 10Gbps
Step 4: Determine broker count
Minimum: 3 brokers (for RF=3 with quorum)
Recommended: 5+ brokers (for rolling upgrades without downtime)
If any single broker is saturated:
- Add more brokers
- Kafka will automatically rebalance partitions
- Use kafka-reassign-partitions tool to balance load
How to Size RabbitMQ
Memory-based sizing:
RabbitMQ stores messages in memory until memory threshold is hit
Default memory high watermark: 40% of total RAM
When exceeded: Flow control kicks in, producers are throttled
Rule: Total unprocessed messages x average message size < 40% of RAM
Example:
Peak queue depth: 100,000 messages
Average message size: 2KB
Total memory needed: 100,000 x 2KB = 200MB
Total RAM needed: 200MB / 0.4 = 500MB minimum
With buffer: 4-8GB RAM for broker
For disk-backed queues (lazy queues or quorum queues):
RAM for indexes and metadata: ~80 bytes per message
100,000 messages = ~8MB RAM for metadata
Much more memory-efficient for large queues
10. Performance Tuning Cheat Sheet
Kafka Tuning Quick Reference
# Producer - High Throughput Mode
batch.size=131072 # 128KB - larger batches
linger.ms=50 # 50ms batching window
compression.type=lz4 # Fast compression
acks=1 # Leader ACK only (for non-critical data)
max.in.flight.requests.per.connection=5
# Producer - High Reliability Mode
acks=all
enable.idempotence=true
max.in.flight.requests.per.connection=5
retries=2147483647
delivery.timeout.ms=120000 # 2 minute total delivery timeout
# Consumer - High Throughput Mode
fetch.min.bytes=65536 # 64KB - fetch larger chunks
fetch.max.wait.ms=500 # Wait up to 500ms for fetch.min.bytes
max.poll.records=500 # Process 500 messages per poll
enable.auto.commit=false # Always manual commit in production
session.timeout.ms=45000 # Avoid unnecessary rebalances
# Broker
num.io.threads=16 # I/O threads - match to disk count or cores
num.network.threads=8 # Network threads - match to network cores
socket.send.buffer.bytes=1048576 # 1MB
socket.receive.buffer.bytes=1048576 # 1MB
socket.request.max.bytes=104857600 # 100MB max request size
log.retention.check.interval.ms=300000 # Check retention every 5 minutesRabbitMQ Tuning Quick Reference
# Connection and channel settings
channel_max = 128 # Max channels per connection
tcp_listen_backlog = 128 # TCP connection queue depth
heartbeat = 60 # Seconds between heartbeats
# Memory and disk settings
vm_memory_high_watermark.relative = 0.4 # Throttle at 40% RAM
vm_memory_high_watermark_paging_ratio = 0.5 # Page to disk at 50% of watermark
disk_free_limit.absolute = 2GB # Minimum free disk space
# Performance settings
consumer_timeout = 1800000 # 30 minutes max for consumer to ACK
max_message_size = 134217728 # 128MB max message sizeLoad Testing Before Production
// Load test producer
@Component
public class LoadTestProducer {
public void runLoadTest(int targetTps, Duration duration) {
RateLimiter rateLimiter = RateLimiter.create(targetTps);
Instant endTime = Instant.now().plus(duration);
long messagesSent = 0;
while (Instant.now().isBefore(endTime)) {
rateLimiter.acquire();
String key = "user-" + ThreadLocalRandom.current().nextInt(1000);
kafkaTemplate.send("load-test-topic", key, generateTestMessage());
messagesSent++;
}
log.info("Load test complete: {} messages sent at {} TPS", messagesSent, targetTps);
}
}Performance Baseline Benchmarks
Kafka:
Single partition, no replication, no compression:
Producer: 800MB/sec, 2 million msg/sec
Consumer: 940MB/sec, 2 million msg/sec
Realistic production (RF=3, acks=all, snappy compression):
Producer: 200-400MB/sec, 500K-1M msg/sec
Consumer: 300-600MB/sec per consumer group
These numbers are on modern SSD hardware. Your results will vary.
RabbitMQ:
Single queue, durable messages, publisher confirms:
Throughput: 20,000-60,000 msg/sec per queue (single node)
Quorum queues: 15,000-40,000 msg/sec (replication overhead)
RabbitMQ is optimized for moderate-throughput, complex routing
Not designed to compete with Kafka at millions of msg/sec
Previous: Part 4 - Advanced Concepts
Next: Part 6 - Pitfalls and Best Practices
Index: Message Queues Demystified - Index