Message Queues Demystified - Part 5: Operations and Performance

A system that works perfectly on your laptop and fails under production load is not a working system.
Operations and performance are where theory meets reality.

Throughput Optimization
Batching Strategies
Message Compression
Horizontal Scaling and Partition Strategy
Consumer Lag - The Most Important Operational Metric
Key Metrics to Monitor
High Availability and Replication
Message TTL and Retention Policies
Capacity Planning
Performance Tuning Cheat Sheet

1. Throughput Optimization

Understanding the Bottlenecks

Before optimizing, identify where the bottleneck is:

End-to-end message path:
Producer App -> Network -> Broker (write) -> Disk -> Network -> Consumer App -> Processing

Common bottleneck locations:
1. Producer: serialization speed, batch size, connection count
2. Network (producer to broker): bandwidth, latency
3. Broker write: disk I/O, replication overhead
4. Broker read: disk I/O, network
5. Consumer: deserialization speed, processing time, downstream dependencies
6. Database (consumer side): write throughput, lock contention

Producer Throughput Optimization

Kafka Producer settings:

@Bean
public ProducerFactory<String, Object> highThroughputProducerFactory() {
    Map<String, Object> config = new HashMap<>();
 
    // THROUGHPUT settings
    config.put(ProducerConfig.LINGER_MS_CONFIG, 20);
    // Wait up to 20ms to accumulate messages before sending
    // Allows batching at the cost of slightly higher latency
    // Default is 0 (send immediately) - terrible for throughput
 
    config.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536);
    // 64KB batch size (default is 16KB)
    // Larger batches = more messages per network round trip
 
    config.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864L);
    // 64MB total buffer memory (default is 32MB)
    // Producer blocks when buffer is full
 
    config.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
    // Compress batches before sending
    // LZ4 is fast with good compression ratio
 
    // RELIABILITY settings (do not sacrifice these for throughput)
    config.put(ProducerConfig.ACKS_CONFIG, "1");
    // acks=1 for throughput (leader ACK only)
    // Use acks=all for critical data
 
    config.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
    // 5 in-flight requests per connection
    // Increases throughput but can affect ordering on retry
    // Set to 1 if you need strict per-key ordering with retries
 
    return new DefaultKafkaProducerFactory<>(config);
}

Throughput vs Latency trade-off:

linger.ms=0 (default):
  - Each message sent immediately
  - Low latency (messages sent fast)
  - Low throughput (each message = one network call)

linger.ms=20:
  - Wait 20ms to accumulate messages into a batch
  - Higher latency (up to 20ms delay)
  - High throughput (one network call per batch = 100x efficiency)
  - Right choice for background processing, analytics

linger.ms=100:
  - Very high throughput
  - 100ms latency acceptable only for bulk operations

Consumer Throughput Optimization

Increase parallelism:

@Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> factory() {
    ConcurrentKafkaListenerContainerFactory<String, Object> factory =
        new ConcurrentKafkaListenerContainerFactory<>();
 
    factory.setConcurrency(6);
    // 6 concurrent consumer threads per instance
    // Each thread processes one partition
    // Requires at least 6 partitions for full utilization
 
    return factory;
}

Parallel processing within a single message (for CPU-bound work):

@KafkaListener(topics = "image-resize-jobs", concurrency = "6")
public void processImage(ImageResizeJob job, Acknowledgment ack) {
    CompletableFuture.runAsync(() -> {
        imageService.resize(job.getImageUrl(), job.getTargetDimensions());
    }, imageProcessingExecutor)
    .thenRun(ack::acknowledge)
    .exceptionally(e -> {
        log.error("Image resize failed: {}", job.getImageId(), e);
        return null; // Do not ACK - will be redelivered
    });
}

Optimize the processing path:

// Slow - N database round trips for N messages
@KafkaListener(topics = "user-events")
public void handleUserEvent(UserEvent event) {
    userRepository.findById(event.getUserId())  // Database call
        .ifPresent(user -> {
            userRepository.save(user.apply(event));  // Another database call
        });
}
 
// Fast - 1 database round trip for N messages (batch consumer)
@KafkaListener(topics = "user-events", batch = "true")
public void handleUserEvents(List<UserEvent> events) {
    Set<String> userIds = events.stream()
        .map(UserEvent::getUserId)
        .collect(Collectors.toSet());
 
    // Single query for all users in the batch
    Map<String, User> users = userRepository.findAllById(userIds)
        .stream().collect(Collectors.toMap(User::getId, u -> u));
 
    List<User> updatedUsers = events.stream()
        .map(event -> users.get(event.getUserId()).apply(event))
        .collect(Collectors.toList());
 
    // Single batch save
    userRepository.saveAll(updatedUsers);
}

2. Batching Strategies

Producer Batching

Producer batching groups multiple messages into a single network call to the broker. This is one of the highest-leverage optimizations.

Without batching:
  10,000 messages = 10,000 network calls = 10,000 disk writes on broker
  Network overhead dominates

With batching (linger.ms=20, batch.size=64KB):
  10,000 messages = ~100 network calls = more efficient disk writes
  10-100x throughput improvement

The batch.size limit:
Kafka sends a batch when either:

The batch reaches batch.size bytes, OR
linger.ms milliseconds have passed

Whichever comes first. So even with a large linger.ms, a busy producer will send batches immediately when they fill up.

Consumer Batch Processing

// Single message processing - 1000 DB calls for 1000 messages
@KafkaListener(topics = "orders")
public void process(OrderEvent event) {
    orderRepository.save(new ProcessedOrder(event));  // 1 DB call per message
}
 
// Batch processing - 10 DB calls for 1000 messages (100x reduction)
@KafkaListener(topics = "orders", batch = "true")
public void processBatch(List<ConsumerRecord<String, OrderEvent>> records) {
    List<ProcessedOrder> orders = records.stream()
        .map(record -> new ProcessedOrder(record.value()))
        .collect(Collectors.toList());
 
    orderRepository.saveAll(orders);  // 1 DB call for the whole batch
 
    // Acknowledge all records in the batch
    // Spring Kafka handles this automatically in batch mode
}

Batch size tuning:

# Maximum records to fetch per poll
spring.kafka.consumer.max-poll-records=500
 
# Maximum bytes to fetch per poll per partition
fetch.max.bytes=52428800     # 50MB
fetch.min.bytes=1024         # Wait for at least 1KB before returning
fetch.max.wait.ms=500        # Wait at most 500ms for fetch.min.bytes

RabbitMQ Batch Publishing

// Manual batching in RabbitMQ
@Service
public class BatchPublisher {
 
    private final List<Message> pendingBatch = new ArrayList<>();
    private static final int BATCH_SIZE = 100;
 
    public synchronized void addToBatch(Object payload) {
        pendingBatch.add(MessageBuilder.withBody(
            objectMapper.writeValueAsBytes(payload)).build());
 
        if (pendingBatch.size() >= BATCH_SIZE) {
            flush();
        }
    }
 
    @Scheduled(fixedDelay = 1000)
    public synchronized void flush() {
        if (pendingBatch.isEmpty()) return;
 
        rabbitTemplate.execute(channel -> {
            for (Message message : pendingBatch) {
                channel.basicPublish("exchange", "routing.key", null, message.getBody());
            }
            channel.waitForConfirmsOrDie(5000);  // Wait for broker ACK
            return null;
        });
 
        pendingBatch.clear();
    }
}

3. Message Compression

Compression reduces network bandwidth and disk usage at the cost of CPU time on the producer (compress) and consumer (decompress).

Compression Options

Algorithm	Ratio	Speed	CPU Cost	Best For
`none`	1.0x	N/A	None	Low-volume, time-critical
`gzip`	3-5x	Slow	High	Disk-space sensitive, throughput not critical
`snappy`	2-3x	Fast	Low	General purpose, balanced
`lz4`	2-3x	Very Fast	Very Low	High-throughput pipelines
`zstd`	3-5x	Fast	Medium	Best overall compression + speed balance

Recommendation for most Kafka deployments:

High throughput pipelines: lz4
Storage-sensitive: zstd (best overall)
Compatibility: snappy

// Kafka compression (batch-level)
config.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
// Entire batch is compressed as one unit
// More messages in batch = better compression ratio

When Compression Helps Most

Best compression scenarios:
- JSON messages (highly redundant, compresses well)
- Log events (repetitive field names and values)
- Large batches (compression works better on larger inputs)

Compression adds little value for:
- Already compressed data (JPEG, MP4, ZIP)
- Very small messages (< 1KB)
- Low-volume, latency-sensitive paths

4. Horizontal Scaling and Partition Strategy

Kafka Partition Count - The Golden Rule

Rule: Number of partitions = maximum consumer parallelism

6 partitions, 3 consumers: each consumer handles 2 partitions
6 partitions, 6 consumers: each consumer handles 1 partition (maximum parallelism)
6 partitions, 10 consumers: 6 consumers active, 4 idle (waste)

To handle 10,000 messages/second and each consumer handles 1,000/second:
  Required consumers = 10,000 / 1,000 = 10
  Therefore: create at least 10 partitions

Partition count guidelines:

Small topic (< 10,000 msg/sec):   3-6 partitions
Medium topic (10K-100K msg/sec):  12-24 partitions
Large topic (> 100K msg/sec):     48-96+ partitions

Start with more partitions than you think you need.
You cannot decrease partitions without recreating the topic.
You CAN add partitions later (but this changes partition assignment for existing keys).

Consumer Group Rebalancing

When a consumer joins or leaves a consumer group, Kafka triggers a rebalance: all partitions are reassigned among the remaining consumers.

During a rebalance:

ALL consumers stop processing
Partitions are redistributed
Consumers start processing from their new partitions
This can take seconds to minutes depending on configuration

Reducing rebalance impact:

@Bean
public ConcurrentKafkaListenerContainerFactory<String, Object> factory() {
    factory.getContainerProperties().setPartitionAssignmentStrategy(
        List.of(CooperativeStickyAssignor.class)
    );
    // CooperativeStickyAssignor (Kafka 2.4+):
    // - Assigns only the necessary partitions, not all
    // - Consumers keep their current partitions when possible
    // - No stop-the-world rebalance - incremental assignment
    // - Greatly reduces impact of scaling events
}
 
// Tune session timeout and heartbeat
config.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 45000);      // 45 seconds
config.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 15000);   // 15 seconds
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000);   // 5 minutes max between polls

The max.poll.interval.ms trap:

If your consumer takes longer than max.poll.interval.ms to process a batch:
1. Kafka considers the consumer dead
2. Triggers a rebalance
3. The consumer's partitions are reassigned
4. When the consumer finishes and tries to commit, it fails
5. Messages are reprocessed by the new assigned consumer
6. DUPLICATE PROCESSING

Fix: Either increase max.poll.interval.ms OR reduce max.poll.records
     so each batch is smaller and faster to process.

5. Consumer Lag - The Most Important Operational Metric

What Consumer Lag Is

Consumer lag is the difference between the latest produced offset and the latest committed consumer offset for each partition.

Kafka Partition 0 current state:
  Latest produced offset: 50,000
  Consumer "email-group" committed offset: 49,900
  Consumer lag: 100 messages

If consumers are keeping up: lag stays constant or decreases
If consumers are falling behind: lag grows over time

Measuring Consumer Lag

# Kafka CLI
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group email-service-group \
  --describe
 
# Output:
# TOPIC         PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID
# order-events  0          49900           50000           100  email-consumer-1
# order-events  1          50200           50200           0    email-consumer-2
# order-events  2          49800           50000           200  email-consumer-3
# TOTAL LAG: 300 messages

// Programmatic lag monitoring
@Service
public class ConsumerLagMonitor {
 
    private final AdminClient adminClient;
    private final MeterRegistry meterRegistry;
 
    @Scheduled(fixedDelay = 30000)
    public void reportLag() {
        Map<TopicPartition, OffsetAndMetadata> offsets = adminClient
            .listConsumerGroupOffsets("email-service-group")
            .partitionsToOffsetAndMetadata()
            .get();
 
        Map<TopicPartition, Long> endOffsets = adminClient
            .listOffsets(offsets.keySet().stream()
                .collect(Collectors.toMap(tp -> tp, tp -> OffsetSpec.latest())))
            .all()
            .get()
            .entrySet().stream()
            .collect(Collectors.toMap(Map.Entry::getKey,
                e -> e.getValue().offset()));
 
        long totalLag = offsets.entrySet().stream()
            .mapToLong(entry -> {
                long endOffset = endOffsets.getOrDefault(entry.getKey(), 0L);
                long consumerOffset = entry.getValue().offset();
                long lag = Math.max(0, endOffset - consumerOffset);
 
                // Report per-partition lag as a gauge metric
                meterRegistry.gauge("kafka.consumer.lag",
                    Tags.of("topic", entry.getKey().topic(),
                            "partition", String.valueOf(entry.getKey().partition()),
                            "group", "email-service-group"),
                    lag);
 
                return lag;
            }).sum();
 
        // Alert if total lag exceeds threshold
        if (totalLag > 10000) {
            alertingService.sendAlert("Consumer lag exceeds 10,000: current lag = " + totalLag);
        }
    }
}

Consumer Lag Alert Thresholds

LAG LEVEL         ACTION REQUIRED
---------------------------------------------------------------------------
0 - 100           Normal, consumers keeping up
100 - 1,000       Watch - slight backpressure
1,000 - 10,000    Alert - investigate, consider scaling
10,000 - 100,000  Critical - immediate scaling or issue investigation
100,000+          Incident - consumers severely behind, SLA at risk

6. Key Metrics to Monitor

Kafka Metrics

# Producer Metrics
kafka.producer.record-send-rate:          Messages sent per second
kafka.producer.record-error-rate:         Failed sends per second
kafka.producer.request-latency-avg:       Average time for broker to ACK
kafka.producer.batch-size-avg:            Average batch size (bigger = more efficient)
kafka.producer.compression-rate:          Compression effectiveness (lower = better)
kafka.producer.record-queue-time-avg:     Time messages wait in buffer before sending
 
# Consumer Metrics
kafka.consumer.records-consumed-rate:     Messages consumed per second
kafka.consumer.fetch-rate:                Poll calls per second
kafka.consumer.fetch-latency-avg:         Time to fetch from broker
kafka.consumer.records-lag:               MOST IMPORTANT: messages behind head
kafka.consumer.records-lag-max:           Maximum lag across all partitions
kafka.consumer.commit-latency-avg:        Time to commit offsets
 
# Broker Metrics
kafka.server.bytes-in-per-sec:            Incoming data rate
kafka.server.bytes-out-per-sec:           Outgoing data rate
kafka.server.messages-in-per-sec:         Message ingestion rate
kafka.server.request-handler-pool.idle:   Broker CPU availability
kafka.log.flush-time-ms-p99:              Disk flush latency (99th percentile)
kafka.replica.under-replicated-partitions: > 0 means replication is lagging (CRITICAL)
kafka.controller.active-controller-count: Must be exactly 1

RabbitMQ Metrics

rabbitmq.queue.messages:               Total messages in queue
rabbitmq.queue.messages.ready:         Messages waiting for consumer
rabbitmq.queue.messages.unacknowledged: Messages in flight (consumer has but not ACKed)
rabbitmq.queue.consumers:              Number of active consumers
rabbitmq.queue.message.publish.rate:   Messages published per second
rabbitmq.queue.message.deliver.rate:   Messages delivered per second
rabbitmq.queue.message.ack.rate:       Messages acknowledged per second
rabbitmq.connection.count:             Active connections
rabbitmq.channel.count:                Active channels
rabbitmq.node.disk.free.alarm:         CRITICAL: disk space alarm
rabbitmq.node.mem.alarm:               CRITICAL: memory alarm

Alert Runbook

ALERT: under-replicated-partitions > 0
  Cause: A broker may be down or falling behind in replication
  Action: Check broker health, disk usage, network connectivity
  Priority: P1 - data at risk if another broker fails

ALERT: consumer lag > 10,000
  Cause: Consumers falling behind producers
  Action: Scale consumers, investigate processing bottleneck
  Priority: P2 - SLA impact growing

ALERT: DLQ depth > 0
  Cause: Messages failing after all retries
  Action: Investigate consumer logs, check for data quality issues
  Priority: P2 - messages require manual review

ALERT: producer error rate > 0.1%
  Cause: Messages failing to publish
  Action: Check broker availability, producer configuration
  Priority: P1 - data may be lost

ALERT: RabbitMQ memory alarm
  Cause: Queue depth too large, memory pressure
  Action: Scale consumers, check for consumer slowdown
  Priority: P1 - broker will stop accepting messages

7. High Availability and Replication

Kafka High Availability

Replication Factor:

replication.factor = 3: Topic data exists on 3 brokers
Tolerated failures: 2 brokers can fail, system continues

replication.factor = 2: Data exists on 2 brokers
Tolerated failures: 1 broker can fail

Minimum for production: replication.factor = 3
Never use replication.factor = 1 in production

min.insync.replicas:

With replication.factor=3 and min.insync.replicas=2:
- Producer with acks=all requires at least 2 replicas to ACK
- Prevents writes to a degraded cluster (only 1 replica up)
- Throws NotEnoughReplicasException if only 1 replica is in-sync
- This is INTENTIONAL: better to fail writes than lose data

min.insync.replicas should always be: replication.factor - 1
(Allows one broker down while maintaining data safety)

Rack awareness:

# Assign brokers to different racks/availability zones
# Kafka distributes replicas across racks for AZ resilience
broker.rack=us-east-1a  # Broker 1 configuration
broker.rack=us-east-1b  # Broker 2 configuration
broker.rack=us-east-1c  # Broker 3 configuration
 
# Result: even if AZ us-east-1a goes down,
# all partitions still have replicas in 1b and 1c

RabbitMQ High Availability with Quorum Queues

// Quorum Queues - Raft-based, strongly consistent replication
@Bean
public Queue highAvailabilityQueue() {
    return QueueBuilder.durable("critical-orders")
        .withArgument("x-queue-type", "quorum")
        // quorum queues:
        // - Automatically replicated across cluster nodes
        // - Data not lost if minority of nodes fail
        // - Default replication: majority of nodes
        // - No configuration of replica count needed (uses cluster size)
        .build();
}

Quorum Queue behavior:

3-node cluster: requires 2 nodes to be available (majority)
5-node cluster: requires 3 nodes to be available
Can tolerate: floor(n/2) failures where n = cluster size

3-node cluster: can tolerate 1 node failure
5-node cluster: can tolerate 2 node failures

SQS High Availability

SQS is fully managed and automatically highly available:

Messages are stored redundantly across multiple availability zones
No configuration required
99.9% availability SLA for standard queues
Standard and FIFO queues are replicated by default

8. Message TTL and Retention Policies

Kafka Retention Configuration

# Per-topic retention - set when creating topic
kafka-topics.sh --create \
  --topic order-events \
  --config retention.ms=604800000 \    # 7 days by default
  --config retention.bytes=10737418240 \ # 10GB max per partition
  --config segment.ms=86400000 \       # Roll log segment daily
  --config cleanup.policy=delete        # delete (default) or compact
 
# Modify retention on existing topic
kafka-configs.sh \
  --bootstrap-server kafka:9092 \
  --entity-type topics \
  --entity-name order-events \
  --alter \
  --add-config retention.ms=2592000000  # Change to 30 days

Retention strategy guidelines:

Topic Purpose	Recommended Retention	Reason
Transaction events (orders, payments)	30-90 days	Audit, replay, debugging
User activity events	7-14 days	Analytics, replay
System logs	3-7 days	Debugging only
Metrics and telemetry	1-3 days	Only recent data matters
Audit log (compliance)	1-7 years	Regulatory requirement
Configuration changes	Forever (compacted)	Need latest state always

Setting Message TTL in RabbitMQ

// Per-message TTL
MessageProperties props = new MessageProperties();
props.setExpiration("60000");  // Milliseconds - expires after 60 seconds
Message message = new Message(payload, props);
rabbitTemplate.send("exchange", "routing.key", message);
 
// Per-queue TTL - all messages in this queue expire after 5 minutes
@Bean
public Queue timeSensitiveQueue() {
    return QueueBuilder.durable("time-sensitive-notifications")
        .withArgument("x-message-ttl", 300000)   // 5 minutes in milliseconds
        .withArgument("x-dead-letter-exchange", "dlx.expired")  // Send expired to DLQ
        .build();
}

SQS Message Visibility and Retention

// Maximum retention: 14 days
CreateQueueRequest request = CreateQueueRequest.builder()
    .queueName("order-processing")
    .attributes(Map.of(
        QueueAttributeName.MESSAGE_RETENTION_PERIOD, "1209600",  // 14 days in seconds
        QueueAttributeName.VISIBILITY_TIMEOUT, "300",            // 5 minute visibility timeout
        QueueAttributeName.RECEIVE_MESSAGE_WAIT_TIME_SECONDS, "20" // Long polling
    ))
    .build();

9. Capacity Planning

How to Size a Kafka Cluster

Step 1: Calculate message volume

Requirement: Handle 50,000 messages per second, each message ~1KB average

Daily volume: 50,000 msg/sec x 86,400 sec/day = 4.32 billion messages
Daily data volume: 4.32B x 1KB = 4.32TB raw
With replication factor 3: 4.32TB x 3 = 12.96TB/day
With 7-day retention: 12.96TB x 7 = ~91TB total storage needed

Step 2: Calculate partition count

Single partition throughput: ~10-20MB/sec read + write
Required throughput: 50,000 msg/sec x 1KB = ~50MB/sec

Partitions needed (write): 50MB / 10MB = 5 partitions minimum
With headroom: 12-24 partitions recommended

Step 3: Size the brokers

Per-broker storage: 91TB / 3 brokers = ~30TB per broker
Per-broker disk I/O: Use SSDs for write-heavy topics
Per-broker RAM: 4-6GB JVM heap + OS page cache (as much as possible)
  Kafka relies heavily on OS page cache for reads
  Rule: RAM = max(6GB JVM, 16-32GB for page cache)
Per-broker CPU: 8-16 cores for high throughput

Production minimum per broker:
  CPU: 16 cores
  RAM: 32GB (6GB JVM + 26GB page cache)
  Disk: NVMe SSD, RAID-10 for reliability + performance
  Network: 10Gbps

Step 4: Determine broker count

Minimum: 3 brokers (for RF=3 with quorum)
Recommended: 5+ brokers (for rolling upgrades without downtime)

If any single broker is saturated:
  - Add more brokers
  - Kafka will automatically rebalance partitions
  - Use kafka-reassign-partitions tool to balance load

How to Size RabbitMQ

Memory-based sizing:

RabbitMQ stores messages in memory until memory threshold is hit
Default memory high watermark: 40% of total RAM
When exceeded: Flow control kicks in, producers are throttled

Rule: Total unprocessed messages x average message size < 40% of RAM

Example:
  Peak queue depth: 100,000 messages
  Average message size: 2KB
  Total memory needed: 100,000 x 2KB = 200MB
  Total RAM needed: 200MB / 0.4 = 500MB minimum
  With buffer: 4-8GB RAM for broker

For disk-backed queues (lazy queues or quorum queues):
  RAM for indexes and metadata: ~80 bytes per message
  100,000 messages = ~8MB RAM for metadata
  Much more memory-efficient for large queues

10. Performance Tuning Cheat Sheet

Kafka Tuning Quick Reference

# Producer - High Throughput Mode
batch.size=131072           # 128KB - larger batches
linger.ms=50                # 50ms batching window
compression.type=lz4        # Fast compression
acks=1                      # Leader ACK only (for non-critical data)
max.in.flight.requests.per.connection=5
 
# Producer - High Reliability Mode
acks=all
enable.idempotence=true
max.in.flight.requests.per.connection=5
retries=2147483647
delivery.timeout.ms=120000  # 2 minute total delivery timeout
 
# Consumer - High Throughput Mode
fetch.min.bytes=65536       # 64KB - fetch larger chunks
fetch.max.wait.ms=500       # Wait up to 500ms for fetch.min.bytes
max.poll.records=500        # Process 500 messages per poll
enable.auto.commit=false    # Always manual commit in production
session.timeout.ms=45000    # Avoid unnecessary rebalances
 
# Broker
num.io.threads=16           # I/O threads - match to disk count or cores
num.network.threads=8       # Network threads - match to network cores
socket.send.buffer.bytes=1048576     # 1MB
socket.receive.buffer.bytes=1048576  # 1MB
socket.request.max.bytes=104857600   # 100MB max request size
log.retention.check.interval.ms=300000  # Check retention every 5 minutes

RabbitMQ Tuning Quick Reference

# Connection and channel settings
channel_max = 128           # Max channels per connection
tcp_listen_backlog = 128    # TCP connection queue depth
heartbeat = 60              # Seconds between heartbeats
 
# Memory and disk settings
vm_memory_high_watermark.relative = 0.4   # Throttle at 40% RAM
vm_memory_high_watermark_paging_ratio = 0.5  # Page to disk at 50% of watermark
disk_free_limit.absolute = 2GB  # Minimum free disk space
 
# Performance settings
consumer_timeout = 1800000  # 30 minutes max for consumer to ACK
max_message_size = 134217728  # 128MB max message size

Load Testing Before Production

// Load test producer
@Component
public class LoadTestProducer {
 
    public void runLoadTest(int targetTps, Duration duration) {
        RateLimiter rateLimiter = RateLimiter.create(targetTps);
        Instant endTime = Instant.now().plus(duration);
 
        long messagesSent = 0;
        while (Instant.now().isBefore(endTime)) {
            rateLimiter.acquire();
 
            String key = "user-" + ThreadLocalRandom.current().nextInt(1000);
            kafkaTemplate.send("load-test-topic", key, generateTestMessage());
            messagesSent++;
        }
 
        log.info("Load test complete: {} messages sent at {} TPS", messagesSent, targetTps);
    }
}

Performance Baseline Benchmarks

Kafka:

Single partition, no replication, no compression:
  Producer: 800MB/sec, 2 million msg/sec
  Consumer: 940MB/sec, 2 million msg/sec

Realistic production (RF=3, acks=all, snappy compression):
  Producer: 200-400MB/sec, 500K-1M msg/sec
  Consumer: 300-600MB/sec per consumer group

These numbers are on modern SSD hardware. Your results will vary.

RabbitMQ:

Single queue, durable messages, publisher confirms:
  Throughput: 20,000-60,000 msg/sec per queue (single node)
  Quorum queues: 15,000-40,000 msg/sec (replication overhead)

RabbitMQ is optimized for moderate-throughput, complex routing
Not designed to compete with Kafka at millions of msg/sec

Previous: Part 4 - Advanced Concepts
Next: Part 6 - Pitfalls and Best Practices
Index: Message Queues Demystified - Index

Series: Message Queues Demystified